
Research
/Security News
Popular Tinycolor npm Package Compromised in Supply Chain Attack Affecting 40+ Packages
Malicious update to @ctrl/tinycolor on npm is part of a supply-chain attack hitting 40+ packages across maintainers
@dawadk/import-util
Advanced tools
This package contains utilities for importing data into the DAWA database.
This package contains utilities for importing data into the DAWA database.
DAWA is based on the PostgreSQL database. All importing logic is essentially SQL based. Data from the different data sources is streamed to the database, and is then operated on using SQL statements. Thus, this library is primarily a library for streaming data to the database and generating SQL statements to operate on these data.
The DAWA database is large. It consists of more than a hundred tables, originating from around 10 different sources. Furthermore, the database is highly denormalized in order to provide efficient queries and maintain API compatibility. An additional challenge is that all data is provided to consumers via the replication API. Therefore, events must be generated and stored reliably whenever the data is changed.
For (almost) every table in DAWA, an associated change table stores all changes to table. We may distinguish between base tables, which contain normalized data in a format similar to the data provided by the data sources, and derived tables, which combines data from multiple base tables as well as additional columns such as PostgreSQL term search vectors for full text searching.
Derived tables are usually specified using SQL views.
Data importing follows the following pattern:
Some importers are capable of running incrementally without recomputing the entire state of the table.
All data manipulation is transactional. For each transaction, a unique transaction ID (txid
) is
generated. Transactions are stored in the transactions
table.
Each event describes a single operation on a table, either an insert
, update
or delete
operation.
Events, changes and operations are generally used interchangeably.
Every changes is associated with a transaction. The table tx_operation_counts
stores the number
of changes performed in each transaction.
Events are stored in change tables. There is one change table for each primary table. It is conventionally named by the name of the primary table suffixed by "_changes". It has the same columns as the primary table and the following additional columns:
txid
: The transaction in which the change was performed.operation
: Whether the change is an insert
, update
or delete
.public
: A boolean indicating whether the change is public (visible on the Replication API) or
private (not visible - only internal columns such as TSVs are modified). The concept
of non-public events is deprecated. Instead, it is preferred to create additional derived tables with any
non-public columns.changeid
: The sequence number (sekvensnummer) for the event. Sequence numbers are deprecated, and generated
for backwards compatibility purposes only.The PostgreSQL database is able to receive streaming data in CSV format directly into a table using
the COPY command. We use NodeJS streams to
stream directly from input files to database. The file src/postgres-streaming.js
contains utility
functions around the COPY command.
Comparing tables is an essential part of the importing process. The implementation of this functionality is in
the file src/table-diff.js
. The comparison process is also capable of deriving new data, such as
TSVs, bounding boxes, visual centers and timestamps.
The comparison function requires a description of the table structure in order to perform comparisons with a source table or view. This description is called a table model. The table model specifies:
Each column of a table model must implement the protocol specified in src/table-diff-protocol.js
-
see the source code docs for details.
Most columns simply compare value of the table with the value from the source, but other
columns implement complex behaviors, such as uploading their content to S3.
Changing data in tables simply involves applying the operations in the change table to the primary table.
In addition to comparing an entire table agains a source table, it is also possible to perform such a comparison on a subset of the table. Before this can happen, a list of dirty rows must be computed, that is, rows that may potentially be changed. The primary keys of the dirty rows are stored in a temporary dirty table.
Sequence numbers are generated at the end of the importing transaction. The code generating the
sequence numbers can be found in the file src/transaction.js
.
Derived tables are also called materializations beacuse they resemble SQL materialized views.
The file src/materialize.js
contains functions to update derived tables.
A materialization consists of:
Some materialiations can be computed incrementally. In order for this to happen, it is necessary to compute the set of dirty rows - rows that may potentially be modified. This is not possible to do efficiently in the general case, but there is support for doing it in the cases where the derived table contains a foreign key reference to the table it is derived from.
A derived table may have both incremental and non-incremental dependencies.
An incremental dependency is possible if and only if there is a forein key relation between the derived table and the dependency table. The foreign key dependency is part of the materialization model.
FAQs
This package contains utilities for importing data into the DAWA database.
We found that @dawadk/import-util demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
Malicious update to @ctrl/tinycolor on npm is part of a supply-chain attack hitting 40+ packages across maintainers
Security News
pnpm's new minimumReleaseAge setting delays package updates to prevent supply chain attacks, with other tools like Taze and NCU following suit.
Security News
The Rust Security Response WG is warning of phishing emails from rustfoundation.dev targeting crates.io users.