ETL
The code in this repo is part of the TEDective project. It defines an ETL
pipeline to transform European public procurement data from Tenders Electronic
Daily (TED) into a format that's easier to handle and analyse. Primarily, the
TED XMLs (and eForms, WIP) are transformed into Open Contracting Data
Standard (OCDS) JSON and parquet files to ease importing the data into a:
- Graph database (KuzuDB in our case, but processed data should be generic
enough to support any graph database and a
- Search engine (Meilisearch in our case)
Organizations are deduplicated using Splinkg and linked to their GLEIF
identifiers (WIP) before they are imported into the graph database.
Table of Contents for ETL
Background
The TEDective project aims to make European public procurement data explorable
for non-experts. This transformation is more or lest based on the Open
Contracting Data Standard (OCDS) EU
Profile:
As such, this pipeline can be used standalone or as part of your project that
does something interesting with TED data. We use it ourselves for the
TEDective API that powers the TEDective
UI.
Install
:construction: Disclaimer: install instructions are working as of 12th of April 2024, but they may be subject of change.
The ETL consist of two parts the pipeline and the Luigi server (scheduler)
Using PyPi package
The easiest way to install TEDective ETL is to use PyPi package via pipx
:
pipx install tedective-etl
pipx ensurepath
run-pipeline --help
Using Nix:
nix profile install git+https://git.fsfe.org/TEDective/etl
run-pipeline --help
Alternatively, you can clone this repository and build it via Nix yourself
git clone https://git.fsfe.org/TEDective/etl && cd etl
nix build
--extra-experimental-features 'nix-command flakes'
--accept-flake-config
Manually
Another way is to use poetry
directly.
After cloning this repo:
poetry install
poetry run run-pipeline --help
Usage
:construction: Disclaimer: usage instructions are working as of 12th of April 2024, but they may be subject of change.
Genaral usage options
run-pipeline [-h] [--first-month FIRST_MONTH] [--last-month LAST_MONTH]
[--meilisearch-url MEILISEARCH_URL] [--in-dir IN_DIR]
[--output-dir OUTPUT_DIR] [--graph-dir GRAPH_DIR] [--local-scheduler]
options:
-h, --help show this help message and exit
--first-month FIRST_MONTH
The first month to process. Defaults to '2017-01'.
--last-month LAST_MONTH
The last month to process. Defaults to the last month.
--meilisearch-url MEILISEARCH_URL
The URL of the Meilisearch server. Defaults to
'http://localhost:7700'
--in-dir IN_DIR The directory to store the TED XMLs. Defaults to '/tmp/ted_notices'
--output-dir OUTPUT_DIR
The directory to store the output data. Defaults to '/tmp/output'
--graph-dir GRAPH_DIR
The name of the KuzuDB graph. Defaults to '/tmp/graph'
--local-scheduler Use the local scheduler.
Using PyPi package
After installation you should able to run both Luigi scheduler and pipeline:
run-server
run-pipeline
Another extra thing that can be ran is a Meilisearch instance so that the search indexes can be built is meilisearch
.
It is NOT provided together with PyPi package, you can install it using your favourite package manager. It is recommended to install it if you plan to use the parsed data with TEDective API
Using Nix
result/bin/run-pipeline --help
result/bin/run-server
run-pipeline --last-month 2017-02
In this case you can also run Meilisearch to build search indexes. That can be done inside the devenv more on that further down
Manually (using poetry
)
Running the pipeline requires running luigi daemon. It is included in the
project and you can run it with the following command:
poetry run run-server
poetry run run-pipeline
It is recommended to run Meilisearch as well, if using this method, you would have to install it manually as well.
Maintainers
@linozen
@micgor32
Contributing
1. Nix development environment
The easiest way to start developing if you are using nix is to use devenv via
the provided flake.nix
.
nix develop --impure
--extra-experimental-features 'nix-command flakes'
--accept-flake-config
kuzu-up
kuzu-down
2. Editing documentation
Small note: If editing the README, please conform to the
standard-readme
specification. Also, please ensure that documentation is kept in sync with the
code. Please note that the main documentation repository is added to this
repository via git-subrepo. To
update the documentation, please use the following commands:
git-subrepo pull docs
cd ./docs
git commit -am "docs: update documentation for new feature"
pnpm install
pnpm run dev
git-subrepo push docs
License
EUPL-1.2 © 2024 Free Software Foundation Europe e.V.