BeETL: Extensible Python/Polars-based ETL Framework
BeETL was born from a job as Integration Developer where a majority of the integrations we develop follow the same pattern - get here, transform a little, put there (with the middle step frequently missing altogether).
After building our 16th integration between the same two systems with another manual template, we decided to build BeETL. BeETL is currently limited to one datasource per source and destination per sync, but this will be expanded in the future. One configuration can contain multiple syncs.
Note: Even though most of the configuration below is in YAML format, you can also use JSON or a python dictionary.
Todo:
TOC
Installation
From PyPi
pip3 install beetl
From Source
git clone https://
python3 setup.py install
Quick Start
The following is the minimum amount of configuration needed to get started with a simple sync
from src.beetl.beetl import Beetl, BeetlConfig
sync_config = {
"version": "V1",
"sources": [
{
"name": "mysql_db",
"type": "Mysql",
"connection": {
"settings": {
"connection_string": "mysql://user:password@host:3306/database"
}
}
},
{
"name": "postgres_db",
"type": "Postgres",
"connection": {
"settings": {
"connection_string": "postgresql://user:password@host:5432/database"
}
}
}
],
"sync": [
{
"source": "mysql_db",
"destination": "postgres_db",
"sourceConfig": {
"query": "SELECT field1, field2, field3 FROM table1",
"columns": [
{
"name": "field1",
"type": "Int32",
"unique": True
},
{
"name": "field2",
"type": "Utf8",
"unique": False
},
{
"name": "field3",
"type": "Utf8",
"unique": False
}
]
},
"destinationConfig": {
"table": "table1",
"columns": [
{
"name": "field1",
"type": "Int32",
"unique": True
},
{
"name": "field2",
"type": "Utf8",
"unique": False
},
{
"name": "field3",
"type": "Utf8",
"unique": False,
"skip_update": True
}
]
},
"sourceTransformers": {},
"insertionTransformers": {}
}
]
}
Secrets from Environment Variables
In case you want to save your secrets in environment variables instead of in the yaml configuration file, you can save them as a json object to an environment variable and replace the "sources"-section with sourcesFromEnv setting.
Note that the "sources" and "sourcesFromEnv" options are mutually exclusive.
sync_config = {
"version": "V1",
"sourcesFromEnv": "BEETL_SOURCES",
"sync": [
.....
version: "V1"
sourcesFromEnv: "BEETL_SOURCES"
sync:
- ......
{
"version": "V1",
"sourcesFromEnv": "BEETL_SOURCES",
"sync": [
......
The format of the sources configuration is the same as the one normally under the "sources"-section:
[
{
"name": "mysql_db",
"type": "Mysql",
"connection": {
"settings": {
"connection_string": "mysql://user:password@host:3306/database"
}
}
},
{
"name": "postgres_db",
"type": "Postgres",
"connection": {
"settings": {
"connection_string": "postgresql://user:password@host:5432/database"
}
}
}
]