What is @dataform/core?
@dataform/core is a powerful tool for managing data workflows and transformations. It allows you to define data models, schedule data transformations, and manage dependencies between different data operations. It is particularly useful for teams working with large-scale data warehouses and ETL processes.
What are @dataform/core's main functionalities?
Defining Data Models
This feature allows you to define data models using SQL queries. The `table` function is used to create a new table, specifying its type, dependencies, and the SQL query that defines its content.
const dataform = require('@dataform/core');
const { table, ref } = dataform;
table('my_table', {
type: 'table',
dependencies: [ref('source_table')],
query: `
SELECT *
FROM ${ref('source_table')}
`
});
Scheduling Data Transformations
This feature allows you to schedule data transformations using cron syntax. The `schedule` function is used to define a scheduled task, specifying the cron schedule and the actions to be performed.
const dataform = require('@dataform/core');
const { schedule } = dataform;
schedule('daily_update', {
cron: '0 0 * * *',
actions: [
{ name: 'update_table', type: 'operation', query: 'CALL update_table_procedure();' }
]
});
Managing Dependencies
This feature allows you to manage dependencies between different data operations. The `ref` function is used to reference other tables or operations, ensuring that dependencies are correctly managed.
const dataform = require('@dataform/core');
const { ref, table } = dataform;
table('dependent_table', {
type: 'table',
dependencies: [ref('base_table')],
query: `
SELECT *
FROM ${ref('base_table')}
`
});
Other packages similar to @dataform/core
dbt
dbt (data build tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively. It is similar to @dataform/core in that it allows you to define data models and manage dependencies, but it also includes features for testing and documentation.
airflow
Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It is more general-purpose than @dataform/core, as it can be used for a wide range of workflow automation tasks beyond just data transformations.
luigi
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, and more. Like @dataform/core, it is designed for managing data workflows, but it is more focused on batch processing.