Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Datagit is a git based metric store library
>>> from datagit import github_connector
>>> from github import Github
>>> dataframe = bigquery.Client().query(query).to_dataframe()
{"unique_key": ['2022-01-01_FR', '2022-01-01_GB'...
>>> github_connector.store_metric(ghClient=Github("Token"), dataframe=dataframe, filepath="Samox/datagit/data/act_metrics_finance/mrr.csv", assignees=["Samox"])
'🎉 data/act_metrics_finance/mrr.csv Successfully stored!'
'💩 Historical data change detected, Samox was assigned to it'
Non-moving data is a journey, in reality, the data moves, or drifts. The purpose of this library is
To get started with Datagit, follow these steps:
datagit
(or whatever other name you prefer) with a README file.datagit
repository. You can do this by going to your GitHub settings, selecting "Developer settings", and then "Personal access tokens". Click "Generate new token" and give it the necessary permissions (content and pull requests).store_metric
with the following parameters
For instance
>>> from datagit import github_connector
>>> github_connector.store_metric(ghClient=Github("Token"), dataframe=dataframe, filepath="Samox/datagit/data/act_metrics_finance/mrr.csv", assignee=["Samox"])
That's it! With these steps, you can start using Datagit to store and track your metrics over time.
>>> githubToken = "github_pat****"
>>> githubRepo = "ReplaceOrgaName/ReplaceRepoName"
>>> import pandas as pd
>>> from datetime import datetime
>>> dataframe = pd.DataFrame({'unique_key': ['a', 'b', 'c'], 'date': [datetime(2023,9,1), datetime(2023,9,1), datetime(2023,9,1)], 'amount': [1001, 1002, 1003], 'is_active': [True, False, True]})
>>> from github import Github
>>> from datagit import github_connector
>>> github_connector.store_metric(ghClient=Github(githubToken), dataframe=dataframe, filepath=githubRepo+"data/act_metrics_finance/mrr.csv")
Datagit is base on the standard dataframe format from Pandas. One can use any library to get the data as long as the format fits the following requirements:
unique_key
The granularity of the dataframe depends on every use case:
The unique_key is used to detect a modification in historical data
In case you have duplicated lines, datagit will automatically rename them with -duplicate-n
unique_key value
0 A 10
1 B 20
2 C 30
3 B 40
4 C 50
5 C 60
6 D 70
unique_key value
0 A 10
1 B 20
2 C 30
3 B-duplicate-1 40
4 C-duplicate-1 50
5 C-duplicate-2 60
6 D 70
The date key is used to detect new historical data, or deleted historical data
Datagit provides a simple query builder to store a table:
>>> from datagit import query_builder
>>> query = query_builder.build_query(table_id="my_table", unique_key_columns=["organisation_id", "date_month"], date="date_month")
'SELECT CONCAT(organisation_id, '__', date_month) AS unique_key, date_month as date, * FROM my_table WHERE TRUE ORDER BY 1'
More examples here
In case of more than 1M rows, partitionning is recomanded using the partition_and_store_table
function.
>>> from datagit import github_connector
>>> very_large_dataframe = bigquery.Client().query(query).to_dataframe()
{"unique_key": ['2022-01-01_FR', '2022-01-01_GB'...
>>> github_connector.partition_and_store_table(ghClient=Github("Token"), dataframe=very_large_dataframe, filepath="Samox/datagit/data/act_metrics_finance/mrr.csv")
'🎁 Partitionning data/act_metrics_finance/mrr.csv...'
A drift is a modification of historical data. It can be a modification, addition or deletion in a table that is supposed to be "non-moving data".
When a drift is detected, the default behaviour is to trigger an alert and prompt the user to explain the drift before merging it to the dataset. But a custom function can be used to decide weather an alert should be triggered, or if the drift should be merged automatically.
The default drift evaluator will open a pull request with a message containing the number of addition, modifications and deletions of the drift.
You can provide a custom evaluator which is a function with the following properties:
True
a pull request will be opened, If False
the drift will be mergedIn case you just want to store the metric in a git branch, this drift evaluator merge the drift in the reported branch without any alert.
FAQs
Git based metric store
We found that datagit demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.