Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
A simple data ingestion library to guide data flows from some places to other places
Documentation: https://dyvenia.github.io/viadot/
Source Code: https://github.com/dyvenia/viadot
A simple data ingestion library to guide data flows from some places to other places.
Viadot supports several API and RDBMS sources, private and public. Currently, we support the UK Carbon Intensity public API and base the examples on it.
from viadot.sources.uk_carbon_intensity import UKCarbonIntensity
ukci = UKCarbonIntensity()
ukci.query("/intensity")
df = ukci.to_df()
df
Output:
from | to | forecast | actual | index | |
---|---|---|---|---|---|
0 | 2021-08-10T11:00Z | 2021-08-10T11:30Z | 211 | 216 | moderate |
The above df
is a python pandas DataFrame
object. The above df contains data downloaded from viadot from the Carbon Intensity UK API.
Depending on the source, viadot provides different methods of uploading data. For instance, for SQL sources, this would be bulk inserts. For data lake sources, it would be a file upload. We also provide ready-made pipelines including data validation steps using Great Expectations.
An example of loading data into SQLite from a pandas DataFrame
using the SQLiteInsert
Prefect task:
from viadot.tasks import SQLiteInsert
insert_task = SQLiteInsert()
insert_task.run(table_name=TABLE_NAME, dtypes=dtypes, db_path=database_path, df=df, if_exists="replace")
Note: If you're running on Unix, after cloning the repo, you may need to grant executable privileges to the update.sh
and run.sh
scripts:
sudo chmod +x viadot/docker/update.sh && \
sudo chmod +x viadot/docker/run.sh
Clone the main
branch, enter the docker
folder, and set up the environment:
git clone https://github.com/dyvenia/viadot.git && \
cd viadot/docker && \
./update.sh
Run the enviroment:
./run.sh
Clone the dev
branch, enter the docker
folder, and set up the environment:
git clone -b dev https://github.com/dyvenia/viadot.git && \
cd viadot/docker && \
./update.sh -t dev
Run the enviroment:
./run.sh -t dev
Install the library in development mode (repeat for the viadot_jupyter_lab
container if needed):
docker exec -it viadot_testing pip install -e . --user
To run tests, log into the container and run pytest:
docker exec -it viadot_testing bash
pytest
You can run the example flows from the terminal:
docker exec -it viadot_testing bash
FLOW_NAME=hello_world; python -m viadot.examples.$FLOW_NAME
However, when developing, the easiest way is to use the provided Jupyter Lab container available in the browser at http://localhost:9000/
.
To begin using Spark, you must first declare the environmental variables as follows:
DATABRICKS_HOST = os.getenv("DATABRICKS_HOST")
DATABRICKS_API_TOKEN = os.getenv("DATABRICKS_API_TOKEN")
DATABRICKS_ORG_ID = os.getenv("DATABRICKS_ORG_ID")
DATABRICKS_PORT = os.getenv("DATABRICKS_PORT")
DATABRICKS_CLUSTER_ID = os.getenv("DATABRICKS_CLUSTER_ID")
Alternatively, you can also create a file called .databricks-connect
in the root directory of viadot and add the required variables there. It should follow the following format:
{
"host": "",
"token": "",
"cluster_id": "",
"org_id": "",
"port": ""
}
To retrieve the values, follow step 2 in this link
To begin using Spark, you must first create a Spark Session: spark = SparkSession.builder.appName('session_name').getOrCreate()
. spark
will be used to access all the Spark methods. Here is a list of commonly used Spark methods (WIP):
spark.createDataFrame(df)
: Create a Spark DataFrame from a Pandas DataFramesparkdf.write.saveAsTable("schema.table")
: Takes a Spark DataFrame and saves it as a table in Databricks.table = spark.sql("select * from schema.table")
: example of a simple query ran through Pythonpytest
CHANGELOG.md
viadot/docs
)The general flow of working for this repository in case of forking:
git checkout -b <name>
git add <files>
git commit -m <message>
Note: See out Style Guidelines for more information about commit messages and PR names
git fetch <remote> <branch>
git checkout <remote>/<branch>
git push origin <name>
git checkout <where_merging_to>
git merge <branch_to_merge>
Please follow the standards and best practices used within the library (eg. when adding tasks, see how other tasks are constructed, etc.). For any questions, please reach out to us here on GitHub.
Your code should be formatted with Black when you want to contribute. To set up Black in Visual Studio Code follow instructions below.
black
in your environment by writing in the terminal:pip install black
Settings
or type "Ctrl" + ",".Format On Save
setting - check the box.Python Formatting Provider
and select "black" in the drop-down list.FAQs
A simple data ingestion library to guide data flows from some places to other places
We found that viadot demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.