
Security News
Another Round of TEA Protocol Spam Floods npm, But It’s Not a Worm
Recent coverage mislabels the latest TEA protocol spam as a worm. Here’s what’s actually happening.
spark-pipelines
Advanced tools
The spark-pipelines package is designed to facilitate data quality validation for Spark DataFrames. By using decorators, users can easily enforce data quality rules on their data loading functions, ensuring that only valid data is processed. This package is particularly useful for data engineers and data scientists who work with large datasets and need to maintain high data quality standards.
To install the spark-pipelines package, you can use pip:
pip install spark-pipelines
To get started, you can define your data quality rules in a dictionary format, where the keys are human-readable descriptions of the rules, and the values are SQL filter expressions. Then, you can decorate your data loading functions with the appropriate validation decorator.
Here is a simple example:
from spark_pipelines import DataQuality
dq = DataQuality(
spark=spark,
job_name="employee_ingestion_job"
)
# Define your data quality rules
rules = {
"Employee ID should be greater than 2": "emp_id > 2",
"Name should not be null": "fname IS NOT NULL"
}
# Use the expect decorator to validate the DataFrame
@dq.expect(rules)
def load_employee_df():
return spark.read.table("employee")
# Load the DataFrame and validate it
df = load_employee_df()
The package provides several decorators for different validation strategies:
expect: This decorator validates the DataFrame and prints a report of the validation results. If any rule fails, the DataFrame is still returned.
@DataQuality.expect(rules)
def load_employee_df():
return spark.read.table("employee")
expect_drop: This decorator validates the DataFrame and returns a filtered DataFrame containing only the valid records. Failed records are excluded.
@DataQuality.expect_drop(rules)
def load_employee_df():
return spark.read.table("employee")
expect_fail: This decorator validates the DataFrame and raises an exception if any validation rule fails. This is useful for stopping the pipeline execution when data quality issues are detected.
@DataQuality.expect_fail(rules)
def load_employee_df():
return spark.read.table("employee")
expect_quarantine: This decorator validates the DataFrame and optionally quarantines failed records to a specified location or table. You can specify either a path or a table name for storing the quarantined records.
from spark_pipelines import DataQuality
dq = DataQuality(
spark=spark,
job_name="employee_ingestion_job",
dq_table_name="data_quality.default.employee_dq",
quarantine_table="data_quality.default.employee_qr",
quarantine_format='delta'
)
@dq.expect_quarantine(rules)
def load_employee_df():
return spark.read.table("employee")
For more detailed usage examples and explanations of each validation method, refer to the comprehensive documentation available in the docs directory of the package.
Contributions to the spark-pipelines package are welcome! Please check the CHANGELOG.md for the history of changes and the LICENSE file for licensing information. If you have suggestions or improvements, feel free to submit a pull request.
By following these instructions, users can effectively utilize the spark-pipelines package to ensure high data quality in their Spark applications.
FAQs
A Python package for validating Spark DataFrames against data quality rules.
We found that spark-pipelines demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Recent coverage mislabels the latest TEA protocol spam as a worm. Here’s what’s actually happening.

Security News
PyPI adds Trusted Publishing support for GitLab Self-Managed as adoption reaches 25% of uploads

Research
/Security News
A malicious Chrome extension posing as an Ethereum wallet steals seed phrases by encoding them into Sui transactions, enabling full wallet takeover.