Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
This project helps us to run Data Quality Rules in flight while spark job is being run
Spark Expectations is a specialized tool designed with the primary goal of maintaining data integrity within your processing pipeline. By identifying and preventing malformed or incorrect data from reaching the target destination, it ensues that only quality data is passed through. Any erroneous records are not simply ignored but are filtered into a separate error table, allowing for detailed analysis and reporting. Additionally, Spark Expectations provides valuable statistical data on the filtered content, empowering you with insights into your data quality.
The documentation for spark-expectations can be found here
Thanks to all the contributors who have helped ideate, develop and bring it to its current state
We're delighted that you're interested in contributing to our project! To get started, please carefully read and follow the guidelines provided in our contributing document
Please find the spark-expectations flow and feature diagrams below
In order to establish the global configuration parameter for DQ Spark Expectations, you must define and complete the required fields within a variable. This involves creating a variable and ensuring that all the necessary information is provided in the appropriate fields.
from spark_expectations.config.user_config import Constants as user_config
se_user_conf = {
user_config.se_notifications_enable_email: False,
user_config.se_notifications_email_smtp_host: "mailhost.nike.com",
user_config.se_notifications_email_smtp_port: 25,
user_config.se_notifications_email_from: "<sender_email_id>",
user_config.se_notifications_email_to_other_nike_mail_id: "<receiver_email_id's>",
user_config.se_notifications_email_subject: "spark expectations - data quality - notifications",
user_config.se_notifications_enable_slack: True,
user_config.se_notifications_slack_webhook_url: "<slack-webhook-url>",
user_config.se_notifications_on_start: True,
user_config.se_notifications_on_completion: True,
user_config.se_notifications_on_fail: True,
user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True,
user_config.se_notifications_on_error_drop_threshold: 15,
#Optional
#Below two params are optional and need to be enabled to capture the detailed stats in the <stats_table_name>_detailed.
#user_config.enable_query_dq_detailed_result: True,
#user_config.enable_agg_dq_detailed_result: True,
}
For all the below examples the below import and SparkExpectations class instantiation is mandatory
SparkExpectations
class which has all the required functions for running data quality rulesfrom spark_expectations.core.expectations import SparkExpectations, WrappedDataFrameWriter
from pyspark.sql import SparkSession
spark: SparkSession = SparkSession.builder.getOrCreate()
writer = WrappedDataFrameWriter().mode("append").format("delta")
# writer = WrappedDataFrameWriter().mode("append").format("iceberg")
# product_id should match with the "product_id" in the rules table
se: SparkExpectations = SparkExpectations(
product_id="your_product",
rules_df=spark.table("dq_spark_local.dq_rules"),
stats_table="dq_spark_local.dq_stats",
stats_table_writer=writer,
target_and_error_table_writer=writer,
debugger=False,
# stats_streaming_options={user_config.se_enable_streaming: False},
)
@se.with_expectations
decoratorfrom spark_expectations.config.user_config import *
from pyspark.sql import DataFrame
import os
@se.with_expectations(
target_table="dq_spark_local.customer_order",
write_to_table=True,
user_conf=se_user_conf,
target_table_view="order",
)
def build_new() -> DataFrame:
# Return the dataframe on which Spark-Expectations needs to be run
_df_order: DataFrame = (
spark.read.option("header", "true")
.option("inferSchema", "true")
.csv(os.path.join(os.path.dirname(__file__), "resources/order.csv"))
)
_df_order.createOrReplaceTempView("order")
return _df_order
FAQs
This project helps us to run Data Quality Rules in flight while spark job is being run
We found that spark-expectations demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.