
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
A lightweight library for managing and validating data schemas from YAML specifications
yads
: Yet Another Data Spec YAML-Augmented Data Specification is a Python library for managing data specs using YAML. It helps you define and manage your data warehouse tables, schemas, and documentation in a structured, version-controlled way. With yads
, you can define your data assets once in YAML and then generate various outputs like DDL statements for different databases, data schemas for tools like Avro or PyArrow, and human-readable, LLM-ready documentation.
The modern data stack is complex, with data assets defined across a multitude of platforms and tools. This often leads to fragmented and inconsistent documentation, making data discovery and governance a challenge. yads
was created to address this by providing a centralized, version-controllable, and extensible way to manage metadata for modern data platforms.
The main goal of yads
is to provide a single source of truth for your data assets using simple YAML files. These files can capture everything from table schemas and column descriptions to governance policies and usage notes. From these specifications, yads
can transpile the information into various formats, such as DDL statements for different SQL dialects, Avro or PyArrow schemas, and generate documentation that is ready for both humans and Large Language Models (LLMs).
pip install yads
To include support for PySpark DataFrame schema generation, install the pyspark
additional dependency with:
pip install 'yads[pyspark]'
Create a YAML file to define your table schema and properties. For example, users.yaml
:
# specs/dim_user.yaml
table_name: "dim_user"
database: "dm_product_performance"
database_schema: "curated"
description: "Dimension table for users."
dimensional_table_type: "dimension"
owner: "data_engineering"
version: "1.0.0"
scd_type: 2
location: "s3://lakehouse/dm_product_performance/curated/dim_user"
partitioning:
- column: "created_date"
strategy: "month"
properties:
table_type: "ICEBERG"
format: "parquet"
write_compression: "snappy"
table_schema:
- name: "id"
type: "integer"
description: "Unique identifier for the user"
constraints:
- not_null: true
- name: "username"
type: "string"
description: "Username for the user"
constraints:
- not_null: true
- name: "email"
type: "string"
description: "Email address for the user"
constraints:
- not_null: true
- name: "preferences"
type: "map"
key_type: "string"
value_type: "string"
- name: "created_at"
type: "timestamp"
description: "Timestamp of user creation"
constraints:
- not_null: true
You can generate a Spark DDL CREATE TABLE
statement from the specification:
from yads import TableSpecification
# Load the specification
spec = TableSpecification("specs/dim_user.yaml")
# Generate the DDL
ddl = spec.to_ddl(dialect="spark")
print(ddl)
CREATE OR REPLACE TABLE dm_product_performance.curated.dim_user (
`id` INTEGER NOT NULL,
`username` STRING NOT NULL,
`email` STRING NOT NULL,
`preferences` MAP<STRING, STRING>,
`created_at` TIMESTAMP NOT NULL
)
USING ICEBERG
PARTITIONED BY (month(`created_date`))
LOCATION 's3://lakehouse/dm_product_performance/curated/dim_user'
TBLPROPERTIES (
'table_type' = 'ICEBERG',
'format' = 'parquet',
'write_compression' = 'snappy'
);
>>>
You can generate a pyspark.sql.types.StructType
schema for a PySpark DataFrame:
from yads import TableSpecification
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Load the specification
spec = TableSpecification("specs/dim_user.yaml")
# Generate the PySpark schema
spark_schema = spec.to_spark_schema()
df = spark.createDataFrame([], schema=spark_schema)
df.printSchema()
root
|-- id: integer (nullable = false)
|-- username: string (nullable = false)
|-- email: string (nullable = false)
|-- preferences: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- created_at: timestamp (nullable = false)
>>>
Contributions are welcome! Please feel free to open an issue or submit a pull request.
FAQs
A lightweight library for managing and validating data schemas from YAML specifications
We found that yads demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.