Simple declarative data extraction and loading in Python, featuring:
- 🍰 Ease of use: Data extraction is performed in a simple, declarative types.
- ⚙ XML / HTML / JSON Extraction: Extraction can be performed across a wide array of structured data
- 🐼 Pandas Integration: Results are easily castable to Pandas Dataframes and Series.
- 😀 Custom Output Classes: Results can be automatically loaded into autogenerated dataclasses, or custom model types.
- 🚀 Performance: XML loading is supported by the excellent and fast lxml library, JSON is supported by UltraJSON for fast parsing, and jsonpath_ng for flexible data extraction.
Quick Start
To extract data from XML, use this import statement, and see the example below:
from yankee.xml.schema import Schema, fields as f, CSSSelector
To extract data from JSON, use this import statement, and see the example below:
from yankee.xml.schema import Schema, fields as f, JSONPath
To extract data from HTML, use this import statement:
from yankee.html.schema import Schema, fields as f, CSSSelector
To extract data from Python objects (either objects or dictionaries), use this import statement:
from yankee.base.schema import Schema, fields as f
Complete documentation is available on Read The Docs
Data extraction from XML. By default, data keys are XPath expressions, but can also be CSS selectors.
Take this:
<name>Johnny Appleseed</name>
Do this:
from yankee.xml.schema import Schema, fields as f, CSSSelector
class XmlExample(Schema):
name = f.String("./name")
birthday = f.Date(CSSSelector("birthdate"))
deep_data = f.Int("./something/many/levels/deep")
Get this:
"name": "Johnny Appleseed",
"birthday":, 1, 1),
"deep_data": 123
Data extraction from JSON. By default, data keys are implied from the field names, but can also be JSONPath expressions
Take this:
"name": "Johnny Appleseed",
"birthdate": "2000-01-01",
"something": [
{"many": {
"levels": {
"deep": 123
Do this:
from yankee.json.schema import Schema, fields as f
class JsonExample(Schema):
name = f.String()
birthday = f.Date("birthdate")
deep_data = f.Int("something.0.many.levels.deep")
Get this:
"name": "Johnny Appleseed",
"birthday":, 1, 1),
"deep_data": 123