mojap-metadata
This python package allows users to read and alter our metadata schemas (using the metadata module) as well as convert our metadata schemas to other schema definitions utilised by other tools (these are defined in the converters module and are defined as Converters).
Metadata
Converters
Converter Systems
Installation
Make sure you are using a new version of pip (>=20.0.0)
pip install git+https://github.com/moj-analytical-services/mojap-metadata
To install additional dependencies that will be used by the converters (e.g. etl-manager
and arrow
extras)
pip install 'mojap-metadata[etl-manager,arrow] @ git+https://github.com/moj-analytical-services/mojap-metadata'
Metadata
This module creates a class called Metadata
which allows you to interact with our agnostic metadata schemas. The Metadata
class deals with parsing, manipulating and validating metadata json schemas.
The Schema
Our metadata schemas are used to define a table. The idea of these schemas are to define the contexts of a table with generic metadata schemas. If you want to use this schema to interact with Oracle, PyArrow or AWS Glue for example, then you can create a Converter class to take the metadata and converter it to a schema that works with that tool (or vice versa).
When adding a parameter to the metadata config first thing is to look if it exists in json-schema. For example enum
, pattern
and type
are parameters in our column types but come from json schema naming definitions.
An example of a basic metadata schema:
{
"$schema" : "$schema": "https://moj-analytical-services.github.io/metadata_schema/mojap_metadata/v1.0.0.json",
"name": "employees",
"description": "table containing employee information",
"file_format": "parquet",
"columns": [
{
"name": "employee_id",
"type": "int64",
"type_desc": "integer",
"description": "an ID for each employee",
"minimum": 1000,
"maximum": 9999
},
{
"name": "employee_name",
"type": "string",
"type_string": "string",
"description": "name of the employee"
},
{
"name": "employee_dob",
"type": "date64",
"type_desc": "date",
"description": "date of birth for the employee in ISO format",
"pattern": "^\\d{4}-([0]\\d|1[0-2])-([0-2]\\d|3[01])$"
}
]
}
Schema Properties
-
name: String that can be whatever you want to name the table. Best to avoid spaces as most systems do not like that but it will let you do this.
-
file_format: String denoting the file format.
-
columns: List of objects where each object descibes a column in your table. Each column object must have at least a name
and a (type
or type_description
).
- name: String denoting the name of the column.
- type: String specifing the type the data is in. We use data types from the Apache Arrow project. We use their type names as it seems to comprehensively cover most of the data types we deal with. Note: In our naming convention for types we allow
bool
(which is equivalent to bool_
) and list
(which is equivalent to list_
). - type_category: These group different sets of
type
properties into a single superset. These are: integer
, float
, string
, timestamp
, bool
, list
, struct
. For example we class int8, int16, int32, int64, uint8, uint16, uint32, uint64
as integer
. It allows users to give more generic types if your data is not coming from a system or output with strict types (i.e. data exported from Excel or an unknown origin). The Metadata class has default type values for each given type_category
. See the default_type_category_lookup
attribute of the Metadata
class to see said defaults. This field is required if type
is not set. - description: Description of the column.
- enum: List of what values that column can take. (Same as the standardised json schema keyword).
- pattern: Regex pattern that value has to to match (for string type_categories only). (Same as the standardised json schema keyword).
- minLength / maxLength: The minimum and maximum length of the string (for string type_categories only). (Same as the standardised json schema keyword).
- minimum / maximum: The minumum and maximum value a numerical type can take (for integer and float type_categories only).
-
partitions: List of what columns in your dataset are partitions.
-
table_location: the location of the table. This is a string that can represent a file path, directory, url, etc.
-
database_name: the name of the database this table belongs to.
Additional Schema Parameters
We allow users to add addition parameters to the table schema object or any of the columns in the schema. If there are specific parameters / tags you want to add to your schema it should still pass validation (as long as the additional parameters are not the same name of ones already used in the schema).
Usage
from mojap_metadata import Metadata
meta1 = Metadata(name="test", columns=[{"name": "c1", "type": "int64"}, {"name": "c2", "type": "string"}])
print(meta1.name)
print(meta1.columns[0])
print(meta1.description)
d = {
"name": "test",
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"}
]
}
meta2 = Metadata.from_dict(d)
meta3 = Metadata.from_json("path/to/metadata_schema.json")
meta3.name = "new_table"
meta3.to_json("path/to/new_metadata_schema.json")
Added Class methods and properties
The metadata class has some methods and properties that are not part of the schema but helps organise and manage the schema.
Column Methods
The class has multiple methods to alter the columns list.
column_names
: Get a list of column namesupdate_column
: If column with name matches replace it otherwise add it to the endremove_column
: Remove column that matches the given name. Note if a name in the partitions
property matches that name then it is also removed.
meta = Metadata(columns=[
{"name": "a", "type": "int8"},
{"name": "b", "type": "string"},
{"name": "c", "type": "date32"},
])
meta.column_names
meta.update_column({"name": "a", "type": "int64"})
meta.columns[0]["type"]
meta.update_column({"name": "d", "type": "string"})
meta.column_names
meta.remove_column("d")
assert meta.column_names == ["a", "b", "c"]
The metadata class is a subclass of MutableMappings, where keys are column names and values are column metadata.
- A metadata column can be accessed using the column name as a key.
- A new or existing column can be updated using the column name as a key. The key must match the column name. Calls update_column.
- A column can be deleted. Calls remove_column.
- Columns of metadata can be iterated over.
- The length of metadata is defined as the number of columns.
meta["c1"]
meta["c3"] = {"name": "c3", "type": "bool"}
del meta["c3"]
for col in meta:
print(f"column name:{col["name"]}, column type:{col["type"]}")
len(meta)
force_partition_order Property
By default this is set to None. However can be set to "start"
or "end"
. When set to None the Metadata Class will not track column order relative to partitions.
Note: For Athena we normally set partitions at the end.
meta = Metadata(columns=[
{"name": "a", "type": "int8"},
{"name": "b", "type": "string"},
{"name": "c", "type": "date32"},
])
meta.partitions = ["b"]
meta.column_names
If set to "start"
or "end"
then any changes to partitions will affect the column order.
meta.force_partition_order = "start"
meta.column_names
Generating Metadata objects
Converters
Converters takes a Metadata object and generates something else from it (or can convert something to a Metadata object). Most of the time your converter will convert our schema into another systems schema.
How to use the Converters
For example the ArrowConverter
takes our schemas and converts them to a pyarrow schema:
from mojap_metadata import Metadata
from mojap_metadata.converters.arrow_converter import ArrowConverter
d = {
"name": "test",
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"},
{"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
],
"file_format": "jsonl"
}
meta = Metadata.from_dict(d)
ac = ArrowConverter()
arrow_schema = ac.generate_from_meta(meta)
print(arrow_schema)
Another use for the arrow converter is to convert it back from an Arrow schema to our metadata. This is especially useful if you have nested data types that would be difficult to write out the full STRUCT
/ LIST
. Instead you can let Arrow do that for you and then pass the agnostic metadata object into something like the Glue Converter to generate a schema for AWS Glue.
import pyarrow as pa
import pandas as pd
from mojap_metadata.converters.arrow_converter import ArrowConverter
data = {
"a": [0,1],
"b": [
{"cat": {"meow": True}, "dog": ["bone", "fish"]},
{"cat": {"meow": True}, "dog": ["bone", "fish"]},
]
}
df = pd.DataFrame(data)
arrow_df = pa.Table.from_pandas(df)
ac = ArrowConverter()
meta = ac.generate_to_meta(arrow_df.schema)
print(meta.columns)
All converter classes are sub classes of the mojap_metadata.converters.BaseConverter
. This BaseConverter
has no actual functionality but is a boilerplate class that ensures standardised attributes for all added Converters
these are:
-
generate_from_meta: (function) takes a Metadata object and returns whatever the converter is producing .
-
generate_to_meta: (function) takes Any object (normally another schema for another system or package) and returns our Metadata object. (i.e. the reverse of generate_from_meta).
-
options: (Data Class) that are the options for the converter. The base options have a suppress_warnings
parameter but it doesn't mean call converters use this. To get a better understanding of setting options see the GlueConverter
class or the tests/test_glue_converter.py
to see how they are set.
## Further Usage
See the mojap-aws-tools repo which utilises the converters a lot in different tutorials.
Contributing and Design Considerations
Each new converter (if not expanding on existing converters) should be added as a new submodule within the parent converters
module. This is especially true if the new converter has additional package dependencies. By design the standard install of this package is fairly lightweight. However if you needed the ArrowConverter
you would need to install the additional package dependencies for the arrow converter:
pip install 'mojap-metadata[arrow] @ git+https://github.com/moj-analytical-services/mojap-metadata'
This means we can continuely add converters (as submodules) and add optional package dependencies (see pyproject.toml ) without making the default install any less lightweight. mojap_metadata
would only error if someone tries to import a converter subclass that with having the additional dependencies dependencies installed.
Converter systems
Glue Converter
The GlueConverter
takes our schemas and converts them to a dictionary that can be passed to an AWS boto glue client to create a table in the AWS Glue Data Catalogue. Included alongside GlueConverter
is GlueTable
which can generate a Glue Table directly from a schema, and also generate a Metadata object from a Glue Table.
See Glue Converter for more details.
SQLAlchemy Converter
Uses the Inspector class to extract metadata from database dialects supported by SQLAlchemy.
See SQLAlchemy Converter for more details.
Postgres Converter
Postgres Converter provides the following functionality
- Conenction to postgres database
- Extract the metadata from the tables
- Convert the extracted ouptut into Metadata object
-
get_object_meta (function) takes the table name, schema name then the extracts the metadata from postgres database and
converts into Metadata object
-
generate_to_meta: (function) takes the database connection and returns a list of Metadata object for all the (non-system schemas) schemas and tables from the connection.
NOTE: the sqlalchemy converter is more robust and should be the default method for most databases, but the postgres converter is retained for compatibility