And these visions of data types, they kept us up past the dawn.
The Semantic Data Library
Visions provides a set of tools for defining and using semantic data types.
Check out the complete
documentation here.
Installation
Source code is available on github and binary installers via pip.
# Pip
pip install visions
Complete installation instructions (including extras) are available in
the docs.
Quick Start Guide
If you want to play immediately check out the examples folder
on
. Otherwise,
let's get some data
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df.head(2)
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
The most important abstraction in visions are Types - these represent semantic notions about data. You have access to
a range of well tested types like Integer, Float, and Files covering the most common software development use
cases.
Types can be bundled together into typesets. Behind the scenes, visions builds a traversable graph for any collection
of types.
from visions import types, typesets
typeset = typesets.CompleteSet()
typeset.plot_graph()
Note: Plots require pygraphviz to be installed.
Because of the special relationship between types these graphs can be used to detect the type of your data or infer a
more appropriate one.
typeset.detect_type(df)
typeset.infer_type(df)
typeset.infer_type(df.astype(str))
>> {
'PassengerId': Integer,
'Survived': Integer,
'Pclass': Integer,
'Name': String,
'Sex': String,
'Age': Float,
'SibSp': Integer,
'Parch': Integer,
'Ticket': String,
'Fare': Float,
'Cabin': String,
'Embarked': String
}
Visions solves many of the most common problems working with tabular data for example, sequences of Integers are still
recognized as integers whether they have trailing decimal 0's from being cast to float, missing values, or something
else altogether. Much of this cleaning is performed automatically providing nicely cleaned and processed data as well.
cleaned_df = typeset.cast_to_inferred(df)
This is only a small taste of everything visions can do
including building your own domain
specific types and typesets so please check out the API
documentation or the examples/ directory for more
info!
Supported frameworks
Thanks to its dispatch based implementation Visions is able to exploit framework specific capabilities offered by
libraries like pandas and spark. Currently it works with the following backends by default.
- Pandas (feature complete)
- Numpy (boolean, complex, date time, float, integer, string, time deltas, string,
objects)
- Spark (boolean, categorical, date, date time, float, integer, numeric, object,
string)
- Python (string, float, integer,
date time, time delta, boolean, categorical, object, complex - other datatypes are untested)
If you're using pandas it will also take advantage of parallelization tools like
swifter if available.
It also offers a simple annotation based API for registering new implementations as needed. For example, if you wished
to extend the categorical data type to include a Dask specific implementation you might do something like
from visions.types.categorical import Categorical
from pandas.api import types as pdt
import dask
@Categorical.contains_op.register
def categorical_contains(series: dask.dataframe.Series, state: dict) -> bool:
return pdt.is_categorical_dtype(series.dtype)
Contributing and support
Contributions to visions are welcome. For more information, please visit the community
contributions page and join on us
on slack. The
github issues tracker is used for reporting bugs, feature
requests and support questions.
Also, please check out some of the other companies and packages using visions including:
If you're currently using visions or would like to be featured here please let us know.
Acknowledgements
This package is part of the dylan-profiler project. The package is core component
of pandas-profiling. More information can be
found here. This work was partially supported
by SIDN Fonds.
