TensorFlow Data Validation
TensorFlow Data Validation (TFDV) is a library for exploring and validating
machine learning data. It is designed to be highly scalable
and to work well with TensorFlow and TensorFlow Extended (TFX).
TF Data Validation includes:
- Scalable calculation of summary statistics of training and test data.
- Integration with a viewer for data distributions and statistics, as well
as faceted comparison of pairs of features (Facets)
- Automated data-schema
generation to describe expectations about data
like required values, ranges, and vocabularies
- A schema viewer to help you inspect the schema.
- Anomaly detection to identify anomalies,
such as missing features,
out-of-range values, or wrong feature types, to name a few.
- An anomalies viewer so that you can see what features have anomalies and
learn more in order to correct them.
For instructions on using TFDV, see the get started guide
and try out the example notebook.
Some of the techniques implemented in TFDV are described in a
technical paper published in SysML'19.
Installing from PyPI
The recommended way to install TFDV is using the
PyPI package:
pip install tensorflow-data-validation
Nightly Packages
TFDV also hosts nightly packages on Google Cloud. To install the latest nightly
package, please use the following command:
export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation
This will install the nightly packages for the major dependencies of TFDV such
as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).
Sometimes TFDV uses those dependencies' most recent changes, which are not yet
released. Because of this, it is safer to use nightly versions of those
dependent libraries when using nightly TFDV. Export the
TFX_DEPENDENCY_SELECTOR
environment variable to do so.
NOTE: These nightly packages are unstable and breakages are likely to happen.
The fix could often take a week or more depending on the complexity involved.
Build with Docker
This is the recommended way to build TFDV under Linux, and is continuously
tested at Google.
1. Install Docker
Please first install docker
and docker-compose
by following the directions:
docker;
docker-compose.
2. Clone the TFDV repository
git clone https://github.com/tensorflow/data-validation
cd data-validation
Note that these instructions will install the latest master branch of TensorFlow
Data Validation. If you want to install a specific branch (such as a release
branch), pass -b <branchname>
to the git clone
command.
3. Build the pip package
Then, run the following at the project root:
sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010
where PYTHON_VERSION
is one of {39, 310, 311}
.
A wheel will be produced under dist/
.
4. Install the pip package
pip install dist/*.whl
Build from source
1. Prerequisites
To compile and use TFDV, you need to set up some prerequisites.
Install NumPy
If NumPy is not installed on your system, install it now by following these
directions.
Install Bazel
If Bazel is not installed on your system, install it now by following these
directions.
2. Clone the TFDV repository
git clone https://github.com/tensorflow/data-validation
cd data-validation
Note that these instructions will install the latest master branch of TensorFlow
Data Validation. If you want to install a specific branch (such as a release
branch), pass -b <branchname>
to the git clone
command.
3. Build the pip package
TFDV
wheel is Python version dependent -- to build the pip package that
works for a specific Python version, use that Python binary to run:
python setup.py bdist_wheel
You can find the generated .whl
file in the dist
subdirectory.
4. Install the pip package
pip install dist/*.whl
Supported platforms
TFDV is tested on the following 64-bit operating systems:
- macOS 12.5 (Monterey) or later.
- Ubuntu 20.04 or later.
Notable Dependencies
TensorFlow is required.
Apache Beam is required; it's the way that efficient
distributed computation is supported. By default, Apache Beam runs in local
mode but can also run in distributed mode using
Google Cloud Dataflow and other Apache
Beam
runners.
Apache Arrow is also required. TFDV uses Arrow to
represent data internally in order to make use of vectorized numpy functions.
Compatible versions
The following table shows the package versions that are
compatible with each other. This is determined by our testing framework, but
other untested combinations may also work.
tensorflow-data-validation | apache-beam[gcp] | pyarrow | tensorflow | tensorflow-metadata | tensorflow-transform | tfx-bsl |
---|
GitHub master | 2.59.0 | 10.0.1 | nightly (2.x) | 1.16.0 | n/a | 1.16.0 |
1.16.0 | 2.59.0 | 10.0.1 | 2.16 | 1.16.0 | n/a | 1.16.0 |
1.15.1 | 2.47.0 | 10.0.0 | 2.15 | 1.15.0 | n/a | 1.15.1 |
1.15.0 | 2.47.0 | 10.0.0 | 2.15 | 1.15.0 | n/a | 1.15.0 |
1.14.0 | 2.47.0 | 10.0.0 | 2.13 | 1.14.0 | n/a | 1.14.0 |
1.13.0 | 2.40.0 | 6.0.0 | 2.12 | 1.13.1 | n/a | 1.13.0 |
1.12.0 | 2.40.0 | 6.0.0 | 2.11 | 1.12.0 | n/a | 1.12.0 |
1.11.0 | 2.40.0 | 6.0.0 | 1.15 / 2.10 | 1.11.0 | n/a | 1.11.0 |
1.10.0 | 2.40.0 | 6.0.0 | 1.15 / 2.9 | 1.10.0 | n/a | 1.10.1 |
1.9.0 | 2.38.0 | 5.0.0 | 1.15 / 2.9 | 1.9.0 | n/a | 1.9.0 |
1.8.0 | 2.38.0 | 5.0.0 | 1.15 / 2.8 | 1.8.0 | n/a | 1.8.0 |
1.7.0 | 2.36.0 | 5.0.0 | 1.15 / 2.8 | 1.7.0 | n/a | 1.7.0 |
1.6.0 | 2.35.0 | 5.0.0 | 1.15 / 2.7 | 1.6.0 | n/a | 1.6.0 |
1.5.0 | 2.34.0 | 5.0.0 | 1.15 / 2.7 | 1.5.0 | n/a | 1.5.0 |
1.4.0 | 2.32.0 | 4.0.1 | 1.15 / 2.6 | 1.4.0 | n/a | 1.4.0 |
1.3.0 | 2.32.0 | 2.0.0 | 1.15 / 2.6 | 1.2.0 | n/a | 1.3.0 |
1.2.0 | 2.31.0 | 2.0.0 | 1.15 / 2.5 | 1.2.0 | n/a | 1.2.0 |
1.1.1 | 2.29.0 | 2.0.0 | 1.15 / 2.5 | 1.1.0 | n/a | 1.1.1 |
1.1.0 | 2.29.0 | 2.0.0 | 1.15 / 2.5 | 1.1.0 | n/a | 1.1.0 |
1.0.0 | 2.29.0 | 2.0.0 | 1.15 / 2.5 | 1.0.0 | n/a | 1.0.0 |
0.30.0 | 2.28.0 | 2.0.0 | 1.15 / 2.4 | 0.30.0 | n/a | 0.30.0 |
0.29.0 | 2.28.0 | 2.0.0 | 1.15 / 2.4 | 0.29.0 | n/a | 0.29.0 |
0.28.0 | 2.28.0 | 2.0.0 | 1.15 / 2.4 | 0.28.0 | n/a | 0.28.1 |
0.27.0 | 2.27.0 | 2.0.0 | 1.15 / 2.4 | 0.27.0 | n/a | 0.27.0 |
0.26.1 | 2.28.0 | 0.17.0 | 1.15 / 2.3 | 0.26.0 | 0.26.0 | 0.26.0 |
0.26.0 | 2.25.0 | 0.17.0 | 1.15 / 2.3 | 0.26.0 | 0.26.0 | 0.26.0 |
0.25.0 | 2.25.0 | 0.17.0 | 1.15 / 2.3 | 0.25.0 | 0.25.0 | 0.25.0 |
0.24.1 | 2.24.0 | 0.17.0 | 1.15 / 2.3 | 0.24.0 | 0.24.1 | 0.24.1 |
0.24.0 | 2.23.0 | 0.17.0 | 1.15 / 2.3 | 0.24.0 | 0.24.0 | 0.24.0 |
0.23.1 | 2.24.0 | 0.17.0 | 1.15 / 2.3 | 0.23.0 | 0.23.0 | 0.23.0 |
0.23.0 | 2.23.0 | 0.17.0 | 1.15 / 2.3 | 0.23.0 | 0.23.0 | 0.23.0 |
0.22.2 | 2.20.0 | 0.16.0 | 1.15 / 2.2 | 0.22.0 | 0.22.0 | 0.22.1 |
0.22.1 | 2.20.0 | 0.16.0 | 1.15 / 2.2 | 0.22.0 | 0.22.0 | 0.22.1 |
0.22.0 | 2.20.0 | 0.16.0 | 1.15 / 2.2 | 0.22.0 | 0.22.0 | 0.22.0 |
0.21.5 | 2.17.0 | 0.15.0 | 1.15 / 2.1 | 0.21.0 | 0.21.1 | 0.21.3 |
0.21.4 | 2.17.0 | 0.15.0 | 1.15 / 2.1 | 0.21.0 | 0.21.1 | 0.21.3 |
0.21.2 | 2.17.0 | 0.15.0 | 1.15 / 2.1 | 0.21.0 | 0.21.0 | 0.21.0 |
0.21.1 | 2.17.0 | 0.15.0 | 1.15 / 2.1 | 0.21.0 | 0.21.0 | 0.21.0 |
0.21.0 | 2.17.0 | 0.15.0 | 1.15 / 2.1 | 0.21.0 | 0.21.0 | 0.21.0 |
0.15.0 | 2.16.0 | 0.14.0 | 1.15 / 2.0 | 0.15.0 | 0.15.0 | 0.15.0 |
0.14.1 | 2.14.0 | 0.14.0 | 1.14 | 0.14.0 | 0.14.0 | n/a |
0.14.0 | 2.14.0 | 0.14.0 | 1.14 | 0.14.0 | 0.14.0 | n/a |
0.13.1 | 2.11.0 | n/a | 1.13 | 0.12.1 | 0.13.0 | n/a |
0.13.0 | 2.11.0 | n/a | 1.13 | 0.12.1 | 0.13.0 | n/a |
0.12.0 | 2.10.0 | n/a | 1.12 | 0.12.1 | 0.12.0 | n/a |
0.11.0 | 2.8.0 | n/a | 1.11 | 0.9.0 | 0.11.0 | n/a |
0.9.0 | 2.6.0 | n/a | 1.9 | n/a | n/a | n/a |
Questions
Please direct any questions about working with TF Data Validation to
Stack Overflow using the
tensorflow-data-validation
tag.
Links