Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
The example code shown in the below explanation can also be found in this example Jupyter notebook.
The Overview visualization is powered by the feature statistics protocol buffer. The feature statistics protocol buffer messages store summary statistics for individual feature columns of a set of input data for an ML system (although it will be general enough to be used for summary statistics of any set of data).
The top-level proto is DatasetFeatureStatisticsList, which is a list of DatasetFeatureStatistics. Each DatasetFeatureStatistics represents the feature statistics for a single dataset. Each DatasetFeatureStatistics contains a list of FeatureNameStatistics, which contain the statistics for a single feature in a single dataset.
The feature statistics are different depending on the feature data type (numeric, string, or raw bytes). For numeric features, the statistics include metrics such as min, mean, median, max and standard deviation. For string feature, the statistics include metrics such as average length, number of unique values and mode.
Feature statistics includes an optional field for weighted statistics. If the dataset has an example weight feature, it can be used to calculate weighted statistics for every feature in addition to standard statistics. If a proto contains weighted fields, then the visualization will show the weighted statistics and the user will be able to toggle between unweighted and weighted versions of the charts per feature.
Feature statistics includes an optional field for custom statistics. If there are additional statistics for features in a dataset that a team wants to track and visualize they can be added to the custom stats field, which is a map of custom stat names to custom stat values (either numbers or strings). These custom stats will be displayed alongside the standard statistics.
The feature statistics protocol buffer can be created for datasets by the python code provided in the facets_overview/facets-overview directory.
This code can be installed through pip install facets-overview
. TensorFlow should also be installed but is not included as a
pip dependency, so as to allow a user to depend on either the tensorflow or tensorflow-gpu package as necessary.
Datasets can be analyzed either from a TfRecord files of tensorflow Example protocol buffers, or from pandas DataFrames.
As of version 1.1.0, the facets-overview
package requires a version of protobuf
at version 3.20.0 or later.
To create the proto from a pandas DataFrame, use the ProtoFromDataFrames
method of the GenericFeatureStatisticsGenerator class.
To create the proto from a TfRecord file, use the ProtoFromTfRecordFiles
method of the FeatureStatisticsGenerator class.
These generators have dependencies on the numpy and pandas python libraries.
Use of the FeatureStatisticsGenerator class also requires having tensorflow installed.
See those files for further documentation.
Example code:
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
import pandas as pd
df = pd.DataFrame({'num' : [1, 2, 3, 4], 'str' : ['a', 'a', 'b', None]})
proto = GenericFeatureStatisticsGenerator().ProtoFromDataFrames([{'name': 'test', 'table': df}])
The python code in this repository for generating feature stats only works on datasets that are small enough to fit into memory on your local machine. For distributed generation of feature stats for large datasets, check out the independently-developed Facets Overview Spark project.
A proto can easily be visualized in a Jupyter notebook using the installed nbextension.
The proto is stingified and then provided as input to a facets-overview Polymer web component, via the protoInput
property on the element.
The web component is then displayed in output cell of the notebook.
Example code (continued from above example):
from IPython.core.display import display, HTML
import base64
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")
HTML_TEMPLATE = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html" >
<facets-overview id="elem"></facets-overview>
<script>
document.querySelector("#elem").protoInput = "{protostr}";
</script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))
The protoInput
property accepts any of the following three forms of the DatasetFeatureStatisticsList protocol buffer:
The visualization contains two tables: one for numeric features and one for categorical (string) features. Each table contains a row for each feature of that type. The rows contains calculated statistics and charts showing the distribution of values for that feature across the dataset(s).
Potentially problematic statistics, such as a feature is missing (has no value) for a large number of the examples in a dataset, are shown in red and bolded.
At the top of the visualization are controls that affect the individual tables.
The sort-by dropdown changes the sort order for the features in each table. The options are:
The name filter input box allows filtering the tables by feature names that match the text provided.
The currently-set filter is exposed as the property searchString
.
The feature checkboxes allow filtering by the type of value for each feature, such as float, int or string.
Which chart is displayed for the features in a table is controlled by a dropdown above the charts. The options for numeric features are:
The options for string features are:
Additionally, the feature statistics proto allows for custom charts to be stored for any feature. If the input proto to the visualization contains any custom charts, they will be listed in the dropdown as well.
Checkboxes next to the dropdown control some other features of the charts:
There are multiple demos of Overview that can be used as functional tests to ensure new builds are working correctly.
These demos are all found under facets_overview/functional_tests.
To run one, for example the “simple” test, run bazel run facets_overview/functional_tests/simple:devserver
and then navigate your browser to "localhost:6006/facets-overview/functional-tests/simple/index.html” to see the resulting visualization.
Run bazel run facets_overview/common/test:devserver
and then navigate your browser to “localhost:6006/facets-overview/facets-overview/common/test/runner.html”.
The output from the tests can be seen in the developer console.
python setup.py bdist_wheel --universal
twine upload dist/*
to upload it to PyPI.After installing the python package, run python -m feature_statistics_generator_test
and python -m generic_feature_statistics_generator_test
.
FAQs
Python code to support the Facets Overview visualization
We found that facets-overview demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.