Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
RDT (Reversible Data Transforms) is a Python library that transforms raw data into fully numerical data, ready for data science. The transforms are reversible, allowing you to convert from numerical data back into your original format.
Install RDT using pip
or conda
. We recommend using a virtual environment to avoid
conflicts with other software on your device.
pip install rdt
conda install -c conda-forge rdt
For more information about using reversible data transformations, visit the RDT Documentation.
In this short series of tutorials we will guide you through a series of steps that will help you getting started using RDT to transform columns, tables and datasets.
After you have installed RDT, you can get started using the demo dataset.
from rdt import get_demo
customers = get_demo()
This dataset contains some randomly generated values that describe the customers of an online marketplace.
last_login email_optin credit_card age dollars_spent
0 2021-06-26 False VISA 29 99.99
1 2021-02-10 False VISA 18 NaN
2 NaT False AMEX 21 2.50
3 2020-09-26 True NaN 45 25.00
4 2020-12-22 NaN DISCOVER 32 19.99
Let's transform this data so that each column is converted to full, numerical data ready for data science.
The HyperTransformer
is capable of transforming multi-column datasets.
from rdt import HyperTransformer
ht = HyperTransformer()
The HyperTransformer
needs to know about the columns in your dataset and which transformers to
apply to each. These are described by a config. We can ask the HyperTransformer
to automatically
detect it based on the data we plan to use.
ht.detect_initial_config(data=customers)
This will create and set the config.
Config:
{
"sdtypes": {
"last_login": "datetime",
"email_optin": "boolean",
"credit_card": "categorical",
"age": "numerical",
"dollars_spent": "numerical"
},
"transformers": {
"last_login": "UnixTimestampEncoder()",
"email_optin": "BinaryEncoder()",
"credit_card": "FrequencyEncoder()",
"age": "FloatFormatter()",
"dollars_spent": "FloatFormatter()"
}
}
The sdtypes
dictionary describes the semantic data types of each of your columns and the
transformers
dictionary describes which transformer to use for each column. You can customize the
transformers and their settings. (See the Transformers Glossary for more information).
The HyperTransformer
references the config while learning the data during the fit
stage.
ht.fit(customers)
Once the transformer is fit, it's ready to use. Use the transform method to transform all columns of your dataset at once.
transformed_data = ht.transform(customers)
last_login.value email_optin.value credit_card.value age.value dollars_spent.value
0 1.624666e+18 0.0 0.2 29 99.99
1 1.612915e+18 0.0 0.2 18 36.87
2 1.611814e+18 0.0 0.5 21 2.50
3 1.601078e+18 1.0 0.7 45 25.00
4 1.608595e+18 0.0 0.9 32 19.99
The HyperTransformer
applied the assigned transformer to each individual column. Each column
now contains fully numerical data that you can use for your project!
When you're done with your project, you can also transform the data back to the original format
using the reverse_transform
method.
original_format_data = ht.reverse_transform(transformed_data)
last_login email_optin credit_card age dollars_spent
0 NaT False VISA 29 99.99
1 2021-02-10 False VISA 18 NaN
2 NaT False AMEX 21 NaN
3 2020-09-26 True NaN 45 25.00
4 2020-12-22 False DISCOVER 32 19.99
To learn more about reversible data transformations, visit the RDT Documentation.
The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:
Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.
FAQs
Reversible Data Transforms
We found that rdt demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 9 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.