You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

pydatalens

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pydatalens

A Python package for automatic EDA, data cleaning, and visualization.

1.0.0

PyPI

Maintainers: 1

pydatalens

pydatalens is a Python package designed to streamline the process of Exploratory Data Analysis (EDA), data cleaning, and visualization. It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort.

Features

1. Smart Summarization

Automatically generates a summary of the dataset, including:
- Data types
- Missing values
- Descriptive statistics
- Unique value counts

2. Data Cleaning

Detects and handles missing values using various strategies (mean, median, mode).
Identifies and removes duplicate rows.
Supports basic outlier detection (planned for future updates).

3. Correlation Analysis

Generates a correlation matrix to identify relationships between features.
Provides heatmaps for better visualization.

4. Automatic Visualizations

Supports generating:
- Histograms
- Box plots
- Correlation heatmaps
- Scatter plots (planned for future updates).

5. Report Generation

Exports EDA results and visualizations into a detailed HTML report for easy sharing.

Installation

Using pip (from source)

Clone the repository:

git clone https://github.com/gopalakrishnanarjun/pydatalens.git
cd pydatalens

Install the package:
```
pip install -e .
```

Dependencies

Python >= 3.6
pandas >= 1.0
numpy >= 1.18
matplotlib >= 3.1
seaborn >= 0.11

Install dependencies manually:

pip install pandas numpy matplotlib seaborn

Quick Start

1. Import the package

from pydatalens import eda, cleaning, visualizations

2. Load a dataset

import pandas as pd
df = pd.read_csv("your_dataset.csv")

3. Summarize the dataset

print(eda.summarize(df))

4. Handle missing values

df_cleaned = cleaning.handle_missing(df, strategy="mean")

5. Visualize the data

visualizations.plot_histogram(df_cleaned, column="age")
visualizations.correlation_heatmap(df_cleaned)

Examples

Summarizing the Data

from pydatalens import eda
summary = eda.summarize(df)
print(summary)

Cleaning the Data

from pydatalens import cleaning
df = cleaning.handle_missing(df, strategy="median")
df = cleaning.drop_duplicates(df)

Visualizing the Data

from pydatalens import visualizations
visualizations.plot_histogram(df, "column_name")
visualizations.correlation_heatmap(df)

Future Enhancements

Advanced anomaly detection.
Support for time series analysis.
Enhanced visualization options (e.g., scatter plots, pair plots).
Integration with machine learning pipelines.

Contributing

Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request.

License

pydatalens is licensed under the MIT License. See the LICENSE file for more details.

FAQs

What is pydatalens?

Is pydatalens well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pydatalens

pydatalens

Features

1. Smart Summarization

2. Data Cleaning

3. Correlation Analysis

4. Automatic Visualizations

5. Report Generation

Installation

Using pip (from source)

Dependencies

Quick Start

1. Import the package

2. Load a dataset

3. Summarize the dataset

4. Handle missing values

5. Visualize the data

Examples

Summarizing the Data

Cleaning the Data

Visualizing the Data

Future Enhancements

Contributing

License

Related posts

AI + a16z Podcast: Vibe Coding, Security Risks, and the Path to Progress

Toptal’s GitHub Organization Hijacked: 10 Malicious Packages Published