
Security News
NVD Quietly Sweeps 100K+ CVEs Into a “Deferred” Black Hole
NVD now marks all pre-2018 CVEs as "Deferred," signaling it will no longer enrich older vulnerabilities, further eroding trust in its data.
A powerful Python library designed to simplify data analysis by providing one-line solutions for cleaning, transformation, and visualization. Eliminate boilerplate code with intuitive, feature-rich functions tailored for analysts, researchers, and developers. Streamline workflows with advanced preprocessing and insightful visualizations, all in a single, user-friendly package.
DataAnalysts is a robust and versatile Python library meticulously designed to simplify and enhance the data analysis process. It caters to users across diverse domains, including students, professional data analysts, researchers, and enthusiasts. The library integrates powerful modules for data cleaning, transformation, and visualization, enabling seamless handling of datasets with minimal coding effort.
CSV Files:
Excel Files:
Robust Exception Handling:
Data Cleaning:
Data Transformation:
Data Visualization:
To use the library in Google Colab or your local environment, install it directly from PyPI:
pip install dataanalysts
!pip install dataanalysts
import dataanalysts as da
import pandas as pd
df = da.csv('data.csv')
df_excel = da.excel('data.xlsx', sheet_name='Sheet1')
The **Data Summary ** simplifies the exploration of datasets by providing a comprehensive summary of your DataFrame in a single step. This module is designed to give users a complete overview of their data, including column-level statistics and metadata, in a tabular format.
Column Overview:
Data Completeness:
Uniqueness:
Descriptive Statistics (Numeric Columns):
Descriptive Statistics (Categorical Columns):
Single-Line Syntax:
Use the da.summary()
function to generate a summary of your DataFrame.
Syntax:
da.summary(df)
Example:
import pandas as pd
import dataanalysts as da
# Sample dataset
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
'Age': [25, 30, 35, None, 30],
'Gender': ['F', 'M', 'M', 'F', None],
'Score': [85, 90, 78, 92, 88]
}
df = pd.DataFrame(data)
# Generate and display summary
summary_df = da.summary(df)
print(summary_df)
Output:
Column | Data Type | Non-Null Count | Unique Values | Min | Max | Mean | Median | Top | Frequency |
---|---|---|---|---|---|---|---|---|---|
Name | object | 5 | 3 | None | None | None | None | Alice | 2 |
Age | float64 | 4 | 3 | 25.0 | 35.0 | 30.0 | 30.0 | None | None |
Gender | object | 4 | 2 | None | None | None | None | F | 2 |
Score | int64 | 5 | 5 | 78 | 92 | 86.6 | 88.0 | None | None |
Numeric Columns:
Categorical Columns:
General Information:
This module provides a quick and easy way to understand your data's structure and key characteristics, making it ideal for analysts, data scientists, and developers.
The Data Cleaning module simplifies common data preprocessing tasks like handling missing values, removing duplicates, fixing structural errors, and more. With this module, you can efficiently prepare your data for analysis or modeling using intuitive and flexible one-line commands.
Remove duplicate rows from the dataset.
Syntax:
da.clean(df, strategy='remove_duplicates')
Example:
cleaned_df = da.clean(df, strategy='remove_duplicates')
Fill or drop missing values using various strategies.
Syntax:
da.clean(df, strategy='handle_missing', missing_strategy='mean')
Options:
missing_strategy
: 'mean', 'median', 'mode', or 'fill'value
: Custom value for filling (required if missing_strategy='fill'
)Example:
# Fill missing values with mean
cleaned_df = da.clean(df, strategy='handle_missing', missing_strategy='mean')
# Fill missing values with custom values
cleaned_df = da.clean(df, strategy='handle_missing', missing_strategy='fill', value={'Age': 25, 'Gender': 'Unknown'})
Standardize text data by fixing structural inconsistencies.
Syntax:
da.clean(df, strategy='fix_structural', column='Category', fix_strategy='lowercase')
Options:
column
: Column to clean.fix_strategy
: 'lowercase' or 'uppercase'.Example:
cleaned_df = da.clean(df, strategy='fix_structural', column='Category', fix_strategy='lowercase')
Detect and handle outliers in numerical columns.
Syntax:
da.clean(df, strategy='handle_outliers', column='Score')
Options:
column
: Column to handle outliers.Example:
cleaned_df = da.clean(df, strategy='handle_outliers', column='Score')
Convert columns to specific data types.
Syntax:
da.clean(df, strategy='convert_dtype', column='Age', dtype='int')
Options:
column
: Column to convert.dtype
: Target data type ('int', 'float', 'str').Example:
cleaned_df = da.clean(df, strategy='convert_dtype', column='Age', dtype='int')
Perform one-hot encoding for categorical variables.
Syntax:
da.clean(df, strategy='encode_categorical', columns=['Category'])
Options:
columns
: List of categorical columns.Example:
cleaned_df = da.clean(df, strategy='encode_categorical', columns=['Category'])
Normalize or standardize numerical columns.
Syntax:
da.clean(df, strategy='scale', columns=['Age'], scaler='minmax')
Options:
columns
: List of numerical columns to scale.scaler
: 'minmax' or 'standard'.Example:
cleaned_df = da.clean(df, strategy='scale', columns=['Age'], scaler='minmax')
Filter rows based on conditions.
Syntax:
da.clean(df, strategy='filter', condition="Age > 30")
Options:
condition
: String condition to filter rows.Example:
cleaned_df = da.clean(df, strategy='filter', condition="Age > 30")
Split a single column into multiple columns using a specified delimiter.
Syntax:
da.clean(df, strategy='split_column', column='FullName', new_columns=['FirstName', 'LastName'], delimiter=' ')
Options:
column
: Column to split.new_columns
: List of new column names.delimiter
: Delimiter to use for splitting.Example:
cleaned_df = da.clean(df, strategy='split_column', column='FullName', new_columns=['FirstName', 'LastName'], delimiter=' ')
Ensure numerical values are within specified ranges.
Syntax:
da.clean(df, strategy='validate', column='Score', min_value=0, max_value=100)
Options:
column
: Column to validate.min_value
: Minimum acceptable value.max_value
: Maximum acceptable value.Example:
cleaned_df = da.clean(df, strategy='validate', column='Score', min_value=0, max_value=100)
Perform interactive cleaning step by step using a menu-based approach.
Syntax:
da.interactive_clean(df)
Example:
cleaned_df = da.interactive_clean(df)
Here’s how you can use the clean
function to perform multiple cleaning operations:
import dataanalysts as da
import pandas as pd
# Sample dataset
data = {
'Name': ['Alice', 'Bob', 'Alice'],
'Age': [25, None, 25],
'Gender': ['F', 'M', None],
'FullName': ['Alice Smith', 'Bob Johnson', 'Alice Smith']
}
df = pd.DataFrame(data)
# Remove duplicates
cleaned_df = da.clean(df, strategy='remove_duplicates')
# Handle missing values
cleaned_df = da.clean(cleaned_df, strategy='handle_missing', missing_strategy='fill', value={'Age': 30, 'Gender': 'Unknown'})
# Fix structural errors
cleaned_df = da.clean(cleaned_df, strategy='fix_structural', column='Gender', fix_strategy='uppercase')
# Split column
cleaned_df = da.clean(cleaned_df, strategy='split_column', column='FullName', new_columns=['FirstName', 'LastName'], delimiter=' ')
# Interactive cleaning
cleaned_df = da.interactive_clean(cleaned_df)
cleaner.log
file.This module simplifies data cleaning, making it accessible and efficient for analysts, researchers, and developers alike.
The Data Transformation Module enables comprehensive data preprocessing and transformation for datasets, including scaling, dimensionality reduction, encoding, and more. The module supports both direct and interactive transformation methods.
Scales numeric columns based on the selected strategy:
Syntax:
import dataanalysts as da
# Standard Scaling
df_transformed = da.transform(df, strategy='standard')
# Min-Max Scaling
df_transformed = da.transform(df, strategy='minmax')
# Robust Scaling
df_transformed = da.transform(df, strategy='robust')
Encodes categorical columns into numeric values using label encoding. This is particularly useful for machine learning models that require numeric data.
Syntax:
# Encode categorical columns
df_transformed = da.transform(df, encode_categorical=True)
Automatically removes duplicate rows from the dataset.
Syntax:
# Remove duplicate rows
df_transformed = da.transform(df, remove_duplicates=True)
Removes features with variance below a specified threshold to reduce noise in the data.
Syntax:
# Remove features with variance below 0.01
df_transformed = da.transform(df, remove_low_variance=True, variance_threshold=0.01)
Uses Principal Component Analysis to reduce the number of features while retaining most of the variance in the dataset.
Syntax:
# Apply PCA to retain 3 components
df_pca = da.transform(df_transformed, reduce_dimensionality=True, n_components=3)
Provides an interactive menu for selecting transformation steps one at a time.
Menu Options:
Syntax:
# Perform interactive transformation
df_interactive_transform = da.interactive_transform(df)
Here’s an end-to-end example combining multiple transformations:
import dataanalysts as da
import pandas as pd
# Sample dataset
data = {
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000],
'Department': ['HR', 'IT', 'Finance', 'IT', 'HR']
}
df = pd.DataFrame(data)
# Step 1: Apply standard scaling
df_transformed = da.transform(df, strategy='standard')
# Step 2: Apply PCA to reduce dimensions to 2 components
df_pca = da.transform(df_transformed, reduce_dimensionality=True, n_components=2)
# Step 3: Perform additional transformations interactively
df_final = da.interactive_transform(df_pca)
print(df_final)
transformer.log
file.The **Data Visualization ** provides advanced tools for creating insightful and customized visual representations of your dataset. With this module, you can generate a variety of plots, including histograms, scatter plots, heatmaps, and more, with customization options for size, titles, and styles.
Visualize the distribution of a single numeric column.
Syntax:
da.histogram(df, column='age', bins=30, kde=True)
Customization Options:
bins
: Number of bins for the histogram.kde
: Whether to display the Kernel Density Estimate.size
: Tuple specifying figure size.title_fontsize
: Font size for the title.axis_fontsize
: Font size for axis labels.custom_title
: Custom title for the chart.Compare values across categories.
Syntax:
da.barchart(df, x_col='city', y_col='population')
Customization Options:
size
: Tuple specifying figure size.title_fontsize
: Font size for the title.axis_fontsize
: Font size for axis labels.custom_title
: Custom title for the chart.Display trends over time or sequential data.
Syntax:
da.linechart(df, x_col='date', y_col='sales')
Customization Options:
size
: Tuple specifying figure size.title_fontsize
: Font size for the title.axis_fontsize
: Font size for axis labels.custom_title
: Custom title for the chart.Show relationships between two numeric columns.
Syntax:
da.scatter(df, x_col='height', y_col='weight', hue='gender')
Customization Options:
hue
: Column for color encoding.size
: Tuple specifying figure size.title_fontsize
: Font size for the title.axis_fontsize
: Font size for axis labels.custom_title
: Custom title for the chart.Visualize correlations between numeric columns.
Syntax:
da.heatmap(df)
Customization Options:
annot
: Whether to annotate the heatmap with correlation values.cmap
: Colormap for the heatmap.size
: Tuple specifying figure size.title_fontsize
: Font size for the title.custom_title
: Custom title for the chart.Display pairwise relationships in a dataset.
Syntax:
da.pairplot(df, hue='category')
Customization Options:
hue
: Column for color encoding.size
: Tuple specifying figure size for each subplot.title_fontsize
: Font size for the title.custom_title
: Custom title for the chart.Compare distributions of a numeric column across categories.
Syntax:
da.boxplot(df, x_col='region', y_col='sales')
Customization Options:
size
: Tuple specifying figure size.title_fontsize
: Font size for the title.axis_fontsize
: Font size for axis labels.custom_title
: Custom title for the chart.Combine box plot and density plot for richer insights.
Syntax:
da.violinplot(df, x_col='region', y_col='sales')
Customization Options:
size
: Tuple specifying figure size.title_fontsize
: Font size for the title.axis_fontsize
: Font size for axis labels.custom_title
: Custom title for the chart.Provides an interactive menu for generating various plots one at a time.
Menu Options:
Syntax:
# Perform interactive visualization
da.interactive_plot(df)
Here’s how you can use the visualizer
functions to create multiple plots:
import dataanalysts as da
import pandas as pd
# Sample dataset
data = {
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 70000, 80000, 90000],
'city': ['NY', 'LA', 'SF', 'CHI', 'HOU'],
'gender': ['M', 'F', 'F', 'M', 'M']
}
df = pd.DataFrame(data)
# Histogram
da.histogram(df, column='age', bins=20, kde=True)
# Bar Chart
da.barchart(df, x_col='city', y_col='salary')
# Scatter Plot
da.scatter(df, x_col='age', y_col='salary', hue='gender')
# Heatmap
da.heatmap(df)
# Interactive Visualization
da.interactive_plot(df)
visualizer.log
file.This module provides highly customizable and interactive visualizations to gain insights from your data effectively.
Contributions are welcome! Please submit a pull request via our GitHub Repository.
This project is licensed under the MIT License. See the LICENSE file for details.
If you encounter any issues, feel free to open an issue on our GitHub Issues page.
FAQs
A powerful Python library designed to simplify data analysis by providing one-line solutions for cleaning, transformation, and visualization. Eliminate boilerplate code with intuitive, feature-rich functions tailored for analysts, researchers, and developers. Streamline workflows with advanced preprocessing and insightful visualizations, all in a single, user-friendly package.
We found that dataanalysts demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
NVD now marks all pre-2018 CVEs as "Deferred," signaling it will no longer enrich older vulnerabilities, further eroding trust in its data.
Research
Security News
Lazarus-linked threat actors expand their npm malware campaign with new RAT loaders, hex obfuscation, and over 5,600 downloads across 11 packages.
Security News
Safari 18.4 adds support for Iterator Helpers and two other TC39 JavaScript features, bringing full cross-browser coverage to key parts of the ECMAScript spec.