You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

dataprobe

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

dataprobe

Advanced data pipeline debugging and profiling tools for Python

1.0.0

PyPI

Maintainers: 1

Project description

DataProbe

DataProbe is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow with enterprise-grade visualizations.

🎨 NEW: Enterprise-Grade Visualizations

DataProbe v1.0.0 introduces comprehensive pipeline debugging capabilities with professional-quality visualizations, intelligent optimization recommendations, advanced memory profiling, data lineage tracking, and enterprise-grade reporting.

Dashboard Features

🏢 Enterprise Dashboard

KPI Panels: Real-time success rates, duration, memory usage
Pipeline Flowchart: Interactive operation flow with status indicators
Performance Analytics: Memory usage timelines with peak detection
Data Insights: Comprehensive lineage and transformation tracking

# Generate enterprise dashboard
debugger.visualize_pipeline()

🌐 3D Pipeline Network

3D Visualization: Interactive network showing operation relationships
Performance Mapping: Z-axis represents operation duration
Status Color-coding: Visual error and bottleneck identification

# Create 3D network visualization
debugger.create_3d_pipeline_visualization()

📊 Executive Reports

Multi-page Reports: Professional stakeholder-ready documentation
Performance Trends: Dual-axis charts showing duration and memory patterns
Optimization Recommendations: AI-powered suggestions for improvements
Data Quality Metrics: Comprehensive pipeline health scoring

# Generate executive report
debugger.generate_executive_report()

Color-Coded Status System

🟢 Success: Operations completed without issues
🟡 Warning: Performance bottlenecks detected
🔴 Error: Failed operations requiring attention
🟦 Info: Data flow and transformation indicators

🚀 Features

PipelineDebugger

🔍 Operation Tracking : Automatically track execution time, memory usage, and data shapes for each operation
📊 Enterprise-Grade Visualizations : Professional dashboards, 3D networks, and executive reports
💾 Memory Profiling : Monitor memory usage and identify memory-intensive operations
🔗 Data Lineage : Track data transformations and column changes throughout the pipeline
⚠️ Bottleneck Detection : Automatically identify slow operations and memory peaks
📈 Performance Reports : Generate comprehensive debugging reports with optimization suggestions
🎯 Error Tracking : Capture and track errors with full traceback information
🌳 Nested Operations : Support for tracking nested function calls and their relationships

📦 Installation

pip install dataprobe

For development installation:

git clone https://github.com/santhoshkrishnan30/dataprobe.git
cd dataprobe
pip install -e ".[dev]"

🎯 Quick Start

Basic Usage with Enhanced Visualizations

from dataprobe import PipelineDebugger
import pandas as pd

# Initialize the debugger with enhanced features
debugger = PipelineDebugger(
    name="My_ETL_Pipeline",
    track_memory=True,
    track_lineage=True
)

# Use decorators to track operations
@debugger.track_operation("Load Data")
def load_data(file_path):
    return pd.read_csv(file_path)

@debugger.track_operation("Transform Data")
def transform_data(df):
    df['new_column'] = df['value'] * 2
    return df

# Run your pipeline
df = load_data("data.csv")
df = transform_data(df)

# Generate enterprise-grade visualizations
debugger.visualize_pipeline()              # Enterprise dashboard
debugger.create_3d_pipeline_visualization() # 3D network view  
debugger.generate_executive_report()       # Executive report

# Get AI-powered optimization suggestions
suggestions = debugger.suggest_optimizations()
for suggestion in suggestions:
    print(f"💡 {suggestion['suggestion']}")

# Print summary and reports
debugger.print_summary()
report = debugger.generate_report()

Memory Profiling

@debugger.profile_memory
def memory_intensive_operation():
    large_df = pd.DataFrame(np.random.randn(1000000, 50))
    result = large_df.groupby(large_df.index % 1000).mean()
    return result

DataFrame Analysis

# Analyze DataFrames for potential issues
debugger.analyze_dataframe(df, name="Sales Data")

📊 Example Output

Enterprise Dashboard

Professional KPI dashboard with real-time metrics, pipeline flowchart, memory analytics, and performance insights.

Pipeline Summary

Pipeline Summary: My_ETL_Pipeline
├── Execution Statistics
│   ├── Total Operations: 5
│   ├── Total Duration: 2.34s
│   └── Total Memory Used: 125.6MB
├── Bottlenecks (1)
│   └── Transform Data: 1.52s
└── Memory Peaks (1)
    └── Load Large Dataset: +85.3MB

Optimization Suggestions

💡 OPTIMIZATION RECOMMENDATIONS:

1. [PERFORMANCE] Transform Data
   Issue: Operation took 1.52s
   💡 Consider optimizing this operation or parallelizing if possible

2. [MEMORY] Load Large Dataset  
   Issue: High memory usage: +85.3MB
   💡 Consider processing data in chunks or optimizing memory usage

🔧 Advanced Features

Multiple Visualization Options

# Enterprise dashboard - Professional KPI dashboard
debugger.visualize_pipeline()

# 3D network visualization - Interactive operation relationships  
debugger.create_3d_pipeline_visualization()

# Executive report - Multi-page stakeholder documentation
debugger.generate_executive_report()

Data Lineage Tracking

# Export data lineage information
lineage_json = debugger.export_lineage(format="json")

# Track column changes automatically
@debugger.track_operation("Add Features")
def add_features(df):
    df['feature_1'] = df['value'].rolling(7).mean()
    df['feature_2'] = df['value'].shift(1)
    return df

Custom Metadata

@debugger.track_operation("Process Batch", batch_id=123, source="api")
def process_batch(data):
    # Operation metadata is stored and included in reports
    return processed_data

Checkpoint Saving

# Auto-save is enabled by default
debugger = PipelineDebugger(name="Pipeline", auto_save=True)

# Manual checkpoint
debugger.save_checkpoint()

📈 Performance Tips

Use with Context : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:

   debugger = PipelineDebugger(name="Pipeline", track_memory=False, track_lineage=False)

Batch Operations : Group small operations together to reduce tracking overhead
Memory Monitoring : Set appropriate memory thresholds to catch issues early:

   debugger = PipelineDebugger(name="Pipeline", memory_threshold_mb=500)

💼 Enterprise Features

✅ Professional Styling: Modern design matching enterprise standards ✅ Executive Ready: Suitable for stakeholder presentations ✅ Performance Insights: AI-powered optimization recommendations ✅ Export Options: High-resolution PNG outputs ✅ Responsive Design: Scales from detailed debugging to executive overview ✅ Real-time Metrics: Live performance and memory tracking

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Rich for beautiful terminal output
Uses NetworkX for pipeline visualization
Enhanced with Matplotlib and Seaborn for enterprise-grade visualizations
Inspired by the need for better data pipeline debugging tools

📞 Support

📧 Email: santhoshkrishnan3006@gmail.com
🐛 Issues: GitHub Issues
📖 Documentation: Read the Docs

🗺️ Roadmap

Enterprise-grade dashboard visualizations
3D pipeline network views
Executive-level reporting capabilities
Support for distributed pipeline debugging
Integration with popular orchestration tools (Airflow, Prefect, Dagster)
Real-time pipeline monitoring dashboard
Advanced anomaly detection in data flow
Support for streaming data pipelines

Made with ❤️ by Santhosh Krishnan R

Keywords

FAQs

What is dataprobe?

Is dataprobe well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

dataprobe

Project description

DataProbe

🎨 NEW: Enterprise-Grade Visualizations

Dashboard Features

🏢 Enterprise Dashboard

🌐 3D Pipeline Network

📊 Executive Reports

Color-Coded Status System

🚀 Features

PipelineDebugger

📦 Installation

🎯 Quick Start

Basic Usage with Enhanced Visualizations

Memory Profiling

DataFrame Analysis

📊 Example Output

Enterprise Dashboard

Pipeline Summary

Optimization Suggestions

🔧 Advanced Features

Multiple Visualization Options

Data Lineage Tracking

Custom Metadata

Checkpoint Saving

📈 Performance Tips

💼 Enterprise Features

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🗺️ Roadmap

Keywords

Related posts

Introducing License Overlays: Smarter License Management for Real-World Code

Introducing Rust Support in Socket