pyrecdp

Dependencies

Maintainers

Alerts

File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

pyrecdp

A data processing bundle for spark based recommender system operations

1.2.0

PyPi

Maintainers: 2

Readme

RecDP - one stop toolkit for AI data process

We provide intel optimized solution for

Auto Feature Engineering - Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
LLM Data Preparation - Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).

How it works

Install this tool through pip.

DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp[all] --pre

RecDP - Tabular

learn more

Auto Feature Engineering Pipeline

Only 3 lines of codes to generate new features for your tabular data. Usually 5x new features can be found with up to 1.2x accuracy boost

from pyrecdp.autofe import AutoFE

pipeline = AutoFE(dataset=train_data, label=target_label, time_series = 'Day')
transformed_train_df = pipeline.fit_transform()

High Performance on Terabyte Tabular data processing

RecDP - LLM

learn more

Low-code Fault-tolerant Auto-scaling Parallel Pipeline

from pyrecdp.primitives.operations import *
from pyrecdp.LLM import ResumableTextPipeline

pipeline = ResumableTextPipeline()
ops = [
    UrlLoader(urls, max_depth=2),
    DocumentSplit(),
    ProfanityFilter(),
    PIIRemoval(),
    ...
    PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
pipeline.execute()

LICENSE

Apache 2.0

Dependency

Spark 3.4.*
python 3.*
Ray 2.7.*

Keywords

pyrecdp recdp distributed parallel auto-feature-engineering autofe LLM python

FAQs

What is pyrecdp?

Is pyrecdp well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pyrecdp

RecDP - one stop toolkit for AI data process

How it works

RecDP - Tabular

RecDP - LLM

LICENSE

Dependency

Keywords

Related posts

The AI Advantage: Reshaping Cybersecurity in the Age of Autonomous Threats

UnitedHealth Group Discloses Protected Health Information Compromised for “Substantial Portion of People in America” in Recent Cyberattack