New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

df-to-rs

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

df-to-rs

A package to upload Pandas DataFrame to Redshift

  • 0.1.24
  • PyPI
  • Socket score

Maintainers
1

df_to_rs

df_to_rs is a Python package that provides efficient methods to upload, upsert and manage Pandas DataFrames in Amazon Redshift using S3 as an intermediary.

Key Features

  • Direct DataFrame to Redshift upload
  • Upsert functionality (update + insert)
  • Delete and insert operations
  • Large dataset handling with chunking
  • Support for JSON/dict/list columns (Redshift SUPER)
  • AWS IAM Role support for secure authentication
  • Automatic cleanup of temporary S3 files

Installation

pip install df_to_rs

Usage

1. Initialize with AWS Credentials

from df_to_rs import df_to_rs
import psycopg2

# Connect to Redshift
redshift_conn = psycopg2.connect(
    dbname='your_db',
    host='your-cluster.region.redshift.amazonaws.com',
    port=1433,
    user='your_user',
    password='your_password'
)
redshift_conn.set_session(autocommit=True)

# Initialize with explicit credentials
uploader = df_to_rs(
    region_name='ap-south-1',
    s3_bucket='your-s3-bucket',
    aws_access_key_id='your-access-key-id',
    aws_secret_access_key='your-secret-access-key',
    redshift_c=redshift_conn
)
# No AWS credentials needed when using instance role
uploader = df_to_rs(
    region_name='ap-south-1',
    s3_bucket='your-s3-bucket',
    redshift_c=redshift_conn
)

3. Basic Upload

Upload a DataFrame to a Redshift table:

# Simple upload
uploader.upload_to_redshift(
    df=your_dataframe,
    dest='schema.table_name'
)

4. Upsert Operation

Update existing records and insert new ones based on key columns:

# Upsert based on specific columns
uploader.upsert_to_redshift(
    df=your_dataframe,
    dest_table='schema.table_name',
    upsert_columns=['id', 'unique_key'],  # Columns to match existing records
    clear_dest_table=False  # Set True to truncate table before insert
)

5. Delete and Insert

Delete records matching a condition and insert new data:

# Delete and insert with condition
uploader.delete_and_insert_to_redshift(
    df=your_dataframe,
    dest_table='schema.table_name',
    filter_cond="date >= CURRENT_DATE - 7"  # SQL condition for deletion
)

Special Data Types

JSON/Dictionary Columns

The package automatically handles JSON/dict/list columns for Redshift SUPER type:

# DataFrame with JSON column
df = pd.DataFrame({
    'id': [1, 2],
    'json_data': [{'key': 'value'}, {'other': 'data'}]
})

# Will be automatically converted for Redshift SUPER column
uploader.upload_to_redshift(df, 'schema.table_name')

Large Dataset Handling

The package automatically handles large datasets by:

  • Chunking data into 1 million row segments
  • Streaming to S3 in memory
  • Automatic cleanup of temporary files
  • Progress tracking with timestamps

Error Handling

  • Automatic transaction rollback on errors
  • S3 temporary file cleanup
  • Detailed error messages and timestamps
  • Safe staging table management for upserts

AWS IAM Role Requirements

When using instance roles, ensure your role has these permissions:

  • S3: PutObject, GetObject, DeleteObject on the specified bucket
  • Redshift: COPY command permissions
  • IAM: AssumeRole permissions if needed

Best Practices

  1. Use instance roles instead of access keys when possible
  2. Set appropriate column types in Redshift, especially for SUPER columns
  3. Create tables with appropriate sort and dist keys before uploading
  4. Monitor the Redshift query logs for performance optimization

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

All notable changes to df_to_rs will be documented in this file.

[0.1.24] - 2025-01-26

Added

  • Documentation Improved

[0.1.23] - 2025-01-26

Added

  • Support for instance role-based authentication in AWS
  • Handling of JSON/dict/list objects for Redshift SUPER columns
  • Proper cleanup of S3 temporary files

Changed

  • Made AWS credentials optional in constructor
  • Optimized DataFrame processing with unified applymap operations
  • Improved string column handling for better type safety

Fixed

  • S3 resource cleanup in error scenarios
  • Transaction handling in delete_and_insert_to_redshift

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc