df_to_rs
df_to_rs
is a Python package that provides efficient methods to upload, upsert and manage Pandas DataFrames in Amazon Redshift using S3 as an intermediary.
Key Features
- Direct DataFrame to Redshift upload
- Upsert functionality (update + insert)
- Delete and insert operations
- Large dataset handling with chunking
- Support for JSON/dict/list columns (Redshift SUPER)
- AWS IAM Role support for secure authentication
- Automatic cleanup of temporary S3 files
Installation
pip install df_to_rs
Usage
1. Initialize with AWS Credentials
from df_to_rs import df_to_rs
import psycopg2
redshift_conn = psycopg2.connect(
dbname='your_db',
host='your-cluster.region.redshift.amazonaws.com',
port=1433,
user='your_user',
password='your_password'
)
redshift_conn.set_session(autocommit=True)
uploader = df_to_rs(
region_name='ap-south-1',
s3_bucket='your-s3-bucket',
aws_access_key_id='your-access-key-id',
aws_secret_access_key='your-secret-access-key',
redshift_c=redshift_conn
)
2. Initialize using EC2 Instance Role (Recommended)
uploader = df_to_rs(
region_name='ap-south-1',
s3_bucket='your-s3-bucket',
redshift_c=redshift_conn
)
3. Basic Upload
Upload a DataFrame to a Redshift table:
uploader.upload_to_redshift(
df=your_dataframe,
dest='schema.table_name'
)
4. Upsert Operation
Update existing records and insert new ones based on key columns:
uploader.upsert_to_redshift(
df=your_dataframe,
dest_table='schema.table_name',
upsert_columns=['id', 'unique_key'],
clear_dest_table=False
)
5. Delete and Insert
Delete records matching a condition and insert new data:
uploader.delete_and_insert_to_redshift(
df=your_dataframe,
dest_table='schema.table_name',
filter_cond="date >= CURRENT_DATE - 7"
)
Special Data Types
JSON/Dictionary Columns
The package automatically handles JSON/dict/list columns for Redshift SUPER type:
df = pd.DataFrame({
'id': [1, 2],
'json_data': [{'key': 'value'}, {'other': 'data'}]
})
uploader.upload_to_redshift(df, 'schema.table_name')
Large Dataset Handling
The package automatically handles large datasets by:
- Chunking data into 1 million row segments
- Streaming to S3 in memory
- Automatic cleanup of temporary files
- Progress tracking with timestamps
Error Handling
- Automatic transaction rollback on errors
- S3 temporary file cleanup
- Detailed error messages and timestamps
- Safe staging table management for upserts
AWS IAM Role Requirements
When using instance roles, ensure your role has these permissions:
- S3: PutObject, GetObject, DeleteObject on the specified bucket
- Redshift: COPY command permissions
- IAM: AssumeRole permissions if needed
Best Practices
- Use instance roles instead of access keys when possible
- Set appropriate column types in Redshift, especially for SUPER columns
- Create tables with appropriate sort and dist keys before uploading
- Monitor the Redshift query logs for performance optimization
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
All notable changes to df_to_rs will be documented in this file.
[0.1.24] - 2025-01-26
Added
[0.1.23] - 2025-01-26
Added
- Support for instance role-based authentication in AWS
- Handling of JSON/dict/list objects for Redshift SUPER columns
- Proper cleanup of S3 temporary files
Changed
- Made AWS credentials optional in constructor
- Optimized DataFrame processing with unified applymap operations
- Improved string column handling for better type safety
Fixed
- S3 resource cleanup in error scenarios
- Transaction handling in delete_and_insert_to_redshift