Launch Week Day 1: Socket for Jira Is Now Available.Learn More →

Book a Demo Sign in

rtx50-compat

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

rtx50-compat

RTX 50-series GPU compatibility layer for PyTorch and CUDA - enables sm_120 support

PyPI

Version: 3.0.2

Maintainers: 1

rtx50-compat: Enable RTX 50-series GPUs in PyTorch 🚀

Enable NVIDIA RTX 50-series GPU support (sm_120) in PyTorch and the entire Python AI ecosystem with a single import.

🎯 Why This Exists

The RTX 5090 features the new sm_120 compute capability, which isn't recognized by current PyTorch/CUDA libraries. This package provides a runtime patch that makes your RTX 5090 work seamlessly with existing AI frameworks.

🚀 Quick Start

Installation

# Recommended: use uv
uv pip install rtx50-compat

# Or with pip
pip install rtx50-compat

Basic Usage

import rtx50_compat  # Must be imported before PyTorch!
import torch

# Verify GPU is recognized
print(torch.cuda.get_device_name(0))  # NVIDIA GeForce RTX 5090
print(torch.cuda.is_available())      # True

# Now use PyTorch normally
model = torch.nn.Linear(1024, 1024).cuda()

📊 Realistic Benchmarks

Based on RTX 5090's 32GB GDDR7 VRAM and 70 TFLOPS compute:

Models that fit entirely in VRAM (fastest)

Model	RTX 5090	i9-14900K	Speedup
Llama 3-8B	180-250 tokens/s	8-12 tokens/s	~20x
Llama 3-13B	120-180 tokens/s	4-6 tokens/s	~30x
Stable Diffusion XL	40-60 img/min	0.5 img/min	~100x

Large models with partial offloading

Model	RTX 5090 (with offload)	i9-14900K	Speedup
Llama 3-70B Q4	25-35 tokens/s	1-3 tokens/s	~15x
Mixtral 8x7B	40-60 tokens/s	2-4 tokens/s	~20x

Note: 70B models require ~35GB for Q4 quantization, exceeding the RTX 5090's 32GB VRAM. Performance depends on offloading efficiency.

📖 Examples

Hello World - Verify Installation

import rtx50_compat
import torch

# Check if patch was applied
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    print(f"✅ VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Quick performance test
    x = torch.randn(10000, 10000, device='cuda')
    y = torch.matmul(x, x)
    print("✅ CUDA operations working!")
else:
    print("❌ CUDA not available")

Running Llama 3-8B (Fits in VRAM)

import rtx50_compat
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model - fits entirely in 32GB VRAM
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda"
)

# Generate at 180-250 tokens/s!
inputs = tokenizer("The future of AI is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=100, temperature=0.8)
print(tokenizer.decode(outputs[0]))

Running Llama 3-70B with llama.cpp (Recommended for large models)

# First convert to GGUF format for efficient memory usage
# pip install llama-cpp-python

import rtx50_compat
from llama_cpp import Llama

# Load 70B model with automatic GPU/CPU splitting
llm = Llama(
    model_path="llama-3-70b-q4_k_m.gguf",
    n_gpu_layers=-1,  # Offload all layers that fit
    n_ctx=4096,
    verbose=False
)

# Generate at 25-35 tokens/s with partial offloading
response = llm("The meaning of life is", max_tokens=100)
print(response['choices'][0]['text'])

Stable Diffusion XL

import rtx50_compat
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16"
).to("cuda")

# Generate at 40-60 images per minute!
images = pipe(
    "A majestic mountain landscape at sunset, highly detailed, 8k",
    num_images_per_prompt=4,
    guidance_scale=7.5
).images

🔧 Technical Details

What the patch does:

Capability Masquerading: Makes sm_120 report as sm_90 (H100) for compatibility
CUDA Compilation: Adds sm_120 flags when compiling CUDA extensions
Memory Management: Optimizes for consumer GPU memory patterns
Library Fixes: Patches flash-attention, xformers, and other CUDA libraries

How it works:

# The patch intercepts CUDA capability queries
original_get_device_capability = torch.cuda.get_device_capability

def patched_get_device_capability(device=None):
    major, minor = original_get_device_capability(device)
    if major == 12 and minor == 0:  # sm_120 (RTX 50-series)
        return (9, 0)  # Masquerade as sm_90 (H100)
    return (major, minor)

🦇 Batman Mode

For subtle operations:

export RTX50_BATMAN_MODE=1
python your_script.py

Output:

🦇 I am Batman - at your local jujitsu establishment
RTX 5090 successfully disguised as H100
You didn't see anything... 🌙

📂 Repository Structure

rtx50-compat/
├── rtx50_compat.py      # Main compatibility layer
├── patches/             # PyTorch & vLLM patches (for reference)
│   ├── pytorch_rtx5090.patch
│   ├── vllm_rtx5090.patch
│   └── README.md        # Patch application guide
├── benchmarks/          # Benchmark scripts
│   ├── benchmark_8b.py  # Llama 3-8B benchmark
│   ├── benchmark_70b.py # Llama 3-70B with offloading
│   └── benchmark_sd.py  # Stable Diffusion benchmark
├── examples/            # Usage examples
│   ├── hello_world.py
│   ├── comfyui_integration.py
│   └── llama_cpp_example.py
├── tests/               # Unit tests
│   └── test_compatibility.py
├── LICENSE
├── README.md
└── setup.py

🐛 Troubleshooting

"No kernel image available" Error

Ensure rtx50_compat is imported before any other CUDA/PyTorch imports:

import rtx50_compat  # MUST be first
import torch  # Now this works

Memory Errors with Large Models

For models exceeding 32GB VRAM, use quantization and offloading:

# Use 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"  # Automatic CPU/GPU splitting
)

Performance Lower Than Expected

Check if model fits entirely in VRAM: nvidia-smi
Use appropriate batch sizes (larger = more efficient)
Enable flash attention if available
Consider using specialized inference engines (vLLM, TGI, llama.cpp)

🤖 AI Assistant Integration

Using with Claude CLI

# Install the package and verify it works
claude "I have an RTX 5090. Help me set up rtx50-compat and run a Llama 3-8B model for maximum performance"

# Optimize for 70B models with offloading
claude "Show me how to run Llama 3-70B on my RTX 5090 using llama.cpp with optimal settings"

# Debug performance issues
claude "My RTX 5090 is only getting 10 tokens/s on Llama 3-13B. Help me diagnose and fix this"

# Integration with existing projects
claude "Add rtx50-compat support to my ComfyUI installation at ~/ComfyUI"

Using with Gemini CLI

# Setup and verification
gemini -p "I have an RTX 5090 with 32GB VRAM. Guide me through installing rtx50-compat and running a benchmark"

# Model recommendations
gemini -p "What's the largest LLM I can run entirely in VRAM on my RTX 5090? Include quantization options"

# Performance optimization
gemini -p "Analyze my RTX 5090 setup and suggest optimizations for running Mixtral 8x7B at maximum speed"

# Troubleshooting
gemini -p "Getting 'no kernel image' error with RTX 5090 in PyTorch. Show me how to fix with rtx50-compat"

Prompt Templates for Complex Tasks

Full Stack Setup

Help me set up a complete local AI workstation with RTX 5090:
1. Install rtx50-compat
2. Configure vLLM for serving
3. Set up Stable Diffusion XL
4. Create benchmarks for both text and image generation

Production Deployment

I need to deploy a Llama 3-70B model on RTX 5090 for production use:
- Optimize for throughput (multiple users)
- Set up proper memory management
- Configure monitoring and logging
- Handle model switching between 8B/13B/70B based on load

🤝 Contributing

PRs welcome! Areas needing help:

RTX 5080/5070 Ti testing
Additional framework patches (JAX, MXNet)
Performance optimizations
Documentation improvements

Upstream Integration

We're working on getting these patches merged upstream:

PyTorch: [PR #pending]
vLLM: [PR #pending]

📄 License

MIT License - see LICENSE

🙏 Acknowledgments

NVIDIA for the incredible RTX 5090 hardware
PyTorch team for the amazing framework
The local LLM community for inspiration and testing

Note: This is a community compatibility layer. Once PyTorch officially supports sm_120, this package will become obsolete. Until then, enjoy running large models locally at impressive speeds! 🚀

FAQs

What is rtx50-compat?

Is rtx50-compat well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

rtx50-compat

rtx50-compat: Enable RTX 50-series GPUs in PyTorch 🚀

🎯 Why This Exists

🚀 Quick Start

Installation

Basic Usage

📊 Realistic Benchmarks

Models that fit entirely in VRAM (fastest)

Large models with partial offloading

📖 Examples

Hello World - Verify Installation

Running Llama 3-8B (Fits in VRAM)

Running Llama 3-70B with llama.cpp (Recommended for large models)

Stable Diffusion XL

🔧 Technical Details

What the patch does:

How it works:

🦇 Batman Mode

📂 Repository Structure

🐛 Troubleshooting

"No kernel image available" Error

Memory Errors with Large Models

Performance Lower Than Expected

🤖 AI Assistant Integration

Using with Claude CLI

Using with Gemini CLI

Prompt Templates for Complex Tasks

Full Stack Setup

Production Deployment

🤝 Contributing

Upstream Integration

📄 License

🙏 Acknowledgments

Related posts

Socket for Jira Is Now Available

Socket Named Top Sales Organization by RepVue