šŸš€ Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →
Socket
Book a DemoInstallSign in
Socket

turbopipe

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

turbopipe

šŸŒ€ Faster ModernGL Buffers inter-process data transfers for subprocesses

1.2.3
PyPI
Maintainers
1

TurboPipe

Faster ModernGL Buffers inter-process data transfers for subprocesses


šŸ”„ Description

TurboPipe speeds up sending raw bytes from moderngl.Buffer objects primarily to FFmpeg subprocess

The optimizations involved are:

  • Zero-copy: Avoid unnecessary memory copies or allocation (intermediate buffer.read)
  • C++: The core of TurboPipe is written in C++ for speed, efficiency and low-level control
  • Threaded:
    • Doesn't block Python code execution, allows to render next frame
    • Decouples the main thread from the I/O thread for performance
  • Chunks: Write in chunks of 4096 bytes (RAM page size), so the hardware is happy (Unix)

āœ… Don't worry, there's proper safety in place. TurboPipe will block Python if a memory address is already queued for writing, and guarantees order of writes per file-descriptor. Just call .sync() when done šŸ˜‰

Also check out ShaderFlow, where TurboPipe shines! šŸ˜‰


šŸ“¦ Installation

It couldn't be easier! Just install the turbopipe package from PyPI:

# With pip (https://pip.pypa.io/)
pip install turbopipe

# With Poetry (https://python-poetry.org/)
poetry add turbopipe

# With PDM (https://pdm-project.org/en/latest/)
pdm add turbopipe

# With Rye (https://rye.astral.sh/)
rye add turbopipe

šŸš€ Usage

See also the Examples folder for comparisons, and ShaderFlow's usage of it!

import subprocess

import moderngl
import turbopipe

# Create ModernGL objects and proxy buffers
ctx = moderngl.create_standalone_context()
width, height, duration, fps = (1920, 1080, 10, 60)
buffers = [
    ctx.buffer(reserve=(width*height*3))
    for _ in range(nbuffers := 2)
]

# Create your FBO, Textures, Shaders, etc.

# Make sure resolution, pixel format matches!
ffmpeg = subprocess.Popen((
    "ffmpeg",
    "-f", "rawvideo",
    "-pix_fmt", "rgb24",
    "-r", str(fps),
    "-s", f"{width}x{height}",
    "-i", "-",
    "-f", "null",
    "output.mp4"
), stdin=subprocess.PIPE)

# Rendering loop of yours
for frame in range(duration*fps):
    buffer = buffers[frame % nbuffers]

    # Wait queued writes before copying
    turbopipe.sync(buffer)
    fbo.read_into(buffer)

    # Doesn't lock the GIL, writes in parallel
    turbopipe.pipe(buffer, ffmpeg.stdin.fileno())

# Wait for queued writes, clean memory
for buffer in buffers:
    turbopipe.sync(buffer)
    buffer.release()

# Signal stdin stream is done
ffmpeg.stdin.close()

# wait for encoding to finish
ffmpeg.wait()

# Warn: Albeit rare, only call close when no other data
# write is pending, as it might skip a frame or halt
turbopipe.close()

ā­ļø Benchmarks

[!NOTE] The tests conditions are as follows:

  • The tests are the average of 3 runs to ensure consistency, with 5 GB of the same data being piped
  • These aren't tests of render speed; but rather the throughput speed of GPU -> CPU -> RAM -> IPC
  • All resolutions are wide-screen (16:9) and have 3 components (RGB) with 3 bytes per pixel (SDR)
  • The data is a random noise per-buffer between 128-135. So, multi-buffers runs are a noise video
  • Multi-buffer cycles through a list of buffer (eg. 1, 2, 3, 1, 2, 3... for 3-buffers)
  • All FFmpeg outputs are scrapped with -f null - to avoid any disk I/O bottlenecks
  • The gain column is the percentage increase over the standard method
  • When x264 is Null, no encoding took place (passthrough)
  • The test cases emoji signifies:
    • 🐢: Standard ffmpeg.stdin.write(buffer.read()) on just the main thread, pure Python
    • šŸš€: Threaded ffmpeg.stdin.write(buffer.read()) with a queue (similar to turbopipe)
    • šŸŒ€: The magic of turbopipe.pipe(buffer, ffmpeg.stdin.fileno())

Also see benchmark.py for the implementation

āœ… Check out benchmarks in a couple of systems below:

šŸ“¦ TurboPipe v1.0.4:

Desktop • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Arch Linux)

Note: I have noted inconsistency across tests, specially at lower resolutions. Some 720p runs might peak at 2900 fps and stay there, while others are limited by 1750 fps. I'm not sure if it's the Linux EEVDF scheduler, or CPU Topology that causes this. Nevertheless, results are stable on Windows 11 on the same machine.

720px264BuffersFramerateBandwidthGain
🐢Null1882 fps2.44 GB/s
šŸš€Null1793 fps2.19 GB/s-10.04%
šŸŒ€Null11911 fps5.28 GB/s116.70%
🐢Null4857 fps2.37 GB/s
šŸš€Null4891 fps2.47 GB/s4.05%
šŸŒ€Null42309 fps6.38 GB/s169.45%
🐢ultrafast4714 fps1.98 GB/s
šŸš€ultrafast4670 fps1.85 GB/s-6.10%
šŸŒ€ultrafast41093 fps3.02 GB/s53.13%
🐢slow4206 fps0.57 GB/s
šŸš€slow4208 fps0.58 GB/s1.37%
šŸŒ€slow4214 fps0.59 GB/s3.93%
1080px264BuffersFramerateBandwidthGain
🐢Null1410 fps2.55 GB/s
šŸš€Null1399 fps2.48 GB/s-2.60%
šŸŒ€Null1794 fps4.94 GB/s93.80%
🐢Null4390 fps2.43 GB/s
šŸš€Null4391 fps2.43 GB/s0.26%
šŸŒ€Null4756 fps4.71 GB/s94.01%
🐢ultrafast4269 fps1.68 GB/s
šŸš€ultrafast4272 fps1.70 GB/s1.48%
šŸŒ€ultrafast4409 fps2.55 GB/s52.29%
🐢slow4115 fps0.72 GB/s
šŸš€slow4118 fps0.74 GB/s3.40%
šŸŒ€slow4119 fps0.75 GB/s4.34%
1440px264BuffersFramerateBandwidthGain
🐢Null1210 fps2.33 GB/s
šŸš€Null1239 fps2.64 GB/s13.84%
šŸŒ€Null1534 fps5.91 GB/s154.32%
🐢Null4219 fps2.43 GB/s
šŸš€Null4231 fps2.56 GB/s5.64%
šŸŒ€Null4503 fps5.56 GB/s129.75%
🐢ultrafast4141 fps1.56 GB/s
šŸš€ultrafast4150 fps1.67 GB/s6.92%
šŸŒ€ultrafast4226 fps2.50 GB/s60.37%
🐢slow472 fps0.80 GB/s
šŸš€slow471 fps0.79 GB/s-0.70%
šŸŒ€slow475 fps0.83 GB/s4.60%
2160px264BuffersFramerateBandwidthGain
🐢Null181 fps2.03 GB/s
šŸš€Null1107 fps2.67 GB/s32.26%
šŸŒ€Null1213 fps5.31 GB/s163.47%
🐢Null487 fps2.18 GB/s
šŸš€Null4109 fps2.72 GB/s25.43%
šŸŒ€Null4212 fps5.28 GB/s143.72%
🐢ultrafast459 fps1.48 GB/s
šŸš€ultrafast467 fps1.68 GB/s14.46%
šŸŒ€ultrafast495 fps2.39 GB/s62.66%
🐢slow437 fps0.94 GB/s
šŸš€slow443 fps1.07 GB/s16.22%
šŸŒ€slow444 fps1.11 GB/s20.65%
Desktop • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Windows 11)
720px264BuffersFramerateBandwidthGain
🐢Null1981 fps2.71 GB/s
šŸš€Null11145 fps3.17 GB/s16.74%
šŸŒ€Null11504 fps4.16 GB/s53.38%
🐢Null4997 fps2.76 GB/s
šŸš€Null41117 fps3.09 GB/s12.08%
šŸŒ€Null41467 fps4.06 GB/s47.14%
🐢ultrafast4601 fps1.66 GB/s
šŸš€ultrafast4616 fps1.70 GB/s2.57%
šŸŒ€ultrafast4721 fps1.99 GB/s20.04%
🐢slow4206 fps0.57 GB/s
šŸš€slow4206 fps0.57 GB/s0.39%
šŸŒ€slow4206 fps0.57 GB/s0.13%
1080px264BuffersFramerateBandwidthGain
🐢Null1451 fps2.81 GB/s
šŸš€Null1542 fps3.38 GB/s20.31%
šŸŒ€Null1711 fps4.43 GB/s57.86%
🐢Null4449 fps2.79 GB/s
šŸš€Null4518 fps3.23 GB/s15.48%
šŸŒ€Null4614 fps3.82 GB/s36.83%
🐢ultrafast4262 fps1.64 GB/s
šŸš€ultrafast4266 fps1.66 GB/s1.57%
šŸŒ€ultrafast4319 fps1.99 GB/s21.88%
🐢slow4119 fps0.74 GB/s
šŸš€slow4121 fps0.76 GB/s2.46%
šŸŒ€slow4121 fps0.75 GB/s1.90%
1440px264BuffersFramerateBandwidthGain
🐢Null1266 fps2.95 GB/s
šŸš€Null1308 fps3.41 GB/s15.87%
šŸŒ€Null1402 fps4.45 GB/s51.22%
🐢Null4276 fps3.06 GB/s
šŸš€Null4307 fps3.40 GB/s11.32%
šŸŒ€Null4427 fps4.73 GB/s54.86%
🐢ultrafast4152 fps1.68 GB/s
šŸš€ultrafast4156 fps1.73 GB/s3.02%
šŸŒ€ultrafast4181 fps2.01 GB/s19.36%
🐢slow477 fps0.86 GB/s
šŸš€slow479 fps0.88 GB/s3.27%
šŸŒ€slow480 fps0.89 GB/s4.86%
2160px264BuffersFramerateBandwidthGain
🐢Null1134 fps3.35 GB/s
šŸš€Null1152 fps3.81 GB/s14.15%
šŸŒ€Null1221 fps5.52 GB/s65.44%
🐢Null4135 fps3.36 GB/s
šŸš€Null4151 fps3.76 GB/s11.89%
šŸŒ€Null4220 fps5.49 GB/s63.34%
🐢ultrafast466 fps1.65 GB/s
šŸš€ultrafast470 fps1.75 GB/s6.44%
šŸŒ€ultrafast482 fps2.04 GB/s24.31%
🐢slow440 fps1.01 GB/s
šŸš€slow443 fps1.09 GB/s9.54%
šŸŒ€slow444 fps1.10 GB/s10.15%
Laptop • (Intel Core i7 11800H) • (NVIDIA RTX 3070) • (DDR4 2x16 GB 3200 MT/s) • (Windows 11)

Note: Must select NVIDIA GPU on their Control Panel instead of Intel iGPU

720px264BuffersFramerateBandwidthGain
🐢Null1786 fps2.17 GB/s
šŸš€Null1903 fps2.50 GB/s14.91%
šŸŒ€Null11366 fps3.78 GB/s73.90%
🐢Null4739 fps2.04 GB/s
šŸš€Null4855 fps2.37 GB/s15.78%
šŸŒ€Null41240 fps3.43 GB/s67.91%
🐢ultrafast4484 fps1.34 GB/s
šŸš€ultrafast4503 fps1.39 GB/s4.10%
šŸŒ€ultrafast4577 fps1.60 GB/s19.37%
🐢slow4143 fps0.40 GB/s
šŸš€slow4145 fps0.40 GB/s1.78%
šŸŒ€slow4151 fps0.42 GB/s5.76%
1080px264BuffersFramerateBandwidthGain
🐢Null1358 fps2.23 GB/s
šŸš€Null1427 fps2.66 GB/s19.45%
šŸŒ€Null1566 fps3.53 GB/s58.31%
🐢Null4343 fps2.14 GB/s
šŸš€Null4404 fps2.51 GB/s17.86%
šŸŒ€Null4465 fps2.89 GB/s35.62%
🐢ultrafast4191 fps1.19 GB/s
šŸš€ultrafast4207 fps1.29 GB/s8.89%
šŸŒ€ultrafast4234 fps1.46 GB/s22.77%
🐢slow462 fps0.39 GB/s
šŸš€slow467 fps0.42 GB/s8.40%
šŸŒ€slow474 fps0.47 GB/s20.89%
1440px264BuffersFramerateBandwidthGain
🐢Null1180 fps1.99 GB/s
šŸš€Null1216 fps2.40 GB/s20.34%
šŸŒ€Null1264 fps2.92 GB/s46.74%
🐢Null4178 fps1.97 GB/s
šŸš€Null4211 fps2.34 GB/s19.07%
šŸŒ€Null4250 fps2.77 GB/s40.48%
🐢ultrafast498 fps1.09 GB/s
šŸš€ultrafast4110 fps1.23 GB/s13.18%
šŸŒ€ultrafast4121 fps1.35 GB/s24.15%
🐢slow440 fps0.45 GB/s
šŸš€slow441 fps0.46 GB/s4.90%
šŸŒ€slow443 fps0.48 GB/s7.89%
2160px264BuffersFramerateBandwidthGain
🐢Null179 fps1.98 GB/s
šŸš€Null195 fps2.37 GB/s20.52%
šŸŒ€Null1104 fps2.60 GB/s32.15%
🐢Null480 fps2.00 GB/s
šŸš€Null494 fps2.35 GB/s17.82%
šŸŒ€Null4108 fps2.70 GB/s35.40%
🐢ultrafast441 fps1.04 GB/s
šŸš€ultrafast448 fps1.20 GB/s17.67%
šŸŒ€ultrafast452 fps1.30 GB/s27.49%
🐢slow417 fps0.43 GB/s
šŸš€slow419 fps0.48 GB/s13.16%
šŸŒ€slow419 fps0.48 GB/s13.78%

šŸŒ€ Conclusion

TurboPipe significantly increases the feeding speed of FFmpeg with data, especially at higher resolutions. However, if there's few CPU compute available, or the video is too hard to encode (/slow preset), the gains are insignificant over the other methods (bottleneck). Multi-buffering didn't prove to have an advantage, debugging shows that TurboPipe C++ is often starved of data to write (as the file stream is buffered on the OS most likely), and the context switching, or cache misses, might be the cause of the slowdown.

The theory supports the threaded method being faster, as writing to a file descriptor is a blocking operation for python, but a syscall under the hood, that doesn't necessarily lock the GIL, just the thread. TurboPipe speeds that even further by avoiding an unecessary copy of the buffer data, and writing directly to the file descriptor on a C++ thread. Linux shows a better performance than Windows in the same system after the optimizations, but Windows wins on the standard method.

Interestingly, due either Linux's scheduler on AMD Ryzen CPUs, or their operating philosophy, it was experimentally seen that Ryzen's frenetic thread switching degrades a bit the single thread performance, which can be "fixed" with prepending the command with taskset --cpu 0,2 (not recommended at all), comparatively speaking to Windows performance on the same system (Linux šŸš€ = Windows 🐢). This can also be due the topology of tested CPUs having more than one Core Complex Die (CCD). Intel CPUs seem to stick to the same thread for longer, which makes the Python threaded method often slightly faster.

Personal experience

On realistically loads, like ShaderFlow's default lightweight shader export, TurboPipe increases rendering speed from 1080p260 to 1080p360 on my system, with mid 80% CPU usage than low 60%s. For DepthFlow's default depth video export, no gains are seen, as the CPU is almost saturated encoding at 1080p130.


šŸ“š Future work

  • Disable/investigate performance degradation on Windows iGPUs
  • Maybe use mmap instead of chunks writing on Linux
  • Split the code into a libturbopipe? Not sure where it would be useful šŸ˜…

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts