alma

A Python library for benchmarking PyTorch model speed for different conversion options 🚀
With just one function call, you can get a full report on how fast your PyTorch model runs for inference across over 40 conversion options, such as
JIT tracing, torch.compile, torch.export, torchao, ONNX, OpenVINO, TensorRT, and many more!
This allows one to find the best option for one's model, data, and hardware. See
here for all supported options.
Table of Contents
Getting Started
Installation
alma
is available as a Python package.
One can install the package from python package index by running
pip install alma-torch
Alternatively, it can be installed from the root of this
repository by running:
pip install -e .
Docker
We recommend that you build the provided Dockerfile to ensure an easy installation of all of the
system dependencies and the alma pip packages.
Working with the docker image
-
Build the Docker Image
bash scripts/build_docker.sh
-
Run the Docker Container
Create and start a container named alma
:
bash scripts/run_docker.sh
-
Access the Running Container
Enter the container's shell:
docker exec -it alma bash
-
Mount Your Repository
By default, the run_docker.sh
script mounts your /home
directory to /home
inside the container.
If your alma
repository is in a different location, update the bind mount, for example:
-v /Users/myuser/alma:/home/alma
Basic usage
The core API is benchmark_model
, which is used to benchmark the speed of a model for different
conversion options. The usage is as follows:
from alma import benchmark_model
from alma.benchmark import BenchmarkConfig
from alma.benchmark.log import display_all_results
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = ...
data_loader = ...
config = BenchmarkConfig(
n_samples=2048,
batch_size=64,
device=device,
)
conversions = ["EAGER", "TORCH_SCRIPT", "COMPILE_INDUCTOR_MAX_AUTOTUNE", "COMPILE_OPENXLA"]
results = benchmark_model(model, config, conversions, data_loader=data_loader)
display_all_results(results)
The results will look like this, depending on one's model, dataloader, and hardware.
EAGER results:
Device: cuda
Total elapsed time: 0.0206 seconds
Total inference time (model only): 0.0074 seconds
Total samples: 2048 - Batch size: 64
Throughput: 275643.45 samples/second
TORCH_SCRIPT results:
Device: cuda
Total elapsed time: 0.0203 seconds
Total inference time (model only): 0.0043 seconds
Total samples: 2048 - Batch size: 64
Throughput: 477575.34 samples/second
COMPILE_INDUCTOR_MAX_AUTOTUNE results:
Device: cuda
Total elapsed time: 0.0159 seconds
Total inference time (model only): 0.0035 seconds
Total samples: 2048 - Batch size: 64
Throughput: 592801.70 samples/second
COMPILE_OPENXLA results:
Device: xla:0
Total elapsed time: 0.0146 seconds
Total inference time (model only): 0.0033 seconds
Total samples: 2048 - Batch size: 64
Throughput: 611865.07 samples/second
See the examples for discussion of design choices and for examples of more advanced usage, e.g. controlling the
multiprocessing setup, controlling graceful failures, setting default device fallbacks if a conversion
option is incompatible with your specified device, memory efficient usage of alma
, etc.
Conversion Options
Naming conventions
The naming convention for conversion options is as follows:
- Short but descriptive names for each technique, e.g.
EAGER
, EXPORT
, etc. - Underscores
_
are used within each technique name to seperate the words for readability,
e.g. AOT_INDUCTOR
, COMPILE_CUDAGRAPHS
, etc. - If multiple "techniques" are used in a conversion option, then the names are separated by a
+
sign in chronological order of operation.
For example, EXPORT+EAGER
, EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE
. In both cases,
EXPORT
is the first operation, followed by EAGER
or COMPILE_INDUCTOR_MAX_AUTOTUNE
.
Conversion Options Summary
Below is a table summarizing the currently supported conversion options and their identifiers:
ID | Conversion Option | Device Support | Project |
---|
0 | EAGER | CPU, MPS, GPU | PyTorch |
1 | EXPORT+EAGER | CPU, MPS, GPU | torch.export |
2 | ONNX_CPU | CPU | ONNXRT |
3 | ONNX_GPU | GPU | ONNXRT |
4 | ONNX+DYNAMO_EXPORT | CPU | ONNXRT |
5 | COMPILE_CUDAGRAPHS | GPU (CUDA) | torch.compile |
6 | COMPILE_INDUCTOR_DEFAULT | CPU, MPS, GPU | torch.compile |
7 | COMPILE_INDUCTOR_REDUCE_OVERHEAD | CPU, MPS, GPU | torch.compile |
8 | COMPILE_INDUCTOR_MAX_AUTOTUNE | CPU, MPS, GPU | torch.compile |
9 | COMPILE_INDUCTOR_EAGER_FALLBACK | CPU, MPS, GPU | torch.compile |
10 | COMPILE_ONNXRT | CPU, MPS, GPU | torch.compile + ONNXRT |
11 | COMPILE_OPENXLA | XLA_GPU | torch.compile + OpenXLA |
12 | COMPILE_TVM | CPU, MPS, GPU | torch.compile + Apache TVM |
13 | EXPORT+AI8WI8_FLOAT_QUANTIZED | CPU, MPS, GPU | torch.export |
14 | EXPORT+AI8WI8_FLOAT_QUANTIZED+RUN_DECOMPOSITION | CPU, MPS, GPU | torch.export |
15 | EXPORT+AI8WI8_STATIC_QUANTIZED | CPU, MPS, GPU | torch.export |
16 | EXPORT+AI8WI8_STATIC_QUANTIZED+RUN_DECOMPOSITION | CPU, MPS, GPU | torch.export |
17 | EXPORT+AOT_INDUCTOR | CPU, MPS, GPU | torch.export + aot_inductor |
18 | EXPORT+COMPILE_CUDAGRAPHS | GPU (CUDA) | torch.export + torch.compile |
19 | EXPORT+COMPILE_INDUCTOR_DEFAULT | CPU, MPS, GPU | torch.export + torch.compile |
20 | EXPORT+COMPILE_INDUCTOR_REDUCE_OVERHEAD | CPU, MPS, GPU | torch.export + torch.compile |
21 | EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE | CPU, MPS, GPU | torch.export + torch.compile |
22 | EXPORT+COMPILE_INDUCTOR_DEFAULT_EAGER_FALLBACK | CPU, MPS, GPU | torch.export + torch.compile |
23 | EXPORT+COMPILE_ONNXRT | CPU, MPS, GPU | torch.export + torch.compile + ONNXRT |
24 | EXPORT+COMPILE_OPENXLA | XLA_GPU | torch.export + torch.compile + OpenXLA |
25 | EXPORT+COMPILE_TVM | CPU, MPS, GPU | torch.export + torch.compile + Apache TVM |
26 | NATIVE_CONVERT_AI8WI8_STATIC_QUANTIZED | CPU | CPU (PyTorch) |
27 | NATIVE_FAKE_QUANTIZED_AI8WI8_STATIC | CPU, GPU | CPU (PyTorch) |
28 | COMPILE_TENSORRT | GPU (CUDA) | torch.compile + NVIDIA TensorRT |
29 | EXPORT+COMPILE_TENSORRT | GPU (CUDA) | torch.export + torch.compile + NVIDIA TensorRT |
30 | COMPILE_OPENVINO | CPU (Intel) | torch.compile + OpenVINO |
31 | JIT_TRACE | CPU, MPS, GPU | PyTorch |
32 | TORCH_SCRIPT | CPU, MPS, GPU | PyTorch |
33 | OPTIMUM_QUANTO_AI8WI8 | CPU, MPS, GPU | optimum quanto |
34 | OPTIMUM_QUANTO_AI8WI4 | CPU, MPS, GPU (not all GPUs supported) | optimum quanto |
35 | OPTIMUM_QUANTO_AI8WI2 | CPU, MPS, GPU (not all GPUs supported) | optimum quanto |
36 | OPTIMUM_QUANTO_WI8 | CPU, MPS, GPU | optimum quanto |
37 | OPTIMUM_QUANTO_WI4 | CPU, MPS, GPU (not all GPUs supported) | optimum quanto |
38 | OPTIMUM_QUANTO_WI2 | CPU, MPS, GPU (not all GPUs supported) | optimum quanto |
39 | OPTIMUM_QUANTO_Wf8E4M3N | CPU, MPS, GPU | optimum quanto |
40 | OPTIMUM_QUANTO_Wf8E4M3NUZ | CPU, MPS, GPU | optimum quanto |
41 | OPTIMUM_QUANTO_Wf8E5M2 | CPU, MPS, GPU | optimum quanto |
42 | OPTIMUM_QUANTO_Wf8E5M2+COMPILE_CUDAGRAPHS | GPU (CUDA) | optimum quanto + torch.compile |
43 | FP16+EAGER | CPU, MPS, GPU | PyTorch |
44 | BF16+EAGER | CPU, MPS, GPU (not all GPUs natively supported) | PyTorch |
45 | COMPILE_INDUCTOR_MAX_AUTOTUNE+ TORCHAO_AUTOQUANT_DEFAULT | GPU | torch.compile + torchao |
46 | COMPILE_INDUCTOR_MAX_AUTOTUNE+ TORCHAO_AUTOQUANT_NONDEFAULT | GPU | torch.compile + torchao |
47 | COMPILE_CUDAGRAPHS+ TORCHAO_AUTOQUANT_DEFAULT | GPU (CUDA) | torch.compile + torchao |
48 | COMPILE_INDUCTOR_MAX_AUTOTUNE+ TORCHAO_QUANT_I4_WEIGHT_ONLY | GPU (requires bf16 support) | torch.compile + torchao |
49 | TORCHAO_QUANT_I4_WEIGHT_ONLY | GPU (requires bf16 support) | torchao |
50 | FP16+COMPILE_CUDAGRAPHS | GPU (CUDA) | PyTorch + torch.compile |
51 | FP16+COMPILE_INDUCTOR_DEFAULT | CPU, MPS, GPU | PyTorch + torch.compile |
52 | FP16+COMPILE_INDUCTOR_REDUCE_OVERHEAD | CPU, MPS, GPU | PyTorch + torch.compile |
53 | FP16+COMPILE_INDUCTOR_MAX_AUTOTUNE | CPU, MPS, GPU | PyTorch + torch.compile |
54 | FP16+COMPILE_INDUCTOR_EAGER_FALLBACK | CPU, MPS, GPU | PyTorch + torch.compile |
55 | FP16+COMPILE_ONNXRT | CPU, MPS, GPU | PyTorch + torch.compile + ONNXRT |
56 | FP16+COMPILE_OPENXLA | XLA_GPU | PyTorch + torch.compile + OpenXLA |
57 | FP16+COMPILE_TVM | CPU, MPS, GPU | PyTorch + torch.compile + Apache TVM |
58 | FP16+COMPILE_TENSORRT | GPU (CUDA) | PyTorch + torch.compile + NVIDIA TensorRT |
59 | FP16+COMPILE_OPENVINO | CPU (Intel) | PyTorch + torch.compile + OpenVINO |
60 | FP16+EXPORT+COMPILE_CUDAGRAPHS | GPU (CUDA) | torch.export + torch.compile |
61 | FP16+EXPORT+COMPILE_INDUCTOR_DEFAULT | CPU, MPS, GPU | torch.export + torch.compile |
62 | FP16+EXPORT+COMPILE_INDUCTOR_REDUCE_OVERHEAD | CPU, MPS, GPU | torch.export + torch.compile |
63 | FP16+EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE | CPU, MPS, GPU | torch.export + torch.compile |
64 | FP16+EXPORT+COMPILE_INDUCTOR_DEFAULT_EAGER_FALLBACK | CPU, MPS, GPU | torch.export + torch.compile |
65 | FP16+EXPORT+COMPILE_ONNXRT | CPU, MPS, GPU | torch.export + torch.compile + ONNXRT |
66 | FP16+EXPORT+COMPILE_OPENXLA | XLA_GPU | torch.export + torch.compile + OpenXLA |
67 | FP16+EXPORT+COMPILE_TVM | CPU, MPS, GPU | torch.export + torch.compile + Apache TVM |
68 | FP16+EXPORT+COMPILE_TENSORRT | GPU (CUDA) | torch.export + torch.compile + NVIDIA TensorRT |
69 | FP16+EXPORT+COMPILE_OPENVINO | CPU (Intel) | torch.export + torch.compile + OpenVINO |
70 | FP16+JIT_TRACE | CPU, MPS, GPU | PyTorch |
71 | FP16+TORCH_SCRIPT | CPU, MPS, GPU | PyTorch |
These conversion options are also all hard-coded in the conversion options
file, which is the source of truth.
Testing:
We use pytest for testing. Simply run:
pytest
We currently don't have comprehensive tests, but we are working on adding more tests to ensure that
the conversion options are working as expected in known environments (e.g. the Docker container).
Future work:
- Add more conversion options. This is a work in progress, and we are always looking for more conversion options.
- Multi-device benchmarking. Currently
alma
only supports single-device benchmarking, but ideally a model
could be split across multiple devices. - Integrating conversion options beyond PyTorch, e.g. HuggingFace, JAX, llama.cpp, etc.
How to contribute:
Contributions are welcome! If you have a new conversion option, feature, or other you would like to add,
so that the whole community can benefit, please open a pull request! We are always looking for new
conversion options, and we are happy to help you get started with adding a new conversion
option/feature!
See the CONTRIBUTING.md file for more detailed information on how to contribute.
Citation
@Misc{alma,
title = {Alma: PyTorch model speed benchmarking across all conversion types},
author = {Oscar Savolainen and Saif Haq},
howpublished = {\url{https://github.com/saifhaq/alma}},
year = {2024}
}