Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
IMPORTANT:
After having gathered feedback from our partners and the community, we have decided that quanto
would not continue as a standalone project but would rather be merged into the optimum project.
External contributions to quanto will be suspended until the merge is complete.
DISCLAIMER: This package is still beta. Expect breaking changes in API and serialization.
🤗 Quanto is a python quantization toolkit that provides several features that are either not supported or limited by the base pytorch quantization tools:
weight_only
and 🤗 safetensors
,Features yet to be implemented:
Thanks to a seamless propagation mechanism through quantized tensors, only a few modules working as quantized tensors insertion points are actually required.
The following modules can be quantized:
At the heart of quanto is a Tensor subclass that corresponds to:
For floating-point destination types, the mapping is done by the native pytorch cast (i.e. Tensor.to()
).
For integer destination types, the mapping is a simple rounding operation (i.e. torch.round()
).
The goal of the projection is to increase the accuracy of the conversion by minimizing the number of:
The projection is symmetric (affine), i.e. it does not use a zero-point. This makes quantized Tensors compatible with many operations.
One of the benefits of using a lower-bitwidth representation is that you will be able to take advantage of accelerated operations for the destination type, which is typically faster than their higher precision equivalents.
The current implementation however falls back to float32
operations for a lot of operations because of a lack of dedicated kernels
(only int8
matrix multiplication is available).
Note: integer operations cannot be performed in float16
as a fallback because this format is very bad at representing
integer
and will likely lead to overflows in intermediate calculations.
Quanto does not support the conversion of a Tensor using mixed destination types.
Quanto provides a generic mechanism to replace torch modules by quanto modules that are able to process quanto tensors.
Quanto modules dynamically convert their weights until a model is frozen, which slows down inference a bit but is required if the model needs to be tuned.
Biases are not converted because to preserve the accuracy of a typical addmm
operation, they must be converted with a
scale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely
requires a very high bitwidth to avoid clipping. Typically, with int8
inputs and weights, biases would need to be quantized
with at least 12
bits, i.e. in int16
. Since most biases are today float16
, this is a waste of time.
Activations are dynamically quantized using static scales (defaults to the range [-1, 1]
). The model needs to be calibrated to evaluate the best activation scales (using a momentum).
In a nutshell:
int8
/float8
weights and float8
activations are very close to the 16-bit
models,2x
slower than the 16-bit
models due to the lack of optimized kernels (for now).The paragraph below is just an example. Please refer to the bench
folder for detailed results per use-case of model.
Quanto is available as a pip package.
pip install quanto
Quanto does not make a clear distinction between dynamic and static quantization: models are always dynamically quantized, but their weights can later be "frozen" to integer values.
A typical quantization workflow would consist of the following steps:
1. Quantize
The first step converts a standard float model into a dynamically quantized model.
quantize(model, weights=quanto.qint8, activations=quanto.qint8)
At this stage, only the inference of the model is modified to dynamically quantize the weights.
2. Calibrate (optional if activations are not quantized)
Quanto supports a calibration mode that allows to record the activation ranges while passing representative samples through the quantized model.
with calibration(momentum=0.9):
model(samples)
This automatically activates the quantization of the activations in the quantized modules.
3. Tune, aka Quantization-Aware-Training (optional)
If the performance of the model degrades too much, one can tune it for a few epochs to recover the float model performance.
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data).dequantize()
loss = torch.nn.functional.nll_loss(output, target)
loss.backward()
optimizer.step()
4. Freeze integer weights
When freezing a model, its float weights are replaced by quantized integer weights.
freeze(model)
Please refer to the examples for instantiations of that workflow.
Activations are always quantized per-tensor because most linear algebra operations in a model graph are not compatible with per-axis inputs: you simply cannot add numbers that are not expressed in the same base (you cannot add apples and oranges
).
Weights involved in matrix multiplications are, on the contrary, always quantized along their first axis, because all output features are evaluated independently from one another.
The outputs of a quantized matrix multiplication will anyway always be dequantized, even if activations are quantized, because:
int32
) than the activation bitwidth (typically int8
),float
bias.Quantizing activations per-tensor to int8
can lead to serious quantization errors if the corresponding tensors contain large outlier values. Typically, this will lead to quantized tensors with most values set to zero (except the outliers).
A possible solution to work around that issue is to 'smooth' the activations statically as illustrated by SmoothQuant. You can find a script to smooth some model architectures under external/smoothquant.
A better option is to represent activations using float8
.
FAQs
A quantization toolkit for pytorch.
We found that quanto demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.