Larq Compute Engine
Larq Compute Engine (LCE) is a highly optimized inference engine for deploying
extremely quantized neural networks, such as
Binarized Neural Networks (BNNs). It currently supports various mobile platforms
and has been benchmarked on a Pixel 1 phone and a Raspberry Pi.
LCE provides a collection of hand-optimized TensorFlow Lite
custom operators for supported instruction sets, developed in inline assembly or in C++
using compiler intrinsics. LCE leverages optimization techniques
such as tiling to maximize the number of cache hits, vectorization to maximize
the computational throughput, and multi-threading parallelization to take
advantage of multi-core modern desktop and mobile CPUs.
Larq Compute Engine is part of a family of libraries for BNN development; you can also check out Larq for building and training BNNs and Larq Zoo for pre-trained models.
Key Features
Performance
The table below presents single-threaded performance of Larq Compute Engine on
different versions of a novel BNN model called QuickNet (trained on ImageNet dataset, released on Larq Zoo)
on a Raspberry Pi 4 Model B at 1.5GHz (BCM2711) board, a Pixel 1 Android phone (2016), and a Mac Mini with M1 ARM CPU:
Model | Top-1 Accuracy | RPi 4B 1.5GHz, 1 thread (ms) | Pixel 1, 1 thread (ms) | Mac Mini M1, 1 thread (ms) |
---|
QuickNetSmall | 59.4% | 27.7 | 16.8 | 4.0 |
QuickNet | 63.3% | 45.0 | 25.5 | 5.8 |
QuickNetLarge | 66.9% | 77.0 | 44.2 | 9.9 |
For reference, dabnn (the other main BNN library) reports an inference time of 61.3 ms for Bi-RealNet (56.4% accuracy) on the Pixel 1 phone,
while LCE achieves an inference time of 41.6 ms for Bi-RealNet on the same device.
They furthermore present a modified version, BiRealNet-Stem, which achieves the same accuracy of 56.4% in 43.2 ms.
The following table presents multi-threaded performance of Larq Compute Engine on
a Pixel 1 phone and a Raspberry Pi 4 Model B at 1.5GHz (BCM2711)
board:
Model | Top-1 Accuracy | RPi 4B 1.5GHz, 4 threads (ms) | Pixel 1, 4 threads (ms) | Mac Mini M1, 4 threads (ms) |
---|
QuickNetSmall | 59.4% | 12.1 | 8.9 | 1.8 |
QuickNet | 63.3% | 20.8 | 12.6 | 2.5 |
QuickNetLarge | 66.9% | 31.7 | 22.8 | 3.9 |
Benchmarked on 2021-06-11 (Pixel 1), 2021-06-13 (Mac Mini M1), and 2022-04-20 (RPi 4B) with LCE custom
TFLite Model Benchmark Tool
(see here)
with XNNPack enabled
and BNN models with randomized inputs.
Getting started
Follow these steps to deploy a BNN with LCE:
-
Pick a Larq model
You can use Larq to build and train your own model or pick a pre-trained model from Larq Zoo.
-
Convert the Larq model
LCE is built on top of TensorFlow Lite and uses the TensorFlow Lite FlatBuffer format to convert and serialize Larq models for inference. We provide an LCE Converter with additional optimization passes to increase the speed of execution of Larq models on supported target platforms.
-
Build LCE
The LCE documentation provides the build instructions for Android and 64-bit ARM-based boards such as Raspberry Pi. Please follow the provided instructions to create a native LCE build or cross-compile for one of the supported targets.
-
Run inference
LCE uses the TensorFlow Lite Interpreter to perform an inference. In addition to the already available built-in TensorFlow Lite operators, optimized LCE operators are registered to the interpreter to execute the Larq specific subgraphs of the model. An example to create and build an LCE compatible TensorFlow Lite interpreter for your own applications is provided here.
Next steps
About
Larq Compute Engine is being developed by a team of deep learning researchers and engineers at Plumerai to help accelerate both our own research and the general adoption of Binarized Neural Networks.