Speech Command Recognizer
The Speech Command Recognizer is a JavaScript module that enables
recognition of spoken commands comprised of simple isolated English
words from a small vocabulary. The default vocabulary includes the following
words: the ten digits from "zero" to "nine", "up", "down", "left", "right",
"go", "stop", "yes", "no", as well as the additional categories of
"unknown word" and "background noise".
It uses the web browser's
WebAudio API.
It is built on top of TensorFlow.js and can
perform inference and transfer learning entirely in the browser, using
WebGL GPU acceleration.
The underlying deep neural network has been trained using the
TensorFlow Speech Commands Dataset.
For more details on the data set, see:
Warden, P. (2018) "Speech commands: A dataset for limited-vocabulary
speech recognition" https://arxiv.org/pdf/1804.03209.pdf
API Usage
A speech command recognizer can be used in two ways:
- Online streaming recognition, during which the library automatically
opens an audio input channel using the browser's
getUserMedia
and
WebAudio
APIs (requesting permission from user) and performs real-time recognition on
the audio input.
- Offline recognition, in which you provide a pre-constructed TensorFlow.js
Tensor object or a
Float32Array and the recognizer will return the recognition results.
Online streaming recognition
To use the speech-command recognizer, first create a recognizer instance,
then start the streaming recognition by calling its listen() method.
import * as tf from '@tensorflow/tfjs';
import * as speechCommands from '@tensorflow-models/speech-commands';
const recognizer = speechCommands.create('BROWSER_FFT');
await recognizer.ensureModelLoaded();
console.log(recognizer.wordLabels());
recognizer.listen(result => {
}, {
includeSpectrogram: true,
probabilityThreshold: 0.75
});
setTimeout(() => recognizer.stopListening(), 10e3);
Vocabularies
When calling speechCommands.create(), you can specify the vocabulary
the loaded model will be able to recognize. This is specified as the second,
optional argument to speechCommands.create(). For example:
const recognizer = speechCommands.create('BROWSER_FFT', 'directional4w');
Currently, the supported vocabularies are:
- '18w' (default): The 20 item vocaulbary, consisting of:
'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
'eight', 'nine', 'up', 'down', 'left', 'right', 'go', 'stop',
'yes', and 'no', in addition to 'background_noise' and 'unknown'.
- 'directional4w': The four directional words: 'up', 'down', 'left', and
'right', in addition to 'background_noise' and 'unknown'.
'18w' is the default vocabulary.
Parameters for online streaming recognition
As the example above shows, you can specify optional parameters when calling
listen(). The supported parameters are:
overlapFactor: Controls how often the recognizer performs prediction on
spectrograms. Must be >=0 and <1 (default: 0.5). For example,
if each spectrogram is 1000 ms long and overlapFactor is set to 0.25,
the prediction will happen every 250 ms.
includeSpectrogram: Let the callback function be invoked with the
spectrogram data included in the argument. Default: false.
probabilityThreshold: The callback function will be invoked if and only if
the maximum probability score of all the words is greater than this threshold.
Default: 0.
invokeCallbackOnNoiseAndUnknown: Whether the callback function will be
invoked if the "word" with the maximum probability score is the "unknown"
or "background noise" token. Default: false.
includeEmbedding: Whether an internal activation from the underlying model
will be included in the callback argument, in addition to the probability
scores. Note: if this field is set as true, the value of
invokeCallbackOnNoiseAndUnknown will be overridden to true and the
value of probabilityThreshold will be overridden to 0.
Offline recognition
To perform offline recognition, you need to have obtained the spectrogram
of an audio snippet through a certain means, e.g., by loading the data
from a .wav file or synthesizing the spectrogram programmatically.
Assuming you have the spectrogram stored in an Array of numbers or
a Float32Array, you can create a tf.Tensor object. Note that the
shape of the Tensor must match the expectation of the recognizer instance.
E.g.,
import * as tf from '@tensorflow/tfjs';
import * as speechCommands from '@tensorflow-models/speech-commands';
const recognizer = speechCommands.create('BROWSER_FFT');
console.log(recognizer.modelInputShape());
console.log(recognizer.params().sampleRateHz);
console.log(recognizer.params().fftSize);
tf.tidy(() => {
const x = tf.tensor4d(
mySpectrogramData, [1].concat(recognizer.modelInputShape().slice(1)));
const output = recognizer.recognize(x);
});
Note that you must provide a spectrogram value to the recognize() call
in order to perform the offline recognition. If recognzie() as called
without a first argument, it will perform one-shot online recognition
by collecting a frame of audio via WebAudio.
Preloading model
By default, a recognizer object will load the underlying
tf.Model via HTTP requests to a centralized location, when its
listen() or recognize() method is called the first time.
You can pre-load the model to reduce the latency of the first calls
to these methods. To do that, use the ensureModelLoaded() method of the
recognizer object. The ensureModelLoaded() method also "warms up" model after
the model is loaded. "Warm up" means running a few dummy examples through the
model for inference to make sure that the necessary states are set up, so that
subsequent inferences can be fast.
Transfer learning
Transfer learning is the process of taking a model trained
previously on a dataset (say dataset A) and applying it on a
different dataset (say dataset B).
To achieve transfer learning, the model needs to be slightly modified and
re-trained on dataset B. However, thanks to the training on
the original dataset (A), the training on the new dataset (B) takes much less
time and computational resource, in addition to requiring a much smaller amount of
data than the original training data. The modification process involves removing the
top (output) dense layer of the original model and keeping the "base" of the
model. Due to its previous training, the base can be used as a good feature
extractor for any data similar to the original training data.
The removed dense layer is replaced with a new dense layer configured
specifically for the new dataset.
The speech-command model is a model suitable for transfer learning on
previously unseen spoken words. The original model has been trained on a relatively
large dataset (~50k examples from 20 classes). It can be used for transfer learning on
words different from the original vocabulary. We provide an API to perform
this type of transfer learning. The steps are listed in the example
code snippet below
const baseRecognizer = speechCommands.create('BROWSER_FFT');
await baseRecognizer.ensureModelLoaded();
const transferRecognizer = baseRecognizer.createTransfer('colors');
await transferRecognizer.collectExample('red');
await transferRecognizer.collectExample('green');
await transferRecognizer.collectExample('blue');
await transferRecognizer.collectExample('red');
await transferRecognizer.collectExample('_background_noise_');
await transferRecognizer.collectExample('green');
await transferRecognizer.collectExample('blue');
await transferRecognizer.collectExample('_background_noise_');
console.log(transferRecognizer.countExamples());
await transferRecognizer.train({
epochs: 25,
callback: {
onEpochEnd: async (epoch, logs) => {
console.log(`Epoch ${epoch}: loss=${logs.loss}, accuracy=${logs.acc}`);
}
}
});
await transferRecognizer.listen(result => {
const words = transferRecognizer.wordLabels();
for (let i = 0; i < words; ++i) {
console.log(`score for word '${words[i]}' = ${result.scores[i]}`);
}
}, {probabilityThreshold: 0.75});
setTimeout(() => transferRecognizer.stopListening(), 10e3);
Serialize examples from a transfer recognizer.
Once examples has been collected with a transfer recognizer,
you can export the examples in serialized form with the serielizedExamples()
method, e.g.,
const serialized = transferRecognizer.serializeExamples();
serialized is a binary ArrayBuffer amenable to storage and transmission.
It contains the spectrogram data of the examples, as well as metadata such
as word labels.
You can also serialize the examples from a subset of the words in the
transfer recognizer's vocabulary, e.g.,
const serializedWithOnlyFoo = transferRecognizer.serializeExamples('foo');
const serializedWithOnlyFooAndBar = transferRecognizer.serializeExamples(['foo', 'bar']);
The serialized examples can later be loaded into another instance of
transfer recognizer with the loadExamples() method, e.g.,
const clearExisting = false;
newTransferRecognizer.loadExamples(serialized, clearExisting);
Theo clearExisting flag ensures that the examples that newTransferRecognizer
already holds are preserved. If true, the existing exampels will be cleared.
If clearExisting is not specified, it'll default to false.
Live demo
A developer-oriented live demo is available at
this address.
How to run the demo from source code
The demo/ folder contains a live demo of the speech-command recognizer.
To run it, do
cd speech-commands
yarn
yarn publish-local
cd demo
yarn
yarn link-local
yarn watch