
A set of tools for integration with the Spokestack API in Node.js 
Installation
$ npm install spokestack --save
Features
Spokestack has all the tools you need to build amazing user experiences for speech. Here are some of the features included in node-spokestack:
- Automatic Speech Recognition (ASR): We provide multiple ways to hook up either Spokestack ASR or Google Cloud Speech to your node/express server, including asr functions for one-off ASR requests and websocket server integrations for ASR streaming. Or, use the ASR services directly for more advanced integrations.
- Speech-to-Text: Through the use of our GraphQL API (see below), Spokestack offers multiple ways to generate voice audio from text. Send raw text, speech markdown, or SSML and get back a URL for audio to play in the browser.
- Wake word and Keyword: Wake word and keyword processing are supported through the use of our speech pipeline (see startPipeline). One of the most powerful features we provide is the ability to define and train custom wake word and keyword models directly on spokestack.io. When training is finished, we host the model files for you on a CDN. Pass the CDN URLs to
startPipeline() and the Speech Pipeline will start listening These same models can be used in spokestack-python, spokestack-android, spokestack-ios, and react-native-spokestack. The pipeline uses a web worker in the browser to keep all of the speech processing off the main thread so your UI never gets blocked. NOTE: The speech pipeline (specifically tensorflow's webgl backend) currently only works in Blink browsers (Chrome, Edge, Opera, Vivaldi, Brave, and most Android browsers) as it requires the use of the experimental OffscreenCanvas API. Firefox is close to full support for that API, and we'll look into supporting Firefox when that's available.
- Natural Language Understanding (NLU): The GraphQL API (see below) also provides a way to convert the text from ASR to actionable "intents", or functions that apps can understand. For instance, if a user says, "Find a recipe for chocolate cake", an NLU might return a "SEARCH_RECIPE" intent. To use the NLU, you'll need an NLU model. While we have plans to release an NLU editor, the best way right now to create an NLU model is to use Alexa, DialogFlow, or Jovo and upload the exported model to your Spokestack account. We support exports from all of those platforms.
This repo includes an example app that demonstrates ASR, speech-to-text, and wake word and keyword processing. It also includes a route for viewing live docs (or "introspection") of the Spokestack API (/graphql).
The GraphQL API
Speech-to-text and NLU are available through Spokestack's GraphQL API, which is available at https://api.spokestack.io/v1. It requires Spokestack credentials to access (creating an account is quick and free).
To use the GraphQL API, node-spokestack includes Express middleware to help integrate a proxy into any node/express server. A proxy is necessary to avoid exposing your Spokestack credentials.
The API is used to synthesize text-to-speech using various methods including raw text, speech markdown, and SSML.
It can also be used for NLU classification.

Automatic Speech Recognition (ASR)
ASR is accomplished through the use of a websocket (rather than GraphQL). node-spokestack includes functions to use either Spokestack ASR or Google Cloud Speech, and there are two functions for each platform.
- A helper function for adding a websocket to a node server (express or otherwise). This is the main way to use ASR.
- A function for processing speech into text in one-off requests. This is useful if you have all of the speech up-front.
Using Google ASR instead of Spokestack ASR
If you'd prefer to use Google ASR, follow these instructions for setting up Google Cloud Speech. Ensure GOOGLE_APPLICATION_CREDENTIALS is set in your environment, and then use the googleASR and googleASRSocketServer functions instead of their Spokestack equivalents.
Wake Word and Keyword (Speech Pipeline)
The speech pipeline uses a custom build of Tensorflow JS in a Web Worker to process speech. It notifies the user when something matches the specified wake word or keyword models. The main function for this is the startPipeline() function. To use startPipeline(), you'll need to serve the web worker and tensorflow from your node/express server. Our example next.js app demonstrates how you might accomplish this in express:
app.use(
'/spokestack-web-worker.js',
express.static('./node_modules/spokestack/dist/spokestack-web-worker.min.js')
)
With these made available to your front-end, the speech pipeline can be started.
Another option is to copy the file from node_modules to your static/public folder during your build process.
"scripts": {
"copy:spokestack": "cp node_modules/spokestack/dist/spokestac-web-worker.min.js public/spokestack-web-worker.js",
"build": "npm run copy:spokestack && next build"
}
Setup
Go to spokestack.io and create an account. Create a token at spokestack.io/account/settings#api. Note that you'll only be able to see the token secret once. If you accidentally leave the page, create another token. Once you have a token, set the following environment variables in your .bash_profile or .zshenv:
export SS_API_CLIENT_ID=
export SS_API_CLIENT_SECRET=
Convenience functions for Node.js servers
spokestackMiddleware
▸ spokestackMiddleware(): function
Express middleware for adding a proxy to the Spokestack GraphQL API.
A proxy is necessary to avoid exposing your Spokestack token secret on the client.
Once a graphql route is in place, your client
can use that with GraphQL.
import { spokestackMiddleware } from 'spokestack'
import bodyParser from 'body-parser'
import express from 'express'
const expressApp = express()
expressApp.post('/graphql', bodyParser.json(), spokestackMiddleware())
This is also convenient for setting up graphiql introspection.
An example fetcher for graphiql on the client (browser only) might look like this:
const graphQLFetcher = (graphQLParams) =>
fetch('/graphql', {
method: 'post',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(graphQLParams)
})
.then((response) => response.json())
.catch((response) => response.text())
Returns: (req: Request, res: Response) => void
Defined in: server/expressMiddleware.ts:37
asrSocketServer
▸ asrSocketServer(serverConfig: WebSocket.ServerOptions, asrConfig?: Omit<SpokestackASRConfig, sampleRate>): void
Adds a web socket server to the given HTTP server
to stream ASR using Spokestack ASR.
This uses the "ws" node package for the socket server.
import { createServer } from 'http'
const port = parseInt(process.env.PORT || '3000', 10)
const server = createServer()
asrSocketServer({ server })
server.listen(port, () => {
console.log(`Listening at http://localhost:${port}`)
})
Parameters:
Returns: void
Defined in: server/socketServer.ts:23
SpokestackASRConfig
format
• Optional format: LINEAR16
Defined in: server/spokestackASRService.ts:9
language
• Optional language: en
Defined in: server/spokestackASRService.ts:10
limit
• Optional limit: number
Defined in: server/spokestackASRService.ts:11
sampleRate
• sampleRate: number
Defined in: server/spokestackASRService.ts:12
spokestackUrl
• Optional spokestackUrl: string
Set a different location for the Spokestack socket URL.
This is very rarely needed.
Default: 'wss:api.spokestack.io/v1/asr/websocket'
Defined in: server/spokestackASRService.ts:27
timeout
• Optional timeout: number
Reset speech recognition and clear the transcript every timeout
milliseconds.
When no new data comes in for the given timeout,
the auth message is sent again to begin a new ASR transcation.
Set to 0 to disable.
Default: 3000
Defined in: server/spokestackASRService.ts:21
asr
▸ asr(content: string | Uint8Array, sampleRate: number): Promise<string | null>
A one-off method for processing speech to text
using Spokestack ASR.
import fileUpload from 'express-fileupload'
import { asr } from 'spokestack'
import express from 'express'
const expressApp = express()
expressApp.post('/asr', fileUpload(), (req, res) => {
const sampleRate = Number(req.body.sampleRate)
const audio = req.files.audio
if (isNaN(sampleRate)) {
res.status(400)
res.send('Parameter required: "sampleRate"')
return
}
if (!audio) {
res.status(400)
res.send('Parameter required: "audio"')
return
}
asr(Buffer.from(audio.data.buffer), sampleRate)
.then((text) => {
res.status(200)
res.json({ text })
})
.catch((error: Error) => {
console.error(error)
res.status(500)
res.send('Unknown error during speech recognition. Check server logs.')
})
})
Parameters:
content | string | Uint8Array |
sampleRate | number |
Returns: Promise<string | null>
Defined in: server/asr.ts:43
googleASRSocketServer
▸ googleASRSocketServer(serverConfig: WebSocket.ServerOptions): void
Adds a web socket server to the given HTTP server
to stream ASR using Google Speech.
This uses the "ws" node package for the socket server.
import { createServer } from 'http'
const port = parseInt(process.env.PORT || '3000', 10)
const server = createServer()
googleASRSocketServer({ server })
server.listen(port, () => {
console.log(`Listening at http://localhost:${port}`)
})
Parameters:
serverConfig | WebSocket.ServerOptions |
Returns: void
Defined in: server/socketServer.ts:108
googleASR
▸ googleASR(content: string | Uint8Array, sampleRate: number): Promise<string | null>
A one-off method for processing speech to text
using Google Speech.
import fileUpload from 'express-fileupload'
import { googleASR } from 'spokestack'
import express from 'express'
const expressApp = express()
expressApp.post('/asr', fileUpload(), (req, res) => {
const sampleRate = Number(req.body.sampleRate)
const audio = req.files.audio
if (isNaN(sampleRate)) {
res.status(400)
res.send('Parameter required: "sampleRate"')
return
}
if (!audio) {
res.status(400)
res.send('Parameter required: "audio"')
return
}
googleASR(Buffer.from(audio.data.buffer), sampleRate)
.then((text) => {
res.status(200)
res.json({ text })
})
.catch((error: Error) => {
console.error(error)
res.status(500)
res.send('Unknown error during speech recognition. Check server logs.')
})
})
Parameters:
content | string | Uint8Array |
sampleRate | number |
Returns: Promise<string | null>
Defined in: server/asr.ts:108
spokestackASRService
▸ spokestackASRService(config: SpokestackASRConfig, onData: (response: SpokestackResponse) => void): Promise<WebSocket>
A low-level utility for working with the Spokestack ASR service directly.
This should not be used most of the time. It is only for
custom, advanced integrations.
See asr for one-off ASR and asrSocketServer for ASR streaming using
a websocket server that can be added to any node server.
Parameters:
Returns: Promise<WebSocket>
Defined in: server/spokestackASRService.ts:74
SpokestackResponse
error
• Optional error: string
When the status is "error", the error message is available here.
Defined in: server/spokestackASRService.ts:48
final
• final: boolean
The final key is used to indicate that
the highest confidence transcript for the utterance is sent.
However, this will only be set to true after
signaling to Spokestack ASR that no more audio data is incoming.
Signal this by sending an empty buffer (e.g. socket.send(Buffer.from(''))).
See the source for asr for an example.
Defined in: server/spokestackASRService.ts:57
hypotheses
• hypotheses: ASRHypothesis[]
This is a list of transcripts, each associated with their own
confidence level from 0 to 1.
It is an array to allow for the possibility of multiple
transcripts in the API, but is almost always a list of one.
Defined in: server/spokestackASRService.ts:64
status
• status: ok | error
Defined in: server/spokestackASRService.ts:46
ASRHypothesis
confidence
• confidence: number
A number between 0 and 1 to indicate the
tensorflow confidence level for the given transcript.
Defined in: server/spokestackASRService.ts:41
transcript
• transcript: string
Defined in: server/spokestackASRService.ts:42
ASRFormat
• LINEAR16: = "PCM16LE"
Defined in: server/spokestackASRService.ts:5
encryptSecret
▸ encryptSecret(body: string): string
This is a convenience method for properly authorizing
requests to the Spokestack graphql API.
Note: Do not to expose your key's secret on the client.
This should only be done on the server.
See server/expressMiddleware.ts
for example usage.
Parameters:
Returns: string
Defined in: server/encryptSecret.ts:13
Convenience functions for the client
These functions are available exports from spokestack/client.
record
▸ record(config?: RecordConfig): Promise<AudioBuffer>
A method to record audio for a given number of seconds
import { record } from 'spokestack/client'
const buffer = await record()
const buffer = await record({
time: 5,
onProgress: (remaining) => {
console.log(`Recording..${remaining}`)
}
})
const buffer = await record({
time: 5,
onStart: () => {
console.log('Recording started')
}
})
Then create a file for uploading
See googleASR for an example on how
to process the resulting audio file
import { convertFloat32ToInt16 } from 'spokestack/client'
const sampleRate = buffer.sampleRate
const file = new File(
[convertFloat32ToInt16(buffer.getChannelData(0))],
'recording.raw'
)
The file can then be uploaded using FormData:
const formData = new FormData()
formData.append('sampleRate', sampleRate + '')
formData.append('audio', file)
fetch('/asr', {
method: 'POST',
body: formData,
headers: { Accept: 'application/json' }
})
.then((res) => {
if (!res.ok) {
console.log(`Response status: ${res.status}`)
}
return res.json()
})
.then(({ text }) => console.log('Processed speech', text))
.catch(console.error.bind(console))
Parameters:
Returns: Promise<AudioBuffer>
Defined in: client/record.ts:84
RecordConfig
onProgress
• Optional onProgress: (remaining: number) => void
A callback function to be called each second of recording.
Parameters:
Returns: void
Defined in: client/record.ts:16
onStart
• Optional onStart: () => void
A callback function to be called when recording starts
Returns: void
Defined in: client/record.ts:14
time
• Optional time: number
The total time to record. Default: 3
Defined in: client/record.ts:12
startStream
▸ startStream(__namedParameters: StartStreamOptions): Promise<WebSocket, [ProcessorReturnValue]>
Returns a function to start recording using a native WebSocket.
This assumes the socket is hosted on the current server.
import { startStream } from 'spokestack/client'
try {
const [ws] = await startStream({
isPlaying: () => this.isPlaying
})
ws.addEventListener('open', () => console.log('Recording started'))
ws.addEventListener('close', () => console.log('Recording stopped'))
ws.addEventListener('message', (e) => console.log('Speech processed: ', e.data))
} catch (e) {
console.error(e)
}
Parameters:
__namedParameters | StartStreamOptions |
Returns: Promise<WebSocket, [ProcessorReturnValue]>
Defined in: client/recordStream.ts:43
stopStream
▸ stopStream(): void
Stop the current recording stream if one exists.
import { stopStream } from 'spokestack/client'
stopStream()
Returns: void
Defined in: client/recordStream.ts:96
convertFloat32ToInt16
▸ convertFloat32ToInt16(fp32Samples: Float32Array): Int16Array
A utility method to convert Float32Array audio
to an Int16Array to be passed directly to Speech APIs
such as Google Speech
import { convertFloat32ToInt16, record } from 'spokestack/client'
const buffer = await record()
const file = new File([convertFloat32ToInt16(buffer.getChannelData(0))], 'recording.raw')
Parameters:
Returns: Int16Array
Defined in: client/convertFloat32ToInt16.ts:16
startPipeline
▸ startPipeline(config: PipelineConfig): Promise<SpeechPipeline>
Create and immediately start a SpeechPipeline to process user
speech using the specified configuration.
To simplify configuration, preset pipeline profiles are provided and
can be passed in the config object's profile key. See
PipelineProfile for more details.
NOTE: The speech pipeline (specifically tensorflow's webgl backend)
currently only works in Blink browsers
(Chrome, Edge, Opera, Vivaldi, Brave, and most Android browsers)
as it requires the use of the experimental
OffscreenCanvas API.
First make sure to serve the web worker and tensorflow.js
from your node server at the expected locations.
For example, with express:
app.use(
'/spokestack-web-worker.js',
express.static(`./node_modules/spokestack/dist/spokestack-web-worker.min.js`)
)
try {
await startPipeline({
profile: PipelineProfile.Wakeword,
baseUrls: { wakeword: 'https://s.spokestack.io/u/hgmYb/js' },
onEvent: (event) => {
switch (event.type) {
case SpeechEventType.Activate:
this.setState({ wakeword: { error: '', result: true } })
break
case SpeechEventType.Timeout:
this.setState({ wakeword: { error: 'timeout' } })
break
case SpeechEventType.Error:
console.error(event.error)
break
}
}
})
} catch (e) {
console.error(e)
}
Parameters:
Returns: Promise<SpeechPipeline>
Defined in: client/pipeline.ts:161
SpeechPipeline
Spokestack's speech pipeline comprises a voice activity detection (VAD)
component and a series of stages that manage voice interaction.
Audio is processed off the main thread, currently via a
ScriptProcessorNode and web worker. Each chunk of audio samples is
passed to the worker along with an indication of speech activity, and
each of the stages processes it in order to, e.g., detect whether the user
said a wakeword or transcribe an occurrence of a keyword. See documentation
for the individual stages for more information on their purpose.
+ new SpeechPipeline(config: SpeechPipelineConfig): SpeechPipeline
Create a new speech pipeline.
Parameters:
config | SpeechPipelineConfig | A SpeechPipelineConfig object describing basic pipeline configuration as well as options specific to certain stages (URLs to models, classes for keyword models, etc.). |
Returns: SpeechPipeline
Defined in: client/SpeechPipeline.ts:40
Methods
▸ start(): Promise<SpeechPipeline>
Start processing audio with the pipeline. If this is the first use of the
pipeline, the microphone permission will be requested from the user if
they have not already granted it.
Returns: Promise<SpeechPipeline>
Defined in: client/SpeechPipeline.ts:85
▸ stop(): void
Stop the pipeline, destroying the internal audio processors and
relinquishing the microphone.
Returns: void
Defined in: client/SpeechPipeline.ts:206
SpeechPipelineConfig
onEvent
• Optional onEvent: PipelineEventHandler
Defined in: client/SpeechPipeline.ts:19
speechConfig
• speechConfig: SpeechConfig
Defined in: client/SpeechPipeline.ts:16
stages
• stages: Stage[]
Defined in: client/SpeechPipeline.ts:17
workerUrl
• Optional workerUrl: string
Defined in: client/SpeechPipeline.ts:18
PipelineProfile
Preset profiles for use with startPipeline that include both
default configuration and lists of processing stages. Individual
stages may require additional configuration that cannot be provided
automatically, so see each stage for more details. The stages used
by each profile are as follows:
- Keyword: VadTrigger and KeywordRecognizer:
actively listens for any user speech and delivers a transcript if
a keyword is recognized.
- Wakeword: WakewordTrigger:
listens passively until a wakeword is recognized, then activates the
pipeline so that ASR can be performed.
• Keyword: = "KEYWORD"
A profile that activates on voice activity and transcribes speech
using pretrained keyword recognizer models that support a limited
vocabulary.
Defined in: client/pipeline.ts:30
• Wakeword: = "WAKEWORD"
A profile that sends an Activate event when a wakeword is detected
by a set of pretrained wakeword models. Once that event is received,
subsequent audio should be sent to a speech recognizer for transcription.
Defined in: client/pipeline.ts:36
SpeechEventType
• Activate: = "ACTIVATE"
Defined in: client/types.ts:83
• Deactivate: = "DEACTIVATE"
Defined in: client/types.ts:84
• Error: = "ERROR"
Defined in: client/types.ts:87
• Recognize: = "RECOGNIZE"
Defined in: client/types.ts:86
• Timeout: = "TIMEOUT"
Defined in: client/types.ts:85
Stage
• KeywordRecognizer: = "keyword"
Defined in: client/types.ts:100
• VadTrigger: = "vadTrigger"
Defined in: client/types.ts:98
• WakewordTrigger: = "wakeword"
Defined in: client/types.ts:99
stopPipeline
▸ stopPipeline(): void
Stop the speech pipeline and relinquish its resources,
including the microphone.
stopPipeline()
Returns: void
Defined in: client/pipeline.ts:195
countdown
▸ countdown(time: number, progress: (remaining: number) => void, complete: () => void): void
Countdown a number of seconds.
This is used by record() to record a certain number of seconds.
Parameters:
time | number | Number of seconds |
progress | (remaining: number) => void | Callback for each second (includes first second) |
complete | () => void | Callback for completion |
Returns: void
Defined in: client/countdown.ts:8
Low-level processor functions
These are low-level functions if you need to work with your own audio processors, available from spokestack/client.
startProcessor
▸ startProcessor(): Promise<Error] | [null, [ProcessorReturnValue]>
Underlying utility method for recording audio,
used by the record and recordStream methods.
While createScriptProcessor is deprecated, the replacement (AudioWorklet)
does not yet have broad support (currently only supported in Blink browsers).
See https://caniuse.com/#feat=mdn-api_audioworkletnode
We'll switch to AudioWorklet when it does.
Returns: Promise<Error] | [null, [ProcessorReturnValue]>
Defined in: client/processor.ts:22
ProcessorReturnValue
context
• context: AudioContext
Defined in: client/processor.ts:8
processor
• processor: ScriptProcessorNode
Defined in: client/processor.ts:9
stopProcessor
▸ stopProcessor(): void
Underlying utility method to stop the current processor
if it exists and disconnect the microphone.
Returns: void
Defined in: client/processor.ts:53
concatenateAudioBuffers
▸ concatenateAudioBuffers(buffer1: AudioBuffer | null, buffer2: AudioBuffer | null, context: AudioContext): null | AudioBuffer
A utility method to concatenate two AudioBuffers
Parameters:
buffer1 | AudioBuffer | null |
buffer2 | AudioBuffer | null |
context | AudioContext |
Returns: null | AudioBuffer
Defined in: client/concatenateAudioBuffers.ts:4