Cartesia JavaScript Client
This client provides convenient access to Cartesia's TTS models. Sonic is the fastest text-to-speech model around—it can generate a second of audio in just 650ms, and it can stream out the first audio chunk in just 135ms. Alongside Sonic, we also offer an extensive prebuilt voice library for a variety of use cases.
The JavaScript client is a thin wrapper around the Cartesia API. You can view docs for the API at docs.cartesia.ai.
Installation
npm install @cartesia/cartesia-js
yarn add @cartesia/cartesia-js
pnpm add @cartesia/cartesia-js
bun add @cartesia/cartesia-js
Usage
CRUD on Voices
import Cartesia from "@cartesia/cartesia-js";
const cartesia = new Cartesia({
apiKey: "your-api-key",
});
const voices = await cartesia.voices.list();
console.log(voices);
const voice = await cartesia.voices.get("<voice-id>");
console.log(voice);
const clonedVoiceEmbedding = await cartesia.voices.clone({
mode: "clip",
clip: myFile,
});
const mixedVoiceEmbedding = await cartesia.voices.mix({
voices: [{ id: "<voice-id-1>", weight: 0.6 }, { id: "<voice-id-2>", weight: 0.4 }],
});
const localizedVoiceEmbedding = await cartesia.voices.localize({
embedding: Array(192).fill(1.0),
original_speaker_gender: "female",
language: "es",
});
const newVoice = await cartesia.voices.create({
name: "Tim",
description: "A deep, resonant voice.",
embedding: Array(192).fill(1.0),
});
console.log(newVoice);
TTS over WebSocket
import Cartesia from "@cartesia/cartesia-js";
const cartesia = new Cartesia({
apiKey: "your-api-key",
});
const websocket = cartesia.tts.websocket({
container: "raw",
encoding: "pcm_f32le",
sampleRate: 44100
});
try {
await websocket.connect();
} catch (error) {
console.error(`Failed to connect to Cartesia: ${error}`);
}
const response = await websocket.send({
model_id: "sonic-english",
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091",
},
transcript: "Hello, world!"
});
response.on("message", (message) => {
console.log("Received message:", message);
});
for await (const message of response.events('message')) {
console.log("Received message:", message);
}
Input Streaming with Contexts
const contextOptions = {
context_id: "my-context",
model_id: "sonic-english",
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091",
},
}
const response = await websocket.send({
...contextOptions,
transcript: "Hello, world!",
});
await websocket.continue({
...contextOptions,
transcript: " How are you today?",
});
See the input streaming docs for more information.
Timestamps
To receive timestamps in responses, set the add_timestamps
field in the request object to true
.
const response = await websocket.send({
model_id: "sonic-english",
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091",
},
transcript: "Hello, world!",
add_timestamps: true,
});
You can then listen for timestamps on the returned response object.
response.on("timestamps", (timestamps) => {
console.log("Received timestamps for words:", timestamps.words);
console.log("Words start at:", timestamps.start);
console.log("Words end at:", timestamps.end);
});
for (await const timestamps of response.events('timestamps')) {
console.log("Received timestamps for words:", timestamps.words);
console.log("Words start at:", timestamps.start);
console.log("Words end at:", timestamps.end);
}
Speed and emotion controls [Alpha]
The API has experimental support for speed and emotion controls that is not subject to semantic versioning and is subject to change without notice. You can control the speed and emotion of the synthesized speech by setting the speed
and emotion
fields under voice.__experimental_controls
in the request object.
const response = await websocket.send({
model_id: "sonic-english",
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091",
__experimental_controls: {
speed: "fastest",
emotion: ["sadness", "surprise:high"],
},
},
transcript: "Hello, world!",
});
Multilingual TTS [Alpha]
You can define the language of the text you want to synthesize by setting the language
field in the request object. Make sure that you are using model_id: "sonic-multilingual"
in the request object.
Supported languages are listed at docs.cartesia.ai.
Playing audio in the browser
(The WebPlayer
class only supports playing audio in the browser and the raw PCM format with fp32le encoding.)
import { WebPlayer } from "@cartesia/cartesia-js";
console.log("Playing stream...");
const player = new WebPlayer();
await player.play(response.source);
console.log("Done playing.");
React
We export a React hook that simplifies the process of using the TTS API. The hook manages the WebSocket connection and provides a simple interface for buffering, playing, pausing and restarting audio.
import { useTTS } from '@cartesia/cartesia-js/react';
function TextToSpeech() {
const tts = useTTS({
apiKey: "your-api-key",
sampleRate: 44100,
})
const [text, setText] = useState("");
const handlePlay = async () => {
const response = await tts.buffer({
model_id: "sonic-english",
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091",
},
transcript: text,
});
await tts.play();
}
return (
<div>
<input type="text" value={text} onChange={(event) => setText(event.target.value)} />
<button onClick={handlePlay}>Play</button>
<div>
{tts.playbackStatus} | {tts.bufferStatus} | {tts.isWaiting}
</div>
</div>
);
}