Providers
Cartesia
Cartesia Sonic text-to-speech with SSML, voice cloning, and audio tags.
| Prefix | cartesia |
| Default model | sonic-3 |
| Env var | CARTESIA_API_KEY |
| Official docs | docs.cartesia.ai |
Models
| Model | Streaming | Audio Tags | Voice Cloning | Notes |
|---|---|---|---|---|
sonic-3 | Yes | Yes (via SSML) | Yes | Current flagship; emotion tags |
sonic-2 | Yes | No | No | Previous generation |
Usage
import { generateSpeech } from "@speech-sdk/core"
const result = await generateSpeech({
model: "cartesia/sonic-3",
text: "Hello from SpeechSDK!",
voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
})Default output is audio/wav at 44.1 kHz.
Audio Tags
sonic-3 supports audio tags with two paths:
- Emotion tags (
[happy],[sad],[angry],[excited], etc.) are converted to Cartesia's SSML<emotion>elements. [laughter]is passed through natively.- Unknown tags are stripped with a warning.
await generateSpeech({
model: "cartesia/sonic-3",
text: "[happy] What a lovely day! [laughter] I can't believe it.",
voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
})Voice Cloning
sonic-3 supports inline voice cloning from reference audio. See Voice Cloning for details.
await generateSpeech({
model: "cartesia/sonic-3",
text: "Hello in a cloned voice!",
voice: { audio: "base64-encoded-audio..." },
})Provider Options
await generateSpeech({
model: "cartesia/sonic-3",
text: "Hello!",
voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
providerOptions: {
language: "en",
output_format: {
container: "wav",
encoding: "pcm_s16le",
sample_rate: 44_100,
},
speed: "normal",
},
})Custom Configuration
import { generateSpeech } from "@speech-sdk/core"
import { createCartesia } from "@speech-sdk/core/providers"
const cartesia = createCartesia({
apiKey: process.env.CARTESIA_API_KEY,
})
const result = await generateSpeech({
model: cartesia("sonic-3"),
text: "Hello!",
voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
})