Cartesia

Cartesia charges approximately $38 per 1 million characters for text-to-speech (on their Scale plan) and a rock-bottom $0.13 per hour for speech-to-text. For a voice agent handling 20,000 minutes of conversation a month, you'd spend roughly $120 on transcription and $850 on voice generation. Compared to ElevenLabs, which can cost 4x that amount for similar volume on public plans, Cartesia offers a compelling middle ground: cheaper than the boutique labs, but more expensive than commodity providers like OpenAI.

The defining feature here is the architecture. While the rest of the industry brute-forces latency down with optimized Transformers, Cartesia built its Sonic models on State Space Models (SSMs). The result is a Time-to-First-Audio (TTFA) of roughly 40ms on the Turbo model. In practice, this eliminates the "thinking pause" that breaks immersion in voice bots. The API is WebSocket-first and feels purpose-built for full-duplex conversations. You send text chunks, you get audio bytes instantly. It’s snappy, stable, and genuinely feels like talking to a human over a good phone line.

However, the tradeoff for speed is nuance. While Sonic is lightyears ahead of robotic AWS Polly voices, it lacks the cinematic emotional range of ElevenLabs. It handles helpful support agents perfectly but struggles with the subtle breathiness or complex intonation needed for audiobook narration. Additionally, the credit-based billing model (1 char = 1 credit, but cloning = 1.5 credits) adds friction to cost forecasting. You also only get ~15 languages on the base model, whereas competitors cover 50+.

The real sleeper hit is their "Ink" STT model. At $0.13/hour, it is arguably the cheapest reliable transcription on the market, undercutting OpenAI's Whisper API ($0.36/hr) and Deepgram ($0.26/hr). It’s fast enough to handle interruptions in real-time agents, making Cartesia a viable single-vendor solution for the entire voice stack.

Skip Cartesia if you are generating long-form content where latency doesn't matter and cost is paramount—OpenAI's standard TTS is half the price. But if you are building a customer service bot, a roleplay companion, or an NPC where 500ms of lag kills the vibe, Cartesia is currently the best price-to-performance engine available.

Pricing

The "freemium" tier is essentially a sandbox: 10,000 credits (approx. 10k characters) is barely 5 minutes of audio, just enough to verify the API works. The real pricing starts at the "Scale" tier ($299/mo for 8M credits), which works out to ~~$37 per 1 million characters. This places Cartesia in a strategic gap: it is significantly more expensive than OpenAI ($15/1M chars) but vastly cheaper than ElevenLabs' public tiers (~~$165/1M chars).

Hidden Cost: Watch out for "Pro Voice Cloning." It burns 1.5 credits per character, effectively raising your price by 50% if you use custom voices. STT is a loss leader at $0.13/hour—almost negligible in your total bill.

Technical Verdict

Cartesia lives up to the SSM hype. The WebSocket API is robust, handling full-duplex streaming with ease. The Python SDK (cartesia) is thin and pythonic, handling the WebSocket handshake and audio buffering for you. Latency is consistently sub-100ms, often hitting ~40ms on Turbo, which is perceptible only as "instant." Documentation is clean but focuses heavily on the WebSocket implementation; REST users might feel like second-class citizens. Integration is trivial for anyone who has used Deepgram or OpenAI Realtime.

Quick Start

# pip install cartesia
from cartesia import Cartesia
import os
 
client = Cartesia(api_key=os.environ.get("CARTESIA_API_KEY"))
data = client.tts.bytes(
    model_id="sonic-english",
    transcript="Hello world! This is generated in under 100ms.",
    voice_id="694f9389-aac1-45b6-b726-9d9369183238", # Example voice ID
    output_format={ "container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100 }
)
print(f"Received {len(data)} bytes of audio.")

Watch Out

Language support is limited (~15 languages) compared to the 50+ standard in the industry.
Voice cloning on the 3-second 'instant' tier can sound robotic; 'Pro' cloning costs extra credits.
Credits do not roll over month-to-month on standard plans, so forecasting usage is critical.
The WebSocket connection can be sensitive to network jitter; ensure your server is close to their regions (mostly US).

Introduction

Information

Categories

Tags

More Products

OpenAI TTS/Whisper

MiniMax Speech

Google Cloud Speech

Pricing

Technical Verdict

Quick Start

Watch Out

Newsletter

Join the Community

Cartesia

Introduction

Information

Categories

Tags

More Products

OpenAI TTS/Whisper

MiniMax Speech

Google Cloud Speech

Pricing

Technical Verdict

Quick Start

Watch Out