Capability layer
Voice & transcription
Speech-to-text and text-to-speech — the audio edges of an AI system.
Whisper
★Voice & transcription
OpenAI's open-source speech-to-text model — the default starting point for transcription.
AssemblyAI
Voice & transcription
A commercial speech-to-text API pairing transcription with audio-intelligence models.
Cartesia
Voice & transcription
A commercial provider of very low-latency, realistic text-to-speech for real-time voice apps.
Deepgram
Voice & transcription
A commercial speech API built for fast, accurate transcription at scale, including real-time.
ElevenLabs
Voice & transcription
A leading text-to-speech and voice platform known for highly natural, expressive synthetic speech.
faster-whisper
Voice & transcription
A fast reimplementation of Whisper on CTranslate2, with much lower latency and memory use.
Kokoro
Voice & transcription
A small, open-weight text-to-speech model that produces natural voices on modest hardware.
Piper
Voice & transcription
A fast, local open-source text-to-speech system designed to run well on small devices.
Vapi
Voice & transcription
A platform for building, testing, and deploying real-time voice agents over the phone.
WhisperX
Voice & transcription
An open-source extension of Whisper adding accurate word-level timestamps and speaker diarization.