ElevenLabs is a comprehensive AI audio platform built for creators, developers, and enterprises to generate, transcribe, localize, and automate voice experiences at scale.
Transcribe audio and video with top-tier accuracy in 90+ languages, including keyterm prompting (up to 100 terms), entity detection, smart language detection, and precise word-level timestamps. Supports speaker diarization (up to 32 speakers) and low-latency realtime transcription (~150 ms) for live experiences.
Translate audio/video into 32 languages while retaining each speaker’s tone, timing, and emotional nuance. Automatically separates speakers from background audio, keeps music/FX, and lets you manually edit transcripts and translations in Dubbing Studio.
Generate controllable, expressive speech across 70+ languages using models optimized for latency, consistency, or emotional control. Clone voices, design new ones from prompts, and explore a large voice library for instant production-ready results.
Deploy natural, human-sounding agents that talk, type, and take action across phone, chat, email, and WhatsApp. Measure outcomes with analytics, validate behavior via testing, and enforce policy and compliance with robust guardrails and workflows.
Build with ElevenAPI for text-to-speech, speech-to-text, music, and more, including batch and realtime streaming options. Integrate webhook callbacks, handle large file support, and leverage detailed docs and cookbooks for quick, reliable implementation.
Yes. The API supports both audio and video uploads for transcription.
Files up to 3 GB are supported. Standard mode (use_multi_channel=false) supports up to 10 hours; Multichannel mode (use_multi_channel=true) supports up to 1 hour.
Audio formats include AAC, AIFF, OGG, MP3, OPUS, WAV, WEBM, FLAC, M4A, and more. Video formats include MP4, AVI, MKV, MOV/QuickTime, WMV, FLV, WEBM, MPEG, and 3GPP.
Yes. Asynchronous transcription results can be delivered via webhooks configured in the UI.
Yes. Multichannel STT can process audio with up to 5 channels, assigning a speaker ID per channel.
Billing is based on the duration of the audio sent for transcription, calculated per hour. Rates vary by tier and model.
You can dub short- and long-form video or audio. For best quality, ElevenLabs recommends projects with a maximum of about 9 unique speakers.
Yes. The model recreates tone, pace, and style in the target language to match the original delivery.
The system uses advanced source separation to isolate individual voices from ambient sound, handling overlapping speakers on separate tracks.
Yes. The UI supports up to 45 minutes (file size limits in docs vary between 500MB and 1GB), while the API supports up to 1GB and 2.5 hours.
Yes. In Dubbing Studio, you can select portions to dub, edit transcripts/translations, and regenerate individual segments.
Yes, but organizations requiring HIPAA must contact ElevenLabs Sales to sign a Business Associate Agreement (BAA) before HIPAA-related integrations or deployments.
Yes. You can add additional languages, set language-specific voices, and optionally enable a language detection tool. Note that language selection is fixed for the duration of a call and cannot be changed mid-conversation.
Yes. A Creator plan or higher is required to dub audio files. For videos, you can optionally add a watermark to reduce credit usage.
Join thousands of developers who are already using Elevenlabs to enhance their workflow and productivity.
OpenClaw is an open-source personal AI assistant that runs on your own hardware — a Mac Mini, Raspberry Pi, VPS, or any computer you control.
Agent IA Vocal is a Quebec-based AI voice receptionist designed specifically for small and medium-sized businesses (PMEs) in Quebec and Canada.