Speech Synthesis

SpeechSynthesisUtterance turns a string into spoken audio in one call. No permission, no cloud, no library -- but voices come from the OS, which means they vary by device and browser.

June 10, 20263 min read13 / 17

The browser already has a speaker. The OS already knows how to produce speech. The Web Speech API's synthesis interface connects those two things: give it a string, and it speaks -- no permission dialog, no cloud service, no library.

This is the other half of the Web Speech API. Speech recognition listens; speech synthesis speaks.

One Call to Speak

JavaScript
const utterance = new SpeechSynthesisUtterance('Thinly slice the red onion.') window.speechSynthesis.speak(utterance)

window.speechSynthesis is the global controller. SpeechSynthesisUtterance holds both the text and the reading configuration.

No permission dialog appears. The browser speaks immediately.

Configuring the Utterance

JavaScript
const utterance = new SpeechSynthesisUtterance() utterance.text = 'Preheat the oven to 200 degrees.' utterance.lang = 'en-US' // BCP 47 language tag utterance.volume = 1 // 0 (silent) to 1 (full) utterance.pitch = 1 // 0.1 (low) to 2 (high) utterance.rate = 1 // 0.1 (slow) to 10 (fast) window.speechSynthesis.speak(utterance)

lang sets the language and accent. 'es-AR' for Argentine Spanish, 'pt-BR' for Brazilian Portuguese, 'ja-JP' for Japanese. If the requested voice is unavailable, the browser falls back to its default.

rate and pitch are accessibility levers: slower speech for language learners, higher pitch for audio alerts. For standard narration, leave both at 1.

Getting Available Voices

The most important fact about speech synthesis: the voices come from the OS, not the browser.

JavaScript
const voices = window.speechSynthesis.getVoices()

The array depends entirely on the user's device and operating system. macOS ships with voices like "Samantha" and "Alex". Android has its own set. Windows has another. Chrome, Safari, and Edge all expose whatever the underlying OS provides.

Chrome loads the voice list asynchronously. Calling getVoices() on page load returns an empty array. Wait for the event:

JavaScript
window.speechSynthesis.addEventListener('voiceschanged', () => { const voices = window.speechSynthesis.getVoices() const esVoice = voices.find(v => v.lang === 'es-AR') const utterance = new SpeechSynthesisUtterance('Cortá la cebolla.') utterance.lang = 'es-AR' utterance.voice = esVoice ?? null // null falls back to default window.speechSynthesis.speak(utterance) })

Each SpeechSynthesisVoice has .name, .lang, .localService (true if OS-native, false if network-dependent), and .default (true for the browser's default choice).

Speech Synthesis: text to utterance, OS TTS engine, speaker output, and the voice selection model ExpandSpeech Synthesis: text to utterance, OS TTS engine, speaker output, and the voice selection model

iOS: All Browsers, One Voice Pool

On iOS, Chrome, Firefox, Edge, and Brave are all built on top of Safari's WebKit engine. They all use Apple's TTS system.

Every browser on iOS gives you the same voices. There is no way to get Google TTS voices in Chrome on an iPhone -- the OS sits below the browser vendor.

On Android, browsers can use different TTS engines, so voice quality and selection can differ between Chrome and Firefox.

This Is Not AI Voice

What you hear from speechSynthesis is the same engine your OS uses for accessibility features like VoiceOver or Narrator. It sounds robotic compared to AI services like ElevenLabs or OpenAI TTS -- because it is pre-recorded phoneme synthesis, not a neural model.

The tradeoff is exactly right for many use cases. No API key, no network request, no latency. For reading recipe steps aloud or announcing form validation errors, that is the correct tool.

Green Tier

Speech synthesis is green tier -- Chrome, Firefox, Safari, and Edge all support it. Unlike speech recognition, no webkit prefix is required and no vendor cloud is involved.

The next step beyond audio is visual intelligence: the Shape Detection API lets the browser decode a QR code, detect a face, or extract text from a photo -- all using the OS vision framework, no library needed.

The Essentials

  1. Zero permissions needed. No user gesture required. window.speechSynthesis.speak(utterance) fires immediately.
  2. Voices are OS-dependent. Chrome fires voiceschanged before getVoices() returns anything. Always wait for it.
  3. iOS: all browsers share the same voice pool (Apple TTS). Android browsers can differ.
  4. Not AI voice. Uses the OS text-to-speech engine. Fast, local, free, and sounds robotic.
  5. Green tier -- Chrome, Firefox, Safari, Edge. No prefix required, no cloud audio.

Further Reading