Speech Synthesis

The browser already has a speaker. The OS already knows how to produce speech. The Web Speech API's synthesis interface connects those two things: give it a string, and it speaks -- no permission dialog, no cloud service, no library.

This is the other half of the Web Speech API. Speech recognition listens; speech synthesis speaks.

One Call to Speak

JavaScript

const utterance = new SpeechSynthesisUtterance('Thinly slice the red onion.')
window.speechSynthesis.speak(utterance)

window.speechSynthesis is the global controller. SpeechSynthesisUtterance holds both the text and the reading configuration.

No permission dialog appears. The browser speaks immediately.

Configuring the Utterance

JavaScript

const utterance = new SpeechSynthesisUtterance()
utterance.text = 'Preheat the oven to 200 degrees.'
utterance.lang = 'en-US'    // BCP 47 language tag
utterance.volume = 1        // 0 (silent) to 1 (full)
utterance.pitch = 1         // 0.1 (low) to 2 (high)
utterance.rate = 1          // 0.1 (slow) to 10 (fast)

window.speechSynthesis.speak(utterance)

lang sets the language and accent. 'es-AR' for Argentine Spanish, 'pt-BR' for Brazilian Portuguese, 'ja-JP' for Japanese. If the requested voice is unavailable, the browser falls back to its default.

rate and pitch are accessibility levers: slower speech for language learners, higher pitch for audio alerts. For standard narration, leave both at 1.

Getting Available Voices

The most important fact about speech synthesis: the voices come from the OS, not the browser.

JavaScript

const voices = window.speechSynthesis.getVoices()

The array depends entirely on the user's device and operating system. macOS ships with voices like "Samantha" and "Alex". Android has its own set. Windows has another. Chrome, Safari, and Edge all expose whatever the underlying OS provides.

Chrome loads the voice list asynchronously. Calling getVoices() on page load returns an empty array. Wait for the event:

JavaScript

window.speechSynthesis.addEventListener('voiceschanged', () => {
  const voices = window.speechSynthesis.getVoices()
  const esVoice = voices.find(v => v.lang === 'es-AR')

  const utterance = new SpeechSynthesisUtterance('Cortá la cebolla.')
  utterance.lang = 'es-AR'
  utterance.voice = esVoice ?? null  // null falls back to default
  window.speechSynthesis.speak(utterance)
})

Each SpeechSynthesisVoice has .name, .lang, .localService (true if OS-native, false if network-dependent), and .default (true for the browser's default choice).

ExpandSpeech Synthesis: text to utterance, OS TTS engine, speaker output, and the voice selection model

iOS: All Browsers, One Voice Pool

On iOS, Chrome, Firefox, Edge, and Brave are all built on top of Safari's WebKit engine. They all use Apple's TTS system.

Every browser on iOS gives you the same voices. There is no way to get Google TTS voices in Chrome on an iPhone -- the OS sits below the browser vendor.

On Android, browsers can use different TTS engines, so voice quality and selection can differ between Chrome and Firefox.

This Is Not AI Voice

What you hear from speechSynthesis is the same engine your OS uses for accessibility features like VoiceOver or Narrator. It sounds robotic compared to AI services like ElevenLabs or OpenAI TTS -- because it is pre-recorded phoneme synthesis, not a neural model.

The tradeoff is exactly right for many use cases. No API key, no network request, no latency. For reading recipe steps aloud or announcing form validation errors, that is the correct tool.

Green Tier

Speech synthesis is green tier -- Chrome, Firefox, Safari, and Edge all support it. Unlike speech recognition, no webkit prefix is required and no vendor cloud is involved.

The next step beyond audio is visual intelligence: the Shape Detection API lets the browser decode a QR code, detect a face, or extract text from a photo -- all using the OS vision framework, no library needed.

The Essentials

Zero permissions needed. No user gesture required. window.speechSynthesis.speak(utterance) fires immediately.
Voices are OS-dependent. Chrome fires voiceschanged before getVoices() returns anything. Always wait for it.
iOS: all browsers share the same voice pool (Apple TTS). Android browsers can differ.
Not AI voice. Uses the OS text-to-speech engine. Fast, local, free, and sounds robotic.
Green tier -- Chrome, Firefox, Safari, Edge. No prefix required, no cloud audio.

Speech Synthesis

One Call to Speak

Configuring the Utterance

Getting Available Voices

iOS: All Browsers, One Voice Pool

This Is Not AI Voice

Green Tier

The Essentials

Further Reading