What is "Speech Recognition" about?

The browser has had built-in speech recognition since 2012. Here is the webkit prefix story, the result event structure, and the restart pattern for continuous voice command listening.

What topics does "Speech Recognition" cover?

This article covers: Web Speech API, SpeechRecognition JavaScript, webkitSpeechRecognition, voice commands web, speech to text browser, continuous speech recognition.

Web Speech Recognition API: No Library, No API Key Needed

Here is something most front-end developers do not know: speech recognition has been in the browser since 2012, built-in, free, no API key, no library.

The API is called the Web Speech API.

It is green tier -- Chrome, Safari, and Edge all support it. Firefox has an implementation behind a flag. You get a microphone input, a confidence score, and a text transcript. That is the whole API.

The webkit Prefix Quirk

This is one of the few APIs that still requires a vendor prefix in 2026.

Browsers like Chrome that are no longer WebKit-based still expose the API as window.webkitSpeechRecognition. It is a historical artifact from when the feature was experimental and shipped under the vendor prefix. The prefix stayed.

The standard pattern handles both:

JavaScript

const SpeechRecognition =
  window.SpeechRecognition || window.webkitSpeechRecognition

if (!SpeechRecognition) {
  // Browser does not support speech recognition
}

Always check both. window.SpeechRecognition exists in newer Safari. window.webkitSpeechRecognition is what Chrome still uses.

Creating a Recognition Instance

JavaScript

const recognition = new SpeechRecognition()
recognition.lang = 'en-US'         // BCP 47 language tag
recognition.continuous = false     // stop after first phrase
recognition.interimResults = false // only final results

Three options matter most.

lang -- a BCP 47 language tag like 'en-US', 'es-ES', or 'fr-FR'. The available languages are not defined by the spec. They depend on which browser you are using and what languages the underlying OS or cloud service supports. In Chrome, you can call SpeechRecognition.getAvailableLanguages() to get the list.

continuous -- false means recognition stops after the first phrase and fires one result. true means it keeps listening and fires a result event each time it detects a complete phrase. For voice command systems, true is what you want.

interimResults -- false means you only get final, committed transcripts. true means you also get partial results while the voice is still being processed. Interim results have lower confidence and the text is still changing. Most command-detection use cases leave this off.

Starting and Listening

JavaScript

recognition.addEventListener('start', () => {
  showMicActiveIndicator()
})

recognition.addEventListener('result', (event) => {
  const results = event.results
  const latest = results[results.length - 1][0]  // last phrase, first alternative

  const text = latest.transcript.trim().replace(/\.$/, '')
  const confidence = latest.confidence  // 0 to 1

  handleCommand(text, confidence)
})

recognition.addEventListener('end', () => {
  hideMicActiveIndicator()
})

recognition.addEventListener('error', (event) => {
  if (event.error === 'not-allowed') {
    showPermissionDeniedMessage()
  }
})

recognition.start()

The result event accumulates. event.results is an array of everything recognized since recognition started, not just the latest phrase. To get the newest phrase, always take results[results.length - 1]. Each entry in the array is itself an array of alternatives -- [0] is the highest-confidence alternative.

Two cleanup steps make command matching reliable. Trim whitespace -- the transcript sometimes has leading or trailing spaces. Remove the trailing dot -- some browsers add a period at the end of a recognized phrase; others don't. Stripping it before matching makes the behavior consistent.

ExpandWeb Speech Recognition API: the mic-to-result flow, options, and lifecycle events

The end Event and Continuous Mode

Here is a non-obvious behavior: even with continuous = true, the recognition session can end on its own. Long silence, network interruption, or the user navigating to a different tab will fire the end event and stop the session.

If you want truly continuous listening, restart from inside end:

JavaScript

recognition.addEventListener('end', () => {
  if (stillWantingToListen) {
    recognition.start()  // immediately restart
  }
})

Without this, a gap in audio stops the session and you never hear another word.

Where the Audio Goes

This API is not local processing. When recognition.start() fires, audio is streamed to the browser vendor's cloud service. In Chrome, that is Google's speech servers. In Safari, it is Apple's.

The browser handles the network call and returns the transcript. You never see the raw audio data or the cloud endpoint. From your code's perspective, it is just an event. But the audio does leave the device.

Microphone Permission

The permission model applies here the same way it does for geolocation. The browser shows a one-time dialog. If the user denies it, event.error === 'not-allowed' fires in the error handler. Denial is permanent until the user manually re-enables it in browser settings -- your code cannot prompt again.

On Safari specifically, recognition will not start unless triggered from a user gesture (a click or tap). Calling recognition.start() from a timeout or DOMContentLoaded will fail silently.

The flip side of this API needs no permission at all: speech synthesis makes the browser speak a string aloud using the OS text-to-speech engine.

The Essentials

Green tier -- Chrome, Safari, Edge. No library, no API key. Audio goes to vendor cloud servers, not processed locally.
Both prefixes are needed: window.SpeechRecognition || window.webkitSpeechRecognition. This is a historical artifact, not a bug.
continuous = true keeps the session alive between phrases. Still needs a restart-on-end pattern for true long-running listening.
event.results[results.length - 1][0].transcript -- always take the last result, first alternative. Trim and strip trailing dot before matching commands.
Safari requires a user gesture before recognition.start() will work. Calling it on page load fails silently.

Speech Recognition

The webkit Prefix Quirk

Creating a Recognition Instance

Starting and Listening

The end Event and Continuous Mode

Where the Audio Goes

Microphone Permission

The Essentials

Further Reading and Watching