Real Time Offline Speech to text.
Real Time Offline Speech to Text.

In this post, I’ll Walk you through how to build a Real time Offline Speech to Text Program with Open-AI Whisper using Python. We’ll break everything down step by step, explaining the “why” behind each part so you can actually understand how it works, not just copy-paste the code (If you want you can do that as well, it’s provided at the end). It’s going to be fun and surprisingly simple!

Here is the list of Things we will be covering in this Article/Blog/Tutorial (idk whatever you wanna say)


Setting up the Environment

Here’s what you need to follow along:

  1. Python Installed: Version 3.7 or later. (If you don’t have it, download Python here.)
  2. Libraries:
    • openai-whisper: This handles speech-to-text transcription. It’s lightweight and works offline.
    • sounddevice: A library to capture real-time audio from your microphone.
    • numpy: Used for processing audio data efficiently.
    • queue: To manage audio chunks as they come in.
    • Depending upon your system you might also need FFmpeg you can download it from here FFmpeg

To install all the libraries, just run:


Before diving into the code if you want to know how to do this is discord, then you might need to start with Discord TTS Bot: Create Your Own TTS Bot for Discord, or if you want to pass the text generated to your own model then How to Customize Your own Ollama Model.

Also here is a cool fun project to improve your coding skills Your Own Random Anime Image Generator using Nekosia API


Overall, Logic of the Program

Before diving in, here’s what the program does, it’s the basic logic behind the overall working of this Real Time Offline Speech to Text Program:

  1. Listens for a Wake Word: The program stays in a listening state and only starts transcribing when it hears a specific wake word (e.g., “Hey Bot” or “Listen”).
  2. Transcribes Speech in Real-Time: Once activated, it processes audio chunks and converts them to text almost instantly (this kinda depends on your system but the tiny model will be good on a relatively low spec system, you can learn more about available models at openai-whisper · PyPI ).
  3. Handles Silence and Stops Gracefully: It detects silence to pause transcription and even lets you stop everything by saying, “stop listening.”

Got all that? Great! Let’s jump into the code.


Real Time Offline Speech to Text Program Code

So first we need to import the packages that we have installed thus we will do that and also load the model that we want to use, in this tutorial I’m using tiny.en which is the lightest of the models and it is for English, you can check different models here openai-whisper · PyPI

Importing Packages and Loading Model

Now that the Packages and Model is loaded let’s create a system to capture our sound


Setting Up the Audio Capture

The first step is capturing audio from your microphone. We’ll use the sounddevice library for this.

How Audio Streaming Works

Think of audio streaming like this: microphone continuously captures sound waves i.e your voice, which are broken into small chunks of data. These chunks are processed by your program in real time.

Here’s how we set it up:

  • Sampling Rate: The number of audio samples taken per second. 16 kHz is enough for speech while keeping processing fast.
  • Queue: Think of it like a conveyor belt where each chunk of audio waits to be processed.

We also define a callback function to handle incoming audio:

This function does three things:

  1. Checks for errors in the audio stream (status).
  2. Takes the audio data (indata) and copies it into our queue.
  3. Keeps things moving smoothly in real time.

Oh, and here is the actual line of code to get the input stream, more on that later.


Listening for the Wake Word

A wake word is like the program’s “on” button. It tells the program, “Hey, I’m ready to talk now.” Sort of like “Hey, Siri” if you are one of those iPhone users…. 🤨

Here’s how the wake word detection works:

  1. Collect recent audio chunks into a buffer.
  2. Process the buffer to check if it contains the wake word.
  3. If the wake word is detected, start transcribing speech.

Detecting the Wake Word

The logic is simple:

  • Combine the last few chunks of audio.
  • Use the Whisper model to transcribe them.
  • Check if the transcription includes the wake word (e.g., “listen”).

Transcribing Speech in Real-Time

Once the wake word is detected, the program starts transcribing your speech.

Continuous Transcription

Here’s how the transcription process works:

Here’s why this works so well:

  • The audio buffer ensures you process enough data for accurate transcription.
  • Clearing the buffer after each transcription keeps things efficient.
  • Results are printed instantly, creating that real-time feel. (well sort of , there is some delay but it works)

Handling Silence and Stopping (used a stop command)

Nobody wants a program that keeps listening forever. That’s why we add two important features:

  1. Silence Detection: Pause transcription when no one is speaking.
  2. Stop Command: End the program by saying, “stop listening.”

Detecting Silence

Silence detection is based on the average amplitude of the audio:

  • Threshold: If the average amplitude is below 0.01, we treat it as silence.

Stopping the Program

In the transcription loop, we check for a stop command:

When the stop phrase is detected, the program exits transcription mode gracefully.


Putting It All Together

Here’s the main function that ties everything together:


Overall Source Code

And below is the overall source code, if you have the required packages then just copy paste it and hopefully it will work with relatively good accuracy, there are some minor hiccups here and there, but it will be fine (hopefully)


Final Thoughts

Well, there you have it a working (hopefully) real time offline speech to text program, it might sound like rocket science, but as you’ve seen it’s exactly that, but with the right tools and a bit of patience (by bit I mean a lot) anyone can create a real time offline speech to text program. This project is perfect for anyone looking to dive into audio processing or build something practical like a personal voice assistant or a TTS Discord Bot Discord TTS Bot: Create Your Own TTS Bot for Discord

Leave a Reply

Your email address will not be published. Required fields are marked *