In this post, I’ll Walk you through how to build a Real time Offline Speech to Text Program with Open-AI Whisper using Python. We’ll break everything down step by step, explaining the “why” behind each part so you can actually understand how it works, not just copy-paste the code (If you want you can do that as well, it’s provided at the end). It’s going to be fun and surprisingly simple!

Here is the list of Things we will be covering in this Article/Blog/Tutorial (idk whatever you wanna say)

Setting up the Environment

Here’s what you need to follow along:

Python Installed: Version 3.7 or later. (If you don’t have it, download Python here.)
Libraries:
- openai-whisper: This handles speech-to-text transcription. It’s lightweight and works offline.
- sounddevice: A library to capture real-time audio from your microphone.
- numpy: Used for processing audio data efficiently.
- queue: To manage audio chunks as they come in.
- Depending upon your system you might also need FFmpeg you can download it from here FFmpeg

To install all the libraries, just run:

pip install openai-whisper sounddevice numpy

Before diving into the code if you want to know how to do this is discord, then you might need to start with Discord TTS Bot: Create Your Own TTS Bot for Discord, or if you want to pass the text generated to your own model then How to Customize Your own Ollama Model.

Also here is a cool fun project to improve your coding skills Your Own Random Anime Image Generator using Nekosia API

Overall, Logic of the Program

Before diving in, here’s what the program does, it’s the basic logic behind the overall working of this Real Time Offline Speech to Text Program:

Listens for a Wake Word: The program stays in a listening state and only starts transcribing when it hears a specific wake word (e.g., “Hey Bot” or “Listen”).
Transcribes Speech in Real-Time: Once activated, it processes audio chunks and converts them to text almost instantly (this kinda depends on your system but the tiny model will be good on a relatively low spec system, you can learn more about available models at openai-whisper · PyPI ).
Handles Silence and Stops Gracefully: It detects silence to pause transcription and even lets you stop everything by saying, “stop listening.”

Got all that? Great! Let’s jump into the code.

Real Time Offline Speech to Text Program Code

So first we need to import the packages that we have installed thus we will do that and also load the model that we want to use, in this tutorial I’m using tiny.en which is the lightest of the models and it is for English, you can check different models here openai-whisper · PyPI

Importing Packages and Loading Model

import whisper
import sounddevice as sd
import numpy as np
import queue

# Load a smaller Whisper model for faster processing
model = whisper.load_model("tiny.en")

Now that the Packages and Model is loaded let’s create a system to capture our sound

Setting Up the Audio Capture

The first step is capturing audio from your microphone. We’ll use the sounddevice library for this.

How Audio Streaming Works

Think of audio streaming like this: microphone continuously captures sound waves i.e your voice, which are broken into small chunks of data. These chunks are processed by your program in real time.

Here’s how we set it up:

fs = 16000  # Sampling rate (16 kHz, which is standard for voice)
chunk_duration = 1  # Each chunk is 1 second of audio
q = queue.Queue()  # A queue to hold audio chunks

Sampling Rate: The number of audio samples taken per second. 16 kHz is enough for speech while keeping processing fast.
Queue: Think of it like a conveyor belt where each chunk of audio waits to be processed.

We also define a callback function to handle incoming audio:

def audio_callback(indata, frames, time, status):
    if status:
        print(f"Status: {status}")
    q.put(indata.copy())

This function does three things:

Checks for errors in the audio stream (status).
Takes the audio data (indata) and copies it into our queue.
Keeps things moving smoothly in real time.

Oh, and here is the actual line of code to get the input stream, more on that later.

with sd.InputStream( samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration) ):

Listening for the Wake Word

A wake word is like the program’s “on” button. It tells the program, “Hey, I’m ready to talk now.” Sort of like “Hey, Siri” if you are one of those iPhone users…. 🤨

Here’s how the wake word detection works:

Collect recent audio chunks into a buffer.
Process the buffer to check if it contains the wake word.
If the wake word is detected, start transcribing speech.

Detecting the Wake Word

def detect_wake_word(audio_buffer):
    combined_audio = np.concatenate(audio_buffer) 
    audio_as_float = combined_audio.flatten() 
    result = model.transcribe(audio_as_float, fp16=False)  
    return wake_word in result["text"].lower()

The logic is simple:

Combine the last few chunks of audio.
Use the Whisper model to transcribe them.
Check if the transcription includes the wake word (e.g., “listen”).

Transcribing Speech in Real-Time

Once the wake word is detected, the program starts transcribing your speech.

Continuous Transcription

Here’s how the transcription process works:

def transcribe_continuously():
    print("Transcribing... Speak now!")
    audio_buffer = []

    with sd.InputStream(samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)):
        while True:
            audio_data = q.get()  
            audio_buffer.append(audio_data)  
            combined_audio = np.concatenate(audio_buffer)  
            audio_as_float = combined_audio.flatten()  
            result = model.transcribe(audio_as_float, fp16=False)  
            print(f"You said: {result['text']}")  
            audio_buffer.clear()

Here’s why this works so well:

The audio buffer ensures you process enough data for accurate transcription.
Clearing the buffer after each transcription keeps things efficient.
Results are printed instantly, creating that real-time feel. (well sort of , there is some delay but it works)

Handling Silence and Stopping (used a stop command)

Nobody wants a program that keeps listening forever. That’s why we add two important features:

Silence Detection: Pause transcription when no one is speaking.
Stop Command: End the program by saying, “stop listening.”

Detecting Silence

Silence detection is based on the average amplitude of the audio:

def is_silent(audio_chunk):
    return np.abs(audio_chunk).mean() < silence_threshold

Threshold: If the average amplitude is below 0.01, we treat it as silence.

Stopping the Program

In the transcription loop, we check for a stop command:

if "stop listening" in result["text"].lower():
    print("Stopping transcription mode.")
    break

When the stop phrase is detected, the program exits transcription mode gracefully.

Putting It All Together

Here’s the main function that ties everything together:

def process_live_audio():
    print("Listening for wake word...")
    audio_buffer = []

    with sd.InputStream(samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)):
        while True:
            audio_data = q.get()
            audio_buffer.append(audio_data)

            if len(audio_buffer) >= 3 and detect_wake_word(audio_buffer[-3:]):
                print("Wake word detected! Listening for speech...")
                transcribe_continuously()
                print("Listening for wake word...")
                audio_buffer.clear()

Overall Source Code

And below is the overall source code, if you have the required packages then just copy paste it and hopefully it will work with relatively good accuracy, there are some minor hiccups here and there, but it will be fine (hopefully)

import whisper
import sounddevice as sd
import numpy as np
import queue

# Load a smaller Whisper model for faster processing
model = whisper.load_model("tiny.en")

# Audio settings
fs = 16000  # Sampling rate
chunk_duration = 1  # Duration of each audio chunk in seconds
silence_threshold = 0.01  # Silence threshold (adjust as necessary)
silence_duration = 1  # Duration to consider as silence (1 second)
wake_word = "listen"  # Define your wake word
q = queue.Queue()

def audio_callback(indata, frames, time, status):
    """Callback function to capture audio data."""
    if status:
        print(f"Status: {status}")
    q.put(indata.copy())

def detect_wake_word(audio_buffer):
    """Detect the wake word in the audio buffer."""
    combined_audio = np.concatenate(audio_buffer)
    audio_as_float = combined_audio.flatten()
    result = model.transcribe(audio_as_float, fp16=False)
    return wake_word in result["text"].lower()

def is_silent(audio_chunk):
    """Check if the audio chunk is silent."""
    return np.abs(audio_chunk).mean() < silence_threshold

def process_live_audio():
    """Continuously listen and process audio."""
    print("Listening for wake word...")
    audio_buffer = []
    with sd.InputStream(
        samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)
    ):
        while True:
            # Collect audio chunks
            audio_data = q.get()
            audio_buffer.append(audio_data)

            # Check for wake word
            if len(audio_buffer) >= 3:  # Use ~3 seconds of audio for wake word detection
                if detect_wake_word(audio_buffer[-3:]):
                    print("Wake word detected! Listening for speech...")
                    transcribe_continuously()
                    print("Listening for wake word...")
                    audio_buffer.clear()

def transcribe_continuously():
    """Transcribe speech in fixed chunks for consistent feedback."""
    print("Transcribing... Speak now!")
    audio_buffer = []
    silence_counter = 0  # Counter for silent chunks
    silence_sample_count = int(silence_duration * fs)  # Convert silence duration to samples

    with sd.InputStream(
        samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)
    ):
        while True:
            # Collect audio chunks
            audio_data = q.get()
            audio_buffer.append(audio_data)

            # Check for silence
            if is_silent(audio_data):
                silence_counter += len(audio_data)  # Increment silence counter by number of samples
            else:
                silence_counter = 0  # Reset silence counter if sound is detected

            # Process the audio buffer when there's sufficient data and silence is detected
            if silence_counter >= silence_sample_count and len(audio_buffer) > 0:
                combined_audio = np.concatenate(audio_buffer)
                audio_as_float = combined_audio.flatten()
                result = model.transcribe(audio_as_float, fp16=False)
                print(f"You said: {result['text']}")

                # Clear the buffer after processing
                audio_buffer.clear()
                silence_counter = 0  # Reset silence counter after processing

                # Stop on a specific phrase
                if "stop listening" in result["text"].lower():
                    print("Stopping transcription mode.")
                    break

if __name__ == "__main__":
    process_live_audio()

Final Thoughts

Well, there you have it a working (hopefully) real time offline speech to text program, it might sound like rocket science, but as you’ve seen it’s exactly that, but with the right tools and a bit of patience (by bit I mean a lot) anyone can create a real time offline speech to text program. This project is perfect for anyone looking to dive into audio processing or build something practical like a personal voice assistant or a TTS Discord Bot Discord TTS Bot: Create Your Own TTS Bot for Discord

Building a Real Time Offline Speech to Text Program

Table of Contents

Setting up the Environment

Overall, Logic of the Program

Real Time Offline Speech to Text Program Code

Importing Packages and Loading Model

Setting Up the Audio Capture

How Audio Streaming Works

Listening for the Wake Word

Detecting the Wake Word

Transcribing Speech in Real-Time

Continuous Transcription

Handling Silence and Stopping (used a stop command)

Detecting Silence

Stopping the Program

Putting It All Together

Overall Source Code

Final Thoughts

Leave a Reply Cancel reply

Latest Posts

Create Your Own PHP HTML Template Engine

How to Dynamically Load Objects with HTMX and PHP

Why you should read Dungeon Tou de Yadoya wo Yarou!

Using HTMX with PHP: Examples and Use Cases