In this post, I’ll Walk you through how to build a Real time Offline Speech to Text Program with Open-AI Whisper using Python. We’ll break everything down step by step, explaining the “why” behind each part so you can actually understand how it works, not just copy-paste the code (If you want you can do that as well, it’s provided at the end). It’s going to be fun and surprisingly simple!
Here is the list of Things we will be covering in this Article/Blog/Tutorial (idk whatever you wanna say)
Table of Contents
- Setting up the Environment
- Overall, Logic of the Program
- Real Time Offline Speech to Text Program Code
- Overall Source Code
- Final Thoughts
Setting up the Environment
Here’s what you need to follow along:
- Python Installed: Version 3.7 or later. (If you don’t have it, download Python here.)
- Libraries:
openai-whisper
: This handles speech-to-text transcription. It’s lightweight and works offline.sounddevice
: A library to capture real-time audio from your microphone.numpy
: Used for processing audio data efficiently.queue
: To manage audio chunks as they come in.- Depending upon your system you might also need FFmpeg you can download it from here FFmpeg
To install all the libraries, just run:
pip install openai-whisper sounddevice numpy
Before diving into the code if you want to know how to do this is discord, then you might need to start with Discord TTS Bot: Create Your Own TTS Bot for Discord, or if you want to pass the text generated to your own model then How to Customize Your own Ollama Model.
Also here is a cool fun project to improve your coding skills Your Own Random Anime Image Generator using Nekosia API
Overall, Logic of the Program
Before diving in, here’s what the program does, it’s the basic logic behind the overall working of this Real Time Offline Speech to Text Program:
- Listens for a Wake Word: The program stays in a listening state and only starts transcribing when it hears a specific wake word (e.g., “Hey Bot” or “Listen”).
- Transcribes Speech in Real-Time: Once activated, it processes audio chunks and converts them to text almost instantly (this kinda depends on your system but the tiny model will be good on a relatively low spec system, you can learn more about available models at openai-whisper · PyPI ).
- Handles Silence and Stops Gracefully: It detects silence to pause transcription and even lets you stop everything by saying, “stop listening.”
Got all that? Great! Let’s jump into the code.
Real Time Offline Speech to Text Program Code
So first we need to import the packages that we have installed thus we will do that and also load the model that we want to use, in this tutorial I’m using tiny.en which is the lightest of the models and it is for English, you can check different models here openai-whisper · PyPI
Importing Packages and Loading Model
import whisper
import sounddevice as sd
import numpy as np
import queue
# Load a smaller Whisper model for faster processing
model = whisper.load_model("tiny.en")
Now that the Packages and Model is loaded let’s create a system to capture our sound
Setting Up the Audio Capture
The first step is capturing audio from your microphone. We’ll use the sounddevice
library for this.
How Audio Streaming Works
Think of audio streaming like this: microphone continuously captures sound waves i.e your voice, which are broken into small chunks of data. These chunks are processed by your program in real time.
Here’s how we set it up:
fs = 16000 # Sampling rate (16 kHz, which is standard for voice)
chunk_duration = 1 # Each chunk is 1 second of audio
q = queue.Queue() # A queue to hold audio chunks
- Sampling Rate: The number of audio samples taken per second. 16 kHz is enough for speech while keeping processing fast.
- Queue: Think of it like a conveyor belt where each chunk of audio waits to be processed.
We also define a callback function to handle incoming audio:
def audio_callback(indata, frames, time, status):
if status:
print(f"Status: {status}")
q.put(indata.copy())
This function does three things:
- Checks for errors in the audio stream (
status
). - Takes the audio data (
indata
) and copies it into our queue. - Keeps things moving smoothly in real time.
Oh, and here is the actual line of code to get the input stream, more on that later.
with sd.InputStream( samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration) ):
Listening for the Wake Word
A wake word is like the program’s “on” button. It tells the program, “Hey, I’m ready to talk now.” Sort of like “Hey, Siri” if you are one of those iPhone users…. 🤨
Here’s how the wake word detection works:
- Collect recent audio chunks into a buffer.
- Process the buffer to check if it contains the wake word.
- If the wake word is detected, start transcribing speech.
Detecting the Wake Word
def detect_wake_word(audio_buffer):
combined_audio = np.concatenate(audio_buffer)
audio_as_float = combined_audio.flatten()
result = model.transcribe(audio_as_float, fp16=False)
return wake_word in result["text"].lower()
The logic is simple:
- Combine the last few chunks of audio.
- Use the Whisper model to transcribe them.
- Check if the transcription includes the wake word (e.g., “listen”).
Transcribing Speech in Real-Time
Once the wake word is detected, the program starts transcribing your speech.
Continuous Transcription
Here’s how the transcription process works:
def transcribe_continuously():
print("Transcribing... Speak now!")
audio_buffer = []
with sd.InputStream(samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)):
while True:
audio_data = q.get()
audio_buffer.append(audio_data)
combined_audio = np.concatenate(audio_buffer)
audio_as_float = combined_audio.flatten()
result = model.transcribe(audio_as_float, fp16=False)
print(f"You said: {result['text']}")
audio_buffer.clear()
Here’s why this works so well:
- The audio buffer ensures you process enough data for accurate transcription.
- Clearing the buffer after each transcription keeps things efficient.
- Results are printed instantly, creating that real-time feel. (well sort of , there is some delay but it works)
Handling Silence and Stopping (used a stop command)
Nobody wants a program that keeps listening forever. That’s why we add two important features:
- Silence Detection: Pause transcription when no one is speaking.
- Stop Command: End the program by saying, “stop listening.”
Detecting Silence
Silence detection is based on the average amplitude of the audio:
def is_silent(audio_chunk):
return np.abs(audio_chunk).mean() < silence_threshold
- Threshold: If the average amplitude is below
0.01
, we treat it as silence.
Stopping the Program
In the transcription loop, we check for a stop command:
if "stop listening" in result["text"].lower():
print("Stopping transcription mode.")
break
When the stop phrase is detected, the program exits transcription mode gracefully.
Putting It All Together
Here’s the main function that ties everything together:
def process_live_audio():
print("Listening for wake word...")
audio_buffer = []
with sd.InputStream(samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)):
while True:
audio_data = q.get()
audio_buffer.append(audio_data)
if len(audio_buffer) >= 3 and detect_wake_word(audio_buffer[-3:]):
print("Wake word detected! Listening for speech...")
transcribe_continuously()
print("Listening for wake word...")
audio_buffer.clear()
Overall Source Code
And below is the overall source code, if you have the required packages then just copy paste it and hopefully it will work with relatively good accuracy, there are some minor hiccups here and there, but it will be fine (hopefully)
import whisper
import sounddevice as sd
import numpy as np
import queue
# Load a smaller Whisper model for faster processing
model = whisper.load_model("tiny.en")
# Audio settings
fs = 16000 # Sampling rate
chunk_duration = 1 # Duration of each audio chunk in seconds
silence_threshold = 0.01 # Silence threshold (adjust as necessary)
silence_duration = 1 # Duration to consider as silence (1 second)
wake_word = "listen" # Define your wake word
q = queue.Queue()
def audio_callback(indata, frames, time, status):
"""Callback function to capture audio data."""
if status:
print(f"Status: {status}")
q.put(indata.copy())
def detect_wake_word(audio_buffer):
"""Detect the wake word in the audio buffer."""
combined_audio = np.concatenate(audio_buffer)
audio_as_float = combined_audio.flatten()
result = model.transcribe(audio_as_float, fp16=False)
return wake_word in result["text"].lower()
def is_silent(audio_chunk):
"""Check if the audio chunk is silent."""
return np.abs(audio_chunk).mean() < silence_threshold
def process_live_audio():
"""Continuously listen and process audio."""
print("Listening for wake word...")
audio_buffer = []
with sd.InputStream(
samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)
):
while True:
# Collect audio chunks
audio_data = q.get()
audio_buffer.append(audio_data)
# Check for wake word
if len(audio_buffer) >= 3: # Use ~3 seconds of audio for wake word detection
if detect_wake_word(audio_buffer[-3:]):
print("Wake word detected! Listening for speech...")
transcribe_continuously()
print("Listening for wake word...")
audio_buffer.clear()
def transcribe_continuously():
"""Transcribe speech in fixed chunks for consistent feedback."""
print("Transcribing... Speak now!")
audio_buffer = []
silence_counter = 0 # Counter for silent chunks
silence_sample_count = int(silence_duration * fs) # Convert silence duration to samples
with sd.InputStream(
samplerate=fs, channels=1, callback=audio_callback, blocksize=int(fs * chunk_duration)
):
while True:
# Collect audio chunks
audio_data = q.get()
audio_buffer.append(audio_data)
# Check for silence
if is_silent(audio_data):
silence_counter += len(audio_data) # Increment silence counter by number of samples
else:
silence_counter = 0 # Reset silence counter if sound is detected
# Process the audio buffer when there's sufficient data and silence is detected
if silence_counter >= silence_sample_count and len(audio_buffer) > 0:
combined_audio = np.concatenate(audio_buffer)
audio_as_float = combined_audio.flatten()
result = model.transcribe(audio_as_float, fp16=False)
print(f"You said: {result['text']}")
# Clear the buffer after processing
audio_buffer.clear()
silence_counter = 0 # Reset silence counter after processing
# Stop on a specific phrase
if "stop listening" in result["text"].lower():
print("Stopping transcription mode.")
break
if __name__ == "__main__":
process_live_audio()
Final Thoughts
Well, there you have it a working (hopefully) real time offline speech to text program, it might sound like rocket science, but as you’ve seen it’s exactly that, but with the right tools and a bit of patience (by bit I mean a lot) anyone can create a real time offline speech to text program. This project is perfect for anyone looking to dive into audio processing or build something practical like a personal voice assistant or a TTS Discord Bot Discord TTS Bot: Create Your Own TTS Bot for Discord