Voice Assistants and Audio Processing with Python
Learn how to build voice assistants and process audio data using Python, from basic speech recognition to advanced natural language understanding and audio analysis.
Introduction
Voice assistants and audio processing technologies have become increasingly prevalent in our daily lives. From smart speakers like Amazon Echo and Google Home to voice-controlled applications on our phones and computers, these technologies are transforming how we interact with devices and access information.
In this tutorial, we'll explore how to build voice assistants and process audio data using Python. We'll cover everything from basic audio processing and speech recognition to advanced natural language understanding and audio analysis techniques.
What You'll Learn
Audio Processing
- Working with audio files in Python
- Audio feature extraction
- Audio filtering and enhancement
- Spectral analysis techniques
Speech Technologies
- Speech recognition with various libraries
- Text-to-speech synthesis
- Voice activity detection
- Speaker identification
Voice Assistants
- Building a complete voice assistant
- Intent recognition and NLU
- Contextual understanding
- Deployment strategies
Prerequisites
Before starting this tutorial, you should have:
- Basic knowledge of Python programming
- Familiarity with installing Python packages using pip
- Basic understanding of machine learning concepts
- A development environment with Python 3.7+ installed
- A microphone for testing voice input (optional but recommended)
Setting Up Your Environment
We'll be using several Python libraries throughout this tutorial. You can install them all at once with the following command:
pip install numpy scipy matplotlib librosa soundfile pyaudio SpeechRecognition pyttsx3 transformers torch
Note: pyaudio
might require additional system dependencies depending on your operating system:
- On Windows: You might need to install Visual C++ Build Tools
- On macOS:
brew install portaudio
- On Linux:
sudo apt-get install python3-pyaudio
orsudo apt-get install portaudio19-dev
Applications of Voice and Audio Processing
Voice assistants and audio processing technologies have a wide range of applications across various domains:
Consumer Applications
- Smart home assistants (Alexa, Google Assistant)
- Voice-controlled applications and devices
- Accessibility tools for people with disabilities
- Voice-based authentication systems
Business Applications
- Customer service chatbots and voice bots
- Meeting transcription and summarization
- Voice analytics for call centers
- Voice-based health diagnostics
By the end of this tutorial, you'll have the skills to build your own voice assistant applications and process audio data for various purposes.
Audio Processing Basics
Before diving into voice assistants, it's essential to understand the fundamentals of audio processing. In this section, we'll explore how to work with audio files in Python, extract features from audio signals, and perform basic audio manipulations.
Understanding Audio Data
Audio is a continuous signal that represents variations in air pressure over time. When we work with audio in computers, we need to convert this continuous signal into a discrete representation through a process called sampling.
Key Audio Properties
- Sampling Rate: Number of samples per second (Hz). Common rates include 44.1 kHz (CD quality) and 16 kHz (speech).
- Bit Depth: Number of bits used to represent each sample. Higher bit depth means better amplitude resolution.
- Channels: Number of audio channels (mono = 1, stereo = 2).
- Duration: Length of the audio in seconds.
Common Audio Formats
- WAV: Uncompressed audio format with high quality but large file size.
- MP3: Compressed format with smaller file size but some quality loss.
- FLAC: Lossless compressed format that preserves audio quality.
- OGG: Open-source compressed format.
The Nyquist-Shannon Sampling Theorem
This fundamental theorem states that to accurately represent a signal, the sampling rate must be at least twice the highest frequency in the signal. Human hearing ranges from about 20 Hz to 20 kHz, which is why CD audio uses a 44.1 kHz sampling rate (slightly more than twice 20 kHz).
Working with Audio Files in Python
Python offers several libraries for working with audio data. We'll focus on librosa
, a powerful library for audio and music analysis, and soundfile
for reading and writing audio files.
Loading and Playing Audio
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import soundfile as sf
from IPython.display import Audio
# Load an audio file
file_path = 'path/to/your/audio_file.wav'
audio, sample_rate = librosa.load(file_path, sr=None) # sr=None preserves the original sample rate
# Display basic information
print(f"Audio shape: {audio.shape}")
print(f"Sample rate: {sample_rate} Hz")
print(f"Duration: {len(audio) / sample_rate:.2f} seconds")
# Play the audio (in Jupyter notebooks)
Audio(data=audio, rate=sample_rate)
# Visualize the waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(audio, sr=sample_rate)
plt.title('Audio Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.show()
Recording Audio
To record audio from a microphone, we can use the pyaudio
library:
import pyaudio
import wave
import numpy as np
def record_audio(filename, duration=5, sample_rate=16000, channels=1):
"""
Record audio from the microphone and save to a file.
Parameters:
- filename: Output file name (WAV format)
- duration: Recording duration in seconds
- sample_rate: Sampling rate in Hz
- channels: Number of audio channels
"""
# Initialize PyAudio
p = pyaudio.PyAudio()
# Open stream
stream = p.open(format=pyaudio.paInt16,
channels=channels,
rate=sample_rate,
input=True,
frames_per_buffer=1024)
print(f"Recording for {duration} seconds...")
frames = []
# Record audio in chunks
for i in range(0, int(sample_rate / 1024 * duration)):
data = stream.read(1024)
frames.append(data)
print("Recording finished.")
# Stop and close the stream
stream.stop_stream()
stream.close()
p.terminate()
# Save the recorded audio to a WAV file
wf = wave.open(filename, 'wb')
wf.setnchannels(channels)
wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
wf.setframerate(sample_rate)
wf.writeframes(b''.join(frames))
wf.close()
print(f"Audio saved to {filename}")
# Example usage
record_audio('recorded_audio.wav', duration=5)
Audio Feature Extraction
Audio features are numerical representations that capture different aspects of audio signals. These features are essential for tasks like speech recognition, music classification, and audio analysis.
Time-Domain Features
import librosa
import numpy as np
import matplotlib.pyplot as plt
# Load audio file
audio, sample_rate = librosa.load('path/to/your/audio_file.wav', sr=None)
# Calculate energy
energy = np.sum(audio**2) / len(audio)
print(f"Energy: {energy:.6f}")
# Calculate zero-crossing rate
zero_crossings = librosa.feature.zero_crossing_rate(audio)[0]
print(f"Zero-crossing rate: {np.mean(zero_crossings):.6f}")
# Calculate root mean square energy (RMS)
rms = librosa.feature.rms(y=audio)[0]
print(f"RMS energy: {np.mean(rms):.6f}")
# Visualize RMS energy over time
plt.figure(figsize=(12, 8))
plt.subplot(2, 1, 1)
librosa.display.waveshow(audio, sr=sample_rate)
plt.title('Audio Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.subplot(2, 1, 2)
frames = np.arange(len(rms))
t = librosa.frames_to_time(frames, sr=sample_rate)
plt.plot(t, rms)
plt.title('RMS Energy Over Time')
plt.xlabel('Time (s)')
plt.ylabel('RMS Energy')
plt.tight_layout()
plt.show()
Frequency-Domain Features
The frequency domain provides insights into the spectral content of audio signals. The Short-Time Fourier Transform (STFT) is a common technique to analyze how frequency content changes over time.
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
# Load audio file
audio, sample_rate = librosa.load('path/to/your/audio_file.wav', sr=None)
# Compute the Short-Time Fourier Transform (STFT)
stft = librosa.stft(audio)
magnitude = np.abs(stft) # Magnitude of the STFT
phase = np.angle(stft) # Phase of the STFT
# Convert to decibels (dB)
magnitude_db = librosa.amplitude_to_db(magnitude, ref=np.max)
# Visualize the spectrogram
plt.figure(figsize=(12, 8))
plt.subplot(2, 1, 1)
librosa.display.waveshow(audio, sr=sample_rate)
plt.title('Audio Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.subplot(2, 1, 2)
librosa.display.specshow(magnitude_db, sr=sample_rate, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.tight_layout()
plt.show()
# Extract spectral features
spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sample_rate)[0]
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=audio, sr=sample_rate)[0]
spectral_rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sample_rate)[0]
print(f"Spectral Centroid (mean): {np.mean(spectral_centroid):.2f} Hz")
print(f"Spectral Bandwidth (mean): {np.mean(spectral_bandwidth):.2f} Hz")
print(f"Spectral Rolloff (mean): {np.mean(spectral_rolloff):.2f} Hz")
Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs are one of the most widely used features in speech and audio processing. They capture the short-term power spectrum of a sound and are particularly useful for speech recognition.
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
# Load audio file
audio, sample_rate = librosa.load('path/to/your/audio_file.wav', sr=None)
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=13)
# Visualize MFCCs
plt.figure(figsize=(12, 4))
librosa.display.specshow(mfccs, sr=sample_rate, x_axis='time')
plt.colorbar(format='%+2.0f')
plt.title('MFCCs')
plt.xlabel('Time (s)')
plt.ylabel('MFCC Coefficients')
plt.tight_layout()
plt.show()
# Calculate statistics of MFCCs
mfcc_means = np.mean(mfccs, axis=1)
mfcc_vars = np.var(mfccs, axis=1)
print("MFCC Means:")
for i, mean in enumerate(mfcc_means):
print(f"MFCC {i+1}: {mean:.4f}")
Why MFCCs?
MFCCs are designed to mimic how the human ear perceives sound. They use the Mel scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another. This makes MFCCs particularly effective for speech recognition and other audio classification tasks.
Basic Audio Manipulations
Now that we understand how to load and analyze audio, let's explore some basic manipulations we can perform on audio signals.
Changing Volume
import librosa
import soundfile as sf
import numpy as np
# Load audio file
audio, sample_rate = librosa.load('path/to/your/audio_file.wav', sr=None)
# Increase volume (multiply by a factor > 1)
audio_louder = audio * 1.5
# Decrease volume (multiply by a factor < 1)
audio_quieter = audio * 0.5
# Save the modified audio
sf.write('louder_audio.wav', audio_louder, sample_rate)
sf.write('quieter_audio.wav', audio_quieter, sample_rate)
Changing Speed and Pitch
import librosa
import soundfile as sf
import numpy as np
import pyrubberband as pyrb
# Load audio file
audio, sample_rate = librosa.load('path/to/your/audio_file.wav', sr=None)
# Change speed (without changing pitch)
# Factor > 1 speeds up, factor < 1 slows down
speed_factor = 1.5
audio_fast = pyrb.time_stretch(audio, sample_rate, speed_factor)
# Change pitch (without changing speed)
# Positive semitones increase pitch, negative decrease
pitch_shift = 4 # Shift up by 4 semitones
audio_high_pitch = pyrb.pitch_shift(audio, sample_rate, pitch_shift)
# Save the modified audio
sf.write('fast_audio.wav', audio_fast, sample_rate)
sf.write('high_pitch_audio.wav', audio_high_pitch, sample_rate)
Note: The pyrubberband
library requires the Rubber Band library to be installed on your system. If you encounter issues, you can use librosa
's built-in functions instead:
# Using librosa for time stretching and pitch shifting
audio_fast = librosa.effects.time_stretch(audio, rate=speed_factor)
audio_high_pitch = librosa.effects.pitch_shift(audio, sr=sample_rate, n_steps=pitch_shift)
Applying Filters
import librosa
import soundfile as sf
import numpy as np
from scipy import signal
# Load audio file
audio, sample_rate = librosa.load('path/to/your/audio_file.wav', sr=None)
# Low-pass filter (keeps frequencies below the cutoff)
cutoff_low = 1000 # 1000 Hz cutoff
b, a = signal.butter(4, cutoff_low, 'low', fs=sample_rate)
audio_low_pass = signal.filtfilt(b, a, audio)
# High-pass filter (keeps frequencies above the cutoff)
cutoff_high = 1000 # 1000 Hz cutoff
b, a = signal.butter(4, cutoff_high, 'high', fs=sample_rate)
audio_high_pass = signal.filtfilt(b, a, audio)
# Band-pass filter (keeps frequencies between the cutoffs)
cutoff_low = 500 # 500 Hz lower cutoff
cutoff_high = 2000 # 2000 Hz upper cutoff
b, a = signal.butter(4, [cutoff_low, cutoff_high], 'band', fs=sample_rate)
audio_band_pass = signal.filtfilt(b, a, audio)
# Save the filtered audio
sf.write('low_pass_audio.wav', audio_low_pass, sample_rate)
sf.write('high_pass_audio.wav', audio_high_pass, sample_rate)
sf.write('band_pass_audio.wav', audio_band_pass, sample_rate)
Noise Reduction
Noise reduction is a common preprocessing step for speech recognition and other audio applications. Here's a simple approach using spectral subtraction:
import librosa
import soundfile as sf
import numpy as np
import matplotlib.pyplot as plt
# Load audio file
audio, sample_rate = librosa.load('path/to/your/noisy_audio.wav', sr=None)
# Assume the first 1 second is noise (adjust as needed)
noise_sample = audio[:int(sample_rate)]
# Compute the noise profile
noise_stft = librosa.stft(noise_sample)
noise_power = np.mean(np.abs(noise_stft)**2, axis=1)
noise_power = noise_power[:, np.newaxis]
# Compute the STFT of the audio
audio_stft = librosa.stft(audio)
audio_power = np.abs(audio_stft)**2
# Perform spectral subtraction
gain = 1 - (noise_power / audio_power)
gain = np.maximum(0, gain) # Ensure non-negative values
gain = gain**0.5 # Apply square root for magnitude
# Apply the gain to the STFT
audio_stft_denoised = audio_stft * gain
# Convert back to time domain
audio_denoised = librosa.istft(audio_stft_denoised)
# Save the denoised audio
sf.write('denoised_audio.wav', audio_denoised, sample_rate)
# Visualize the results
plt.figure(figsize=(12, 8))
plt.subplot(2, 1, 1)
librosa.display.waveshow(audio, sr=sample_rate)
plt.title('Original Noisy Audio')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.subplot(2, 1, 2)
librosa.display.waveshow(audio_denoised, sr=sample_rate)
plt.title('Denoised Audio')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.show()
Practice Exercise: Audio Feature Extraction Pipeline
Let's create a complete audio feature extraction pipeline that can be used for various audio analysis tasks:
import librosa
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
def extract_features(file_path, n_mfcc=13, n_chroma=12, n_spectral=7):
"""
Extract audio features from a file.
Parameters:
- file_path: Path to the audio file
- n_mfcc: Number of MFCCs to extract
- n_chroma: Number of chroma features
- n_spectral: Number of spectral features
Returns:
- Dictionary of features
"""
# Load the audio file
try:
audio, sample_rate = librosa.load(file_path, sr=None, res_type='kaiser_fast')
except Exception as e:
print(f"Error loading {file_path}: {e}")
return None
# Initialize the feature dictionary
features = {}
# Basic properties
features['duration'] = librosa.get_duration(y=audio, sr=sample_rate)
features['sample_rate'] = sample_rate
# Time-domain features
features['zero_crossing_rate'] = np.mean(librosa.feature.zero_crossing_rate(audio)[0])
features['energy'] = np.mean(librosa.feature.rms(y=audio)[0])
# Spectral features
if n_spectral > 0:
spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sample_rate)[0]
features['spectral_centroid_mean'] = np.mean(spectral_centroid)
features['spectral_centroid_var'] = np.var(spectral_centroid)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=audio, sr=sample_rate)[0]
features['spectral_bandwidth_mean'] = np.mean(spectral_bandwidth)
features['spectral_bandwidth_var'] = np.var(spectral_bandwidth)
spectral_rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sample_rate)[0]
features['spectral_rolloff_mean'] = np.mean(spectral_rolloff)
features['spectral_rolloff_var'] = np.var(spectral_rolloff)
# MFCCs
if n_mfcc > 0:
mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=n_mfcc)
for i in range(n_mfcc):
features[f'mfcc{i+1}_mean'] = np.mean(mfccs[i])
features[f'mfcc{i+1}_var'] = np.var(mfccs[i])
# Chroma features
if n_chroma > 0:
chroma = librosa.feature.chroma_stft(y=audio, sr=sample_rate, n_chroma=n_chroma)
for i in range(n_chroma):
features[f'chroma{i+1}_mean'] = np.mean(chroma[i])
features[f'chroma{i+1}_var'] = np.var(chroma[i])
return features
def extract_features_from_directory(directory, extension='.wav'):
"""
Extract features from all audio files in a directory.
Parameters:
- directory: Path to the directory containing audio files
- extension: File extension to filter by
Returns:
- DataFrame of features
"""
features_list = []
file_paths = []
# Get all audio files in the directory
for root, _, files in os.walk(directory):
for file in files:
if file.endswith(extension):
file_path = os.path.join(root, file)
file_paths.append(file_path)
# Extract features from each file
for file_path in tqdm(file_paths, desc="Extracting features"):
features = extract_features(file_path)
if features is not None:
features['file_path'] = file_path
features['file_name'] = os.path.basename(file_path)
features_list.append(features)
# Create a DataFrame from the features
df = pd.DataFrame(features_list)
return df
# Example usage
if __name__ == "__main__":
# Extract features from a single file
features = extract_features('path/to/your/audio_file.wav')
print(features)
# Extract features from all WAV files in a directory
df = extract_features_from_directory('path/to/your/audio_directory')
print(df.head())
# Save the features to a CSV file
df.to_csv('audio_features.csv', index=False)
This pipeline extracts a comprehensive set of audio features that can be used for various tasks like speech recognition, music genre classification, and emotion detection. Try running it on different types of audio files and explore how the features vary across different sounds.
Speech Recognition
Speech recognition is the technology that enables computers to convert spoken language into text. In this section, we'll explore different approaches to speech recognition using Python libraries.
Understanding Speech Recognition
Speech recognition systems typically follow a pipeline that includes:
- Audio Capture: Recording audio from a microphone or loading from a file
- Preprocessing: Noise reduction, normalization, and feature extraction
- Feature Extraction: Converting audio into features like MFCCs
- Acoustic Modeling: Mapping audio features to phonetic units
- Language Modeling: Determining the most likely sequence of words
- Decoding: Converting the model output into text
Modern speech recognition systems use deep learning models like recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer-based architectures.
Speech Recognition Challenges
Speech recognition faces several challenges:
- Accent and Dialect Variations: Different accents and dialects can affect recognition accuracy
- Background Noise: Environmental noise can interfere with speech recognition
- Homonyms: Words that sound the same but have different meanings
- Continuous Speech: Recognizing words in continuous speech without clear pauses
- Speaker Independence: Recognizing speech from different speakers
Speech Recognition in Python
Python offers several libraries for speech recognition, ranging from simple API wrappers to complete deep learning frameworks.
Using the SpeechRecognition Library
The SpeechRecognition
library provides a simple interface to various speech recognition APIs and engines.
import speech_recognition as sr
def recognize_from_file(audio_file_path, language='en-US'):
"""
Recognize speech from an audio file using Google's speech recognition API.
Parameters:
- audio_file_path: Path to the audio file
- language: Language code (default: 'en-US')
Returns:
- Recognized text
"""
# Initialize recognizer
recognizer = sr.Recognizer()
# Load audio file
with sr.AudioFile(audio_file_path) as source:
# Record the audio data
audio_data = recognizer.record(source)
try:
# Recognize speech using Google Speech Recognition
text = recognizer.recognize_google(audio_data, language=language)
print(f"Google Speech Recognition thinks you said: {text}")
return text
except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
return None
except sr.RequestError as e:
print(f"Could not request results from Google Speech Recognition service; {e}")
return None
def recognize_from_microphone(language='en-US', duration=5):
"""
Recognize speech from the microphone using Google's speech recognition API.
Parameters:
- language: Language code (default: 'en-US')
- duration: Recording duration in seconds (default: 5)
Returns:
- Recognized text
"""
# Initialize recognizer
recognizer = sr.Recognizer()
# Use the microphone as source
with sr.Microphone() as source:
print("Adjusting for ambient noise...")
recognizer.adjust_for_ambient_noise(source, duration=1)
print(f"Listening for {duration} seconds...")
audio_data = recognizer.listen(source, timeout=duration)
try:
# Recognize speech using Google Speech Recognition
text = recognizer.recognize_google(audio_data, language=language)
print(f"Google Speech Recognition thinks you said: {text}")
return text
except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
return None
except sr.RequestError as e:
print(f"Could not request results from Google Speech Recognition service; {e}")
return None
# Example usage
if __name__ == "__main__":
# Recognize speech from a file
text = recognize_from_file('path/to/your/audio_file.wav')
# Recognize speech from the microphone
# text = recognize_from_microphone(duration=5)
Note: The SpeechRecognition
library supports multiple speech recognition engines:
recognize_google
: Google Web Speech API (requires internet connection)recognize_google_cloud
: Google Cloud Speech API (requires API key)recognize_bing
: Microsoft Bing Speech API (requires API key)recognize_ibm
: IBM Speech to Text API (requires API key)recognize_sphinx
: CMU Sphinx (offline, no internet required)recognize_wit
: Wit.ai API (requires API key)recognize_azure
: Microsoft Azure Speech API (requires API key)recognize_houndify
: Houndify API (requires API key)
Using CMU Sphinx for Offline Recognition
If you need offline speech recognition, you can use CMU Sphinx through the pocketsphinx
library:
import speech_recognition as sr
def recognize_offline(audio_file_path):
"""
Recognize speech from an audio file using CMU Sphinx (offline).
Parameters:
- audio_file_path: Path to the audio file
Returns:
- Recognized text
"""
# Initialize recognizer
recognizer = sr.Recognizer()
# Load audio file
with sr.AudioFile(audio_file_path) as source:
# Record the audio data
audio_data = recognizer.record(source)
try:
# Recognize speech using Sphinx
text = recognizer.recognize_sphinx(audio_data)
print(f"Sphinx thinks you said: {text}")
return text
except sr.UnknownValueError:
print("Sphinx could not understand audio")
return None
except sr.RequestError as e:
print(f"Sphinx error; {e}")
return None
# Example usage
text = recognize_offline('path/to/your/audio_file.wav')
Offline vs. Online Speech Recognition
When choosing a speech recognition solution, consider the trade-offs:
- Online Services (Google, Azure, etc.): Higher accuracy, support for many languages, but require internet connection and may have usage limits or costs
- Offline Solutions (Sphinx, Vosk, etc.): Work without internet, no privacy concerns, but typically lower accuracy and limited language support
Advanced Speech Recognition with Deep Learning
For more advanced speech recognition tasks, you can use deep learning libraries like TensorFlow or PyTorch with pre-trained models.
Using Hugging Face Transformers
The Hugging Face Transformers library provides access to state-of-the-art speech recognition models like Wav2Vec2:
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import numpy as np
def recognize_with_wav2vec2(audio_file_path, model_name="facebook/wav2vec2-base-960h"):
"""
Recognize speech from an audio file using Wav2Vec2.
Parameters:
- audio_file_path: Path to the audio file
- model_name: Name of the pre-trained model
Returns:
- Recognized text
"""
# Load the model and processor
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
# Load audio file
audio, sample_rate = librosa.load(audio_file_path, sr=16000)
# Process the audio
inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt", padding=True)
# Perform inference
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
# Decode the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
return transcription[0]
# Example usage
text = recognize_with_wav2vec2('path/to/your/audio_file.wav')
print(f"Wav2Vec2 thinks you said: {text}")
Using Whisper
OpenAI's Whisper is a powerful speech recognition model that supports multiple languages and can handle noisy environments:
import whisper
def recognize_with_whisper(audio_file_path, model_size="base"):
"""
Recognize speech from an audio file using OpenAI's Whisper.
Parameters:
- audio_file_path: Path to the audio file
- model_size: Size of the Whisper model ('tiny', 'base', 'small', 'medium', 'large')
Returns:
- Recognized text
"""
# Load the model
model = whisper.load_model(model_size)
# Transcribe the audio
result = model.transcribe(audio_file_path)
return result["text"]
# Example usage
text = recognize_with_whisper('path/to/your/audio_file.wav', model_size="base")
print(f"Whisper thinks you said: {text}")
Choosing the Right Whisper Model
Whisper offers models of different sizes, each with a trade-off between accuracy and computational requirements:
- tiny: Fastest, lowest accuracy, ~39M parameters
- base: Good balance of speed and accuracy, ~74M parameters
- small: Better accuracy, slower, ~244M parameters
- medium: High accuracy, slower, ~769M parameters
- large: Highest accuracy, slowest, ~1.5B parameters
Real-Time Speech Recognition
For voice assistants, real-time speech recognition is essential. Here's how to implement it using the SpeechRecognition library:
import speech_recognition as sr
import time
def real_time_speech_recognition(timeout=None, phrase_time_limit=None):
"""
Perform real-time speech recognition from the microphone.
Parameters:
- timeout: Maximum number of seconds to wait for speech (None for no timeout)
- phrase_time_limit: Maximum number of seconds for a phrase (None for no limit)
Returns:
- Generator yielding recognized text
"""
# Initialize recognizer
recognizer = sr.Recognizer()
# Use the microphone as source
with sr.Microphone() as source:
print("Adjusting for ambient noise...")
recognizer.adjust_for_ambient_noise(source, duration=1)
print("Listening... (Press Ctrl+C to stop)")
try:
while True:
try:
print("Waiting for speech...")
audio_data = recognizer.listen(source, timeout=timeout, phrase_time_limit=phrase_time_limit)
try:
# Recognize speech using Google Speech Recognition
text = recognizer.recognize_google(audio_data)
print(f"Recognized: {text}")
yield text
except sr.UnknownValueError:
print("Could not understand audio")
except sr.RequestError as e:
print(f"Request error: {e}")
except sr.WaitTimeoutError:
print("Timeout waiting for speech")
except KeyboardInterrupt:
print("Stopped by user")
return
# Example usage
if __name__ == "__main__":
for text in real_time_speech_recognition(timeout=5, phrase_time_limit=5):
# Process the recognized text
if text.lower() == "stop":
print("Stopping...")
break
# Respond to the recognized text
print(f"You said: {text}")
Voice Activity Detection (VAD)
Voice Activity Detection (VAD) is a technique to detect the presence of speech in an audio signal. It's useful for real-time speech recognition to determine when to start and stop recording:
import numpy as np
import librosa
import pyaudio
import wave
import webrtcvad
import struct
import collections
def vad_collector(sample_rate=16000, frame_duration_ms=30, padding_duration_ms=300, vad=None, frames=None):
"""
Generator that yields series of consecutive audio frames comprising each utterance.
Parameters:
- sample_rate: Audio sample rate in Hz
- frame_duration_ms: Duration of each frame in milliseconds
- padding_duration_ms: Amount of padding to include before and after each utterance
- vad: Voice activity detector
- frames: Audio frames
Returns:
- Generator yielding utterances as a list of frames
"""
if vad is None:
vad = webrtcvad.Vad(3) # Aggressiveness mode (0-3)
num_padding_frames = int(padding_duration_ms / frame_duration_ms)
ring_buffer = collections.deque(maxlen=num_padding_frames)
triggered = False
for frame in frames:
is_speech = vad.is_speech(frame, sample_rate)
if not triggered:
ring_buffer.append((frame, is_speech))
num_voiced = len([f for f, speech in ring_buffer if speech])
if num_voiced > 0.9 * ring_buffer.maxlen:
triggered = True
for f, s in ring_buffer:
yield f
ring_buffer.clear()
else:
yield frame
ring_buffer.append((frame, is_speech))
num_unvoiced = len([f for f, speech in ring_buffer if not speech])
if num_unvoiced > 0.9 * ring_buffer.maxlen:
triggered = False
yield None # Signal the end of an utterance
ring_buffer.clear()
def record_with_vad(output_file, duration=10, sample_rate=16000, frame_duration_ms=30):
"""
Record audio with voice activity detection.
Parameters:
- output_file: Output WAV file
- duration: Maximum recording duration in seconds
- sample_rate: Audio sample rate in Hz
- frame_duration_ms: Duration of each frame in milliseconds
Returns:
- List of utterances (each utterance is a list of frames)
"""
# Initialize PyAudio
p = pyaudio.PyAudio()
# Open stream
stream = p.open(format=pyaudio.paInt16,
channels=1,
rate=sample_rate,
input=True,
frames_per_buffer=int(sample_rate * frame_duration_ms / 1000))
# Initialize VAD
vad = webrtcvad.Vad(3) # Aggressiveness mode (0-3)
print(f"Recording for {duration} seconds with VAD...")
frames = []
utterances = []
current_utterance = []
# Record audio in chunks
for i in range(0, int(sample_rate / (sample_rate * frame_duration_ms / 1000) * duration)):
frame = stream.read(int(sample_rate * frame_duration_ms / 1000))
frames.append(frame)
# Stop and close the stream
stream.stop_stream()
stream.close()
p.terminate()
# Process frames with VAD
for frame in vad_collector(sample_rate, frame_duration_ms, 300, vad, frames):
if frame is None:
if current_utterance:
utterances.append(current_utterance)
current_utterance = []
else:
current_utterance.append(frame)
# Add the last utterance if it exists
if current_utterance:
utterances.append(current_utterance)
# Save the utterances to a WAV file
if utterances:
with wave.open(output_file, 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
wf.setframerate(sample_rate)
for utterance in utterances:
for frame in utterance:
wf.writeframes(frame)
print(f"Recorded {len(utterances)} utterances")
print(f"Audio saved to {output_file}")
return utterances
# Example usage
if __name__ == "__main__":
utterances = record_with_vad('vad_recording.wav', duration=10)
Practice Exercise: Building a Simple Voice Command System
Let's create a simple voice command system that can recognize and respond to basic commands:
import speech_recognition as sr
import time
import os
import webbrowser
import datetime
import random
import pyttsx3
class VoiceCommandSystem:
def __init__(self):
# Initialize recognizer
self.recognizer = sr.Recognizer()
# Initialize text-to-speech engine
self.engine = pyttsx3.init()
# Define commands
self.commands = {
"hello": self.hello,
"time": self.get_time,
"date": self.get_date,
"open browser": self.open_browser,
"search": self.search_web,
"weather": self.get_weather,
"joke": self.tell_joke,
"exit": self.exit_program
}
# Running flag
self.running = True
def speak(self, text):
"""Speak the given text"""
print(f"Assistant: {text}")
self.engine.say(text)
self.engine.runAndWait()
def listen(self, timeout=None, phrase_time_limit=None):
"""Listen for a command"""
with sr.Microphone() as source:
print("Listening...")
self.recognizer.adjust_for_ambient_noise(source, duration=1)
try:
audio = self.recognizer.listen(source, timeout=timeout, phrase_time_limit=phrase_time_limit)
try:
text = self.recognizer.recognize_google(audio).lower()
print(f"You said: {text}")
return text
except sr.UnknownValueError:
self.speak("Sorry, I didn't understand that.")
return None
except sr.RequestError:
self.speak("Sorry, I'm having trouble accessing the recognition service.")
return None
except sr.WaitTimeoutError:
return None
def process_command(self, command):
"""Process the recognized command"""
if not command:
return
# Check for exact command matches
if command in self.commands:
self.commands[command]()
return
# Check for commands that start with specific phrases
if command.startswith("search for "):
query = command[len("search for "):]
self.search_web(query)
return
# Check for partial matches
for cmd, func in self.commands.items():
if cmd in command:
func()
return
self.speak("Sorry, I don't understand that command.")
# Command functions
def hello(self):
"""Respond to hello command"""
responses = ["Hello there!", "Hi!", "Greetings!", "Hello, how can I help you?"]
self.speak(random.choice(responses))
def get_time(self):
"""Tell the current time"""
current_time = datetime.datetime.now().strftime("%I:%M %p")
self.speak(f"The current time is {current_time}")
def get_date(self):
"""Tell the current date"""
current_date = datetime.datetime.now().strftime("%A, %B %d, %Y")
self.speak(f"Today is {current_date}")
def open_browser(self):
"""Open the web browser"""
self.speak("Opening web browser")
webbrowser.open("https://www.google.com")
def search_web(self, query=None):
"""Search the web for a query"""
if not query:
self.speak("What would you like to search for?")
query = self.listen(timeout=5, phrase_time_limit=5)
if not query:
self.speak("Sorry, I didn't catch that.")
return
self.speak(f"Searching for {query}")
webbrowser.open(f"https://www.google.com/search?q={query.replace(' ', '+')}")
def get_weather(self):
"""Get the weather (placeholder)"""
self.speak("I'm sorry, I don't have access to weather information at the moment.")
def tell_joke(self):
"""Tell a joke"""
jokes = [
"Why don't scientists trust atoms? Because they make up everything!",
"Why did the scarecrow win an award? Because he was outstanding in his field!",
"What do you call a fake noodle? An impasta!",
"Why couldn't the bicycle stand up by itself? It was two tired!",
"What do you call a fish with no eyes? Fsh!"
]
self.speak(random.choice(jokes))
def exit_program(self):
"""Exit the program"""
self.speak("Goodbye!")
self.running = False
def run(self):
"""Run the voice command system"""
self.speak("Voice command system is ready. Say 'hello' to start.")
while self.running:
command = self.listen(timeout=5)
if command:
self.process_command(command)
time.sleep(0.1)
# Example usage
if __name__ == "__main__":
voice_system = VoiceCommandSystem()
voice_system.run()
This simple voice command system can recognize basic commands like "hello", "time", "date", "open browser", "search for [query]", "tell me a joke", and "exit". Try extending it with more commands and functionality!
Text-to-Speech Synthesis
Text-to-speech (TTS) synthesis is the technology that converts written text into spoken voice output. In this section, we'll explore different approaches to TTS using Python libraries.
Understanding Text-to-Speech Synthesis
Text-to-speech synthesis typically involves several steps:
- Text Analysis: Parsing and normalizing the input text
- Phonetic Conversion: Converting text to phonetic representations
- Prosody Generation: Determining rhythm, stress, and intonation
- Waveform Generation: Generating the audio waveform
Modern TTS systems use deep learning models to generate natural-sounding speech. These models can be categorized into several types:
- Concatenative TTS: Combines pre-recorded speech segments
- Parametric TTS: Uses statistical models to generate speech parameters
- Neural TTS: Uses neural networks to generate speech directly
TTS Quality Factors
The quality of TTS systems is evaluated based on several factors:
- Naturalness: How natural and human-like the speech sounds
- Intelligibility: How easily the speech can be understood
- Expressiveness: The ability to convey emotions and emphasis
- Voice Variety: The range of different voices available
- Pronunciation: Accuracy of word and phoneme pronunciation
Text-to-Speech in Python
Python offers several libraries for text-to-speech synthesis, ranging from simple wrappers to advanced neural TTS systems.
Using pyttsx3 (Offline TTS)
pyttsx3
is a text-to-speech conversion library that works offline. It uses the speech engines available on your system (e.g., SAPI5 on Windows, NSSpeechSynthesizer on macOS, eSpeak on Linux).
import pyttsx3
def text_to_speech(text, voice_id=None, rate=200, volume=1.0, save_to_file=None):
"""
Convert text to speech using pyttsx3.
Parameters:
- text: Text to convert to speech
- voice_id: Voice ID to use (None for default)
- rate: Speech rate (words per minute)
- volume: Volume (0.0 to 1.0)
- save_to_file: Path to save the speech to a file (None to play directly)
"""
# Initialize the TTS engine
engine = pyttsx3.init()
# Set properties
engine.setProperty('rate', rate)
engine.setProperty('volume', volume)
# Set voice if specified
if voice_id is not None:
engine.setProperty('voice', voice_id)
# Get available voices
voices = engine.getProperty('voices')
print(f"Available voices: {len(voices)}")
for i, voice in enumerate(voices):
print(f"Voice {i}: {voice.id} - {voice.name}")
# Save to file or speak directly
if save_to_file:
engine.save_to_file(text, save_to_file)
engine.runAndWait()
print(f"Speech saved to {save_to_file}")
else:
engine.say(text)
engine.runAndWait()
# Example usage
if __name__ == "__main__":
# Speak directly
text_to_speech("Hello, this is a test of text-to-speech synthesis using pyttsx3.")
# Save to file
text_to_speech("This is a test of saving speech to a file.", save_to_file="tts_output.mp3")
# Use a different voice (if available)
# text_to_speech("This is a test with a different voice.", voice_id="HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Speech\\Voices\\Tokens\\TTS_MS_EN-US_ZIRA_11.0")
Note: The available voices depend on your operating system. On Windows, you can install additional voices through the Windows settings. On macOS, you can use the built-in voices. On Linux, you can install additional voices for eSpeak.
Using gTTS (Google Text-to-Speech)
gTTS
(Google Text-to-Speech) is a Python library and CLI tool that interfaces with Google Translate's text-to-speech API. It requires an internet connection but provides high-quality speech.
from gtts import gTTS
import os
from io import BytesIO
from pydub import AudioSegment
from pydub.playback import play
def text_to_speech_gtts(text, lang='en', slow=False, save_to_file=None, play_audio=True):
"""
Convert text to speech using Google Text-to-Speech.
Parameters:
- text: Text to convert to speech
- lang: Language code (e.g., 'en', 'fr', 'es')
- slow: Whether to speak slowly
- save_to_file: Path to save the speech to a file (None to play directly)
- play_audio: Whether to play the audio (if save_to_file is None)
"""
# Create gTTS object
tts = gTTS(text=text, lang=lang, slow=slow)
# Save to file or play directly
if save_to_file:
tts.save(save_to_file)
print(f"Speech saved to {save_to_file}")
elif play_audio:
# Save to a BytesIO object
fp = BytesIO()
tts.write_to_fp(fp)
fp.seek(0)
# Convert to AudioSegment and play
audio = AudioSegment.from_file(fp, format="mp3")
play(audio)
# Example usage
if __name__ == "__main__":
# Speak directly
text_to_speech_gtts("Hello, this is a test of text-to-speech synthesis using Google Text-to-Speech.")
# Save to file
text_to_speech_gtts("This is a test of saving speech to a file.", save_to_file="gtts_output.mp3")
# Use a different language
text_to_speech_gtts("Bonjour, comment ça va?", lang='fr')
gTTS Language Support
gTTS supports a wide range of languages. You can get a list of supported languages using the following code:
from gtts.lang import tts_langs
print(tts_langs())
Advanced Text-to-Speech with Deep Learning
For more advanced text-to-speech tasks, you can use deep learning libraries like TensorFlow or PyTorch with pre-trained models.
Using Mozilla TTS
Mozilla TTS is an open-source text-to-speech framework that provides high-quality speech synthesis using deep learning models.
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
import numpy as np
import soundfile as sf
def text_to_speech_mozilla(text, model_name="tts_models/en/ljspeech/tacotron2-DDC", vocoder_name="vocoder_models/en/ljspeech/multiband-melgan", save_to_file=None):
"""
Convert text to speech using Mozilla TTS.
Parameters:
- text: Text to convert to speech
- model_name: Name of the TTS model
- vocoder_name: Name of the vocoder model
- save_to_file: Path to save the speech to a file (None to return the audio array)
Returns:
- Audio array if save_to_file is None, otherwise None
"""
# Initialize model manager
model_manager = ModelManager()
# Download models if not already downloaded
model_path, config_path, model_item = model_manager.download_model(model_name)
vocoder_path, vocoder_config_path, _ = model_manager.download_model(vocoder_name)
# Initialize synthesizer
synthesizer = Synthesizer(
tts_checkpoint=model_path,
tts_config_path=config_path,
vocoder_checkpoint=vocoder_path,
vocoder_config=vocoder_config_path
)
# Synthesize speech
outputs = synthesizer.tts(text)
# Save to file or return the audio array
if save_to_file:
sf.write(save_to_file, outputs["wav"], outputs["sampling_rate"])
print(f"Speech saved to {save_to_file}")
return None
else:
return outputs["wav"], outputs["sampling_rate"]
# Example usage
if __name__ == "__main__":
# Synthesize speech and save to file
text_to_speech_mozilla("Hello, this is a test of text-to-speech synthesis using Mozilla TTS.", save_to_file="mozilla_tts_output.wav")
# Synthesize speech and get the audio array
audio, sample_rate = text_to_speech_mozilla("This is another test.")
print(f"Audio shape: {audio.shape}, Sample rate: {sample_rate}")
Using Hugging Face Transformers
The Hugging Face Transformers library provides access to state-of-the-art text-to-speech models like SpeechT5:
import torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import soundfile as sf
def text_to_speech_huggingface(text, speaker_embedding=None, save_to_file=None):
"""
Convert text to speech using Hugging Face Transformers.
Parameters:
- text: Text to convert to speech
- speaker_embedding: Speaker embedding for voice cloning (None for default)
- save_to_file: Path to save the speech to a file (None to return the audio array)
Returns:
- Audio array if save_to_file is None, otherwise None
"""
# Load models and processor
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Process text
inputs = processor(text=text, return_tensors="pt")
# Generate speech
if speaker_embedding is None:
# Use a random speaker embedding
speaker_embedding = torch.randn(1, 512)
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
# Save to file or return the audio array
if save_to_file:
sf.write(save_to_file, speech.numpy(), 16000)
print(f"Speech saved to {save_to_file}")
return None
else:
return speech.numpy(), 16000
# Example usage
if __name__ == "__main__":
# Synthesize speech and save to file
text_to_speech_huggingface("Hello, this is a test of text-to-speech synthesis using Hugging Face Transformers.", save_to_file="huggingface_tts_output.wav")
Voice Cloning with SpeechT5
SpeechT5 supports voice cloning by providing a speaker embedding. You can extract a speaker embedding from a reference audio file using the SpeechT5 model:
from transformers import SpeechT5Processor, SpeechT5ForSpeechToSpeech
import librosa
# Load the model and processor
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_vc")
model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")
# Load a reference audio file
audio, sample_rate = librosa.load("reference_audio.wav", sr=16000)
# Extract the speaker embedding
inputs = processor(audio=audio, sampling_rate=sample_rate, return_tensors="pt")
speaker_embedding = model.get_speaker_embeddings(inputs["input_values"])
# Use the speaker embedding for text-to-speech
text_to_speech_huggingface("This is voice cloning with SpeechT5.", speaker_embedding=speaker_embedding, save_to_file="voice_cloning_output.wav")
Customizing Text-to-Speech Output
Text-to-speech systems often provide ways to customize the output, such as changing the voice, speed, pitch, and adding pauses or emphasis.
Using SSML (Speech Synthesis Markup Language)
SSML is a markup language that allows you to control how text is spoken. It's supported by many TTS systems, including Google Cloud Text-to-Speech and Amazon Polly.
from google.cloud import texttospeech
import os
# Set up Google Cloud credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your/credentials.json"
def text_to_speech_ssml(ssml_text, language_code="en-US", voice_name="en-US-Wavenet-D", save_to_file="output.mp3"):
"""
Convert SSML text to speech using Google Cloud Text-to-Speech.
Parameters:
- ssml_text: SSML text to convert to speech
- language_code: Language code
- voice_name: Voice name
- save_to_file: Path to save the speech to a file
"""
# Initialize client
client = texttospeech.TextToSpeechClient()
# Set the input
synthesis_input = texttospeech.SynthesisInput(ssml=ssml_text)
# Set the voice
voice = texttospeech.VoiceSelectionParams(
language_code=language_code,
name=voice_name
)
# Set the audio config
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
# Perform the synthesis
response = client.synthesize_speech(
input=synthesis_input,
voice=voice,
audio_config=audio_config
)
# Save the audio to a file
with open(save_to_file, "wb") as out:
out.write(response.audio_content)
print(f"Audio content written to {save_to_file}")
# Example usage
if __name__ == "__main__":
ssml_text = """
Here's an example of SSML .
You can add pauses, change the speaking rate ,
adjust the pitch ,
and even add SSML .
"""
text_to_speech_ssml(ssml_text, save_to_file="ssml_output.mp3")
Common SSML Tags
Here are some common SSML tags and their usage:
<speak>
: Root element for SSML<break>
: Adds a pause (e.g.,<break time="1s"/>
)<emphasis>
: Adds emphasis (e.g.,<emphasis level="strong">important</emphasis>
)<prosody>
: Controls rate, pitch, and volume (e.g.,<prosody rate="slow">slow speech</prosody>
)<say-as>
: Specifies how to interpret text (e.g.,<say-as interpret-as="characters">ABC</say-as>
)<audio>
: Inserts an audio file (e.g.,<audio src="sound.mp3">fallback text</audio>
)<voice>
: Changes the voice (e.g.,<voice name="en-US-Wavenet-F">female voice</voice>
)
Practice Exercise: Building a Text-to-Speech Converter
Let's create a simple text-to-speech converter that supports multiple TTS engines and customization options:
import pyttsx3
from gtts import gTTS
import os
import tempfile
import pygame
import time
import threading
class TextToSpeechConverter:
def __init__(self, engine="pyttsx3"):
"""
Initialize the text-to-speech converter.
Parameters:
- engine: TTS engine to use ("pyttsx3" or "gtts")
"""
self.engine_name = engine
self.is_speaking = False
self.stop_speaking = False
if engine == "pyttsx3":
self.engine = pyttsx3.init()
self.voices = self.engine.getProperty('voices')
self.engine.setProperty('rate', 200)
self.engine.setProperty('volume', 1.0)
if self.voices:
self.engine.setProperty('voice', self.voices[0].id)
# Initialize pygame for audio playback
pygame.mixer.init()
def list_voices(self):
"""List available voices"""
if self.engine_name == "pyttsx3":
for i, voice in enumerate(self.voices):
print(f"Voice {i}: {voice.id} - {voice.name}")
else:
print("Voice listing is only supported for pyttsx3 engine.")
def set_voice(self, voice_index):
"""Set the voice by index"""
if self.engine_name == "pyttsx3" and 0 <= voice_index < len(self.voices):
self.engine.setProperty('voice', self.voices[voice_index].id)
return True
return False
def set_rate(self, rate):
"""Set the speech rate"""
if self.engine_name == "pyttsx3":
self.engine.setProperty('rate', rate)
return True
return False
def set_volume(self, volume):
"""Set the volume"""
if self.engine_name == "pyttsx3":
self.engine.setProperty('volume', volume)
return True
return False
def speak(self, text, lang='en', slow=False):
"""
Speak the given text.
Parameters:
- text: Text to speak
- lang: Language code (for gtts)
- slow: Whether to speak slowly (for gtts)
"""
self.stop_speaking = False
self.is_speaking = True
if self.engine_name == "pyttsx3":
def speak_thread():
self.engine.say(text)
self.engine.runAndWait()
self.is_speaking = False
threading.Thread(target=speak_thread).start()
elif self.engine_name == "gtts":
# Create a temporary file
with tempfile.NamedTemporaryFile(delete=False, suffix='.mp3') as fp:
temp_filename = fp.name
# Generate speech
tts = gTTS(text=text, lang=lang, slow=slow)
tts.save(temp_filename)
# Play the speech
def play_thread():
pygame.mixer.music.load(temp_filename)
pygame.mixer.music.play()
# Wait for the audio to finish
while pygame.mixer.music.get_busy() and not self.stop_speaking:
time.sleep(0.1)
# Clean up
pygame.mixer.music.stop()
os.remove(temp_filename)
self.is_speaking = False
threading.Thread(target=play_thread).start()
def stop(self):
"""Stop speaking"""
self.stop_speaking = True
if self.engine_name == "gtts":
pygame.mixer.music.stop()
self.is_speaking = False
def save_to_file(self, text, filename, lang='en', slow=False):
"""
Save speech to a file.
Parameters:
- text: Text to convert to speech
- filename: Output filename
- lang: Language code (for gtts)
- slow: Whether to speak slowly (for gtts)
"""
if self.engine_name == "pyttsx3":
self.engine.save_to_file(text, filename)
self.engine.runAndWait()
elif self.engine_name == "gtts":
tts = gTTS(text=text, lang=lang, slow=slow)
tts.save(filename)
print(f"Speech saved to {filename}")
# Example usage
if __name__ == "__main__":
# Create a TTS converter with pyttsx3
tts_pyttsx3 = TextToSpeechConverter(engine="pyttsx3")
# List available voices
tts_pyttsx3.list_voices()
# Set voice, rate, and volume
tts_pyttsx3.set_voice(0) # Use the first voice
tts_pyttsx3.set_rate(180) # Slightly slower than default
tts_pyttsx3.set_volume(0.8) # 80% volume
# Speak some text
tts_pyttsx3.speak("Hello, this is a test of the pyttsx3 engine.")
# Wait for speech to finish
while tts_pyttsx3.is_speaking:
time.sleep(0.1)
# Create a TTS converter with gtts
tts_gtts = TextToSpeechConverter(engine="gtts")
# Speak some text
tts_gtts.speak("Hello, this is a test of the Google Text-to-Speech engine.")
# Wait for speech to finish
while tts_gtts.is_speaking:
time.sleep(0.1)
# Save speech to a file
tts_pyttsx3.save_to_file("This is a test of saving speech to a file with pyttsx3.", "pyttsx3_output.mp3")
tts_gtts.save_to_file("This is a test of saving speech to a file with Google Text-to-Speech.", "gtts_output.mp3")
This text-to-speech converter supports both pyttsx3
(offline) and gTTS
(online) engines, and provides options for customizing the voice, rate, and volume. It also supports saving speech to a file and stopping speech playback.
Building a Voice Assistant
Now that we understand the core components of voice technology, let's put everything together to build a complete voice assistant. In this section, we'll create a voice assistant that can understand commands, respond to queries, and perform actions.
Voice Assistant Architecture
A typical voice assistant consists of several key components working together:
Core Components
- Wake Word Detection: Listens for a specific phrase to activate the assistant
- Speech Recognition: Converts spoken language to text
- Natural Language Understanding (NLU): Extracts intent and entities from text
- Dialog Management: Manages conversation flow and context
- Action Execution: Performs tasks based on user intent
- Text-to-Speech: Converts response text to spoken output
Design Considerations
- Privacy: How user data is collected, stored, and processed
- Latency: Response time for user interactions
- Accuracy: Recognition and understanding accuracy
- Personalization: Adapting to individual users
- Multimodality: Combining voice with other interfaces
- Fallback Strategies: Handling errors and misunderstandings
Online vs. Offline Voice Assistants
Voice assistants can be designed to work online (requiring internet connection) or offline (running entirely on-device):
- Online Assistants: Higher accuracy, more capabilities, but require internet and may raise privacy concerns
- Offline Assistants: More privacy-friendly, work without internet, but typically have limited capabilities
- Hybrid Approaches: Basic functionality works offline, advanced features require internet
Natural Language Understanding (NLU)
Natural Language Understanding is a critical component of voice assistants that extracts meaning from the user's speech. It involves identifying the user's intent and extracting relevant entities from their utterances.
Intent Recognition
Intent recognition determines what the user wants to do. For example, "What's the weather like today?" has a weather intent, while "Set an alarm for 7 AM" has an alarm intent.
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import joblib
import numpy as np
class IntentClassifier:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
self.vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=1000)
self.classifier = LinearSVC()
self.intents = []
def train(self, training_data):
"""
Train the intent classifier.
Parameters:
- training_data: List of (text, intent) tuples
"""
texts, intents = zip(*training_data)
# Preprocess texts
processed_texts = [self._preprocess(text) for text in texts]
# Vectorize texts
X = self.vectorizer.fit_transform(processed_texts)
# Train classifier
self.classifier.fit(X, intents)
# Store unique intents
self.intents = list(set(intents))
def predict(self, text):
"""
Predict the intent of a text.
Parameters:
- text: Input text
Returns:
- Predicted intent
- Confidence score
"""
processed_text = self._preprocess(text)
X = self.vectorizer.transform([processed_text])
# Get prediction
intent = self.classifier.predict(X)[0]
# Get confidence score
decision_values = self.classifier.decision_function(X)
confidence = np.max(decision_values)
return intent, confidence
def _preprocess(self, text):
"""Preprocess text by lemmatizing and removing stopwords"""
doc = self.nlp(text.lower())
tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
return " ".join(tokens)
def save(self, filepath):
"""Save the model to a file"""
model_data = {
"vectorizer": self.vectorizer,
"classifier": self.classifier,
"intents": self.intents
}
joblib.dump(model_data, filepath)
def load(self, filepath):
"""Load the model from a file"""
model_data = joblib.load(filepath)
self.vectorizer = model_data["vectorizer"]
self.classifier = model_data["classifier"]
self.intents = model_data["intents"]
# Example usage
if __name__ == "__main__":
# Training data: (text, intent) pairs
training_data = [
("What's the weather like today", "weather"),
("What's the forecast for tomorrow", "weather"),
("Will it rain this weekend", "weather"),
("Set an alarm for 7 AM", "set_alarm"),
("Wake me up at 6:30 tomorrow", "set_alarm"),
("Remind me to call mom at 5 PM", "set_reminder"),
("I need to remember to buy milk", "set_reminder"),
("Play some music", "play_music"),
("I want to listen to jazz", "play_music"),
("Tell me a joke", "tell_joke"),
("What time is it", "get_time"),
("What's the current time", "get_time")
]
# Create and train the classifier
intent_classifier = IntentClassifier()
intent_classifier.train(training_data)
# Test the classifier
test_texts = [
"What's the weather going to be like",
"Set an alarm for tomorrow morning",
"I need to remember my appointment",
"Play some rock music",
"Tell me something funny"
]
for text in test_texts:
intent, confidence = intent_classifier.predict(text)
print(f"Text: '{text}'")
print(f"Predicted intent: {intent} (confidence: {confidence:.2f})")
print()
# Save the model
intent_classifier.save("intent_classifier.joblib")
Entity Extraction
Entity extraction identifies specific pieces of information in the user's utterance, such as dates, times, locations, and names. For example, in "Set an alarm for 7 AM," "7 AM" is a time entity.
import spacy
import re
from datetime import datetime, timedelta
class EntityExtractor:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
# Regular expressions for common entities
self.time_pattern = re.compile(r'(\d{1,2})(:\d{2})?\s*(am|pm|AM|PM)?')
self.date_pattern = re.compile(r'(today|tomorrow|yesterday|next week|next month)')
def extract_entities(self, text):
"""
Extract entities from text.
Parameters:
- text: Input text
Returns:
- Dictionary of extracted entities
"""
entities = {}
# Process with spaCy
doc = self.nlp(text)
# Extract named entities
for ent in doc.ents:
if ent.label_ not in entities:
entities[ent.label_] = []
entities[ent.label_].append(ent.text)
# Extract time entities
time_matches = self.time_pattern.findall(text)
if time_matches:
entities['TIME'] = []
for match in time_matches:
hour, minute, period = match
hour = int(hour)
minute = int(minute[1:]) if minute else 0
# Handle AM/PM
if period and period.lower() == 'pm' and hour < 12:
hour += 12
elif period and period.lower() == 'am' and hour == 12:
hour = 0
time_str = f"{hour:02d}:{minute:02d}"
entities['TIME'].append(time_str)
# Extract date entities
date_matches = self.date_pattern.findall(text)
if date_matches:
entities['DATE'] = []
for match in date_matches:
if match.lower() == 'today':
date = datetime.now().strftime('%Y-%m-%d')
elif match.lower() == 'tomorrow':
date = (datetime.now() + timedelta(days=1)).strftime('%Y-%m-%d')
elif match.lower() == 'yesterday':
date = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
elif match.lower() == 'next week':
date = (datetime.now() + timedelta(weeks=1)).strftime('%Y-%m-%d')
elif match.lower() == 'next month':
# Simple approximation for next month
date = (datetime.now() + timedelta(days=30)).strftime('%Y-%m-%d')
entities['DATE'].append(date)
return entities
# Example usage
if __name__ == "__main__":
entity_extractor = EntityExtractor()
test_texts = [
"Set an alarm for 7 AM tomorrow",
"Remind me to call John at 3:30 PM today",
"What's the weather like in New York next week",
"Schedule a meeting with Sarah for 2 PM next month"
]
for text in test_texts:
entities = entity_extractor.extract_entities(text)
print(f"Text: '{text}'")
print(f"Extracted entities: {entities}")
print()
Using Pre-built NLU Services
Instead of building your own NLU system, you can use pre-built services like:
- Rasa NLU: Open-source NLU library
- Dialogflow: Google's NLU service
- Wit.ai: Facebook's NLU service
- LUIS: Microsoft's Language Understanding service
- Amazon Lex: Amazon's NLU service
These services provide more advanced features and are easier to integrate, but may require internet connectivity and have usage limits.
Dialog Management
Dialog management is responsible for maintaining the conversation flow and context. It determines how the voice assistant should respond to user inputs based on the current state of the conversation.
State-Based Dialog Management
A simple approach to dialog management is to use a state machine, where the conversation transitions between different states based on user inputs.
class DialogState:
def __init__(self, name, handler):
self.name = name
self.handler = handler
self.transitions = {}
def add_transition(self, intent, next_state):
"""Add a transition to another state based on intent"""
self.transitions[intent] = next_state
def next_state(self, intent):
"""Get the next state based on intent"""
return self.transitions.get(intent, self)
def handle(self, intent, entities):
"""Handle the current state"""
return self.handler(intent, entities)
class DialogManager:
def __init__(self):
self.states = {}
self.current_state = None
self.context = {}
def add_state(self, state):
"""Add a state to the dialog manager"""
self.states[state.name] = state
if self.current_state is None:
self.current_state = state
def process(self, intent, entities):
"""Process user input and return a response"""
# Update context with new entities
for entity_type, values in entities.items():
self.context[entity_type] = values
# Handle current state
response = self.current_state.handle(intent, self.context)
# Transition to next state
self.current_state = self.current_state.next_state(intent)
return response
# Example usage
def greeting_handler(intent, context):
return "Hello! How can I help you today?"
def weather_handler(intent, context):
location = context.get('GPE', ['your location'])[0]
date = context.get('DATE', ['today'])[0]
return f"The weather in {location} for {date} is sunny with a high of 75°F."
def alarm_handler(intent, context):
time = context.get('TIME', [''])[0]
date = context.get('DATE', ['today'])[0]
if time:
return f"I've set an alarm for {time} on {date}."
else:
return "What time would you like to set the alarm for?"
def reminder_handler(intent, context):
time = context.get('TIME', [''])[0]
date = context.get('DATE', ['today'])[0]
if time:
return f"I'll remind you at {time} on {date}."
else:
return "When would you like to be reminded?"
def fallback_handler(intent, context):
return "I'm not sure how to help with that. Can you try rephrasing?"
# Create dialog states
greeting_state = DialogState("greeting", greeting_handler)
weather_state = DialogState("weather", weather_handler)
alarm_state = DialogState("alarm", alarm_handler)
reminder_state = DialogState("reminder", reminder_handler)
fallback_state = DialogState("fallback", fallback_handler)
# Set up transitions
greeting_state.add_transition("weather", weather_state)
greeting_state.add_transition("set_alarm", alarm_state)
greeting_state.add_transition("set_reminder", reminder_state)
weather_state.add_transition("set_alarm", alarm_state)
weather_state.add_transition("set_reminder", reminder_state)
weather_state.add_transition("weather", weather_state)
alarm_state.add_transition("set_reminder", reminder_state)
alarm_state.add_transition("weather", weather_state)
alarm_state.add_transition("set_alarm", alarm_state)
reminder_state.add_transition("set_alarm", alarm_state)
reminder_state.add_transition("weather", weather_state)
reminder_state.add_transition("set_reminder", reminder_state)
# Create dialog manager
dialog_manager = DialogManager()
dialog_manager.add_state(greeting_state)
dialog_manager.add_state(weather_state)
dialog_manager.add_state(alarm_state)
dialog_manager.add_state(reminder_state)
dialog_manager.add_state(fallback_state)
# Example conversation
print(dialog_manager.process("greeting", {}))
print(dialog_manager.process("weather", {"GPE": ["New York"]}))
print(dialog_manager.process("set_alarm", {"TIME": ["07:00"], "DATE": ["tomorrow"]}))
Frame-Based Dialog Management
Frame-based dialog management uses "frames" or "slots" to track the information needed to complete a task. The system prompts the user for missing information until all required slots are filled.
class DialogFrame:
def __init__(self, name, slots=None, handler=None):
self.name = name
self.slots = slots or {}
self.handler = handler
self.required_slots = set()
def add_slot(self, slot_name, prompt, required=False):
"""Add a slot to the frame"""
self.slots[slot_name] = {
"value": None,
"prompt": prompt
}
if required:
self.required_slots.add(slot_name)
def fill_slot(self, slot_name, value):
"""Fill a slot with a value"""
if slot_name in self.slots:
self.slots[slot_name]["value"] = value
return True
return False
def is_complete(self):
"""Check if all required slots are filled"""
for slot_name in self.required_slots:
if self.slots[slot_name]["value"] is None:
return False
return True
def get_missing_slot(self):
"""Get the first missing required slot"""
for slot_name in self.required_slots:
if self.slots[slot_name]["value"] is None:
return slot_name, self.slots[slot_name]["prompt"]
return None, None
def execute(self):
"""Execute the frame handler with the filled slots"""
if self.handler and self.is_complete():
slot_values = {name: slot["value"] for name, slot in self.slots.items()}
return self.handler(slot_values)
return None
class FrameDialogManager:
def __init__(self):
self.frames = {}
self.active_frame = None
self.entity_slot_mapping = {}
def add_frame(self, frame):
"""Add a frame to the dialog manager"""
self.frames[frame.name] = frame
def map_entity_to_slot(self, entity_type, frame_name, slot_name):
"""Map an entity type to a slot in a frame"""
if entity_type not in self.entity_slot_mapping:
self.entity_slot_mapping[entity_type] = []
self.entity_slot_mapping[entity_type].append((frame_name, slot_name))
def activate_frame(self, frame_name):
"""Activate a frame"""
if frame_name in self.frames:
self.active_frame = self.frames[frame_name]
return True
return False
def process(self, intent, entities):
"""Process user input and return a response"""
# Activate frame based on intent if no active frame
if self.active_frame is None or intent != "continue":
frame_name = intent.replace("get_", "").replace("set_", "")
if frame_name in self.frames:
self.activate_frame(frame_name)
else:
return "I'm not sure how to help with that."
# Fill slots based on entities
for entity_type, values in entities.items():
if entity_type in self.entity_slot_mapping:
for frame_name, slot_name in self.entity_slot_mapping[entity_type]:
if frame_name == self.active_frame.name:
self.active_frame.fill_slot(slot_name, values[0])
# Check if frame is complete
if self.active_frame.is_complete():
response = self.active_frame.execute()
self.active_frame = None
return response
else:
# Prompt for missing slot
slot_name, prompt = self.active_frame.get_missing_slot()
return prompt
# Example usage
def weather_handler(slots):
location = slots.get("location", "your location")
date = slots.get("date", "today")
return f"The weather in {location} for {date} is sunny with a high of 75°F."
def alarm_handler(slots):
time = slots.get("time", "")
date = slots.get("date", "today")
return f"I've set an alarm for {time} on {date}."
# Create frames
weather_frame = DialogFrame("weather", handler=weather_handler)
weather_frame.add_slot("location", "Which location would you like the weather for?", required=True)
weather_frame.add_slot("date", "Which date would you like the weather for?", required=True)
alarm_frame = DialogFrame("alarm", handler=alarm_handler)
alarm_frame.add_slot("time", "What time would you like to set the alarm for?", required=True)
alarm_frame.add_slot("date", "Which date would you like to set the alarm for?", required=False)
# Create frame dialog manager
frame_dialog_manager = FrameDialogManager()
frame_dialog_manager.add_frame(weather_frame)
frame_dialog_manager.add_frame(alarm_frame)
# Map entities to slots
frame_dialog_manager.map_entity_to_slot("GPE", "weather", "location")
frame_dialog_manager.map_entity_to_slot("DATE", "weather", "date")
frame_dialog_manager.map_entity_to_slot("DATE", "alarm", "date")
frame_dialog_manager.map_entity_to_slot("TIME", "alarm", "time")
# Example conversation
print(frame_dialog_manager.process("weather", {"GPE": ["New York"]}))
print(frame_dialog_manager.process("continue", {"DATE": ["tomorrow"]}))
print(frame_dialog_manager.process("set_alarm", {"TIME": ["07:00"]}))
print(frame_dialog_manager.process("continue", {"DATE": ["tomorrow"]}))
Advanced Dialog Management
More advanced dialog management approaches include:
- Information State Update: Maintains a complex representation of the dialog context
- Agenda-Based: Uses a stack of dialog goals to manage the conversation
- Neural Network-Based: Uses deep learning to learn dialog policies from data
- Reinforcement Learning: Learns optimal dialog strategies through trial and error
These approaches can handle more complex conversations but require more sophisticated implementation.
Complete Voice Assistant Implementation
Now let's put everything together to build a complete voice assistant that can listen for commands, understand them, and respond appropriately.
import speech_recognition as sr
import pyttsx3
import spacy
import re
import datetime
import webbrowser
import random
import time
import threading
import json
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np
class VoiceAssistant:
def __init__(self, name="Assistant", wake_word="hey assistant"):
self.name = name
self.wake_word = wake_word.lower()
# Initialize speech recognition
self.recognizer = sr.Recognizer()
self.recognizer.energy_threshold = 4000
self.recognizer.dynamic_energy_threshold = True
# Initialize text-to-speech
self.tts_engine = pyttsx3.init()
self.tts_engine.setProperty('rate', 180)
self.tts_engine.setProperty('volume', 0.9)
# Get available voices
voices = self.tts_engine.getProperty('voices')
if voices:
# Try to find a female voice
female_voice = next((voice for voice in voices if 'female' in voice.name.lower()), None)
if female_voice:
self.tts_engine.setProperty('voice', female_voice.id)
# Initialize NLU components
self.nlp = spacy.load("en_core_web_sm")
self.intent_classifier = IntentClassifier()
self.entity_extractor = EntityExtractor()
# Initialize dialog manager
self.dialog_manager = DialogManager()
self._setup_dialog_manager()
# Running flag
self.running = False
self.listening_for_wake_word = False
# Load training data
self._load_training_data()
def _load_training_data(self):
"""Load and train the intent classifier with training data"""
training_data = [
("what's the weather like", "weather"),
("what's the forecast for today", "weather"),
("how's the weather", "weather"),
("will it rain today", "weather"),
("what's the temperature", "weather"),
("set an alarm", "set_alarm"),
("wake me up at", "set_alarm"),
("set a timer for", "set_alarm"),
("remind me to wake up", "set_alarm"),
("remind me to", "set_reminder"),
("i need to remember to", "set_reminder"),
("don't let me forget to", "set_reminder"),
("set a reminder for", "set_reminder"),
("what time is it", "get_time"),
("tell me the time", "get_time"),
("what's the current time", "get_time"),
("what day is it", "get_date"),
("what's today's date", "get_date"),
("tell me the date", "get_date"),
("play some music", "play_music"),
("i want to listen to", "play_music"),
("play", "play_music"),
("open", "open_app"),
("launch", "open_app"),
("start", "open_app"),
("search for", "web_search"),
("look up", "web_search"),
("find information about", "web_search"),
("tell me a joke", "tell_joke"),
("say something funny", "tell_joke"),
("make me laugh", "tell_joke"),
("who are you", "assistant_info"),
("what's your name", "assistant_info"),
("tell me about yourself", "assistant_info"),
("thank you", "gratitude"),
("thanks", "gratitude"),
("that's helpful", "gratitude"),
("goodbye", "goodbye"),
("bye", "goodbye"),
("see you later", "goodbye"),
("exit", "goodbye"),
("stop", "goodbye")
]
self.intent_classifier.train(training_data)
def _setup_dialog_manager(self):
"""Set up the dialog manager with states and handlers"""
# Define handlers
def greeting_handler(intent, context):
responses = [
f"Hello! I'm {self.name}. How can I help you today?",
f"Hi there! I'm {self.name}. What can I do for you?",
f"Greetings! I'm {self.name}. How may I assist you?"
]
return random.choice(responses)
def weather_handler(intent, context):
location = context.get('GPE', ['your location'])[0]
date = context.get('DATE', ['today'])[0]
# In a real implementation, you would call a weather API here
weather_conditions = ["sunny", "partly cloudy", "cloudy", "rainy", "snowy"]
temp_range = (65, 85) if random.choice(weather_conditions) in ["sunny", "partly cloudy"] else (45, 65)
temp = random.randint(*temp_range)
condition = random.choice(weather_conditions)
return f"The weather in {location} for {date} is {condition} with a high of {temp}°F."
def alarm_handler(intent, context):
time_entity = context.get('TIME', [''])[0]
date_entity = context.get('DATE', ['today'])[0]
if time_entity:
# In a real implementation, you would set an actual alarm here
return f"I've set an alarm for {time_entity} on {date_entity}."
else:
return "What time would you like to set the alarm for?"
def reminder_handler(intent, context):
time_entity = context.get('TIME', [''])[0]
date_entity = context.get('DATE', ['today'])[0]
# Try to extract what to remind about
reminder_text = ""
if 'REMINDER' in context:
reminder_text = f" to {context['REMINDER'][0]}"
if time_entity:
# In a real implementation, you would set an actual reminder here
return f"I'll remind you{reminder_text} at {time_entity} on {date_entity}."
else:
return f"When would you like to be reminded{reminder_text}?"
def time_handler(intent, context):
current_time = datetime.datetime.now().strftime("%I:%M %p")
return f"The current time is {current_time}."
def date_handler(intent, context):
current_date = datetime.datetime.now().strftime("%A, %B %d, %Y")
return f"Today is {current_date}."
def music_handler(intent, context):
genre = context.get('GENRE', [''])[0]
artist = context.get('PERSON', [''])[0]
if genre:
return f"Playing {genre} music for you."
elif artist:
return f"Playing music by {artist}."
else:
return "Playing some music for you."
def app_handler(intent, context):
app_name = context.get('APP', [''])[0]
if app_name:
return f"Opening {app_name} for you."
else:
return "Which application would you like to open?"
def search_handler(intent, context):
query = context.get('QUERY', [''])[0]
if query:
# In a real implementation, you would open a browser with the search query
return f"Searching the web for {query}."
else:
return "What would you like to search for?"
def joke_handler(intent, context):
jokes = [
"Why don't scientists trust atoms? Because they make up everything!",
"Why did the scarecrow win an award? Because he was outstanding in his field!",
"What do you call a fake noodle? An impasta!",
"Why couldn't the bicycle stand up by itself? It was two tired!",
"What do you call a fish with no eyes? Fsh!"
]
return random.choice(jokes)
def assistant_info_handler(intent, context):
return f"I'm {self.name}, your voice assistant. I can help you with weather, alarms, reminders, and more."
def gratitude_handler(intent, context):
responses = [
"You're welcome!",
"Happy to help!",
"Anytime!",
"My pleasure!"
]
return random.choice(responses)
def goodbye_handler(intent, context):
responses = [
"Goodbye!",
"See you later!",
"Have a great day!",
"Bye for now!"
]
self.running = False
return random.choice(responses)
def fallback_handler(intent, context):
responses = [
"I'm not sure how to help with that.",
"I didn't understand. Could you try rephrasing?",
"I'm still learning and don't know how to respond to that yet.",
"I'm not sure what you mean. Can you try asking differently?"
]
return random.choice(responses)
# Create states
greeting_state = DialogState("greeting", greeting_handler)
weather_state = DialogState("weather", weather_handler)
alarm_state = DialogState("set_alarm", alarm_handler)
reminder_state = DialogState("set_reminder", reminder_handler)
time_state = DialogState("get_time", time_handler)
date_state = DialogState("get_date", date_handler)
music_state = DialogState("play_music", music_handler)
app_state = DialogState("open_app", app_handler)
search_state = DialogState("web_search", search_handler)
joke_state = DialogState("tell_joke", joke_handler)
assistant_info_state = DialogState("assistant_info", assistant_info_handler)
gratitude_state = DialogState("gratitude", gratitude_handler)
goodbye_state = DialogState("goodbye", goodbye_handler)
fallback_state = DialogState("fallback", fallback_handler)
# Add states to dialog manager
self.dialog_manager.add_state(greeting_state)
self.dialog_manager.add_state(weather_state)
self.dialog_manager.add_state(alarm_state)
self.dialog_manager.add_state(reminder_state)
self.dialog_manager.add_state(time_state)
self.dialog_manager.add_state(date_state)
self.dialog_manager.add_state(music_state)
self.dialog_manager.add_state(app_state)
self.dialog_manager.add_state(search_state)
self.dialog_manager.add_state(joke_state)
self.dialog_manager.add_state(assistant_info_state)
self.dialog_manager.add_state(gratitude_state)
self.dialog_manager.add_state(goodbye_state)
self.dialog_manager.add_state(fallback_state)
# Set up transitions (simplified - in a real implementation, you would define more transitions)
for state in self.dialog_manager.states.values():
for intent in self.dialog_manager.states:
if intent in self.dialog_manager.states:
state.add_transition(intent, self.dialog_manager.states[intent])
def speak(self, text):
"""Speak the given text"""
print(f"{self.name}: {text}")
self.tts_engine.say(text)
self.tts_engine.runAndWait()
def listen(self, timeout=None, phrase_time_limit=None):
"""Listen for a command"""
with sr.Microphone() as source:
print("Listening...")
self.recognizer.adjust_for_ambient_noise(source, duration=0.5)
try:
audio = self.recognizer.listen(source, timeout=timeout, phrase_time_limit=phrase_time_limit)
try:
text = self.recognizer.recognize_google(audio).lower()
print(f"You said: {text}")
return text
except sr.UnknownValueError:
return None
except sr.RequestError:
self.speak("Sorry, I'm having trouble accessing the recognition service.")
return None
except sr.WaitTimeoutError:
return None
def listen_for_wake_word(self):
"""Listen for the wake word"""
self.listening_for_wake_word = True
while self.listening_for_wake_word:
with sr.Microphone() as source:
print("Listening for wake word...")
self.recognizer.adjust_for_ambient_noise(source, duration=0.5)
try:
audio = self.recognizer.listen(source, timeout=10, phrase_time_limit=3)
try:
text = self.recognizer.recognize_google(audio).lower()
print(f"Heard: {text}")
if self.wake_word in text:
print("Wake word detected!")
self.speak(f"Yes, I'm here.")
self.process_commands()
except sr.UnknownValueError:
pass
except sr.RequestError:
print("Could not request results from Google Speech Recognition service")
except (sr.WaitTimeoutError, Exception) as e:
pass
def process_commands(self):
"""Process voice commands"""
self.running = True
while self.running:
command = self.listen(timeout=5, phrase_time_limit=5)
if command:
# Predict intent
intent, confidence = self.intent_classifier.predict(command)
print(f"Intent: {intent} (confidence: {confidence:.2f})")
# Extract entities
entities = self.entity_extractor.extract_entities(command)
print(f"Entities: {entities}")
# Extract query for web search
if intent == "web_search" and 'QUERY' not in entities:
query = command.replace("search for", "").replace("look up", "").replace("find information about", "").strip()
entities['QUERY'] = [query]
# Extract reminder text
if intent == "set_reminder" and 'REMINDER' not in entities:
reminder_text = command.replace("remind me to", "").replace("i need to remember to", "").replace("don't let me forget to", "").replace("set a reminder for", "").strip()
entities['REMINDER'] = [reminder_text]
# Process with dialog manager
response = self.dialog_manager.process(intent, entities)
# Speak response
self.speak(response)
# If goodbye intent, exit the loop
if intent == "goodbye":
break
time.sleep(0.1)
def run(self):
"""Run the voice assistant"""
self.speak(f"Hello, I'm {self.name}. Say '{self.wake_word}' to activate me.")
try:
self.listen_for_wake_word()
except KeyboardInterrupt:
self.speak("Goodbye!")
self.listening_for_wake_word = False
# Intent classifier class
class IntentClassifier:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
self.vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=1000)
self.classifier = LinearSVC()
self.intents = []
def train(self, training_data):
"""Train the intent classifier"""
texts, intents = zip(*training_data)
processed_texts = [self._preprocess(text) for text in texts]
X = self.vectorizer.fit_transform(processed_texts)
self.classifier.fit(X, intents)
self.intents = list(set(intents))
def predict(self, text):
"""Predict the intent of a text"""
processed_text = self._preprocess(text)
X = self.vectorizer.transform([processed_text])
intent = self.classifier.predict(X)[0]
decision_values = self.classifier.decision_function(X)
confidence = np.max(decision_values)
return intent, confidence
def _preprocess(self, text):
"""Preprocess text by lemmatizing and removing stopwords"""
doc = self.nlp(text.lower())
tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
return " ".join(tokens)
# Entity extractor class
class EntityExtractor:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
self.time_pattern = re.compile(r'(\d{1,2})(:\d{2})?\s*(am|pm|AM|PM)?')
self.date_pattern = re.compile(r'(today|tomorrow|yesterday|next week|next month)')
def extract_entities(self, text):
"""Extract entities from text"""
entities = {}
# Process with spaCy
doc = self.nlp(text)
# Extract named entities
for ent in doc.ents:
if ent.label_ not in entities:
entities[ent.label_] = []
entities[ent.label_].append(ent.text)
# Extract time entities
time_matches = self.time_pattern.findall(text)
if time_matches:
entities['TIME'] = []
for match in time_matches:
hour, minute, period = match
hour = int(hour)
minute = int(minute[1:]) if minute else 0
# Handle AM/PM
if period and period.lower() == 'pm' and hour < 12:
hour += 12
elif period and period.lower() == 'am' and hour == 12:
hour = 0
time_str = f"{hour:02d}:{minute:02d}"
entities['TIME'].append(time_str)
# Extract date entities
date_matches = self.date_pattern.findall(text)
if date_matches:
entities['DATE'] = []
for match in date_matches:
if match.lower() == 'today':
date = datetime.datetime.now().strftime('%Y-%m-%d')
elif match.lower() == 'tomorrow':
date = (datetime.datetime.now() + datetime.timedelta(days=1)).strftime('%Y-%m-%d')
elif match.lower() == 'yesterday':
date = (datetime.datetime.now() - datetime.timedelta(days=1)).strftime('%Y-%m-%d')
elif match.lower() == 'next week':
date = (datetime.datetime.now() + datetime.timedelta(weeks=1)).strftime('%Y-%m-%d')
elif match.lower() == 'next month':
date = (datetime.datetime.now() + datetime.timedelta(days=30)).strftime('%Y-%m-%d')
entities['DATE'].append(date)
return entities
# Dialog state class
class DialogState:
def __init__(self, name, handler):
self.name = name
self.handler = handler
self.transitions = {}
def add_transition(self, intent, next_state):
"""Add a transition to another state based on intent"""
self.transitions[intent] = next_state
def next_state(self, intent):
"""Get the next state based on intent"""
return self.transitions.get(intent, self)
def handle(self, intent, entities):
"""Handle the current state"""
return self.handler(intent, entities)
# Dialog manager class
class DialogManager:
def __init__(self):
self.states = {}
self.current_state = None
self.context = {}
def add_state(self, state):
"""Add a state to the dialog manager"""
self.states[state.name] = state
if self.current_state is None:
self.current_state = state
def process(self, intent, entities):
"""Process user input and return a response"""
# Update context with new entities
for entity_type, values in entities.items():
self.context[entity_type] = values
# Find the appropriate state for the intent
if intent in self.states:
self.current_state = self.states[intent]
# Handle current state
response = self.current_state.handle(intent, self.context)
# Transition to next state
self.current_state = self.current_state.next_state(intent)
return response
# Example usage
if __name__ == "__main__":
assistant = VoiceAssistant(name="Aria", wake_word="hey aria")
assistant.run()
This implementation includes all the components we've discussed: speech recognition, text-to-speech, intent classification, entity extraction, and dialog management. The voice assistant can:
- Listen for a wake word to activate
- Recognize various intents like weather, alarms, reminders, time, date, music, web search, etc.
- Extract entities like times, dates, locations, and people
- Maintain context across turns in the conversation
- Respond appropriately to user queries
Note: This is a simplified implementation for educational purposes. A production-ready voice assistant would include more robust error handling, better NLU capabilities, integration with external services (weather APIs, calendar APIs, etc.), and more sophisticated dialog management.
Extending the Voice Assistant
You can extend this voice assistant in several ways:
- Add More Intents: Expand the training data to recognize more types of user requests
- Improve Entity Extraction: Add more sophisticated entity extraction for complex queries
- Integrate External APIs: Connect to weather services, calendar APIs, music streaming services, etc.
- Add Contextual Understanding: Improve the dialog manager to handle more complex conversations
- Implement Personalization: Store user preferences and adapt responses accordingly
- Add Multi-turn Conversations: Handle follow-up questions and references to previous turns
- Implement Proactive Features: Add reminders, notifications, and other proactive behaviors
Practice Exercise: Extending the Voice Assistant
Try extending the voice assistant with a new feature. For example, you could add a calculator functionality that can perform basic arithmetic operations.
- Add training data for the "calculate" intent
- Implement a handler for the "calculate" intent that can parse and evaluate arithmetic expressions
- Add entity extraction for numbers and operators
- Test the new functionality with various arithmetic queries
Here's a starting point for the calculator functionality:
# Add to training data
training_data.extend([
("calculate", "calculate"),
("what is", "calculate"),
("compute", "calculate"),
("add", "calculate"),
("subtract", "calculate"),
("multiply", "calculate"),
("divide", "calculate")
])
# Add entity extraction for numbers and operators
def extract_calculation(text):
"""Extract a calculation from text"""
# Remove words like "calculate", "what is", etc.
text = re.sub(r'calculate|what is|compute', '', text, flags=re.IGNORECASE).strip()
# Replace words with symbols
text = text.replace('plus', '+').replace('minus', '-').replace('times', '*').replace('divided by', '/')
# Extract the calculation
calculation = re.sub(r'[^0-9+\-*/().]', '', text)
return calculation
# Add handler for calculate intent
def calculate_handler(intent, context):
calculation = context.get('CALCULATION', [''])[0]
if not calculation:
return "What would you like me to calculate?"
try:
result = eval(calculation)
return f"The result of {calculation} is {result}."
except Exception as e:
return f"Sorry, I couldn't calculate that. Please try again."
# Add to entity extraction
calculation = extract_calculation(command)
if calculation:
entities['CALCULATION'] = [calculation]
# Create and add state
calculate_state = DialogState("calculate", calculate_handler)
self.dialog_manager.add_state(calculate_state)
This is just one example of how you can extend the voice assistant. You could also add features like setting timers, controlling smart home devices, playing games, or providing news updates.
Advanced Audio Analysis
Beyond speech recognition and synthesis, there are many other ways to analyze and process audio data. In this section, we'll explore advanced audio analysis techniques, including music information retrieval, audio classification, and more.
Audio Classification
Audio classification involves categorizing audio samples into predefined classes. This can be used for environmental sound recognition, music genre classification, or identifying specific audio events.
import librosa
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
class AudioClassifier:
def __init__(self):
self.model = RandomForestClassifier(n_estimators=100, random_state=42)
def extract_features(self, file_path):
"""Extract audio features from a file."""
# Load audio file
y, sr = librosa.load(file_path, sr=None)
# Extract features
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
zero_crossing_rate = librosa.feature.zero_crossing_rate(y)
# Compute statistics for each feature
features = []
for feature in [mfccs, spectral_centroid, chroma, zero_crossing_rate]:
features.extend([
np.mean(feature),
np.std(feature),
np.min(feature),
np.max(feature)
])
return np.array(features)
def train(self, file_paths, labels):
"""Train the classifier on audio files."""
features = []
for file_path in file_paths:
features.append(self.extract_features(file_path))
X_train, X_test, y_train, y_test = train_test_split(
features, labels, test_size=0.2, random_state=42
)
self.model.fit(X_train, y_train)
# Evaluate
y_pred = self.model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")
return accuracy
def predict(self, file_path):
"""Predict the class of an audio file."""
features = self.extract_features(file_path)
return self.model.predict([features])[0]
# Example usage
if __name__ == "__main__":
# Example with environmental sounds
classifier = AudioClassifier()
# Assuming you have a dataset of audio files with labels
file_paths = ["dog_bark.wav", "car_horn.wav", "siren.wav", "rain.wav", "thunder.wav"]
labels = ["animal", "vehicle", "alert", "nature", "nature"]
classifier.train(file_paths, labels)
# Predict a new sound
prediction = classifier.predict("unknown_sound.wav")
print(f"The sound is classified as: {prediction}")
Pre-trained Audio Classification Models
For more advanced audio classification, consider using pre-trained deep learning models:
- PANNs (Pre-trained Audio Neural Networks): Trained on AudioSet, these models excel at general audio classification.
- VGGish: A model by Google trained on YouTube audio for audio event recognition.
- YAMNet: Another Google model that can identify 521 audio classes from the AudioSet ontology.
- Wav2Vec2: While primarily for speech recognition, it can be fine-tuned for audio classification tasks.
Speaker Recognition
Speaker recognition involves identifying who is speaking based on voice characteristics. This can be used for voice authentication, multi-speaker transcription, or personalized responses in voice assistants.
import librosa
import numpy as np
from sklearn.mixture import GaussianMixture
import pickle
import os
class SpeakerRecognizer:
def __init__(self):
self.speakers = {}
self.models = {}
def extract_features(self, file_path):
"""Extract MFCC features for speaker recognition."""
y, sr = librosa.load(file_path, sr=None)
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
# Transpose to get time as first dimension
mfccs = mfccs.T
return mfccs
def train_speaker_model(self, speaker_name, audio_files):
"""Train a GMM model for a specific speaker."""
all_features = []
for file in audio_files:
features = self.extract_features(file)
all_features.extend(features)
# Convert to numpy array
all_features = np.array(all_features)
# Train Gaussian Mixture Model
gmm = GaussianMixture(n_components=16, covariance_type='diag', random_state=42)
gmm.fit(all_features)
# Save the model
self.models[speaker_name] = gmm
self.speakers[speaker_name] = audio_files
print(f"Trained model for speaker: {speaker_name}")
def identify_speaker(self, audio_file):
"""Identify the speaker in an audio file."""
features = self.extract_features(audio_file)
best_score = float('-inf')
best_speaker = None
for speaker, model in self.models.items():
score = model.score(features)
if score > best_score:
best_score = score
best_speaker = speaker
return best_speaker, best_score
def save_models(self, directory="speaker_models"):
"""Save all speaker models to disk."""
if not os.path.exists(directory):
os.makedirs(directory)
for speaker, model in self.models.items():
model_path = os.path.join(directory, f"{speaker}.pkl")
with open(model_path, 'wb') as f:
pickle.dump(model, f)
print(f"Saved {len(self.models)} speaker models to {directory}")
def load_models(self, directory="speaker_models"):
"""Load speaker models from disk."""
if not os.path.exists(directory):
print(f"Directory {directory} does not exist.")
return
self.models = {}
for file in os.listdir(directory):
if file.endswith(".pkl"):
speaker = file[:-4] # Remove .pkl extension
model_path = os.path.join(directory, file)
with open(model_path, 'rb') as f:
self.models[speaker] = pickle.load(f)
print(f"Loaded {len(self.models)} speaker models from {directory}")
# Example usage
if __name__ == "__main__":
recognizer = SpeakerRecognizer()
# Train models for different speakers
recognizer.train_speaker_model("alice", ["alice_sample1.wav", "alice_sample2.wav", "alice_sample3.wav"])
recognizer.train_speaker_model("bob", ["bob_sample1.wav", "bob_sample2.wav", "bob_sample3.wav"])
recognizer.train_speaker_model("charlie", ["charlie_sample1.wav", "charlie_sample2.wav"])
# Save models
recognizer.save_models()
# Later, load models and identify a speaker
# recognizer.load_models()
# Identify speaker in a new recording
# recognizer.load_models()
# Identify speaker in a new recording
speaker, score = recognizer.identify_speaker("unknown_speaker.wav")
print(f"Identified speaker as {speaker} with confidence score: {score:.2f}")
Emotion Detection from Speech
Emotion detection from speech analyzes vocal characteristics to determine the emotional state of the speaker. This can enhance voice assistants by enabling them to respond appropriately to the user's emotional state.
import librosa
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import os
class EmotionDetector:
def __init__(self):
self.model = RandomForestClassifier(n_estimators=100, random_state=42)
self.emotions = ['angry', 'happy', 'sad', 'neutral', 'fearful', 'disgusted', 'surprised']
def extract_features(self, file_path):
"""Extract features for emotion detection."""
y, sr = librosa.load(file_path, sr=None)
# Extract various features
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
mel = librosa.feature.melspectrogram(y=y, sr=sr)
contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
tonnetz = librosa.feature.tonnetz(y=librosa.effects.harmonic(y), sr=sr)
# Extract statistics from each feature
features = []
for feature in [mfccs, chroma, mel, contrast, tonnetz]:
features.extend([
np.mean(feature),
np.std(feature),
np.max(feature),
np.min(feature),
np.median(feature),
np.quantile(feature, 0.25),
np.quantile(feature, 0.75)
])
# Add zero crossing rate
zcr = librosa.feature.zero_crossing_rate(y)
features.extend([np.mean(zcr), np.std(zcr)])
# Add energy
energy = np.sum(y**2) / len(y)
features.append(energy)
return np.array(features)
def train(self, data_dir):
"""Train the emotion detector on a directory of audio files.
The directory should have subdirectories named after emotions,
each containing audio samples of that emotion.
"""
features = []
labels = []
for emotion in self.emotions:
emotion_dir = os.path.join(data_dir, emotion)
if not os.path.exists(emotion_dir):
continue
for file in os.listdir(emotion_dir):
if file.endswith('.wav'):
file_path = os.path.join(emotion_dir, file)
feature = self.extract_features(file_path)
features.append(feature)
labels.append(emotion)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
features, labels, test_size=0.2, random_state=42
)
# Train model
self.model.fit(X_train, y_train)
# Evaluate
y_pred = self.model.predict(X_test)
print(classification_report(y_test, y_pred))
def predict_emotion(self, file_path):
"""Predict the emotion in an audio file."""
features = self.extract_features(file_path)
emotion = self.model.predict([features])[0]
# Get probability scores
probs = self.model.predict_proba([features])[0]
emotion_probs = {self.emotions[i]: probs[i] for i in range(len(self.emotions))}
return emotion, emotion_probs
# Example usage
if __name__ == "__main__":
detector = EmotionDetector()
# Train on a dataset like RAVDESS or TESS
detector.train("path/to/emotion_dataset")
# Predict emotion in a new recording
emotion, probs = detector.predict_emotion("user_speech.wav")
print(f"Detected emotion: {emotion}")
print("Emotion probabilities:")
for emotion, prob in sorted(probs.items(), key=lambda x: x[1], reverse=True):
print(f" {emotion}: {prob:.2f}")
Emotion Detection Datasets
To train emotion detection models, you can use these publicly available datasets:
- RAVDESS: The Ryerson Audio-Visual Database of Emotional Speech and Song contains recordings of professional actors expressing different emotions.
- TESS: Toronto Emotional Speech Set contains recordings of actresses saying phrases with different emotions.
- CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset includes audio-visual recordings of actors expressing emotions.
- IEMOCAP: Interactive Emotional Dyadic Motion Capture Database includes audio-visual recordings of actors in dyadic sessions.
Music Information Retrieval
Music Information Retrieval (MIR) involves extracting meaningful information from music, such as beat detection, chord recognition, genre classification, and music recommendation.
import librosa
import numpy as np
import matplotlib.pyplot as plt
class MusicAnalyzer:
def __init__(self, file_path):
"""Initialize with an audio file."""
self.y, self.sr = librosa.load(file_path, sr=None)
self.file_path = file_path
def detect_beats(self):
"""Detect beats in the music."""
tempo, beat_frames = librosa.beat.beat_track(y=self.y, sr=self.sr)
beat_times = librosa.frames_to_time(beat_frames, sr=self.sr)
print(f"Estimated tempo: {tempo:.2f} BPM")
print(f"Number of beats detected: {len(beat_times)}")
return tempo, beat_times
def extract_pitch(self):
"""Extract pitch information using chroma features."""
chroma = librosa.feature.chroma_cqt(y=self.y, sr=self.sr)
# Get the dominant pitch class for each frame
dominant_pitches = np.argmax(chroma, axis=0)
pitch_names = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
# Count occurrences of each pitch class
pitch_counts = np.bincount(dominant_pitches, minlength=12)
pitch_distribution = {pitch_names[i]: pitch_counts[i] for i in range(12)}
# Determine the most common pitch class (key)
key = pitch_names[np.argmax(pitch_counts)]
print(f"Estimated key: {key}")
print("Pitch distribution:")
for pitch, count in sorted(pitch_distribution.items(), key=lambda x: x[1], reverse=True):
print(f" {pitch}: {count}")
return key, pitch_distribution
def detect_structure(self):
"""Detect the structure of the song (verse, chorus, etc.)."""
# Compute the mel spectrogram
mel_spec = librosa.feature.melspectrogram(y=self.y, sr=self.sr)
# Compute the structural features (MFCC)
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(mel_spec), n_mfcc=13)
# Compute a self-similarity matrix
S = librosa.segment.recurrence_matrix(mfcc, mode='affinity')
# Use spectral clustering to identify segments
segments = librosa.segment.agglomerative(S, 10)
segment_times = librosa.frames_to_time(segments, sr=self.sr)
print("Detected structural segments:")
for i, (start, end) in enumerate(zip(segment_times[:-1], segment_times[1:])):
print(f" Segment {i+1}: {start:.2f}s - {end:.2f}s (duration: {end-start:.2f}s)")
return segment_times
def visualize(self):
"""Visualize various aspects of the music."""
plt.figure(figsize=(12, 8))
# Plot waveform
plt.subplot(3, 1, 1)
librosa.display.waveshow(self.y, sr=self.sr)
plt.title('Waveform')
# Plot spectrogram
plt.subplot(3, 1, 2)
S = librosa.feature.melspectrogram(y=self.y, sr=self.sr)
S_dB = librosa.power_to_db(S, ref=np.max)
librosa.display.specshow(S_dB, sr=self.sr, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
# Plot chromagram
plt.subplot(3, 1, 3)
chroma = librosa.feature.chroma_cqt(y=self.y, sr=self.sr)
librosa.display.specshow(chroma, sr=self.sr, x_axis='time', y_axis='chroma')
plt.colorbar()
plt.title('Chromagram')
plt.tight_layout()
plt.savefig(f"{self.file_path.split('.')[0]}_analysis.png")
plt.close()
print(f"Visualization saved as {self.file_path.split('.')[0]}_analysis.png")
# Example usage
if __name__ == "__main__":
analyzer = MusicAnalyzer("song.mp3")
# Analyze the music
tempo, beats = analyzer.detect_beats()
key, pitch_dist = analyzer.extract_pitch()
segments = analyzer.detect_structure()
# Create visualizations
analyzer.visualize()
These advanced audio analysis techniques can significantly enhance your voice assistant or be used to build specialized audio processing applications. By combining these techniques with the voice assistant framework we built earlier, you can create more sophisticated and context-aware voice interfaces.
Practice Exercise: Audio Event Detection System
Build a system that can detect specific audio events (like glass breaking, dog barking, or a doorbell) and send notifications. Use the audio classification techniques covered in this section.
- Collect or find a dataset of common household sounds
- Extract features and train a classifier
- Implement a real-time detection system that listens for these sounds
- Add a notification system (console output, email, or mobile notification)
Bonus: Integrate this with your voice assistant to enable commands like "Alert me if you hear glass breaking."
Deployment Strategies
Once you've built your voice assistant or audio processing application, you'll need to deploy it for real-world use. In this section, we'll explore different deployment strategies for voice and audio applications.
Packaging Your Voice Assistant
Before deploying your voice assistant, you need to package it properly to ensure it can be easily installed and run on different systems.
# File: setup.py
from setuptools import setup, find_packages
setup(
name="my_voice_assistant",
version="0.1.0",
packages=find_packages(),
install_requires=[
"SpeechRecognition>=3.8.1",
"pyttsx3>=2.90",
"PyAudio>=0.2.11",
"scikit-learn>=0.24.0",
"numpy>=1.19.5",
"spacy>=3.0.0",
"librosa>=0.8.1",
"pydub>=0.25.1",
"requests>=2.25.1",
],
python_requires=">=3.7",
entry_points={
"console_scripts": [
"voice-assistant=my_voice_assistant.main:main",
],
},
include_package_data=True,
package_data={
"my_voice_assistant": ["data/*.json", "models/*.pkl"],
},
)
Create a proper package structure for your voice assistant:
my_voice_assistant/
├── LICENSE
├── README.md
├── setup.py
├── my_voice_assistant/
│ ├── __init__.py
│ ├── main.py
│ ├── speech_recognition.py
│ ├── text_to_speech.py
│ ├── intent_classifier.py
│ ├── entity_extractor.py
│ ├── dialog_manager.py
│ ├── audio_processor.py
│ ├── data/
│ │ ├── intents.json
│ │ └── responses.json
│ └── models/
│ ├── intent_model.pkl
│ └── speaker_models.pkl
└── tests/
├── __init__.py
├── test_speech_recognition.py
├── test_intent_classifier.py
└── test_dialog_manager.py
Desktop Application Deployment
You can convert your voice assistant into a standalone desktop application using tools like PyInstaller or cx_Freeze.
# File: build_app.py
import PyInstaller.__main__
PyInstaller.__main__.run([
'my_voice_assistant/main.py',
'--name=VoiceAssistant',
'--onefile',
'--windowed',
'--add-data=my_voice_assistant/data:data',
'--add-data=my_voice_assistant/models:models',
'--hidden-import=sklearn.neighbors._partition_nodes',
'--hidden-import=pyttsx3.drivers',
'--hidden-import=pyttsx3.drivers.sapi5',
])
For a more polished desktop application, you can create a GUI using frameworks like PyQt or Tkinter:
# File: gui.py
import tkinter as tk
import threading
from my_voice_assistant.main import VoiceAssistant
class VoiceAssistantGUI:
def __init__(self, root):
self.root = root
self.root.title("Voice Assistant")
self.root.geometry("400x500")
self.root.resizable(False, False)
self.assistant = VoiceAssistant()
self.is_listening = False
self.setup_ui()
def setup_ui(self):
# Title
title_label = tk.Label(self.root, text="Voice Assistant", font=("Arial", 24))
title_label.pack(pady=20)
# Status display
self.status_var = tk.StringVar()
self.status_var.set("Ready")
status_label = tk.Label(self.root, textvariable=self.status_var, font=("Arial", 12))
status_label.pack(pady=10)
# Conversation history
self.conversation_text = tk.Text(self.root, width=40, height=15)
self.conversation_text.pack(pady=10)
self.conversation_text.config(state=tk.DISABLED)
# Listen button
self.listen_button = tk.Button(
self.root,
text="Start Listening",
command=self.toggle_listening,
width=15,
height=2,
bg="#4CAF50",
fg="white",
font=("Arial", 12, "bold")
)
self.listen_button.pack(pady=20)
def toggle_listening(self):
if self.is_listening:
self.is_listening = False
self.listen_button.config(text="Start Listening", bg="#4CAF50")
self.status_var.set("Ready")
else:
self.is_listening = True
self.listen_button.config(text="Stop Listening", bg="#F44336")
self.status_var.set("Listening...")
threading.Thread(target=self.listen_loop, daemon=True).start()
def listen_loop(self):
while self.is_listening:
self.status_var.set("Listening...")
command = self.assistant.listen()
if command and self.is_listening:
self.status_var.set("Processing...")
self.add_to_conversation(f"You: {command}")
response = self.assistant.process_command(command)
self.add_to_conversation(f"Assistant: {response}")
self.assistant.speak(response)
self.status_var.set("Listening...")
def add_to_conversation(self, message):
self.conversation_text.config(state=tk.NORMAL)
self.conversation_text.insert(tk.END, message + "\n")
self.conversation_text.see(tk.END)
self.conversation_text.config(state=tk.DISABLED)
if __name__ == "__main__":
root = tk.Tk()
app = VoiceAssistantGUI(root)
root.mainloop()
Web Service Deployment
You can deploy your voice assistant as a web service using frameworks like Flask or FastAPI. This allows you to access your assistant from any device with a web browser.
# File: app.py
from flask import Flask, request, jsonify, render_template
import base64
import tempfile
import os
from pydub import AudioSegment
from my_voice_assistant.main import VoiceAssistant
app = Flask(__name__)
assistant = VoiceAssistant()
@app.route('/')
def index():
return render_template('index.html')
@app.route('/api/process-audio', methods=['POST'])
def process_audio():
# Get audio data from request
audio_data = request.json.get('audio')
# Decode base64 audio data
audio_bytes = base64.b64decode(audio_data.split(',')[1])
# Save to temporary file
with tempfile.NamedTemporaryFile(suffix='.webm', delete=False) as f:
f.write(audio_bytes)
temp_filename = f.name
# Convert to WAV (if needed)
wav_filename = temp_filename + '.wav'
AudioSegment.from_file(temp_filename).export(wav_filename, format='wav')
# Process with voice assistant
command = assistant.recognize_speech_from_file(wav_filename)
if command:
response = assistant.process_command(command)
else:
response = "I couldn't understand what you said."
# Clean up temporary files
os.unlink(temp_filename)
os.unlink(wav_filename)
return jsonify({
'command': command,
'response': response
})
@app.route('/api/text-command', methods=['POST'])
def text_command():
command = request.json.get('command')
response = assistant.process_command(command)
return jsonify({'response': response})
if __name__ == '__main__':
app.run(debug=True)
Create a simple HTML interface for your web-based voice assistant:
Voice Assistant
Voice Assistant
Cloud Deployment
For scalable deployment, you can host your voice assistant on cloud platforms like AWS, Google Cloud, or Azure.
Cloud Deployment Options
- AWS Lambda + API Gateway: Serverless deployment for processing voice commands
- Google Cloud Run: Container-based deployment for your voice assistant API
- Azure App Service: Platform as a Service (PaaS) for hosting your web-based voice assistant
- Heroku: Simple deployment platform for small to medium-sized applications
Example Docker configuration for containerized deployment:
# Dockerfile
FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
portaudio19-dev \
libsndfile1 \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose port
EXPOSE 5000
# Run the application
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]
# docker-compose.yml
version: '3'
services:
voice-assistant:
build: .
ports:
- "5000:5000"
volumes:
- ./models:/app/models
environment:
- FLASK_ENV=production
- MODEL_PATH=/app/models
Mobile Integration
You can integrate your voice assistant with mobile devices using frameworks like React Native or Flutter for the frontend, and your Python backend as an API.
Mobile Integration Approaches
- Web App: Deploy your voice assistant as a progressive web app (PWA) that can be accessed from any mobile browser
- Hybrid App: Use frameworks like React Native or Flutter to create a mobile app that communicates with your Python backend
- Native App with API: Build native Android/iOS apps that connect to your voice assistant API
Embedded Systems Deployment
For IoT and smart home applications, you can deploy your voice assistant on embedded systems like Raspberry Pi.
#!/bin/bash
# setup_raspberry_pi.sh
# Update system
sudo apt-get update
sudo apt-get upgrade -y
# Install dependencies
sudo apt-get install -y \
python3-pip \
python3-pyaudio \
portaudio19-dev \
libffi-dev \
libssl-dev \
libatlas-base-dev \
libopenjp2-7 \
libtiff5 \
libsndfile1
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install Python packages
pip install wheel
pip install -r requirements.txt
# Set up autostart
cat > /etc/systemd/system/voice-assistant.service << EOF
[Unit]
Description=Voice Assistant Service
After=network.target
[Service]
ExecStart=/home/pi/voice-assistant/venv/bin/python /home/pi/voice-assistant/main.py
WorkingDirectory=/home/pi/voice-assistant
StandardOutput=inherit
StandardError=inherit
Restart=always
User=pi
[Install]
WantedBy=multi-user.target
EOF
# Enable service
sudo systemctl enable voice-assistant.service
sudo systemctl start voice-assistant.service
echo "Voice assistant installed and started!"
Performance Optimization
Before deploying your voice assistant, optimize its performance to ensure it runs efficiently on your target platform.
Performance Optimization Techniques
- Model Quantization: Reduce the precision of model weights to decrease memory usage and improve inference speed
- Model Pruning: Remove unnecessary connections in neural networks to reduce model size
- Caching: Cache frequently used responses or computation results
- Asynchronous Processing: Use asynchronous programming to handle multiple tasks concurrently
- Offline Processing: Implement offline processing for non-critical tasks to reduce latency
Practice Exercise: Deploy Your Voice Assistant
Choose one of the deployment strategies discussed in this section and deploy your voice assistant.
- Package your voice assistant code into a proper Python package
- Choose a deployment strategy (desktop app, web service, or embedded system)
- Implement the necessary deployment code
- Test your deployed voice assistant on the target platform
- Optimize performance based on your deployment environment
Next Steps & Resources
Congratulations on completing this tutorial on voice assistants and audio processing! Here are some resources and next steps to continue your learning journey.
Further Learning Resources
Books
- Voice User Interface Design by James Giangola, Jennifer Balogh
- Designing Voice User Interfaces by Cathy Pearl
- Audio Signal Processing and Coding by Andreas Spanias, Ted Painter, Venkatraman Atti
- Fundamentals of Music Processing by Meinard Müller
- Speech and Language Processing by Daniel Jurafsky and James H. Martin
Online Courses
- Audio Signal Processing for Music Applications (Coursera) - Covers fundamentals of audio signal processing with a focus on music applications
- Natural Language Processing Specialization (Coursera) - Comprehensive course on NLP techniques used in voice assistants
- Deep Learning for Audio (Udemy) - Focuses on applying deep learning to audio processing tasks
- Building Voice AI with Alexa Skills (Pluralsight) - Teaches how to build skills for Amazon Alexa
- Actions on Google: Build Applications for Google Assistant (Google) - Learn to build applications for Google Assistant
Libraries and Tools
- Librosa - Python library for audio and music analysis
- PyAudio - Python bindings for PortAudio, a cross-platform audio I/O library
- SpeechRecognition - Library for performing speech recognition with various engines
- Pyttsx3 - Text-to-speech conversion library in Python
- Rasa - Open source machine learning framework for building conversational AI
- Kaldi - Speech recognition toolkit
- ESPnet - End-to-End Speech Processing Toolkit
- Transformers - Hugging Face's library with state-of-the-art models for speech recognition and NLP
Project Ideas
Here are some project ideas to apply what you've learned in this tutorial:
Beginner Projects
- Smart Home Assistant - Build a voice assistant that can control smart home devices
- Voice-Controlled Music Player - Create an application that plays music based on voice commands
- Meeting Transcriber - Develop a tool that transcribes meetings and identifies speakers
- Voice Memo App - Build an application that records, transcribes, and organizes voice memos
- Audio Classification System - Create a system that can classify different types of sounds
Advanced Projects
- Multilingual Voice Assistant - Build a voice assistant that can understand and respond in multiple languages
- Emotion-Aware Voice Interface - Create a voice interface that adapts its responses based on detected emotions
- Voice Cloning System - Develop a system that can clone a person's voice from a few samples
- Real-time Audio Enhancement - Build a tool that enhances audio quality in real-time (noise reduction, echo cancellation)
- Multimodal Assistant - Create an assistant that combines voice, vision, and text for more natural interactions
Community and Forums
Join these communities to connect with other developers working on voice assistants and audio processing:
- Stack Overflow - Tags: speech-recognition, text-to-speech, voice-assistant
- Reddit - r/MachineLearning, r/speechrecognition, r/voiceassistants
- GitHub - Follow repositories related to speech recognition and voice assistants
- Discord - Join AI and ML communities with channels dedicated to speech and audio
- Meetups - Look for local or virtual meetups focused on voice technology and audio processing
Industry Trends
Stay updated on these emerging trends in voice assistants and audio processing:
- Multimodal AI - Integration of voice with other modalities like vision and text
- On-device Processing - Moving speech recognition and NLU to edge devices for privacy and reduced latency
- Conversational AI - More natural and context-aware conversations with voice assistants
- Voice Cloning and Synthesis - Creating more natural and customizable voices
- Emotion Recognition - Detecting and responding to user emotions in voice interactions
- Ambient Computing - Voice interfaces that blend seamlessly into the environment
Final Challenge: Build Your Own Voice Assistant Product
Combine everything you've learned in this tutorial to build a complete voice assistant product.
- Choose a specific use case (e.g., productivity assistant, cooking assistant, fitness coach)
- Design the voice interface with appropriate prompts and responses
- Implement speech recognition, NLU, dialog management, and TTS
- Add domain-specific features relevant to your use case
- Deploy your assistant on your preferred platform
- Test with real users and iterate based on feedback
Share your project with the community and continue to enhance it as you learn more!