๐Ÿ† Production-Quality ML Project

Speech Emotion
Recognition

6 ML models detect 8 human emotions from speech โ€” powered by MFCC, Mel Spectrogram, CNN, Bi-LSTM and a CNN-LSTM Hybrid trained on the RAVDESS dataset.

8
Emotions
6
ML Models
89.6%
Best Accuracy
0.993
Best AUC
1,440
Audio Samples
๐Ÿง  Architecture
End-to-end MLOps pipeline from raw audio to emotion prediction
1

Data Ingestion

Auto-downloads RAVDESS via kagglehub in one line โ€” no API key setup needed. Parses 1,440 speech recordings and builds a labelled catalogue.

2

Preprocessing

Silence trimming, spectral noise gating, resampling to 22 kHz. All configurable via YAML.

3

Feature Extraction

MFCC+ฮด+ฮดยฒ, Mel Spectrogram, Chroma, ZCR, RMS, Spectral Contrast, Tonnetz โ†’ ~315 flat features + 2-D tensors.

4

Model Training

RF, SVM, XGBoost with RandomizedSearchCV. CNN, Bi-LSTM, CNN-LSTM with early stopping + ReduceLR.

5

Evaluation

Accuracy, F1-Weighted, F1-Macro, ROC-AUC (OVR), confusion matrices, SHAP explainability.

6

Deployment

Streamlit app with waveform + Mel Spectrogram visualization, real-time prediction, model selector.

๐ŸŽญ 8 Emotions Detected
Trained on RAVDESS โ€” Ryerson Audio-Visual Database of Emotional Speech and Song
๐Ÿ˜ Neutral
๐Ÿ˜Œ Calm
๐Ÿ˜„ Happy
๐Ÿ˜ข Sad
๐Ÿ˜ก Angry
๐Ÿ˜จ Fearful
๐Ÿคข Disgust
๐Ÿ˜ฒ Surprised
๐Ÿ“Š Performance Results
All models evaluated on 20% test holdout. CNN-LSTM Hybrid achieves the best performance.
Model Type Accuracy F1 Weighted ROC-AUC
CNN (Best F1) Deep Learning
89.58%
0.8964 0.9894
SVM (Best AUC) Classical ML
89.58%
0.8949 0.9926
Bi-LSTM Deep Learning
85.76%
0.8579 0.9761
XGBoost Classical ML
85.76%
0.8573 0.9905
Random Forest Classical ML
84.90%
0.8480 0.9914
CNN-LSTM Deep Learning
81.42%
0.8111 0.9624
๐Ÿ”ฌ Feature Engineering
~315-dimensional feature vector extracted from each audio clip
๐ŸŽต

MFCC + ฮ” + ฮ”ยฒ

40 mel-frequency cepstral coefficients + first and second order derivatives. Mean and std pooled โ†’ 120 values.

๐ŸŒˆ

Mel Spectrogram

128-bin log-power mel spectrogram captures frequency distribution matching human auditory perception.

๐ŸŽน

Chroma Features

12-dimensional pitch class profile โ€” captures harmonic and tonal content of speech.

โšก

ZCR + RMS Energy

Zero-crossing rate measures noisiness. RMS energy captures loudness and intensity patterns.

๐Ÿ“

Spectral Contrast

7 sub-band spectral valley-to-peak ratio โ€” distinguishes different types of vocal sounds.

๐ŸŽธ

Tonnetz

6D tonal centroid features on harmonic audio โ€” captures musical structure and tension.

Quick Start
Run the app in 3 steps โ€” models are already included in the repo
# 1. Clone the repo (models included, no training needed)
git clone https://github.com/rajneeshbabu/speech-emotion-recognition.git
cd speech-emotion-recognition

# 2. Set up virtual environment and install dependencies
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 3. Launch the app
python -m streamlit run app.py

# To retrain from scratch โ€” dataset downloads automatically via kagglehub
# No Kaggle API key needed
pip install tensorflow # for CNN / LSTM models
jupyter notebook notebooks/SER_Training.ipynb

# Inside the notebook, data ingestion is just one line:
path = kagglehub.dataset_download("uwrfkaggler/ravdess-emotional-speech-audio")