🏆 Production-Quality ML Project

Speech Emotion
Recognition

6 ML models detect 8 human emotions from speech — powered by MFCC, Mel Spectrogram, CNN, Bi-LSTM and a CNN-LSTM Hybrid trained on the RAVDESS dataset.

🚀 Try Live Demo 📂 View Code

Emotions

ML Models

89.6%

Best Accuracy

0.993

Best AUC

1,440

Audio Samples

🧠 Architecture

End-to-end MLOps pipeline from raw audio to emotion prediction

Data Ingestion

Auto-downloads RAVDESS via kagglehub in one line — no API key setup needed. Parses 1,440 speech recordings and builds a labelled catalogue.

Preprocessing

Silence trimming, spectral noise gating, resampling to 22 kHz. All configurable via YAML.

Feature Extraction

MFCC+δ+δ², Mel Spectrogram, Chroma, ZCR, RMS, Spectral Contrast, Tonnetz → ~315 flat features + 2-D tensors.

Model Training

RF, SVM, XGBoost with RandomizedSearchCV. CNN, Bi-LSTM, CNN-LSTM with early stopping + ReduceLR.

Evaluation

Accuracy, F1-Weighted, F1-Macro, ROC-AUC (OVR), confusion matrices, SHAP explainability.

Deployment

Streamlit app with waveform + Mel Spectrogram visualization, real-time prediction, model selector.

🎭 8 Emotions Detected

Trained on RAVDESS — Ryerson Audio-Visual Database of Emotional Speech and Song

😐 Neutral

😌 Calm

😄 Happy

😢 Sad

😡 Angry

😨 Fearful

🤢 Disgust

😲 Surprised

📊 Performance Results

All models evaluated on 20% test holdout. CNN-LSTM Hybrid achieves the best performance.

Model	Type	Accuracy	F1 Weighted	ROC-AUC
CNN (Best F1)	Deep Learning	89.58%	0.8964	0.9894
SVM (Best AUC)	Classical ML	89.58%	0.8949	0.9926
Bi-LSTM	Deep Learning	85.76%	0.8579	0.9761
XGBoost	Classical ML	85.76%	0.8573	0.9905
Random Forest	Classical ML	84.90%	0.8480	0.9914
CNN-LSTM	Deep Learning	81.42%	0.8111	0.9624

🔬 Feature Engineering

~315-dimensional feature vector extracted from each audio clip

🎵

MFCC + Δ + Δ²

40 mel-frequency cepstral coefficients + first and second order derivatives. Mean and std pooled → 120 values.

🌈

Mel Spectrogram

128-bin log-power mel spectrogram captures frequency distribution matching human auditory perception.

🎹

Chroma Features

12-dimensional pitch class profile — captures harmonic and tonal content of speech.

⚡

ZCR + RMS Energy

Zero-crossing rate measures noisiness. RMS energy captures loudness and intensity patterns.

📐

Spectral Contrast

7 sub-band spectral valley-to-peak ratio — distinguishes different types of vocal sounds.

🎸

Tonnetz

6D tonal centroid features on harmonic audio — captures musical structure and tension.

Quick Start

Run the app in 3 steps — models are already included in the repo

# 1. Clone the repo (models included, no training needed)
git clone https://github.com/rajneeshbabu/speech-emotion-recognition.git
cd speech-emotion-recognition

# 2. Set up virtual environment and install dependencies
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 3. Launch the app
python -m streamlit run app.py

# To retrain from scratch — dataset downloads automatically via kagglehub
# No Kaggle API key needed
pip install tensorflow   # for CNN / LSTM models
jupyter notebook notebooks/SER_Training.ipynb

# Inside the notebook, data ingestion is just one line:
path = kagglehub.dataset_download("uwrfkaggler/ravdess-emotional-speech-audio")

Speech EmotionRecognition