Speech Emotion
Recognition
6 ML models detect 8 human emotions from speech โ powered by MFCC, Mel Spectrogram, CNN, Bi-LSTM and a CNN-LSTM Hybrid trained on the RAVDESS dataset.
Data Ingestion
Auto-downloads RAVDESS via kagglehub in one line โ no API key setup needed. Parses 1,440 speech recordings and builds a labelled catalogue.
Preprocessing
Silence trimming, spectral noise gating, resampling to 22 kHz. All configurable via YAML.
Feature Extraction
MFCC+ฮด+ฮดยฒ, Mel Spectrogram, Chroma, ZCR, RMS, Spectral Contrast, Tonnetz โ ~315 flat features + 2-D tensors.
Model Training
RF, SVM, XGBoost with RandomizedSearchCV. CNN, Bi-LSTM, CNN-LSTM with early stopping + ReduceLR.
Evaluation
Accuracy, F1-Weighted, F1-Macro, ROC-AUC (OVR), confusion matrices, SHAP explainability.
Deployment
Streamlit app with waveform + Mel Spectrogram visualization, real-time prediction, model selector.
| Model | Type | Accuracy | F1 Weighted | ROC-AUC |
|---|---|---|---|---|
| CNN (Best F1) | Deep Learning | 0.8964 | 0.9894 | |
| SVM (Best AUC) | Classical ML | 0.8949 | 0.9926 | |
| Bi-LSTM | Deep Learning | 0.8579 | 0.9761 | |
| XGBoost | Classical ML | 0.8573 | 0.9905 | |
| Random Forest | Classical ML | 0.8480 | 0.9914 | |
| CNN-LSTM | Deep Learning | 0.8111 | 0.9624 |
MFCC + ฮ + ฮยฒ
40 mel-frequency cepstral coefficients + first and second order derivatives. Mean and std pooled โ 120 values.
Mel Spectrogram
128-bin log-power mel spectrogram captures frequency distribution matching human auditory perception.
Chroma Features
12-dimensional pitch class profile โ captures harmonic and tonal content of speech.
ZCR + RMS Energy
Zero-crossing rate measures noisiness. RMS energy captures loudness and intensity patterns.
Spectral Contrast
7 sub-band spectral valley-to-peak ratio โ distinguishes different types of vocal sounds.
Tonnetz
6D tonal centroid features on harmonic audio โ captures musical structure and tension.