ML Pipeline Stages

The pipeline is composed of five deterministic stages, each implemented as an independent script and tracked by DVC.

1. Data Ingestion

Loads the raw dataset from a remote source
Filters only happiness and sadness labels
Converts labels to binary format
Performs train/test split
Outputs written to data/raw/

2. Data Preprocessing

Normalizes text data
Steps include lowercasing, stopword removal, lemmatization, and noise removal
Cleaned text stored in data/interim/

3. Feature Engineering

Converts text into numerical features using TF-IDF
Feature size controlled through parameters
Outpu- ts written to data/processed/

4. Model Building

Trains a Gradient Boosting classifier
Hyperparameters are externally configurable
Trained model saved as models/model.pkl

5. Model Evaluation

Computes Accuracy, Precision, Recall, and AUC
Metrics stored in reports/metrics.json

« Previous Next »