ML Pipeline Stages

The pipeline is composed of five deterministic stages, each implemented as an independent script and tracked by DVC.

1. Data Ingestion

  • Loads the raw dataset from a remote source

  • Filters only happiness and sadness labels

  • Converts labels to binary format

  • Performs train/test split

  • Outputs written to data/raw/

2. Data Preprocessing

  • Normalizes text data

  • Steps include lowercasing, stopword removal, lemmatization, and noise removal

  • Cleaned text stored in data/interim/

3. Feature Engineering

  • Converts text into numerical features using TF-IDF

  • Feature size controlled through parameters

  • Outpu- ts written to data/processed/

4. Model Building

  • Trains a Gradient Boosting classifier

  • Hyperparameters are externally configurable

  • Trained model saved as models/model.pkl

5. Model Evaluation

  • Computes Accuracy, Precision, Recall, and AUC

  • Metrics stored in reports/metrics.json