Skip to content

heyisula/binarytextclassificationNLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Binary Text Classification β€” NLP

Classifying News Articles Using TF-IDF & Logistic Regression


Python scikit-learn Pandas NumPy Matplotlib License: MIT


98.1% Accuracy on binary classification of Reuters news articles using a tuned Logistic Regression pipeline with TF-IDF features and 5-fold cross-validation.



🎯 Overview

This project implements a binary text classification pipeline to categorize Reuters news articles into two classes. The pipeline covers the full ML lifecycle β€” from raw text preprocessing and feature engineering to hyperparameter tuning and comprehensive model evaluation.

✨ Features

  • 🧹 Text Preprocessing β€” HTML/URL removal, lemmatization, stopword filtering
  • πŸ“Š TF-IDF Vectorization β€” Up to 5,000 features extracted from text
  • πŸ” Grid Search CV β€” Exhaustive hyperparameter tuning with 5-fold cross-validation
  • πŸ“ˆ Rich Evaluation β€” ROC/PR curves, confusion matrix, feature importance
  • πŸ“ Prediction Export β€” Full test set predictions saved to CSV

πŸ† Results at a Glance

Metric Score
Accuracy 98.13%
ROC-AUC 0.99
Best C 10.0
Best Solver liblinear / lbfgs
CV Folds 5
TF-IDF Features 5,000

πŸ—οΈ Project Structure

binarytextclassificationNLP/
β”‚
β”œβ”€β”€ πŸ““ train.ipynb              # Main training notebook (end-to-end pipeline)
β”œβ”€β”€ πŸ“„ README.md
β”œβ”€β”€ πŸ“œ LICENSE                  # MIT License
β”‚
β”œβ”€β”€ πŸ“‚ data/
β”‚   β”œβ”€β”€ trainset.txt            # Labeled training data (tab-separated)
β”‚   └── testsetwithoutlabels.txt # Unlabeled test data
β”‚
└── πŸ“‚ out/
    β”œβ”€β”€ πŸ“‚ cv/
    β”‚   └── results.csv         # Grid search cross-validation results
    β”œβ”€β”€ πŸ“‚ plots/
    β”‚   β”œβ”€β”€ roc_curve.png       # ROC curve plot
    β”‚   β”œβ”€β”€ pr_curve.png        # Precision-Recall curve
    β”‚   β”œβ”€β”€ confusion_matrix.png
    β”‚   β”œβ”€β”€ feature_importance.png
    β”‚   └── grid_search_heatmap.png
    └── πŸ“‚ pred/
        └── final_results_fixed.csv  # Predictions on the test set

βš™οΈ Pipeline

graph LR
    A[πŸ“„ Raw Text] --> B[🧹 Preprocessing]
    B --> C[πŸ“Š TF-IDF Vectorizer]
    C --> D[πŸ” Grid Search CV]
    D --> E[πŸ€– Logistic Regression]
    E --> F[πŸ“ˆ Evaluation]
    F --> G[πŸ“ Predictions]

    style A fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style B fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style C fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
    style D fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style E fill:#1e293b,stroke:#10b981,color:#e2e8f0
    style F fill:#1e293b,stroke:#ef4444,color:#e2e8f0
    style G fill:#1e293b,stroke:#ec4899,color:#e2e8f0
Loading

1️⃣ Text Preprocessing

  • Remove HTML tags, URLs, punctuation, and numbers
  • Convert to lowercase
  • Remove stopwords and short words (length ≀ 2)
  • Lemmatize tokens using WordNet

2️⃣ Feature Extraction

  • TF-IDF Vectorizer with a vocabulary cap of 5,000 features
  • Fit on training data, transform applied to validation & test sets

3️⃣ Model Training & Tuning

  • Logistic Regression with GridSearchCV

  • Parameter grid:

    Parameter Values
    C 0.01, 0.1, 1.0, 10.0
    solver liblinear, lbfgs
  • 5-fold stratified cross-validation

  • Best result: C=10.0 with 98.13% CV accuracy

4️⃣ Evaluation Suite

Comprehensive model evaluation including:

  • Accuracy, Precision, Recall, F1-Score
  • ROC Curve & AUC
  • Precision-Recall Curve & Average Precision
  • Confusion Matrix
  • Log Loss
  • Top 30 most influential words (feature coefficient analysis)

πŸ“Š Results

ROC Curve

ROC Curve



Precision-Recall Curve

Precision-Recall Curve



Top 30 Most Influential Words

Feature Importance

πŸ”¬ Grid Search Results

C Solver Mean CV Accuracy Rank
0.01 liblinear 69.38% 7
0.01 lbfgs 51.25% 8
0.1 liblinear 91.25% 5
0.1 lbfgs 88.13% 6
1.0 liblinear 97.50% 3
1.0 lbfgs 97.50% 3
10.0 liblinear 98.13% πŸ₯‡ 1
10.0 lbfgs 98.13% πŸ₯‡ 1

πŸš€ Getting Started

Prerequisites

pip install numpy pandas matplotlib seaborn scikit-learn nltk

Download NLTK Data

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

Run

  1. Clone the repository:

    git clone https://github.com/heyisula/binarytextclassificationNLP.git
    cd binarytextclassificationNLP
  2. Open and run the notebook:

    jupyter notebook train.ipynb
  3. Results will be saved to the out/ directory.


πŸ“‹ Data Format

Training Data (trainset.txt)

Tab-separated with 4 fields:

<label> \t <title> \t <date> \t <body>
Field Description
label +1 (positive) or -1 (negative)
title Article headline
date Publication date line
body Full article body text

Test Data (testsetwithoutlabels.txt)

Tab-separated with 3 fields (no label):

<title> \t <date> \t <body>

πŸ› οΈ Tech Stack

Tool Purpose
Python β€” Core language
Jupyter β€” Interactive development
NumPy β€” Numerical computing
Pandas β€” Data manipulation
scikit-learn β€” ML models & evaluation
πŸ“Š Matplotlib & Seaborn β€” Visualization
πŸ“ NLTK β€” Text preprocessing

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


Made with ❀️ by Isula Dissanayake

⭐ Star this repo if you found it useful!

About

Binary Text Classification using News Articles

Topics

Resources

License

Stars

Watchers

Forks

Contributors