98.1% Accuracy on binary classification of Reuters news articles using a tuned Logistic Regression pipeline with TF-IDF features and 5-fold cross-validation.
This project implements a binary text classification pipeline to categorize Reuters news articles into two classes. The pipeline covers the full ML lifecycle β from raw text preprocessing and feature engineering to hyperparameter tuning and comprehensive model evaluation.
|
|
binarytextclassificationNLP/
β
βββ π train.ipynb # Main training notebook (end-to-end pipeline)
βββ π README.md
βββ π LICENSE # MIT License
β
βββ π data/
β βββ trainset.txt # Labeled training data (tab-separated)
β βββ testsetwithoutlabels.txt # Unlabeled test data
β
βββ π out/
βββ π cv/
β βββ results.csv # Grid search cross-validation results
βββ π plots/
β βββ roc_curve.png # ROC curve plot
β βββ pr_curve.png # Precision-Recall curve
β βββ confusion_matrix.png
β βββ feature_importance.png
β βββ grid_search_heatmap.png
βββ π pred/
βββ final_results_fixed.csv # Predictions on the test set
graph LR
A[π Raw Text] --> B[π§Ή Preprocessing]
B --> C[π TF-IDF Vectorizer]
C --> D[π Grid Search CV]
D --> E[π€ Logistic Regression]
E --> F[π Evaluation]
F --> G[π Predictions]
style A fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
style B fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
style C fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
style D fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
style E fill:#1e293b,stroke:#10b981,color:#e2e8f0
style F fill:#1e293b,stroke:#ef4444,color:#e2e8f0
style G fill:#1e293b,stroke:#ec4899,color:#e2e8f0
- Remove HTML tags, URLs, punctuation, and numbers
- Convert to lowercase
- Remove stopwords and short words (length β€ 2)
- Lemmatize tokens using WordNet
- TF-IDF Vectorizer with a vocabulary cap of 5,000 features
- Fit on training data, transform applied to validation & test sets
-
Logistic Regression with
GridSearchCV -
Parameter grid:
Parameter Values C0.01,0.1,1.0,10.0solverliblinear,lbfgs -
5-fold stratified cross-validation
-
Best result: C=10.0 with 98.13% CV accuracy
Comprehensive model evaluation including:
- Accuracy, Precision, Recall, F1-Score
- ROC Curve & AUC
- Precision-Recall Curve & Average Precision
- Confusion Matrix
- Log Loss
- Top 30 most influential words (feature coefficient analysis)
| C | Solver | Mean CV Accuracy | Rank |
|---|---|---|---|
| 0.01 | liblinear | 69.38% | 7 |
| 0.01 | lbfgs | 51.25% | 8 |
| 0.1 | liblinear | 91.25% | 5 |
| 0.1 | lbfgs | 88.13% | 6 |
| 1.0 | liblinear | 97.50% | 3 |
| 1.0 | lbfgs | 97.50% | 3 |
| 10.0 | liblinear | 98.13% | π₯ 1 |
| 10.0 | lbfgs | 98.13% | π₯ 1 |
pip install numpy pandas matplotlib seaborn scikit-learn nltkimport nltk
nltk.download('stopwords')
nltk.download('wordnet')-
Clone the repository:
git clone https://github.com/heyisula/binarytextclassificationNLP.git cd binarytextclassificationNLP -
Open and run the notebook:
jupyter notebook train.ipynb
-
Results will be saved to the
out/directory.
Tab-separated with 4 fields:
<label> \t <title> \t <date> \t <body>
| Field | Description |
|---|---|
label |
+1 (positive) or -1 (negative) |
title |
Article headline |
date |
Publication date line |
body |
Full article body text |
Tab-separated with 3 fields (no label):
<title> \t <date> \t <body>
This project is licensed under the MIT License β see the LICENSE file for details.
Made with β€οΈ by Isula Dissanayake
β Star this repo if you found it useful!


