Skip to content

SadiqCodex/Next_Word_Prediction_Model

Repository files navigation

🧠 AI Next Word & Sentence Predictor β€” LSTM Based NLP Project

A complete end-to-end Natural Language Processing (NLP) project that predicts the next word or completes a sentence using a custom-trained LSTM (Long Short-Term Memory) neural network. Built with TensorFlow, Keras, and Streamlit.

This project was trained on Apache Databricks and deployed as an interactive web app using Streamlit.


πŸ“Œ What Does This Project Do?

You type a few words β†’ the AI model predicts what word comes next, or completes your entire sentence.

It has two prediction modes:

  • Auto Completion Mode β€” Shows a single "ghost text" suggestion (like VS Code autocomplete). Press TAB to accept.
  • Auto Suggestion Mode β€” Shows multiple full sentence completions as clickable buttons.

🧠 How It Works β€” The Complete ML Pipeline

Step 1: Dataset

  • Dataset: qoute_dataset.csv β€” a collection of 3,038 famous quotes with columns quote and Author.
  • Sample quotes are from Albert Einstein, J.K. Rowling, Marilyn Monroe, Jane Austen, and many more.

Step 2: Text Preprocessing (in word_pred_model.ipynb)

Raw Quotes β†’ Lowercase β†’ Remove Punctuation β†’ Tokenize β†’ Create N-gram Sequences
  1. Lowercasing β€” All quotes converted to lowercase using str.lower()
  2. Punctuation Removal β€” string.maketrans removes all punctuation marks
  3. Tokenization β€” Keras Tokenizer assigns a unique integer to every word
    • Total unique words (vocabulary size): 8,978
    • Most frequent words: the(1), you(2), to(3), and(4), a(5)...
  4. N-gram Sequence Generation β€” For each quote, every possible prefix subsequence is created:
    • [713] β†’ predict 62
    • [713, 62] β†’ predict 29
    • [713, 62, 29] β†’ predict 19
    • This creates thousands of (input β†’ output) training pairs
  5. Padding β€” All input sequences padded to max_len - 1 using pre padding so they are uniform length

Step 3: Model Architecture

Input (token IDs)
      ↓
Embedding Layer  β†’  8978 vocab β†’ 50 dimensions
      ↓
LSTM Layer       β†’  128 units (learns word patterns & context)
      ↓
Dense Layer      β†’  8978 units + Softmax (probability for each word)
      ↓
Output: Next Word Prediction
Layer Details
Embedding vocab_size=8978, output_dim=50
LSTM units=128
Dense (Output) units=8978, activation=softmax
Total Parameters ~1.69 Million
Framework TensorFlow / Keras

Step 4: Training

  • Trained on Apache Databricks (cloud GPU environment)
  • Notebook: word_pred_model.ipynb
  • Loss: categorical_crossentropy
  • Optimizer: adam

Step 5: Saving Artifacts

After training, three files are saved and used during inference:

File What it stores
quote_lstm_model.h5 The trained LSTM model weights
tokenizer.pkl Word ↔ Integer mapping (fitted on dataset)
max_len.pkl Maximum sequence length for padding

Step 6: Inference (in app.py)

When you type something in the app:

User Input Text
      ↓
Lowercase + Tokenize (using saved tokenizer)
      ↓
Pad sequence to max_len - 1
      ↓
LSTM Model β†’ Output probability distribution (8978 values)
      ↓
Temperature Scaling (controls creativity vs accuracy)
      ↓
Top-K sampling β†’ Return top predicted words

Temperature parameter:

  • Low (0.1) β†’ Very focused, safe predictions
  • High (2.0) β†’ More random, creative predictions

🌟 App Features

Feature Description
Auto Completion Mode Ghost text suggestion, accept with TAB
Auto Suggestion Mode Multiple sentence completions shown as buttons
Top-K Control Choose how many top words to consider (1–10)
Temperature Slider Control prediction creativity (0.1–2.0)
Prediction Length Control how many words to generate (3–20)
Auto Predict Toggle Enable/disable real-time prediction
Show Probabilities Display confidence % of prediction
Word/Char Counter Live word and character count in editor
Model Analytics Dashboard Shows vocab size, parameters, speed, architecture
100% Local Inference No external APIs used β€” all runs on your machine

πŸ“‚ Project Structure

Word Prediction Model/
β”‚
β”œβ”€β”€ app.py                  ← Streamlit web app (UI + prediction logic)
β”œβ”€β”€ word_pred_model.ipynb   ← Model training notebook (ran on Databricks)
β”œβ”€β”€ qoute_dataset.csv       ← Dataset: 3038 famous quotes
β”œβ”€β”€ quote_lstm_model.h5     ← Trained LSTM model (saved weights)
β”œβ”€β”€ tokenizer.pkl           ← Fitted tokenizer (word-to-index mapping)
β”œβ”€β”€ max_len.pkl             ← Max sequence length for padding
β”œβ”€β”€ requirements.txt        ← All Python dependencies
β”œβ”€β”€ README.md               ← You are here
β”œβ”€β”€ LICENSE                 ← MIT License
└── .gitignore              ← Git ignore rules

πŸ› οΈ Tech Stack

Category Technology
Frontend / UI Streamlit
Backend Python 3.x
Deep Learning TensorFlow 2.21, Keras
Data Processing Pandas, NumPy
Model Training Platform Apache Databricks
Model Storage HDF5 (.h5), Pickle (.pkl)

βš™οΈ Installation & Setup

1. Clone the repository:

git clone https://github.com/YOUR_USERNAME/Next_Word_Prediction_Model.git
cd "Next_Word_Prediction_Model"

2. Create and activate virtual environment:

# Create
python -m venv venv

# Activate (Windows)
venv\Scripts\activate

# Activate (Linux/Mac)
source venv/bin/activate

3. Install dependencies:

pip install -r requirements.txt

▢️ Run Locally

streamlit run app.py

App will open at: http://localhost:8501


πŸ§ͺ How to Use the App

  1. Open the app in your browser
  2. Choose a mode: Auto Completion or Auto Suggestion
  3. Type a few words in the writing editor (e.g. "The future of")
  4. In Auto Completion mode: ghost text appears β†’ press TAB or click "Accept Prediction" to accept
  5. In Auto Suggestion mode: multiple sentence completions appear β†’ click any to insert it
  6. Use the sidebar to adjust Temperature, Top-K, Prediction Length settings

πŸ“Š Model Details

Parameter Value
Architecture LSTM Neural Network
Dataset 3,038 inspirational quotes
Vocabulary Size 8,978 unique words
Embedding Dimension 50
LSTM Units 128
Total Parameters ~1.69 Million
Max Sequence Length Stored in max_len.pkl
Prediction Speed ~50ms per inference
Training Platform Apache Databricks
Framework TensorFlow 2.21 / Keras

🌐 Deployment Options

Platform Notes
Streamlit Community Cloud Recommended β€” free and easy
Render Simple Docker-based deployment
Railway Fast deployment via GitHub
Hugging Face Spaces Good for ML model demos
AWS EC2 Full control, production-grade

Recommended: Streamlit Community Cloud β€” just connect your GitHub repo and deploy in minutes.


πŸ“œ License

This project is licensed under the MIT License β€” see LICENSE for details.


πŸ‘¨β€πŸ’» Author

Sadik Rangrej

Built with ❀️ using TensorFlow, Keras, and Streamlit.

Trained on Apache Databricks Β· Deployed with Streamlit

About

This is My Next Word Prediction Model Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors