🧠 AI Next Word & Sentence Predictor — LSTM Based NLP Project

A complete end-to-end Natural Language Processing (NLP) project that predicts the next word or completes a sentence using a custom-trained LSTM (Long Short-Term Memory) neural network. Built with TensorFlow, Keras, and Streamlit.

This project was trained on Apache Databricks and deployed as an interactive web app using Streamlit.

📌 What Does This Project Do?

You type a few words → the AI model predicts what word comes next, or completes your entire sentence.

It has two prediction modes:

Auto Completion Mode — Shows a single "ghost text" suggestion (like VS Code autocomplete). Press TAB to accept.
Auto Suggestion Mode — Shows multiple full sentence completions as clickable buttons.

🧠 How It Works — The Complete ML Pipeline

Step 1: Dataset

Dataset: qoute_dataset.csv — a collection of 3,038 famous quotes with columns quote and Author.
Sample quotes are from Albert Einstein, J.K. Rowling, Marilyn Monroe, Jane Austen, and many more.

Step 2: Text Preprocessing (in `word_pred_model.ipynb`)

Raw Quotes → Lowercase → Remove Punctuation → Tokenize → Create N-gram Sequences

Lowercasing — All quotes converted to lowercase using str.lower()
Punctuation Removal — string.maketrans removes all punctuation marks
Tokenization — Keras Tokenizer assigns a unique integer to every word
- Total unique words (vocabulary size): 8,978
- Most frequent words: the(1), you(2), to(3), and(4), a(5)...
N-gram Sequence Generation — For each quote, every possible prefix subsequence is created:
- [713] → predict 62
- [713, 62] → predict 29
- [713, 62, 29] → predict 19
- This creates thousands of (input → output) training pairs
Padding — All input sequences padded to max_len - 1 using pre padding so they are uniform length

Step 3: Model Architecture

Input (token IDs)
      ↓
Embedding Layer  →  8978 vocab → 50 dimensions
      ↓
LSTM Layer       →  128 units (learns word patterns & context)
      ↓
Dense Layer      →  8978 units + Softmax (probability for each word)
      ↓
Output: Next Word Prediction

Layer	Details
Embedding	vocab_size=8978, output_dim=50
LSTM	units=128
Dense (Output)	units=8978, activation=softmax
Total Parameters	~1.69 Million
Framework	TensorFlow / Keras

Step 4: Training

Trained on Apache Databricks (cloud GPU environment)
Notebook: word_pred_model.ipynb
Loss: categorical_crossentropy
Optimizer: adam

Step 5: Saving Artifacts

After training, three files are saved and used during inference:

File	What it stores
`quote_lstm_model.h5`	The trained LSTM model weights
`tokenizer.pkl`	Word ↔ Integer mapping (fitted on dataset)
`max_len.pkl`	Maximum sequence length for padding

Step 6: Inference (in `app.py`)

When you type something in the app:

User Input Text
      ↓
Lowercase + Tokenize (using saved tokenizer)
      ↓
Pad sequence to max_len - 1
      ↓
LSTM Model → Output probability distribution (8978 values)
      ↓
Temperature Scaling (controls creativity vs accuracy)
      ↓
Top-K sampling → Return top predicted words

Temperature parameter:

Low (0.1) → Very focused, safe predictions
High (2.0) → More random, creative predictions

🌟 App Features

Feature	Description
Auto Completion Mode	Ghost text suggestion, accept with TAB
Auto Suggestion Mode	Multiple sentence completions shown as buttons
Top-K Control	Choose how many top words to consider (1–10)
Temperature Slider	Control prediction creativity (0.1–2.0)
Prediction Length	Control how many words to generate (3–20)
Auto Predict Toggle	Enable/disable real-time prediction
Show Probabilities	Display confidence % of prediction
Word/Char Counter	Live word and character count in editor
Model Analytics Dashboard	Shows vocab size, parameters, speed, architecture
100% Local Inference	No external APIs used — all runs on your machine

📂 Project Structure

Word Prediction Model/
│
├── app.py                  ← Streamlit web app (UI + prediction logic)
├── word_pred_model.ipynb   ← Model training notebook (ran on Databricks)
├── qoute_dataset.csv       ← Dataset: 3038 famous quotes
├── quote_lstm_model.h5     ← Trained LSTM model (saved weights)
├── tokenizer.pkl           ← Fitted tokenizer (word-to-index mapping)
├── max_len.pkl             ← Max sequence length for padding
├── requirements.txt        ← All Python dependencies
├── README.md               ← You are here
├── LICENSE                 ← MIT License
└── .gitignore              ← Git ignore rules

🛠️ Tech Stack

Category	Technology
Frontend / UI	Streamlit
Backend	Python 3.x
Deep Learning	TensorFlow 2.21, Keras
Data Processing	Pandas, NumPy
Model Training Platform	Apache Databricks
Model Storage	HDF5 (.h5), Pickle (.pkl)

⚙️ Installation & Setup

1. Clone the repository:

git clone https://github.com/YOUR_USERNAME/Next_Word_Prediction_Model.git
cd "Next_Word_Prediction_Model"

2. Create and activate virtual environment:

# Create
python -m venv venv

# Activate (Windows)
venv\Scripts\activate

# Activate (Linux/Mac)
source venv/bin/activate

3. Install dependencies:

pip install -r requirements.txt

▶️ Run Locally

streamlit run app.py

App will open at: http://localhost:8501

🧪 How to Use the App

Open the app in your browser
Choose a mode: Auto Completion or Auto Suggestion
Type a few words in the writing editor (e.g. "The future of")
In Auto Completion mode: ghost text appears → press TAB or click "Accept Prediction" to accept
In Auto Suggestion mode: multiple sentence completions appear → click any to insert it
Use the sidebar to adjust Temperature, Top-K, Prediction Length settings

📊 Model Details

Parameter	Value
Architecture	LSTM Neural Network
Dataset	3,038 inspirational quotes
Vocabulary Size	8,978 unique words
Embedding Dimension	50
LSTM Units	128
Total Parameters	~1.69 Million
Max Sequence Length	Stored in `max_len.pkl`
Prediction Speed	~50ms per inference
Training Platform	Apache Databricks
Framework	TensorFlow 2.21 / Keras

🌐 Deployment Options

Platform	Notes
Streamlit Community Cloud	Recommended — free and easy
Render	Simple Docker-based deployment
Railway	Fast deployment via GitHub
Hugging Face Spaces	Good for ML model demos
AWS EC2	Full control, production-grade

Recommended: Streamlit Community Cloud — just connect your GitHub repo and deploy in minutes.

📜 License

This project is licensed under the MIT License — see LICENSE for details.

👨‍💻 Author

Sadik Rangrej

Built with ❤️ using TensorFlow, Keras, and Streamlit.

Trained on Apache Databricks · Deployed with Streamlit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 AI Next Word & Sentence Predictor — LSTM Based NLP Project

📌 What Does This Project Do?

🧠 How It Works — The Complete ML Pipeline

Step 1: Dataset

Step 2: Text Preprocessing (in `word_pred_model.ipynb`)

Step 3: Model Architecture

Step 4: Training

Step 5: Saving Artifacts

Step 6: Inference (in `app.py`)

🌟 App Features

📂 Project Structure

🛠️ Tech Stack

⚙️ Installation & Setup

▶️ Run Locally

🧪 How to Use the App

📊 Model Details

🌐 Deployment Options

📜 License

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Word_Predictor_Model.png		Word_Predictor_Model.png
app.py		app.py
max_len.pkl		max_len.pkl
qoute_dataset.csv		qoute_dataset.csv
quote_lstm_model.h5		quote_lstm_model.h5
requirements.txt		requirements.txt
tokenizer.pkl		tokenizer.pkl
word_pred_model.ipynb		word_pred_model.ipynb

Folders and files

Latest commit

History

Repository files navigation

🧠 AI Next Word & Sentence Predictor — LSTM Based NLP Project

📌 What Does This Project Do?

🧠 How It Works — The Complete ML Pipeline

Step 1: Dataset

Step 2: Text Preprocessing (in word_pred_model.ipynb)

Step 3: Model Architecture

Step 4: Training

Step 5: Saving Artifacts

Step 6: Inference (in app.py)

🌟 App Features

📂 Project Structure

🛠️ Tech Stack

⚙️ Installation & Setup

▶️ Run Locally

🧪 How to Use the App

📊 Model Details

🌐 Deployment Options

📜 License

👨‍💻 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 2: Text Preprocessing (in `word_pred_model.ipynb`)

Step 6: Inference (in `app.py`)

Packages