A complete end-to-end Natural Language Processing (NLP) project that predicts the next word or completes a sentence using a custom-trained LSTM (Long Short-Term Memory) neural network. Built with TensorFlow, Keras, and Streamlit.
This project was trained on Apache Databricks and deployed as an interactive web app using Streamlit.
You type a few words β the AI model predicts what word comes next, or completes your entire sentence.
It has two prediction modes:
- Auto Completion Mode β Shows a single "ghost text" suggestion (like VS Code autocomplete). Press
TABto accept. - Auto Suggestion Mode β Shows multiple full sentence completions as clickable buttons.
- Dataset:
qoute_dataset.csvβ a collection of 3,038 famous quotes with columnsquoteandAuthor. - Sample quotes are from Albert Einstein, J.K. Rowling, Marilyn Monroe, Jane Austen, and many more.
Raw Quotes β Lowercase β Remove Punctuation β Tokenize β Create N-gram Sequences
- Lowercasing β All quotes converted to lowercase using
str.lower() - Punctuation Removal β
string.maketransremoves all punctuation marks - Tokenization β Keras
Tokenizerassigns a unique integer to every word- Total unique words (vocabulary size): 8,978
- Most frequent words:
the(1),you(2),to(3),and(4),a(5)...
- N-gram Sequence Generation β For each quote, every possible prefix subsequence is created:
[713]β predict62[713, 62]β predict29[713, 62, 29]β predict19- This creates thousands of (input β output) training pairs
- Padding β All input sequences padded to
max_len - 1usingprepadding so they are uniform length
Input (token IDs)
β
Embedding Layer β 8978 vocab β 50 dimensions
β
LSTM Layer β 128 units (learns word patterns & context)
β
Dense Layer β 8978 units + Softmax (probability for each word)
β
Output: Next Word Prediction
| Layer | Details |
|---|---|
| Embedding | vocab_size=8978, output_dim=50 |
| LSTM | units=128 |
| Dense (Output) | units=8978, activation=softmax |
| Total Parameters | ~1.69 Million |
| Framework | TensorFlow / Keras |
- Trained on Apache Databricks (cloud GPU environment)
- Notebook:
word_pred_model.ipynb - Loss:
categorical_crossentropy - Optimizer:
adam
After training, three files are saved and used during inference:
| File | What it stores |
|---|---|
quote_lstm_model.h5 |
The trained LSTM model weights |
tokenizer.pkl |
Word β Integer mapping (fitted on dataset) |
max_len.pkl |
Maximum sequence length for padding |
When you type something in the app:
User Input Text
β
Lowercase + Tokenize (using saved tokenizer)
β
Pad sequence to max_len - 1
β
LSTM Model β Output probability distribution (8978 values)
β
Temperature Scaling (controls creativity vs accuracy)
β
Top-K sampling β Return top predicted words
Temperature parameter:
- Low (0.1) β Very focused, safe predictions
- High (2.0) β More random, creative predictions
| Feature | Description |
|---|---|
| Auto Completion Mode | Ghost text suggestion, accept with TAB |
| Auto Suggestion Mode | Multiple sentence completions shown as buttons |
| Top-K Control | Choose how many top words to consider (1β10) |
| Temperature Slider | Control prediction creativity (0.1β2.0) |
| Prediction Length | Control how many words to generate (3β20) |
| Auto Predict Toggle | Enable/disable real-time prediction |
| Show Probabilities | Display confidence % of prediction |
| Word/Char Counter | Live word and character count in editor |
| Model Analytics Dashboard | Shows vocab size, parameters, speed, architecture |
| 100% Local Inference | No external APIs used β all runs on your machine |
Word Prediction Model/
β
βββ app.py β Streamlit web app (UI + prediction logic)
βββ word_pred_model.ipynb β Model training notebook (ran on Databricks)
βββ qoute_dataset.csv β Dataset: 3038 famous quotes
βββ quote_lstm_model.h5 β Trained LSTM model (saved weights)
βββ tokenizer.pkl β Fitted tokenizer (word-to-index mapping)
βββ max_len.pkl β Max sequence length for padding
βββ requirements.txt β All Python dependencies
βββ README.md β You are here
βββ LICENSE β MIT License
βββ .gitignore β Git ignore rules
| Category | Technology |
|---|---|
| Frontend / UI | Streamlit |
| Backend | Python 3.x |
| Deep Learning | TensorFlow 2.21, Keras |
| Data Processing | Pandas, NumPy |
| Model Training Platform | Apache Databricks |
| Model Storage | HDF5 (.h5), Pickle (.pkl) |
1. Clone the repository:
git clone https://github.com/YOUR_USERNAME/Next_Word_Prediction_Model.git
cd "Next_Word_Prediction_Model"2. Create and activate virtual environment:
# Create
python -m venv venv
# Activate (Windows)
venv\Scripts\activate
# Activate (Linux/Mac)
source venv/bin/activate3. Install dependencies:
pip install -r requirements.txtstreamlit run app.pyApp will open at: http://localhost:8501
- Open the app in your browser
- Choose a mode: Auto Completion or Auto Suggestion
- Type a few words in the writing editor (e.g.
"The future of") - In Auto Completion mode: ghost text appears β press
TABor click "Accept Prediction" to accept - In Auto Suggestion mode: multiple sentence completions appear β click any to insert it
- Use the sidebar to adjust Temperature, Top-K, Prediction Length settings
| Parameter | Value |
|---|---|
| Architecture | LSTM Neural Network |
| Dataset | 3,038 inspirational quotes |
| Vocabulary Size | 8,978 unique words |
| Embedding Dimension | 50 |
| LSTM Units | 128 |
| Total Parameters | ~1.69 Million |
| Max Sequence Length | Stored in max_len.pkl |
| Prediction Speed | ~50ms per inference |
| Training Platform | Apache Databricks |
| Framework | TensorFlow 2.21 / Keras |
| Platform | Notes |
|---|---|
| Streamlit Community Cloud | Recommended β free and easy |
| Render | Simple Docker-based deployment |
| Railway | Fast deployment via GitHub |
| Hugging Face Spaces | Good for ML model demos |
| AWS EC2 | Full control, production-grade |
Recommended: Streamlit Community Cloud β just connect your GitHub repo and deploy in minutes.
This project is licensed under the MIT License β see LICENSE for details.
Sadik Rangrej
Built with β€οΈ using TensorFlow, Keras, and Streamlit.
Trained on Apache Databricks Β· Deployed with Streamlit