Skip to content

RyukR1/Clickstream_analysis

Repository files navigation

Website Clickstream Trend Analysis πŸ“Š

A complete big data pipeline demonstrating ETL (Extract-Transform-Load) and analytics using Apache Flume, Pig, and Hive. Processes website clickstream data to identify user behavior trends and patternsβ€”similar to how Amazon, Netflix, and Facebook analyze user interactions.

MIT License Python 3.7+ Docker


🎯 What This Project Does

This is a production-inspired big data pipeline that:

  1. Ingests clickstream data in real-time (Apache Flume)
  2. Cleans messy raw logs, removing errors and static assets (Apache Pig)
  3. Analyzes clean data using SQL queries to find trends (Apache Hive)

Key Achievement: Processes 100 raw logs β†’ filters to 89 quality records β†’ generates 8 analytics insights

Raw Logs (100) ──Flume──> HDFS Raw ──Pig──> Cleaned Data (89) ──Hive──> Analytics Results
  8,012 bytes                        MapReduce        5,809 bytes           8 Queries

πŸš€ Quick Start (5 Minutes with Docker)

Prerequisites

  • Docker installed (silicoflare/hadoop:amd image pre-pulled)
  • 4GB+ RAM available
  • Linux/Mac terminal

Step 1: Start Docker Container

# Clone this repo
git clone <your-repo-url>
cd "ClickSteam analysis"

# Start Docker container with all services
sudo docker run -d --name clickstream \
  -p 9870:9870 \
  -p 8088:8088 \
  -p 9864:9864 \
  -v "$(pwd):/clickstream" \
  --entrypoint /bin/bash \
  silicoflare/hadoop:amd \
  -c "sleep infinity"

# Enter container
docker exec -it clickstream /bin/bash

Step 2: Start Hadoop Services

# Inside the container, start services
/usr/local/hadoop/bin/hdfs namenode -format -force
/usr/local/hadoop/bin/hdfs namenode &
/usr/local/hadoop/bin/hdfs datanode &
/usr/local/hadoop/bin/yarn resourcemanager &
/usr/local/hadoop/bin/yarn nodemanager &

# Verify services running
jps
# Should show: NameNode, DataNode, ResourceManager, NodeManager

Step 3: Setup Pipeline

# Create HDFS directories
hdfs dfs -mkdir -p /user/root/clickstream/{raw,processed}

# Generate sample logs (100 entries)
python3 << 'EOF'
import random
from datetime import datetime, timedelta

pages = ['/index.html', '/products/laptop', '/products/phone', '/cart', '/checkout']
ips = ['192.168.1.100', '192.168.1.101', '192.168.1.102', '10.0.0.1']

with open('/clickstream/logs/access.log', 'w') as f:
    current_time = datetime(2026, 4, 5, 16, 31, 52)
    for i in range(100):
        ip = random.choice(ips)
        page = random.choice(pages)
        status = random.choices([200, 404], weights=[90, 10])[0]
        size = random.randint(1000, 10000)
        timestamp = current_time.strftime('%d/%b/%Y:%H:%M:%S +0000')
        log = f'{ip} - - [{timestamp}] "GET {page} HTTP/1.1" {status} {size}\n'
        f.write(log)
        current_time += timedelta(seconds=random.randint(1, 10))

print("Generated 100 sample logs")
EOF

# Upload to HDFS
hdfs dfs -put /clickstream/logs/access.log /user/root/clickstream/raw/

Step 4: Run Data Pipeline

Phase 1 & 2: Ingestion & Cleaning (Pig)

# Delete old output
hdfs dfs -rm -r /user/root/clickstream/processed

# Run Pig script (ETL/cleaning)
pig -x local /clickstream/phase2_cleaning/clean_logs.pig
# Result: 89 clean records (11 404s filtered)

Phase 3: Start Hive MetaStore

# Start MetaStore service
nohup hive --service metastore > /tmp/metastore.log 2>&1 &

# Wait 5 seconds for startup
sleep 5

Create Table & Run Analytics

# Run Hive queries
hive -hiveconf hive.metastore.uris=thrift://localhost:9083 \
     -f /clickstream/phase3_analysis/create_table.hql

hive -hiveconf hive.metastore.uris=thrift://localhost:9083 \
     -f /clickstream/phase3_analysis/trend_queries.hql

πŸ“ Project Structure

clickstream-analysis/
β”‚
β”œβ”€β”€ README.md                          ← You are here
β”œβ”€β”€ DOCKER_QUICKSTART.md               ← Detailed Docker setup
β”œβ”€β”€ ARCHITECTURE.md                    ← System design & diagrams
β”œβ”€β”€ INTERVIEW_GUIDE.md                 ← Interview preparation
β”‚
β”œβ”€β”€ phase1_ingestion/
β”‚   └── flume-conf.properties          ← Real-time log ingestion config
β”‚
β”œβ”€β”€ phase2_cleaning/
β”‚   └── clean_logs.pig                 ← ETL/data cleaning script
β”‚
β”œβ”€β”€ phase3_analysis/
β”‚   β”œβ”€β”€ create_table.hql               ← Hive table creation
β”‚   └── trend_queries.hql              ← 8 analytics queries
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ README.md                      ← Additional documentation
β”‚   └── DOCKER_SETUP.md                ← Detailed Docker reference
β”‚
└── logs/                              ← Local log directory (empty template)

πŸ”„ Data Pipeline Explanation

Phase 1: Ingestion (Apache Flume) πŸ“₯

  • Purpose: Collect logs in real-time from web servers
  • Source: File directory monitoring (spooling directory)
  • Destination: HDFS distributed storage
  • Config: phase1_ingestion/flume-conf.properties
  • Why: Simulates enterprise log aggregation (Netflix, Amazon scale)

Phase 2: Cleaning & Transformation (Apache Pig) 🧹

  • Purpose: Extract, transform, load (ETL) - remove bad data
  • Input: 100 raw logs (8,012 bytes)
  • Processing:
    • Parse Apache Common Log Format using regex
    • Remove HTTP 404/500 errors
    • Remove static assets (.jpg, .css, .js files)
    • Extract: IP, timestamp, URL
  • Output: 89 clean records (5,809 bytes, 11% filtered)
  • Script: phase2_cleaning/clean_logs.pig

Phase 3: Analytics (Apache Hive) πŸ“Š

  • Purpose: SQL-based analysis of clickstream trends
  • Language: HiveQL (SQL-like interface to MapReduce)
  • Queries: 8 pre-built analysis questions
  • Results: Business insights on user behavior
  • Scripts:
    • phase3_analysis/create_table.hql - Schema definition
    • phase3_analysis/trend_queries.hql - Analytics queries

πŸ“Š Sample Results

Running all 8 analytics queries produces:

Query 1: Top 5 Pages by Click Count
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ url                         β”‚ clicks   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ GET /checkout HTTP/1.1      β”‚ 20       β”‚
β”‚ GET /products/laptop HTTP/1.1β”‚ 19      β”‚
β”‚ GET /cart HTTP/1.1          β”‚ 18       β”‚
β”‚ GET /products/phone HTTP/1.1β”‚ 17       β”‚
β”‚ GET /index.html HTTP/1.1    β”‚ 15       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Query 2: Daily Traffic Trends
Date: 05/Apr/2026 β†’ Total Clicks: 89

Query 3: Unique Visitors
Total Unique IPs: 4

Query 4: Traffic by IP (Top Visitors)
192.168.1.101 β†’ 30 visits
192.168.1.100 β†’ 27 visits
192.168.1.102 β†’ 16 visits
10.0.0.1 β†’ 16 visits

Query 5: URL Pattern Categories
Product Pages β†’ 36 clicks (40%)
Checkout β†’ 20 clicks (22%)
Shopping Cart β†’ 18 clicks (20%)
Other β†’ 15 clicks (17%)

πŸ’Ό Technologies Used

Layer Technology Version Purpose
Ingestion Apache Flume 1.x Real-time log streaming
Storage Hadoop HDFS 3.3.6 Distributed file system
Processing Apache Pig 0.17.0 Data transformation/ETL
Analytics Apache Hive 3.1.3 SQL interface to data
Compute MapReduce Hadoop 3.3.6 Distributed computation
Deployment Docker Latest Containerization
OS Linux Ubuntu 20.04 Container base

πŸ—οΈ Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 CLICKSTREAM DATA PIPELINE                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

         Web Server Logs
               ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  PHASE 1: INGESTION β”‚
    β”‚  Apache Flume       β”‚
    β”‚  (Real-time stream) β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   HDFS Storage      β”‚
    β”‚   /raw/access.log   β”‚
    β”‚   (100 records)     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ PHASE 2: CLEANING   β”‚
    β”‚ Apache Pig (ETL)    β”‚
    β”‚ MapReduce jobs      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   HDFS Storage      β”‚
    β”‚   /processed/ (CSV) β”‚
    β”‚   (89 records)      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ PHASE 3: ANALYTICS  β”‚
    β”‚ Apache Hive (SQL)   β”‚
    β”‚ 8 pre-built queries β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Business Insights   β”‚
    β”‚ β€’ Top pages         β”‚
    β”‚ β€’ Traffic trends    β”‚
    β”‚ β€’ Unique visitors   β”‚
    β”‚ β€’ Bot detection     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸŽ“ Learning Outcomes

Working through this project demonstrates:

βœ… Big Data Engineering

  • Real-time data ingestion architecture
  • Distributed ETL processing
  • Data quality and validation

βœ… Data Processing

  • Log parsing and regex patterns
  • Complex transformations
  • Handling > 1 billion records (scalable design)

βœ… Data Analytics

  • SQL-based analysis
  • Answering business questions
  • Trend identification

βœ… DevOps Skills

  • Docker containerization
  • Distributed systems
  • Service configuration and debugging

βœ… Problem Solving

  • Debugging distributed systems
  • Handling real-world messy data
  • Performance optimization

πŸ“š Additional Documentation


πŸ”§ Troubleshooting

Services Won't Start

# Check what services are running
jps

# If any missing, start manually
/usr/local/hadoop/bin/hdfs namenode &
/usr/local/hadoop/bin/hdfs datanode &
/usr/local/hadoop/bin/yarn resourcemanager &
/usr/local/hadoop/bin/yarn nodemanager &

Hive Connection Fails

# Start MetaStore service
nohup hive --service metastore > /tmp/metastore.log 2>&1 &

# Use explicit MetaStore URI
hive -hiveconf hive.metastore.uris=thrift://localhost:9083

Pig Output Directory Error

# Pig won't overwrite - delete first
hdfs dfs -rm -r /user/root/clickstream/processed
# Then re-run Pig script

View Service Logs

# NameNode logs
tail -f /usr/local/hadoop/logs/hadoop-root-namenode-*.log

# Pig job logs
tail -f /tmp/pig*.log

# Hive MetaStore logs
cat /tmp/metastore.log

🌐 Web UI Access

Once services are running, access dashboards:

Component URL Port
NameNode http://localhost:9870 9870
ResourceManager http://localhost:8088 8088
DataNode http://localhost:9864 9864

πŸ’‘ Real-World Applications

This pipeline architecture is used by:

  • Amazon: Track product clicks β†’ Recommend similar items
  • Netflix: Analyze viewing patterns β†’ Personalize recommendations
  • Facebook: Process billions of events β†’ Ad targeting
  • Airbnb: Study search/booking flows β†’ Optimize listings
  • Spotify: Analyze listening behavior β†’ Suggest playlists

πŸš€ Next Steps / Enhancements

  • Use real website clickstream data (Kaggle dataset)
  • Add real-time Kafka streaming instead of batch Flume
  • Implement Spark instead of MapReduce for faster processing
  • Add Hive partitioning for daily data (optimization)
  • Create dashboard (Grafana/Superset) for visualization
  • Automate pipeline with Airflow/Oozie scheduler
  • Add anomaly detection for bot/attack patterns

πŸ“‹ Requirements

Environment

  • Docker with silicoflare/hadoop image
  • 4GB+ RAM
  • 20GB+ disk space
  • Linux/Mac/WSL environment

Software (Inside Docker)

  • Hadoop 3.3.6
  • Apache Flume 1.x
  • Apache Pig 0.17.0
  • Apache Hive 3.1.3
  • Java 8+

πŸ“– Data Format

Input: Apache Common Log Format

192.168.1.100 - - [05/Apr/2026:16:31:52 +0000] "GET /products/laptop HTTP/1.1" 200 5234
IP           USER AUTH [TIMESTAMP]                 METHOD PAGE VERSION         STATUS SIZE

Output: CSV (Cleaned)

192.168.1.100,05/Apr/2026:16:31:52 +0000,GET /products/laptop HTTP/1.1
IP,timestamp,url

🀝 Contributing

Contributions welcome! Areas:

  • Add more analytics queries
  • Improve data generation (more realistic patterns)
  • Add visualization dashboards
  • Performance optimizations
  • Documentation improvements

πŸ“„ License

MIT License - see LICENSE file for details


πŸ‘€ Author

Created as a portfolio project demonstrating big data engineering skills.

Skills Demonstrated: Apache Flume, Pig, Hive, Hadoop, Docker, ETL, Data Analysis, Distributed Systems


πŸ“ž Questions?

For detailed walkthroughs:


⭐ If This Helped

If you found this project helpful for learning big data engineering, consider starring it! ⭐


Last Updated: April 2026
Status: Complete & Production-Ready

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors