A complete big data pipeline demonstrating ETL (Extract-Transform-Load) and analytics using Apache Flume, Pig, and Hive. Processes website clickstream data to identify user behavior trends and patternsβsimilar to how Amazon, Netflix, and Facebook analyze user interactions.
This is a production-inspired big data pipeline that:
- Ingests clickstream data in real-time (Apache Flume)
- Cleans messy raw logs, removing errors and static assets (Apache Pig)
- Analyzes clean data using SQL queries to find trends (Apache Hive)
Key Achievement: Processes 100 raw logs β filters to 89 quality records β generates 8 analytics insights
Raw Logs (100) ββFlumeββ> HDFS Raw ββPigββ> Cleaned Data (89) ββHiveββ> Analytics Results
8,012 bytes MapReduce 5,809 bytes 8 Queries
- Docker installed (
silicoflare/hadoop:amdimage pre-pulled) - 4GB+ RAM available
- Linux/Mac terminal
# Clone this repo
git clone <your-repo-url>
cd "ClickSteam analysis"
# Start Docker container with all services
sudo docker run -d --name clickstream \
-p 9870:9870 \
-p 8088:8088 \
-p 9864:9864 \
-v "$(pwd):/clickstream" \
--entrypoint /bin/bash \
silicoflare/hadoop:amd \
-c "sleep infinity"
# Enter container
docker exec -it clickstream /bin/bash# Inside the container, start services
/usr/local/hadoop/bin/hdfs namenode -format -force
/usr/local/hadoop/bin/hdfs namenode &
/usr/local/hadoop/bin/hdfs datanode &
/usr/local/hadoop/bin/yarn resourcemanager &
/usr/local/hadoop/bin/yarn nodemanager &
# Verify services running
jps
# Should show: NameNode, DataNode, ResourceManager, NodeManager# Create HDFS directories
hdfs dfs -mkdir -p /user/root/clickstream/{raw,processed}
# Generate sample logs (100 entries)
python3 << 'EOF'
import random
from datetime import datetime, timedelta
pages = ['/index.html', '/products/laptop', '/products/phone', '/cart', '/checkout']
ips = ['192.168.1.100', '192.168.1.101', '192.168.1.102', '10.0.0.1']
with open('/clickstream/logs/access.log', 'w') as f:
current_time = datetime(2026, 4, 5, 16, 31, 52)
for i in range(100):
ip = random.choice(ips)
page = random.choice(pages)
status = random.choices([200, 404], weights=[90, 10])[0]
size = random.randint(1000, 10000)
timestamp = current_time.strftime('%d/%b/%Y:%H:%M:%S +0000')
log = f'{ip} - - [{timestamp}] "GET {page} HTTP/1.1" {status} {size}\n'
f.write(log)
current_time += timedelta(seconds=random.randint(1, 10))
print("Generated 100 sample logs")
EOF
# Upload to HDFS
hdfs dfs -put /clickstream/logs/access.log /user/root/clickstream/raw/Phase 1 & 2: Ingestion & Cleaning (Pig)
# Delete old output
hdfs dfs -rm -r /user/root/clickstream/processed
# Run Pig script (ETL/cleaning)
pig -x local /clickstream/phase2_cleaning/clean_logs.pig
# Result: 89 clean records (11 404s filtered)Phase 3: Start Hive MetaStore
# Start MetaStore service
nohup hive --service metastore > /tmp/metastore.log 2>&1 &
# Wait 5 seconds for startup
sleep 5Create Table & Run Analytics
# Run Hive queries
hive -hiveconf hive.metastore.uris=thrift://localhost:9083 \
-f /clickstream/phase3_analysis/create_table.hql
hive -hiveconf hive.metastore.uris=thrift://localhost:9083 \
-f /clickstream/phase3_analysis/trend_queries.hqlclickstream-analysis/
β
βββ README.md β You are here
βββ DOCKER_QUICKSTART.md β Detailed Docker setup
βββ ARCHITECTURE.md β System design & diagrams
βββ INTERVIEW_GUIDE.md β Interview preparation
β
βββ phase1_ingestion/
β βββ flume-conf.properties β Real-time log ingestion config
β
βββ phase2_cleaning/
β βββ clean_logs.pig β ETL/data cleaning script
β
βββ phase3_analysis/
β βββ create_table.hql β Hive table creation
β βββ trend_queries.hql β 8 analytics queries
β
βββ docs/
β βββ README.md β Additional documentation
β βββ DOCKER_SETUP.md β Detailed Docker reference
β
βββ logs/ β Local log directory (empty template)
- Purpose: Collect logs in real-time from web servers
- Source: File directory monitoring (spooling directory)
- Destination: HDFS distributed storage
- Config:
phase1_ingestion/flume-conf.properties - Why: Simulates enterprise log aggregation (Netflix, Amazon scale)
- Purpose: Extract, transform, load (ETL) - remove bad data
- Input: 100 raw logs (8,012 bytes)
- Processing:
- Parse Apache Common Log Format using regex
- Remove HTTP 404/500 errors
- Remove static assets (.jpg, .css, .js files)
- Extract: IP, timestamp, URL
- Output: 89 clean records (5,809 bytes, 11% filtered)
- Script:
phase2_cleaning/clean_logs.pig
- Purpose: SQL-based analysis of clickstream trends
- Language: HiveQL (SQL-like interface to MapReduce)
- Queries: 8 pre-built analysis questions
- Results: Business insights on user behavior
- Scripts:
phase3_analysis/create_table.hql- Schema definitionphase3_analysis/trend_queries.hql- Analytics queries
Running all 8 analytics queries produces:
Query 1: Top 5 Pages by Click Count
βββββββββββββββββββββββββββββββ¬βββββββββββ
β url β clicks β
βββββββββββββββββββββββββββββββΌβββββββββββ€
β GET /checkout HTTP/1.1 β 20 β
β GET /products/laptop HTTP/1.1β 19 β
β GET /cart HTTP/1.1 β 18 β
β GET /products/phone HTTP/1.1β 17 β
β GET /index.html HTTP/1.1 β 15 β
βββββββββββββββββββββββββββββββ΄βββββββββββ
Query 2: Daily Traffic Trends
Date: 05/Apr/2026 β Total Clicks: 89
Query 3: Unique Visitors
Total Unique IPs: 4
Query 4: Traffic by IP (Top Visitors)
192.168.1.101 β 30 visits
192.168.1.100 β 27 visits
192.168.1.102 β 16 visits
10.0.0.1 β 16 visits
Query 5: URL Pattern Categories
Product Pages β 36 clicks (40%)
Checkout β 20 clicks (22%)
Shopping Cart β 18 clicks (20%)
Other β 15 clicks (17%)
| Layer | Technology | Version | Purpose |
|---|---|---|---|
| Ingestion | Apache Flume | 1.x | Real-time log streaming |
| Storage | Hadoop HDFS | 3.3.6 | Distributed file system |
| Processing | Apache Pig | 0.17.0 | Data transformation/ETL |
| Analytics | Apache Hive | 3.1.3 | SQL interface to data |
| Compute | MapReduce | Hadoop 3.3.6 | Distributed computation |
| Deployment | Docker | Latest | Containerization |
| OS | Linux | Ubuntu 20.04 | Container base |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLICKSTREAM DATA PIPELINE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Web Server Logs
β
βββββββββββββββββββββββ
β PHASE 1: INGESTION β
β Apache Flume β
β (Real-time stream) β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β HDFS Storage β
β /raw/access.log β
β (100 records) β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β PHASE 2: CLEANING β
β Apache Pig (ETL) β
β MapReduce jobs β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β HDFS Storage β
β /processed/ (CSV) β
β (89 records) β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β PHASE 3: ANALYTICS β
β Apache Hive (SQL) β
β 8 pre-built queries β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β Business Insights β
β β’ Top pages β
β β’ Traffic trends β
β β’ Unique visitors β
β β’ Bot detection β
βββββββββββββββββββββββ
Working through this project demonstrates:
β Big Data Engineering
- Real-time data ingestion architecture
- Distributed ETL processing
- Data quality and validation
β Data Processing
- Log parsing and regex patterns
- Complex transformations
- Handling > 1 billion records (scalable design)
β Data Analytics
- SQL-based analysis
- Answering business questions
- Trend identification
β DevOps Skills
- Docker containerization
- Distributed systems
- Service configuration and debugging
β Problem Solving
- Debugging distributed systems
- Handling real-world messy data
- Performance optimization
- DOCKER_QUICKSTART.md - Step-by-step Docker setup guide
- ARCHITECTURE.md - Detailed system design and data flow diagrams
- INTERVIEW_GUIDE.md - Interview preparation and talking points
- docs/DOCKER_SETUP.md - Complete Docker command reference
# Check what services are running
jps
# If any missing, start manually
/usr/local/hadoop/bin/hdfs namenode &
/usr/local/hadoop/bin/hdfs datanode &
/usr/local/hadoop/bin/yarn resourcemanager &
/usr/local/hadoop/bin/yarn nodemanager &# Start MetaStore service
nohup hive --service metastore > /tmp/metastore.log 2>&1 &
# Use explicit MetaStore URI
hive -hiveconf hive.metastore.uris=thrift://localhost:9083# Pig won't overwrite - delete first
hdfs dfs -rm -r /user/root/clickstream/processed
# Then re-run Pig script# NameNode logs
tail -f /usr/local/hadoop/logs/hadoop-root-namenode-*.log
# Pig job logs
tail -f /tmp/pig*.log
# Hive MetaStore logs
cat /tmp/metastore.logOnce services are running, access dashboards:
| Component | URL | Port |
|---|---|---|
| NameNode | http://localhost:9870 | 9870 |
| ResourceManager | http://localhost:8088 | 8088 |
| DataNode | http://localhost:9864 | 9864 |
This pipeline architecture is used by:
- Amazon: Track product clicks β Recommend similar items
- Netflix: Analyze viewing patterns β Personalize recommendations
- Facebook: Process billions of events β Ad targeting
- Airbnb: Study search/booking flows β Optimize listings
- Spotify: Analyze listening behavior β Suggest playlists
- Use real website clickstream data (Kaggle dataset)
- Add real-time Kafka streaming instead of batch Flume
- Implement Spark instead of MapReduce for faster processing
- Add Hive partitioning for daily data (optimization)
- Create dashboard (Grafana/Superset) for visualization
- Automate pipeline with Airflow/Oozie scheduler
- Add anomaly detection for bot/attack patterns
- Docker with
silicoflare/hadoopimage - 4GB+ RAM
- 20GB+ disk space
- Linux/Mac/WSL environment
- Hadoop 3.3.6
- Apache Flume 1.x
- Apache Pig 0.17.0
- Apache Hive 3.1.3
- Java 8+
192.168.1.100 - - [05/Apr/2026:16:31:52 +0000] "GET /products/laptop HTTP/1.1" 200 5234
IP USER AUTH [TIMESTAMP] METHOD PAGE VERSION STATUS SIZE
192.168.1.100,05/Apr/2026:16:31:52 +0000,GET /products/laptop HTTP/1.1
IP,timestamp,url
Contributions welcome! Areas:
- Add more analytics queries
- Improve data generation (more realistic patterns)
- Add visualization dashboards
- Performance optimizations
- Documentation improvements
MIT License - see LICENSE file for details
Created as a portfolio project demonstrating big data engineering skills.
Skills Demonstrated: Apache Flume, Pig, Hive, Hadoop, Docker, ETL, Data Analysis, Distributed Systems
For detailed walkthroughs:
- Check ARCHITECTURE.md for system design
- See INTERVIEW_GUIDE.md for deeper explanations
- Review script comments in phase folders
If you found this project helpful for learning big data engineering, consider starring it! β
Last Updated: April 2026
Status: Complete & Production-Ready