Dynamic ETL pipeline that pulls data from the CollegeFootballData.com REST API and loads it into PostgreSQL.
Version: 0.2.0 Author: Tyler Shepherd
For each configured endpoint (teams, drives, plays, …) the pipeline:
- Retrieves JSON from
api.collegefootballdata.com/{endpoint}with bearer-token auth. - Transforms the nested response into a flat pandas DataFrame (recursively expanding
dict-valued columns). - Pushes the DataFrame into PostgreSQL, auto-creating the destination table on first run.
┌────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ CFBD REST API │──▶│ RetrieverService │──▶│ │
└────────────────────┘ │ (data_retriever_ │ │ TransformService │
│ service) │ │ (data_transformer_ │
└──────────────────────┘ │ service) │
└──────────┬──────────┘
│
▼
┌────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ PostgreSQL │◀──│ PusherService │◀──│ pandas DataFrame │
└────────────────────┘ │ (data_pusher_ │ └─────────────────────┘
│ service) │
└──────────────────────┘
▲
│
┌──────────────────────┐
│ EndpointRequest │
│ Service │
│ (endpoints/*) │
└──────────────────────┘
▲
│
┌──────────────────────┐
│ request_manager │
│ (orchestrator) │
└──────────────────────┘
cfb-data/
├── data_retriever_service/ # HTTP client for CFBD
├── data_transformer_service/ # JSON -> flat DataFrame
├── data_pusher_service/ # DataFrame -> PostgreSQL
├── endpoints/ # One file per API endpoint
│ ├── base.py # EndpointRequestService base class
│ ├── teams.py
│ ├── game_drives.py
│ └── play_by_play.py
├── request_manager/ # Orchestration / CLI entry point
├── tests/ # pytest unit tests
├── pyproject.toml # Packaging + ruff config
├── requirements.txt # Runtime deps
└── requirements-dev.txt # + dev deps (pytest, ruff, responses)
- Python 3.10+
- PostgreSQL 14+ (local or remote)
- A free CFBD API key: https://collegefootballdata.com/key
git clone https://github.com/TylerShep/cfb-data.git
cd cfb-data
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"cp .env.example .env
# then edit .env with your CFBD key + Postgres credsThe pipeline reads the following environment variables:
| Variable | Purpose |
|---|---|
CFBD_DATA_API_KEY |
Bearer token for the CFBD API |
DB_HOST |
Postgres host |
DB_PORT |
Postgres port (default 5432) |
DB_NAME |
Target database name |
DB_USER |
Postgres user |
DB_PASSWORD |
Postgres password |
# Run every configured endpoint:
python -m request_manager.manager
# Or (once installed via pip):
cfb-data --log-level DEBUGCreate a new file in endpoints/ subclassing EndpointRequestService:
from dataclasses import dataclass, field
from typing import Any
from endpoints.base import EndpointRequestService
@dataclass
class GamesEndpoint(EndpointRequestService):
endpoint: str = "games"
default_params: dict[str, Any] = field(
default_factory=lambda: {"year": 2023, "seasonType": "regular"}
)Then add an instance to the ENDPOINTS list in request_manager/manager.py.
For endpoints that need multiple calls (e.g. once per week), override params_iter() — see endpoints/play_by_play.py for an example.
# Lint + format:
ruff check .
ruff format .
# Run tests:
pytest- Rewrote retriever/transformer/pusher services to use proper connection management, error handling, and batch inserts.
- Renamed the
requests/package toendpoints/to avoid shadowing the third-partyrequestslibrary. - Added a base
EndpointRequestServiceclass with a common retrieve → transform → push flow. - Added
pyproject.toml, a realrequirements.txt(with pins, minus stdlib entries), a--log-levelCLI, and an.env.example. - Removed committed
.idea/IntelliJ metadata. - Fixed several runtime bugs: missing
numpyimport,getConnectionvsgetPostgresConnectiontypo, bound-method call on class, emptygames.py.
Initial release (classroom project scaffolding).
- CFBD's official Python client: https://github.com/CFBD/cfbd-python
- API docs (endpoints + schema): https://api.collegefootballdata.com/api/docs/?url=/api-docs.json
- Raw data exporter: https://collegefootballdata.com/exporter