A production-ready distributed task scheduling system built with Java and Spring Boot. Supports scheduled task execution, automatic retries with exponential backoff, task dependency resolution, optimistic locking for concurrency safety, and full execution history logging.
- Scheduled Execution — Tasks are polled every 5 seconds and dispatched when due
- Async Thread Pool — Tasks execute concurrently via a configurable
ThreadPoolTaskExecutor - Exponential Backoff Retry — Failed tasks are retried with delay
2^retryCountseconds, up to a configurablemaxRetrieslimit - Task Dependency Graph — Tasks can declare a dependency on another task; blocked tasks automatically unblock when their dependency completes
- Optimistic Locking —
@Version-based concurrency control prevents double-execution under concurrent dispatcher polls - Execution History Logging — Every completed task generates a
TaskExecutionHistoryrecord with completion time, retry count, and final status - Docker + PostgreSQL — Fully containerised with
docker-compose, no manual database setup required
| Layer | Technology |
|---|---|
| Language | Java 21 |
| Framework | Spring Boot 3.5.14 |
| Database | PostgreSQL 16 |
| ORM | Spring Data JPA / Hibernate |
| Containerisation | Docker + Docker Compose |
| Build Tool | Maven |
┌─────────────────────────────────────────────────────┐
│ TaskService │
│ │
│ @Scheduled (every 5s) │
│ dispatchTasks() │
│ │ │
│ ▼ │
│ getDueTasks() ──► findByStatusAndScheduledAt... │
│ │ (PENDING tasks only) │
│ ▼ │
│ For each task: │
│ setStatus(RUNNING) │
│ taskRepository.save(t) ◄── @Version check │
│ │ OptimisticLockException│
│ ▼ if already running │
│ taskExecutor.execute(λ) │
│ │ │
│ ▼ (thread pool thread) │
│ executeTasks(task) │
│ ├── SUCCESS │
│ │ setStatus(DONE) │
│ │ save() ◄── per-save OptimisticLock handling │
│ │ log to TaskExecutionHistory │
│ │ unblock dependent tasks (BLOCKED → PENDING) │
│ │ │
│ └── FAILURE │
│ retryCount < maxRetries? │
│ YES → setStatus(PENDING) │
│ scheduledAt = now + 2^retryCount │
│ NO → setStatus(FAILED) │
└─────────────────────────────────────────────────────┘
1. Allow-list state guards in cancelTask :
Only PENDING and BLOCKED tasks can be cancelled. Using an allow-list (permit known-safe states, reject everything else) rather than a deny-list means any future status added to the enum fails safely — rejected by default rather than accidentally permitted.
2. Effectively-final lambda capture :
The dispatcher reassigns t = taskRepository.save(t) to capture the updated @Version. A separate Task x = t alias is introduced before the lambda so the thread pool captures a stable, non-reassigned reference. Without this, a thread could execute against a stale variable pointing to a different task entirely.
3. Per-save OptimisticLockException handling :
Each taskRepository.save() call in executeTasks is individually wrapped. A version conflict on the DONE-path triggers a re-fetch-and-reapply pattern — fetch the current row with its actual version, reapply the DONE status, and save again. This prevents completed tasks from being silently orphaned as permanently RUNNING in the database.
4. Exponential backoff via existing dispatcher :
Backoff doesn't need a separate timer or scheduler. Setting scheduledAt = now + 2^retryCount and returning the task to PENDING means the existing poll naturally picks it up when the delay has elapsed — zero new infrastructure.
5. Inter-container communication :
Inside Docker Compose, the datasource URL uses the service name (postgres) rather than localhost. Containers on the same Compose network resolve each other by service name; localhost inside a container refers to that container itself, not the host or sibling containers.
- Docker Desktop installed and running
# Clone the repository
git clone https://github.com/CapnHazard/distributed-task-scheduler
cd distributed-task-scheduler
# Build the jar
mvn package -DskipTests
# Start both containers (PostgreSQL + app)
docker compose up --buildThe app will be available at http://localhost:8080.
To stop:
docker compose downRequires a running PostgreSQL instance on port 5432 with a database named taskdb.
mvn spring-boot:runPOST /tasks
Request body:
{
"name": "Send report email",
"scheduledAt": "2026-07-02T10:00:00",
"priority": "HIGH",
"maxRetries": 3,
"dependsOn": null
}scheduledAt— optional. Defaults tonowif not provided. Cannot be more than 60 seconds in the past.priority— optional. Values:HIGH,MEDIUM,LOW.maxRetries— number of retry attempts before marking asFAILED.dependsOn— optional. ID of another task this task depends on. Task will beBLOCKEDuntil the dependency completes.
Response: 200 OK
{
"id": 1,
"name": "Send report email",
"status": "PENDING",
"priority": "HIGH",
"scheduledAt": "2026-07-02T10:00:00",
"retryCount": 0,
"maxRetries": 3,
"dependsOn": null,
"completedAt": null
}GET /tasks
Response: 200 OK
[
{
"id": 1,
"name": "Send report email",
"status": "DONE",
"retryCount": 0,
"completedAt": "2026-07-02T10:00:07.123"
}
]GET /tasks/{id}
Response: 200 OK — task object as above.
Error: 404 NOT FOUND if task does not exist.
PUT /tasks/{id}/cancel
Only PENDING or BLOCKED tasks can be cancelled. Attempting to cancel a RUNNING, DONE, FAILED, or already CANCELLED task returns a conflict error.
Response: 200 OK
{
"id": 1,
"status": "CANCELLED"
}Error: 409 CONFLICT if task cannot be cancelled.
PENDING ──► RUNNING ──► DONE
▲ │
│ └──► FAILED (max retries exceeded)
│
└──────────────── PENDING (retry with backoff)
BLOCKED ──► PENDING (when dependency completes)
PENDING ──► CANCELLED
BLOCKED ──► CANCELLED
Key properties in application.properties:
| Property | Default | Description |
|---|---|---|
server.port |
8080 |
Server port |
spring.jpa.hibernate.ddl-auto |
update |
Schema strategy |
spring.jpa.show-sql |
true |
Log SQL queries |
Thread pool settings in TaskExecutorConfig.java:
| Setting | Value |
|---|---|
| Core pool size | 5 |
| Max pool size | 10 |
| Queue capacity | 20 |
Dispatcher poll interval: 5000ms (hardcoded via @Scheduled(fixedDelay = 5000)).
1. Multithreading is not just about performance — it's about failure isolation.
Threads don't share your try/catch. Handing work to a thread pool via taskExecutor.execute(λ) breaks the call stack. A try/catch in dispatchTasks cannot catch exceptions thrown inside executeTasks because by the time the thread pool runs that lambda, dispatchTasks has already returned. The catch block I wrote there does nothing for failures happening on a pool thread. Every thread handles its own mess.
2. Optimistic locking requires discipline at every save, not just the first one.
I assumed optimistic locking was more magic than it actually is.@Version doesn't protect you automatically everywhere, it only fires when you call save(). I learned to wrap each save() call individually and handle OptimisticLockException at the exact point it can occur, rather than one broad catch that can't distinguish between a concurrency conflict and a real task failure.
3. Allow-lists are safer than deny-lists for state transition guards.
When writing cancelTask, I could've blocked known-bad states. Instead I only permitted known-safe ones. A forgotten entry in an allow-list silently rejects an unhandled state — safe failure. A forgotten entry in a deny-list silently permits it — dangerous failure.
4. Think through what the data model already gives you before adding complexity.
I almost thought retry scheduling needed its own infrastructure. Then I realised: set scheduledAt = now + 2^retryCount, return the task to PENDING, and the existing dispatcher poll handles it on the next cycle. No new code needed. The right data model can eliminate entire features.
5. Lambda capture rules exist for a real reason.
Took me time to understand and accept this one. If the loop reassigns t while a pool thread is still holding a reference to it via the lambda, that thread executes the wrong task! The effectively-final constraint isn't arbitrary compiler pedantry — it prevents a genuine race condition where a lambda running on a pool thread captures a variable that the main thread has already moved to point at a different object. Introducing Task x = t as a frozen alias after the last reassignment made the risk concrete, not just theoretical.
6. localhost doesn't mean what you think inside Docker.
Inside Docker Compose, localhost inside a container refers to that container itself. Not the host machine, not a sibling container. Sibling containers communicate by service name. Switching the datasource URL from localhost to the Compose service name (postgres) was a one-word fix, but understanding why it was wrong took longer.