Distributed Task Scheduler

A production-ready distributed task scheduling system built with Java and Spring Boot. Supports scheduled task execution, automatic retries with exponential backoff, task dependency resolution, optimistic locking for concurrency safety, and full execution history logging.

Features

Scheduled Execution — Tasks are polled every 5 seconds and dispatched when due
Async Thread Pool — Tasks execute concurrently via a configurable ThreadPoolTaskExecutor
Exponential Backoff Retry — Failed tasks are retried with delay 2^retryCount seconds, up to a configurable maxRetries limit
Task Dependency Graph — Tasks can declare a dependency on another task; blocked tasks automatically unblock when their dependency completes
Optimistic Locking — @Version-based concurrency control prevents double-execution under concurrent dispatcher polls
Execution History Logging — Every completed task generates a TaskExecutionHistory record with completion time, retry count, and final status
Docker + PostgreSQL — Fully containerised with docker-compose, no manual database setup required

Tech Stack

Layer	Technology
Language	Java 21
Framework	Spring Boot 3.5.14
Database	PostgreSQL 16
ORM	Spring Data JPA / Hibernate
Containerisation	Docker + Docker Compose
Build Tool	Maven

Architecture

┌─────────────────────────────────────────────────────┐
│                     TaskService                      │
│                                                      │
│  @Scheduled (every 5s)                               │
│  dispatchTasks()                                     │
│       │                                              │
│       ▼                                              │
│  getDueTasks()  ──►  findByStatusAndScheduledAt...   │
│       │              (PENDING tasks only)            │
│       ▼                                              │
│  For each task:                                      │
│    setStatus(RUNNING)                                │
│    taskRepository.save(t)  ◄── @Version check        │
│       │                       OptimisticLockException│
│       ▼                        if already running    │
│  taskExecutor.execute(λ)                             │
│       │                                              │
│       ▼  (thread pool thread)                        │
│  executeTasks(task)                                  │
│    ├── SUCCESS                                       │
│    │     setStatus(DONE)                             │
│    │     save() ◄── per-save OptimisticLock handling │
│    │     log to TaskExecutionHistory                 │
│    │     unblock dependent tasks (BLOCKED → PENDING) │
│    │                                                 │
│    └── FAILURE                                       │
│          retryCount < maxRetries?                    │
│            YES → setStatus(PENDING)                  │
│                  scheduledAt = now + 2^retryCount    │
│            NO  → setStatus(FAILED)                   │
└─────────────────────────────────────────────────────┘

Key Design Decisions

1. Allow-list state guards in cancelTask : Only PENDING and BLOCKED tasks can be cancelled. Using an allow-list (permit known-safe states, reject everything else) rather than a deny-list means any future status added to the enum fails safely — rejected by default rather than accidentally permitted.

2. Effectively-final lambda capture : The dispatcher reassigns t = taskRepository.save(t) to capture the updated @Version. A separate Task x = t alias is introduced before the lambda so the thread pool captures a stable, non-reassigned reference. Without this, a thread could execute against a stale variable pointing to a different task entirely.

3. Per-save OptimisticLockException handling : Each taskRepository.save() call in executeTasks is individually wrapped. A version conflict on the DONE-path triggers a re-fetch-and-reapply pattern — fetch the current row with its actual version, reapply the DONE status, and save again. This prevents completed tasks from being silently orphaned as permanently RUNNING in the database.

4. Exponential backoff via existing dispatcher : Backoff doesn't need a separate timer or scheduler. Setting scheduledAt = now + 2^retryCount and returning the task to PENDING means the existing poll naturally picks it up when the delay has elapsed — zero new infrastructure.

5. Inter-container communication : Inside Docker Compose, the datasource URL uses the service name (postgres) rather than localhost. Containers on the same Compose network resolve each other by service name; localhost inside a container refers to that container itself, not the host or sibling containers.

Getting Started

Prerequisites

Docker Desktop installed and running

Run with Docker

# Clone the repository
git clone https://github.com/CapnHazard/distributed-task-scheduler
cd distributed-task-scheduler

# Build the jar
mvn package -DskipTests

# Start both containers (PostgreSQL + app)
docker compose up --build

The app will be available at http://localhost:8080.

To stop:

docker compose down

Run Locally (without Docker)

Requires a running PostgreSQL instance on port 5432 with a database named taskdb.

mvn spring-boot:run

API Endpoints

Create a Task

POST /tasks

Request body:

{
  "name": "Send report email",
  "scheduledAt": "2026-07-02T10:00:00",
  "priority": "HIGH",
  "maxRetries": 3,
  "dependsOn": null
}

scheduledAt — optional. Defaults to now if not provided. Cannot be more than 60 seconds in the past.
priority — optional. Values: HIGH, MEDIUM, LOW.
maxRetries — number of retry attempts before marking as FAILED.
dependsOn — optional. ID of another task this task depends on. Task will be BLOCKED until the dependency completes.

Response: 200 OK

{
  "id": 1,
  "name": "Send report email",
  "status": "PENDING",
  "priority": "HIGH",
  "scheduledAt": "2026-07-02T10:00:00",
  "retryCount": 0,
  "maxRetries": 3,
  "dependsOn": null,
  "completedAt": null
}

Get All Tasks

GET /tasks

Response: 200 OK

[
  {
    "id": 1,
    "name": "Send report email",
    "status": "DONE",
    "retryCount": 0,
    "completedAt": "2026-07-02T10:00:07.123"
  }
]

Get Task by ID

GET /tasks/{id}

Response: 200 OK — task object as above.

Error: 404 NOT FOUND if task does not exist.

Cancel a Task

PUT /tasks/{id}/cancel

Only PENDING or BLOCKED tasks can be cancelled. Attempting to cancel a RUNNING, DONE, FAILED, or already CANCELLED task returns a conflict error.

Response: 200 OK

{
  "id": 1,
  "status": "CANCELLED"
}

Error: 409 CONFLICT if task cannot be cancelled.

Task Lifecycle

PENDING ──► RUNNING ──► DONE
   ▲            │
   │            └──► FAILED (max retries exceeded)
   │
   └──────────────── PENDING (retry with backoff)

BLOCKED ──► PENDING (when dependency completes)
PENDING ──► CANCELLED
BLOCKED ──► CANCELLED

Configuration

Key properties in application.properties:

Property	Default	Description
`server.port`	`8080`	Server port
`spring.jpa.hibernate.ddl-auto`	`update`	Schema strategy
`spring.jpa.show-sql`	`true`	Log SQL queries

Thread pool settings in TaskExecutorConfig.java:

Setting	Value
Core pool size	5
Max pool size	10
Queue capacity	20

Dispatcher poll interval: 5000ms (hardcoded via @Scheduled(fixedDelay = 5000)).

What I Learned

1. Multithreading is not just about performance — it's about failure isolation. Threads don't share your try/catch. Handing work to a thread pool via taskExecutor.execute(λ) breaks the call stack. A try/catch in dispatchTasks cannot catch exceptions thrown inside executeTasks because by the time the thread pool runs that lambda, dispatchTasks has already returned. The catch block I wrote there does nothing for failures happening on a pool thread. Every thread handles its own mess.

2. Optimistic locking requires discipline at every save, not just the first one. I assumed optimistic locking was more magic than it actually is.@Version doesn't protect you automatically everywhere, it only fires when you call save(). I learned to wrap each save() call individually and handle OptimisticLockException at the exact point it can occur, rather than one broad catch that can't distinguish between a concurrency conflict and a real task failure.

3. Allow-lists are safer than deny-lists for state transition guards. When writing cancelTask, I could've blocked known-bad states. Instead I only permitted known-safe ones. A forgotten entry in an allow-list silently rejects an unhandled state — safe failure. A forgotten entry in a deny-list silently permits it — dangerous failure.

4. Think through what the data model already gives you before adding complexity. I almost thought retry scheduling needed its own infrastructure. Then I realised: set scheduledAt = now + 2^retryCount, return the task to PENDING, and the existing dispatcher poll handles it on the next cycle. No new code needed. The right data model can eliminate entire features.

5. Lambda capture rules exist for a real reason. Took me time to understand and accept this one. If the loop reassigns t while a pool thread is still holding a reference to it via the lambda, that thread executes the wrong task! The effectively-final constraint isn't arbitrary compiler pedantry — it prevents a genuine race condition where a lambda running on a pool thread captures a variable that the main thread has already moved to point at a different object. Introducing Task x = t as a frozen alias after the last reassignment made the risk concrete, not just theoretical.

6. localhost doesn't mean what you think inside Docker. Inside Docker Compose, localhost inside a container refers to that container itself. Not the host machine, not a sibling container. Sibling containers communicate by service name. Switching the datasource URL from localhost to the Compose service name (postgres) was a one-word fix, but understanding why it was wrong took longer.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
HELP.md		HELP.md
README.md		README.md
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Task Scheduler

Table of Contents

Features

Tech Stack

Architecture

Key Design Decisions

Getting Started

Prerequisites

Run with Docker

Run Locally (without Docker)

API Endpoints

Create a Task

Get All Tasks

Get Task by ID

Cancel a Task

Task Lifecycle

Configuration

What I Learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Task Scheduler

Table of Contents

Features

Tech Stack

Architecture

Key Design Decisions

Getting Started

Prerequisites

Run with Docker

Run Locally (without Docker)

API Endpoints

Create a Task

Get All Tasks

Get Task by ID

Cancel a Task

Task Lifecycle

Configuration

What I Learned

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages