Skip to content

CapnHazard/distributed-task-scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Task Scheduler

A production-ready distributed task scheduling system built with Java and Spring Boot. Supports scheduled task execution, automatic retries with exponential backoff, task dependency resolution, optimistic locking for concurrency safety, and full execution history logging.


Table of Contents


Features

  • Scheduled Execution — Tasks are polled every 5 seconds and dispatched when due
  • Async Thread Pool — Tasks execute concurrently via a configurable ThreadPoolTaskExecutor
  • Exponential Backoff Retry — Failed tasks are retried with delay 2^retryCount seconds, up to a configurable maxRetries limit
  • Task Dependency Graph — Tasks can declare a dependency on another task; blocked tasks automatically unblock when their dependency completes
  • Optimistic Locking@Version-based concurrency control prevents double-execution under concurrent dispatcher polls
  • Execution History Logging — Every completed task generates a TaskExecutionHistory record with completion time, retry count, and final status
  • Docker + PostgreSQL — Fully containerised with docker-compose, no manual database setup required

Tech Stack

Layer Technology
Language Java 21
Framework Spring Boot 3.5.14
Database PostgreSQL 16
ORM Spring Data JPA / Hibernate
Containerisation Docker + Docker Compose
Build Tool Maven

Architecture

┌─────────────────────────────────────────────────────┐
│                     TaskService                      │
│                                                      │
│  @Scheduled (every 5s)                               │
│  dispatchTasks()                                     │
│       │                                              │
│       ▼                                              │
│  getDueTasks()  ──►  findByStatusAndScheduledAt...   │
│       │              (PENDING tasks only)            │
│       ▼                                              │
│  For each task:                                      │
│    setStatus(RUNNING)                                │
│    taskRepository.save(t)  ◄── @Version check        │
│       │                       OptimisticLockException│
│       ▼                        if already running    │
│  taskExecutor.execute(λ)                             │
│       │                                              │
│       ▼  (thread pool thread)                        │
│  executeTasks(task)                                  │
│    ├── SUCCESS                                       │
│    │     setStatus(DONE)                             │
│    │     save() ◄── per-save OptimisticLock handling │
│    │     log to TaskExecutionHistory                 │
│    │     unblock dependent tasks (BLOCKED → PENDING) │
│    │                                                 │
│    └── FAILURE                                       │
│          retryCount < maxRetries?                    │
│            YES → setStatus(PENDING)                  │
│                  scheduledAt = now + 2^retryCount    │
│            NO  → setStatus(FAILED)                   │
└─────────────────────────────────────────────────────┘

Key Design Decisions

1. Allow-list state guards in cancelTask : Only PENDING and BLOCKED tasks can be cancelled. Using an allow-list (permit known-safe states, reject everything else) rather than a deny-list means any future status added to the enum fails safely — rejected by default rather than accidentally permitted.

2. Effectively-final lambda capture : The dispatcher reassigns t = taskRepository.save(t) to capture the updated @Version. A separate Task x = t alias is introduced before the lambda so the thread pool captures a stable, non-reassigned reference. Without this, a thread could execute against a stale variable pointing to a different task entirely.

3. Per-save OptimisticLockException handling : Each taskRepository.save() call in executeTasks is individually wrapped. A version conflict on the DONE-path triggers a re-fetch-and-reapply pattern — fetch the current row with its actual version, reapply the DONE status, and save again. This prevents completed tasks from being silently orphaned as permanently RUNNING in the database.

4. Exponential backoff via existing dispatcher : Backoff doesn't need a separate timer or scheduler. Setting scheduledAt = now + 2^retryCount and returning the task to PENDING means the existing poll naturally picks it up when the delay has elapsed — zero new infrastructure.

5. Inter-container communication : Inside Docker Compose, the datasource URL uses the service name (postgres) rather than localhost. Containers on the same Compose network resolve each other by service name; localhost inside a container refers to that container itself, not the host or sibling containers.


Getting Started

Prerequisites

  • Docker Desktop installed and running

Run with Docker

# Clone the repository
git clone https://github.com/CapnHazard/distributed-task-scheduler
cd distributed-task-scheduler

# Build the jar
mvn package -DskipTests

# Start both containers (PostgreSQL + app)
docker compose up --build

The app will be available at http://localhost:8080.

To stop:

docker compose down

Run Locally (without Docker)

Requires a running PostgreSQL instance on port 5432 with a database named taskdb.

mvn spring-boot:run

API Endpoints

Create a Task

POST /tasks

Request body:

{
  "name": "Send report email",
  "scheduledAt": "2026-07-02T10:00:00",
  "priority": "HIGH",
  "maxRetries": 3,
  "dependsOn": null
}
  • scheduledAt — optional. Defaults to now if not provided. Cannot be more than 60 seconds in the past.
  • priority — optional. Values: HIGH, MEDIUM, LOW.
  • maxRetries — number of retry attempts before marking as FAILED.
  • dependsOn — optional. ID of another task this task depends on. Task will be BLOCKED until the dependency completes.

Response: 200 OK

{
  "id": 1,
  "name": "Send report email",
  "status": "PENDING",
  "priority": "HIGH",
  "scheduledAt": "2026-07-02T10:00:00",
  "retryCount": 0,
  "maxRetries": 3,
  "dependsOn": null,
  "completedAt": null
}

Get All Tasks

GET /tasks

Response: 200 OK

[
  {
    "id": 1,
    "name": "Send report email",
    "status": "DONE",
    "retryCount": 0,
    "completedAt": "2026-07-02T10:00:07.123"
  }
]

Get Task by ID

GET /tasks/{id}

Response: 200 OK — task object as above.

Error: 404 NOT FOUND if task does not exist.


Cancel a Task

PUT /tasks/{id}/cancel

Only PENDING or BLOCKED tasks can be cancelled. Attempting to cancel a RUNNING, DONE, FAILED, or already CANCELLED task returns a conflict error.

Response: 200 OK

{
  "id": 1,
  "status": "CANCELLED"
}

Error: 409 CONFLICT if task cannot be cancelled.


Task Lifecycle

PENDING ──► RUNNING ──► DONE
   ▲            │
   │            └──► FAILED (max retries exceeded)
   │
   └──────────────── PENDING (retry with backoff)

BLOCKED ──► PENDING (when dependency completes)
PENDING ──► CANCELLED
BLOCKED ──► CANCELLED

Configuration

Key properties in application.properties:

Property Default Description
server.port 8080 Server port
spring.jpa.hibernate.ddl-auto update Schema strategy
spring.jpa.show-sql true Log SQL queries

Thread pool settings in TaskExecutorConfig.java:

Setting Value
Core pool size 5
Max pool size 10
Queue capacity 20

Dispatcher poll interval: 5000ms (hardcoded via @Scheduled(fixedDelay = 5000)).


What I Learned

1. Multithreading is not just about performance — it's about failure isolation. Threads don't share your try/catch. Handing work to a thread pool via taskExecutor.execute(λ) breaks the call stack. A try/catch in dispatchTasks cannot catch exceptions thrown inside executeTasks because by the time the thread pool runs that lambda, dispatchTasks has already returned. The catch block I wrote there does nothing for failures happening on a pool thread. Every thread handles its own mess.

2. Optimistic locking requires discipline at every save, not just the first one. I assumed optimistic locking was more magic than it actually is.@Version doesn't protect you automatically everywhere, it only fires when you call save(). I learned to wrap each save() call individually and handle OptimisticLockException at the exact point it can occur, rather than one broad catch that can't distinguish between a concurrency conflict and a real task failure.

3. Allow-lists are safer than deny-lists for state transition guards. When writing cancelTask, I could've blocked known-bad states. Instead I only permitted known-safe ones. A forgotten entry in an allow-list silently rejects an unhandled state — safe failure. A forgotten entry in a deny-list silently permits it — dangerous failure.

4. Think through what the data model already gives you before adding complexity. I almost thought retry scheduling needed its own infrastructure. Then I realised: set scheduledAt = now + 2^retryCount, return the task to PENDING, and the existing dispatcher poll handles it on the next cycle. No new code needed. The right data model can eliminate entire features.

5. Lambda capture rules exist for a real reason. Took me time to understand and accept this one. If the loop reassigns t while a pool thread is still holding a reference to it via the lambda, that thread executes the wrong task! The effectively-final constraint isn't arbitrary compiler pedantry — it prevents a genuine race condition where a lambda running on a pool thread captures a variable that the main thread has already moved to point at a different object. Introducing Task x = t as a frozen alias after the last reassignment made the risk concrete, not just theoretical.

6. localhost doesn't mean what you think inside Docker. Inside Docker Compose, localhost inside a container refers to that container itself. Not the host machine, not a sibling container. Sibling containers communicate by service name. Switching the datasource URL from localhost to the Compose service name (postgres) was a one-word fix, but understanding why it was wrong took longer.


About

Containerized distributed task scheduler made with Java, Spring Boot and Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors