This project contains the following sub-projects:
fractals- the Julia Set fractals workloaddna- the DNA Genome matcher workloadgrayscott- the Gray-Scott reaction-diffusion workloadcommon- contains common code between all workloads (specially metrics - dto, how to save them, etc)webserver- the web server exposing the functionality of the workloadslb- the load balancer and autoscaler
Have installed:
- Java 11.0.31-amzn (and set it in JAVA_HOME, be sure it is correctly configured by running
java --version) - Packer
- Terraform
- AWS CLI
- Maven
Make your own copy of the .example files inside the config/ directory - simply remove the ".example" part when naming them.
Local Setup (No AWS infrastrucutre, simply running the webserver locally with a local dockerized DynamoDB instance)
-
Run
mvn clean package -
Run
docker compose -
Run
LOCAL=TRUE java -cp <web-server-jar> -javaagent:webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar=<tool-name>:<comma-separated-app-list>:<output-file-name> <main-class>Example:
LOCAL=TRUE java -cp webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar -javaagent:webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar=MetricsSnatcher:pt.ulisboa.tecnico.cnv.dna,pt.ulisboa.tecnico.cnv.grayscott,pt.ulisboa.tecnico.cnv.fractals pt.ulisboa.tecnico.cnv.webserver.WebServer
-
Use the provided
https://grupos.ist.utl.pt/meic-cnv/project/interface
./setup.sh # builds the loadbalancer and worker images, also launches both. Packer logs can be seen using: tail -f /tmp/packer-worker.log /tmp/packer-lb.log
./teardown.sh # removes all resources previously setup Note:
setup.shwrites output tologs/setup.log.teardown.shwriteslogs/teardown.log. Tail with:
tail -f logs/setup.logortail -f logs/teardown.logRuntime LB logs live on the LB EC2 itself:
/var/log/cnv-{lb,as,queue,placement,retrain}.log.
The system is deployed on AWS and consists of the following components:
Client
└─> Load Balancer EC2 (lb.jar, port 8080) [1 t3.micro]
├─> Cost Estimator (cache → DynamoDB → model → rough)
├─> ClassifiedQueue (LIGHT / MEDIUM / HEAVY / SUPER_HEAVY + retry lane)
├─> Dispatcher + WorkerPool (PACK / SPREAD placement)
│ ├─> EC2 Workers (webserver, port 8000) [1–3 t3.micro]
│ │ └─> DynamoDB (WORKLOAD_METRICS table)
│ └─> Lambda overflow for LIGHT requests when EC2 pool is full
│ ├─ cnv-fractals (java11, 512 MB)
│ ├─ cnv-dna (java11, 512 MB)
│ └─ cnv-grayscott (java11, 512 MB)
├─> AutoScaler (1-min tick, CloudWatch CPU EMA, scale 1..3)
└─> Online retrainer (rebuilds cost models from DynamoDB on threshold)
- Load Balancer (lb.jar): A custom Java HTTP server on port 8080. Receives every client request, estimates its cost, classifies it, picks a worker (or Lambda for worker max cost overflow on LIGHT requests), forwards it, and retries on failure. Forward uses
HttpClient.sendAsyncso in-flight requests can be cancelled when a worker is evicted. - EC2 Workers: Each runs the webserver JAR instrumented with the Javassist agent (
MetricsSnatcher). Launched and terminated directly by the LB's AutoScaler via the EC2 RunInstances/TerminateInstances API. After completing each request, the worker writes the measured metrics to DynamoDB. - DynamoDB (
WORKLOAD_METRICS): Stores per-request metrics. Read by the LB on startup (to warm the cost cache in case the LB ever shuts down/fails) and by the online retrainer. - Lambda functions: One per workload (
cnv-fractals,cnv-dna,cnv-grayscott). Invoked synchronously by the LB only when (a) the request is classified LIGHT and (b) the EC2 pool has no capacity. Lambda call has its own 3-attempt retry with exponential backoff (750 ms × 2^n) for transient AWS-side errors.
- Estimate.
CostEstimatorlooks up theinstCountandtotalMemAccessescost via, in order: in-memoryPerfectCostCache(exact-parameter match) → DynamoDB →WorkloadModel(polynomial regression on log features) →roughheuristic fallback. - Classify. Cost is bucketed into LIGHT (≤ 4 000), MEDIUM (≤ 8 000), HEAVY (≤ 16 000), SUPER_HEAVY (> 16 000). The constant
WORKER_CAPACITY = 16 000is the per-worker projected-load ceiling that placement respects. - Lambda overflow (LIGHT only). If
pool.hasCapacityFor(cost, 16 000)is false, the request goes to Lambda. On success it returns; on retryable failure (Rate Exceeded,429,503,Read timed out) it backs off and retries; on exhaustion it falls back to the EC2 queue. - Enqueue. Request is added to
ClassifiedQueue(per-class lane + separate retry lane). Caller blocks up toQUEUE_WAIT_BUDGET_MS = 5 minper attempt for an assignment. - Place.
WorkerPool.reserveForRequestpicks an eligible READY worker in either PACK mode (consolidate onto fewest workers -avgLoad < cap × 0.4) or SPREAD mode (balance across workers -avgLoad > cap × 0.6), with hysteresis. Workers whose projected load + cost would exceedWORKER_CAPACITYare skipped. - Forward.
HTTP.sendAsynctohttp://<worker>:8000/<workload>?.... On cancellation/timeout/error →RETRYABLE, request is re-enqueued (up toMAX_RETRIES = 3) with the failed worker added to an excluded set. - Record. On success, the worker's response headers carry
X-CNV-InstCount/X-CNV-TotalAccesses. The LB records these into the estimator (warming the cache) and the worker has already persisted them to DynamoDB.
Runs a periodic tick (60 s) that:
- Drains any worker in DRAINING with
active == 0(terminate via EC2 API, remove from pool). - If
pool.countActive() < MIN_WORKERS(=1), force a scale-up. - Reads CloudWatch
CPUUtilization(1-min window, latest datapoint) averaged across all pool workers, fold into an EMA withα = 0.6. - Scale up if
ema_cpu > 85%for ≥ 2 consecutive ticks AND queue is non-empty AND below MAX_WORKERS (=3). Launches 1 instance, or 2 if queued cost exceeds one worker's capacity. - Scale down if
ema_cpu < 20%for ≥ 5 consecutive ticks AND above MIN_WORKERS. Marks the least-loaded READY worker as DRAINING.
- Health check every 15 s against worker
/health. - Non-fresh worker hitting
FAILURES_BEFORE_REMOVAL = 3consecutive failures is evicted immediately (regardless of in-flight count, viaFAILURES_BEFORE_FORCE_EVICT = 3). Fresh-booting workers getFRESH_FAILURES_BEFORE_REMOVAL = 16failures of grace (~4 min) for cold-start. - On eviction,
Worker.cancelAllInFlight()cancels every in-flightCompletableFuturerequest for that worker. Each cancelled request returnsRETRYABLEand the request is re-enqueued onto a healthy worker — no requests are dropped.
| Parameter | Value |
|---|---|
| Tick interval | 60 s |
| Min / Max workers | 1 / 3 |
| Scale-up trigger | EMA CPU > 85% for 2 consecutive ticks, queue non-empty |
| Scale-down trigger | EMA CPU < 20% for 5 consecutive ticks |
| CPU EMA α | 0.6 |
| Function | Runtime | Memory |
|---|---|---|
cnv-fractals |
java11 | 512 MB |
cnv-dna |
java11 | 512 MB |
cnv-grayscott |
java11 | 512 MB |
Bytecode instrumentation is performed at class-load time via a Javassist javaagent (MetricsSnatcher). It instruments the following workload classes:
pt.ulisboa.tecnico.cnv.dna.Dnapt.ulisboa.tecnico.cnv.dna.DnaHtmlRendererpt.ulisboa.tecnico.cnv.grayscott.GrayScottpt.ulisboa.tecnico.cnv.fractals.JuliaFractal
For these classes, we collect the number of instructions executed and the memory accesses/allocations.
These are tracked per-thread, using ThreadLocal<Metrics> to avoid contention across concurrent requests.
Because GrayScott's inner loops are too tight to instrument without significant overhead, an analytical formula is used instead of injecting counters:
pixelsProcessed = actualIterations × gridSize²
instCount += pixelsProcessed × 274 (instructions per cell)
totalAccesses += pixelsProcessed × 62 (memory accesses per cell)
These constants were derived empirically by profiling representative GrayScott runs.
| Attribute | Type | Description |
|---|---|---|
id |
String | UUID per request |
type (PK) |
String | Workload type: FRACTAL, DNA, GRAYSCOTT |
sortKey (SK) |
String | hash(request_parameters) |
parameters |
Map | Request parameters (e.g. width, height, iterations) |
metrics |
Map | { instCount, totalAccesses } |
startTime |
String | Timestamp |
endTime |
String | Timestamp |