Skip to content
This repository was archived by the owner on Jun 11, 2026. It is now read-only.

JoaoRuss0/cnv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nature@Cloud

This project contains the following sub-projects:

  1. fractals - the Julia Set fractals workload
  2. dna - the DNA Genome matcher workload
  3. grayscott - the Gray-Scott reaction-diffusion workload
  4. common - contains common code between all workloads (specially metrics - dto, how to save them, etc)
  5. webserver - the web server exposing the functionality of the workloads
  6. lb - the load balancer and autoscaler

Pre-requisites

Have installed:

  • Java 11.0.31-amzn (and set it in JAVA_HOME, be sure it is correctly configured by running java --version)
  • Packer
  • Terraform
  • AWS CLI
  • Maven

Make your own copy of the .example files inside the config/ directory - simply remove the ".example" part when naming them.

Local Setup (No AWS infrastrucutre, simply running the webserver locally with a local dockerized DynamoDB instance)

  1. Run mvn clean package

  2. Run docker compose

  3. Run LOCAL=TRUE java -cp <web-server-jar> -javaagent:webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar=<tool-name>:<comma-separated-app-list>:<output-file-name> <main-class>

    Example:

    LOCAL=TRUE java -cp webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar -javaagent:webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar=MetricsSnatcher:pt.ulisboa.tecnico.cnv.dna,pt.ulisboa.tecnico.cnv.grayscott,pt.ulisboa.tecnico.cnv.fractals pt.ulisboa.tecnico.cnv.webserver.WebServer
  4. Use the provided https://grupos.ist.utl.pt/meic-cnv/project/ interface

Remotely (On AWS)

./setup.sh # builds the loadbalancer and worker images, also launches both. Packer logs can be seen using: tail -f /tmp/packer-worker.log /tmp/packer-lb.log   
./teardown.sh # removes all resources previously setup 

Note: setup.sh writes output to logs/setup.log. teardown.sh writes logs/teardown.log. Tail with:

tail -f logs/setup.log or tail -f logs/teardown.log

Runtime LB logs live on the LB EC2 itself: /var/log/cnv-{lb,as,queue,placement,retrain}.log.


Architecture

Overview

The system is deployed on AWS and consists of the following components:

Client
  └─> Load Balancer EC2 (lb.jar, port 8080)                        [1 t3.micro]
        ├─> Cost Estimator (cache → DynamoDB → model → rough)
        ├─> ClassifiedQueue (LIGHT / MEDIUM / HEAVY / SUPER_HEAVY + retry lane)
        ├─> Dispatcher + WorkerPool (PACK / SPREAD placement)
        │     ├─> EC2 Workers (webserver, port 8000)             [1–3 t3.micro]
        │     │     └─> DynamoDB (WORKLOAD_METRICS table)
        │     └─> Lambda overflow for LIGHT requests when EC2 pool is full
        │           ├─ cnv-fractals  (java11, 512 MB)
        │           ├─ cnv-dna       (java11, 512 MB)
        │           └─ cnv-grayscott (java11, 512 MB)
        ├─> AutoScaler  (1-min tick, CloudWatch CPU EMA, scale 1..3)
        └─> Online retrainer (rebuilds cost models from DynamoDB on threshold)
  • Load Balancer (lb.jar): A custom Java HTTP server on port 8080. Receives every client request, estimates its cost, classifies it, picks a worker (or Lambda for worker max cost overflow on LIGHT requests), forwards it, and retries on failure. Forward uses HttpClient.sendAsync so in-flight requests can be cancelled when a worker is evicted.
  • EC2 Workers: Each runs the webserver JAR instrumented with the Javassist agent (MetricsSnatcher). Launched and terminated directly by the LB's AutoScaler via the EC2 RunInstances/TerminateInstances API. After completing each request, the worker writes the measured metrics to DynamoDB.
  • DynamoDB (WORKLOAD_METRICS): Stores per-request metrics. Read by the LB on startup (to warm the cost cache in case the LB ever shuts down/fails) and by the online retrainer.
  • Lambda functions: One per workload (cnv-fractals, cnv-dna, cnv-grayscott). Invoked synchronously by the LB only when (a) the request is classified LIGHT and (b) the EC2 pool has no capacity. Lambda call has its own 3-attempt retry with exponential backoff (750 ms × 2^n) for transient AWS-side errors.

Request lifecycle

  1. Estimate. CostEstimator looks up the instCount and totalMemAccesses cost via, in order: in-memory PerfectCostCache (exact-parameter match) → DynamoDB → WorkloadModel (polynomial regression on log features) → rough heuristic fallback.
  2. Classify. Cost is bucketed into LIGHT (≤ 4 000), MEDIUM (≤ 8 000), HEAVY (≤ 16 000), SUPER_HEAVY (> 16 000). The constant WORKER_CAPACITY = 16 000 is the per-worker projected-load ceiling that placement respects.
  3. Lambda overflow (LIGHT only). If pool.hasCapacityFor(cost, 16 000) is false, the request goes to Lambda. On success it returns; on retryable failure (Rate Exceeded, 429, 503, Read timed out) it backs off and retries; on exhaustion it falls back to the EC2 queue.
  4. Enqueue. Request is added to ClassifiedQueue (per-class lane + separate retry lane). Caller blocks up to QUEUE_WAIT_BUDGET_MS = 5 min per attempt for an assignment.
  5. Place. WorkerPool.reserveForRequest picks an eligible READY worker in either PACK mode (consolidate onto fewest workers - avgLoad < cap × 0.4) or SPREAD mode (balance across workers - avgLoad > cap × 0.6), with hysteresis. Workers whose projected load + cost would exceed WORKER_CAPACITY are skipped.
  6. Forward. HTTP.sendAsync to http://<worker>:8000/<workload>?.... On cancellation/timeout/error → RETRYABLE, request is re-enqueued (up to MAX_RETRIES = 3) with the failed worker added to an excluded set.
  7. Record. On success, the worker's response headers carry X-CNV-InstCount / X-CNV-TotalAccesses. The LB records these into the estimator (warming the cache) and the worker has already persisted them to DynamoDB.

AutoScaler

Runs a periodic tick (60 s) that:

  1. Drains any worker in DRAINING with active == 0 (terminate via EC2 API, remove from pool).
  2. If pool.countActive() < MIN_WORKERS (=1), force a scale-up.
  3. Reads CloudWatch CPUUtilization (1-min window, latest datapoint) averaged across all pool workers, fold into an EMA with α = 0.6.
  4. Scale up if ema_cpu > 85% for ≥ 2 consecutive ticks AND queue is non-empty AND below MAX_WORKERS (=3). Launches 1 instance, or 2 if queued cost exceeds one worker's capacity.
  5. Scale down if ema_cpu < 20% for ≥ 5 consecutive ticks AND above MIN_WORKERS. Marks the least-loaded READY worker as DRAINING.

WorkerPool health & eviction (inside lb.jar)

  • Health check every 15 s against worker /health.
  • Non-fresh worker hitting FAILURES_BEFORE_REMOVAL = 3 consecutive failures is evicted immediately (regardless of in-flight count, via FAILURES_BEFORE_FORCE_EVICT = 3). Fresh-booting workers get FRESH_FAILURES_BEFORE_REMOVAL = 16 failures of grace (~4 min) for cold-start.
  • On eviction, Worker.cancelAllInFlight() cancels every in-flight CompletableFuture request for that worker. Each cancelled request returns RETRYABLE and the request is re-enqueued onto a healthy worker — no requests are dropped.

AutoScaler Constnats

Parameter Value
Tick interval 60 s
Min / Max workers 1 / 3
Scale-up trigger EMA CPU > 85% for 2 consecutive ticks, queue non-empty
Scale-down trigger EMA CPU < 20% for 5 consecutive ticks
CPU EMA α 0.6

Lambda Functions

Function Runtime Memory
cnv-fractals java11 512 MB
cnv-dna java11 512 MB
cnv-grayscott java11 512 MB

Instrumentation

Bytecode instrumentation is performed at class-load time via a Javassist javaagent (MetricsSnatcher). It instruments the following workload classes:

  • pt.ulisboa.tecnico.cnv.dna.Dna
  • pt.ulisboa.tecnico.cnv.dna.DnaHtmlRenderer
  • pt.ulisboa.tecnico.cnv.grayscott.GrayScott
  • pt.ulisboa.tecnico.cnv.fractals.JuliaFractal

For these classes, we collect the number of instructions executed and the memory accesses/allocations. These are tracked per-thread, using ThreadLocal<Metrics> to avoid contention across concurrent requests.

GrayScott Smart Instrumentation

Because GrayScott's inner loops are too tight to instrument without significant overhead, an analytical formula is used instead of injecting counters:

pixelsProcessed = actualIterations × gridSize²
instCount      += pixelsProcessed × 274   (instructions per cell)
totalAccesses  += pixelsProcessed × 62    (memory accesses per cell)

These constants were derived empirically by profiling representative GrayScott runs.

DynamoDB Schema (WORKLOAD_METRICS)

Attribute Type Description
id String UUID per request
type (PK) String Workload type: FRACTAL, DNA, GRAYSCOTT
sortKey (SK) String hash(request_parameters)
parameters Map Request parameters (e.g. width, height, iterations)
metrics Map { instCount, totalAccesses }
startTime String Timestamp
endTime String Timestamp

Releases

No releases published

Packages

 
 
 

Contributors