GitHub - JoaoRuss0/cnv: Cloud Computing

Nature@Cloud

This project contains the following sub-projects:

fractals - the Julia Set fractals workload
dna - the DNA Genome matcher workload
grayscott - the Gray-Scott reaction-diffusion workload
common - contains common code between all workloads (specially metrics - dto, how to save them, etc)
webserver - the web server exposing the functionality of the workloads
lb - the load balancer and autoscaler

Pre-requisites

Have installed:

Java 11.0.31-amzn (and set it in JAVA_HOME, be sure it is correctly configured by running java --version)
Packer
Terraform
AWS CLI
Maven

Make your own copy of the .example files inside the config/ directory - simply remove the ".example" part when naming them.

Local Setup (No AWS infrastrucutre, simply running the webserver locally with a local dockerized DynamoDB instance)

Run mvn clean package
Run docker compose

Run LOCAL=TRUE java -cp <web-server-jar> -javaagent:webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar=<tool-name>:<comma-separated-app-list>:<output-file-name> <main-class>

Example:

LOCAL=TRUE java -cp webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar -javaagent:webserver/target/webserver-1.0.0-SNAPSHOT-jar-with-dependencies.jar=MetricsSnatcher:pt.ulisboa.tecnico.cnv.dna,pt.ulisboa.tecnico.cnv.grayscott,pt.ulisboa.tecnico.cnv.fractals pt.ulisboa.tecnico.cnv.webserver.WebServer

Use the provided https://grupos.ist.utl.pt/meic-cnv/project/ interface

Remotely (On AWS)

./setup.sh # builds the loadbalancer and worker images, also launches both. Packer logs can be seen using: tail -f /tmp/packer-worker.log /tmp/packer-lb.log   
./teardown.sh # removes all resources previously setup

Note: setup.sh writes output to logs/setup.log. teardown.sh writes logs/teardown.log. Tail with:

tail -f logs/setup.log or tail -f logs/teardown.log

Runtime LB logs live on the LB EC2 itself: /var/log/cnv-{lb,as,queue,placement,retrain}.log.

Architecture

Overview

The system is deployed on AWS and consists of the following components:

Client
  └─> Load Balancer EC2 (lb.jar, port 8080)                        [1 t3.micro]
        ├─> Cost Estimator (cache → DynamoDB → model → rough)
        ├─> ClassifiedQueue (LIGHT / MEDIUM / HEAVY / SUPER_HEAVY + retry lane)
        ├─> Dispatcher + WorkerPool (PACK / SPREAD placement)
        │     ├─> EC2 Workers (webserver, port 8000)             [1–3 t3.micro]
        │     │     └─> DynamoDB (WORKLOAD_METRICS table)
        │     └─> Lambda overflow for LIGHT requests when EC2 pool is full
        │           ├─ cnv-fractals  (java11, 512 MB)
        │           ├─ cnv-dna       (java11, 512 MB)
        │           └─ cnv-grayscott (java11, 512 MB)
        ├─> AutoScaler  (1-min tick, CloudWatch CPU EMA, scale 1..3)
        └─> Online retrainer (rebuilds cost models from DynamoDB on threshold)

Load Balancer (lb.jar): A custom Java HTTP server on port 8080. Receives every client request, estimates its cost, classifies it, picks a worker (or Lambda for worker max cost overflow on LIGHT requests), forwards it, and retries on failure. Forward uses HttpClient.sendAsync so in-flight requests can be cancelled when a worker is evicted.
EC2 Workers: Each runs the webserver JAR instrumented with the Javassist agent (MetricsSnatcher). Launched and terminated directly by the LB's AutoScaler via the EC2 RunInstances/TerminateInstances API. After completing each request, the worker writes the measured metrics to DynamoDB.
DynamoDB (WORKLOAD_METRICS): Stores per-request metrics. Read by the LB on startup (to warm the cost cache in case the LB ever shuts down/fails) and by the online retrainer.
Lambda functions: One per workload (cnv-fractals, cnv-dna, cnv-grayscott). Invoked synchronously by the LB only when (a) the request is classified LIGHT and (b) the EC2 pool has no capacity. Lambda call has its own 3-attempt retry with exponential backoff (750 ms × 2^n) for transient AWS-side errors.

Request lifecycle

Estimate. CostEstimator looks up the instCount and totalMemAccesses cost via, in order: in-memory PerfectCostCache (exact-parameter match) → DynamoDB → WorkloadModel (polynomial regression on log features) → rough heuristic fallback.
Classify. Cost is bucketed into LIGHT (≤ 4 000), MEDIUM (≤ 8 000), HEAVY (≤ 16 000), SUPER_HEAVY (> 16 000). The constant WORKER_CAPACITY = 16 000 is the per-worker projected-load ceiling that placement respects.
Lambda overflow (LIGHT only). If pool.hasCapacityFor(cost, 16 000) is false, the request goes to Lambda. On success it returns; on retryable failure (Rate Exceeded, 429, 503, Read timed out) it backs off and retries; on exhaustion it falls back to the EC2 queue.
Enqueue. Request is added to ClassifiedQueue (per-class lane + separate retry lane). Caller blocks up to QUEUE_WAIT_BUDGET_MS = 5 min per attempt for an assignment.
Place. WorkerPool.reserveForRequest picks an eligible READY worker in either PACK mode (consolidate onto fewest workers - avgLoad < cap × 0.4) or SPREAD mode (balance across workers - avgLoad > cap × 0.6), with hysteresis. Workers whose projected load + cost would exceed WORKER_CAPACITY are skipped.
Forward. HTTP.sendAsync to http://<worker>:8000/<workload>?.... On cancellation/timeout/error → RETRYABLE, request is re-enqueued (up to MAX_RETRIES = 3) with the failed worker added to an excluded set.
Record. On success, the worker's response headers carry X-CNV-InstCount / X-CNV-TotalAccesses. The LB records these into the estimator (warming the cache) and the worker has already persisted them to DynamoDB.

AutoScaler

Runs a periodic tick (60 s) that:

Drains any worker in DRAINING with active == 0 (terminate via EC2 API, remove from pool).
If pool.countActive() < MIN_WORKERS (=1), force a scale-up.
Reads CloudWatch CPUUtilization (1-min window, latest datapoint) averaged across all pool workers, fold into an EMA with α = 0.6.
Scale up if ema_cpu > 85% for ≥ 2 consecutive ticks AND queue is non-empty AND below MAX_WORKERS (=3). Launches 1 instance, or 2 if queued cost exceeds one worker's capacity.
Scale down if ema_cpu < 20% for ≥ 5 consecutive ticks AND above MIN_WORKERS. Marks the least-loaded READY worker as DRAINING.

WorkerPool health & eviction (inside lb.jar)

Health check every 15 s against worker /health.
Non-fresh worker hitting FAILURES_BEFORE_REMOVAL = 3 consecutive failures is evicted immediately (regardless of in-flight count, via FAILURES_BEFORE_FORCE_EVICT = 3). Fresh-booting workers get FRESH_FAILURES_BEFORE_REMOVAL = 16 failures of grace (~4 min) for cold-start.
On eviction, Worker.cancelAllInFlight() cancels every in-flight CompletableFuture request for that worker. Each cancelled request returns RETRYABLE and the request is re-enqueued onto a healthy worker — no requests are dropped.

AutoScaler Constnats

Parameter	Value
Tick interval	60 s
Min / Max workers	1 / 3
Scale-up trigger	EMA CPU > 85% for 2 consecutive ticks, queue non-empty
Scale-down trigger	EMA CPU < 20% for 5 consecutive ticks
CPU EMA α	0.6

Lambda Functions

Function	Runtime	Memory
`cnv-fractals`	java11	512 MB
`cnv-dna`	java11	512 MB
`cnv-grayscott`	java11	512 MB

Instrumentation

Bytecode instrumentation is performed at class-load time via a Javassist javaagent (MetricsSnatcher). It instruments the following workload classes:

pt.ulisboa.tecnico.cnv.dna.Dna
pt.ulisboa.tecnico.cnv.dna.DnaHtmlRenderer
pt.ulisboa.tecnico.cnv.grayscott.GrayScott
pt.ulisboa.tecnico.cnv.fractals.JuliaFractal

For these classes, we collect the number of instructions executed and the memory accesses/allocations. These are tracked per-thread, using ThreadLocal<Metrics> to avoid contention across concurrent requests.

GrayScott Smart Instrumentation

Because GrayScott's inner loops are too tight to instrument without significant overhead, an analytical formula is used instead of injecting counters:

pixelsProcessed = actualIterations × gridSize²
instCount      += pixelsProcessed × 274   (instructions per cell)
totalAccesses  += pixelsProcessed × 62    (memory accesses per cell)

These constants were derived empirically by profiling representative GrayScott runs.

DynamoDB Schema (`WORKLOAD_METRICS`)

Attribute	Type	Description
`id`	String	UUID per request
`type` (PK)	String	Workload type: `FRACTAL`, `DNA`, `GRAYSCOTT`
`sortKey` (SK)	String	`hash(request_parameters)`
`parameters`	Map	Request parameters (e.g. width, height, iterations)
`metrics`	Map	`{ instCount, totalAccesses }`
`startTime`	String	Timestamp
`endTime`	String	Timestamp

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
config		config
datasets		datasets
deploy		deploy
docs		docs
instances		instances
scripts		scripts
worker		worker
.gitignore		.gitignore
.sdkmanrc		.sdkmanrc
README.md		README.md
load.sh		load.sh
setup.sh		setup.sh
teardown.sh		teardown.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nature@Cloud

Pre-requisites

Local Setup (No AWS infrastrucutre, simply running the webserver locally with a local dockerized DynamoDB instance)

Remotely (On AWS)

Architecture

Overview

Request lifecycle

AutoScaler

WorkerPool health & eviction (inside lb.jar)

AutoScaler Constnats

Lambda Functions

Instrumentation

GrayScott Smart Instrumentation

DynamoDB Schema (`WORKLOAD_METRICS`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nature@Cloud

Pre-requisites

Local Setup (No AWS infrastrucutre, simply running the webserver locally with a local dockerized DynamoDB instance)

Remotely (On AWS)

Architecture

Overview

Request lifecycle

AutoScaler

WorkerPool health & eviction (inside lb.jar)

AutoScaler Constnats

Lambda Functions

Instrumentation

GrayScott Smart Instrumentation

DynamoDB Schema (WORKLOAD_METRICS)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

DynamoDB Schema (`WORKLOAD_METRICS`)

Packages