Problem
`Client::PutObject` uploads parts strictly sequentially in src/client.cc:497-636. A 10 GiB object with 5 MiB parts means ~2000 sequential HTTP round-trips on a single TCP stream — single-stream bandwidth is the cap, even though the path is now backed by a shared CURL handle pool (PR #215).
Why now
PR #215 made the HTTP layer concurrency-safe (one-time `curl_global_init`, `CURLSH` with per-slot mutex, `region_map_` shared_mutex). The transport is ready to be exercised from multiple threads; the upload pipeline isn't.
Design tradeoffs (need input)
The current loop has a single shared aligned buffer (alloc'd once at `client.cc:914`), reads sequentially from `args.stream` via `utils::ReadPart`, registers ONE buffer for the whole multipart RDMA upload, and has a one-byte read-ahead to detect the last part. Parallelizing means:
- N buffers (page-aligned, each individually cuObj-registered for RDMA)
- Producer/consumer split — one thread drains the stream into the next free buffer; N threads post UploadParts
- Inflight cap as new public API (e.g., `PutObjectArgs::max_inflight_parts`)
- Memory pressure scales linearly with parallelism — large `part_size` × N can blow up RSS
Suggested approach
- Add `PutObjectArgs::max_inflight_parts` (default 1 = current behavior)
- Refactor the loop into a small producer (single thread reading the stream) + bounded executor (N consumers posting UploadParts via `UploadPart`)
- For RDMA: register N buffers up front via N `ScopedRDMARegistration` slots
- Preserve part ordering via part numbers (CompleteMultipartUpload doesn't care about completion order)
Impact
Large. Cited as the single biggest throughput unlock in the original audit alongside the now-landed handle pool. For large objects on fast networks, expect Nx improvement up to network/server limit.
Roadmap
T1.2 from the Tier 1 modernization audit; was paused because of the API/memory design questions above. Related: PR #215.
Problem
`Client::PutObject` uploads parts strictly sequentially in src/client.cc:497-636. A 10 GiB object with 5 MiB parts means ~2000 sequential HTTP round-trips on a single TCP stream — single-stream bandwidth is the cap, even though the path is now backed by a shared CURL handle pool (PR #215).
Why now
PR #215 made the HTTP layer concurrency-safe (one-time `curl_global_init`, `CURLSH` with per-slot mutex, `region_map_` shared_mutex). The transport is ready to be exercised from multiple threads; the upload pipeline isn't.
Design tradeoffs (need input)
The current loop has a single shared aligned buffer (alloc'd once at `client.cc:914`), reads sequentially from `args.stream` via `utils::ReadPart`, registers ONE buffer for the whole multipart RDMA upload, and has a one-byte read-ahead to detect the last part. Parallelizing means:
Suggested approach
Impact
Large. Cited as the single biggest throughput unlock in the original audit alongside the now-landed handle pool. For large objects on fast networks, expect Nx improvement up to network/server limit.
Roadmap
T1.2 from the Tier 1 modernization audit; was paused because of the API/memory design questions above. Related: PR #215.