Skip to content

Bugfix refill#55

Open
mihai145 wants to merge 5 commits into
masterfrom
bugfix-refill
Open

Bugfix refill#55
mihai145 wants to merge 5 commits into
masterfrom
bugfix-refill

Conversation

@mihai145

@mihai145 mihai145 commented Mar 24, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Refill behavior now accepts an optional threshold so callers can control when receive buffers are replenished.
  • Bug Fixes

    • Corrected request-count tracking to apply updates to the intended connection.
    • Receive buffer sizing now properly scales with the configured number of cores.
    • Resource-manager receive handling now updates and refills work-credits at appropriate points.
  • Chores

    • Increased maximum work-queue send capacity for higher throughput.

@coderabbitai

coderabbitai Bot commented Mar 24, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Changed rdmalib::RecvWorkCompletions::refill() to accept an optional int threshold = -1 and use it when deciding to post receives; updated callers (executor, client, manager) to pass thresholds and adjust request accounting; increased RDMA QP send WR capacity from 40 to 128; ensured executor receive buffer ≥ numcores.

Changes

Cohort / File(s) Summary
RDMA receive queue API & impl
rdmalib/include/rdmalib/queue.hpp, rdmalib/lib/queue.cpp
RecvWorkCompletions::refill() signature changed to bool refill(int threshold = -1); implementation treats -1 as fallback to internal _refill_threshold and posts receives when _requests < threshold.
RDMA QP configuration
rdmalib/lib/rdmalib.cpp
Increased _cfg.attr.cap.max_send_wr from 40 to 128 in RDMAActive and RDMAPassive constructors.
Executor: caller updates & sizing
rfaas/include/rfaas/executor.hpp, rfaas/lib/executor.cpp
Initial refill now calls receive_wcs().refill(numcores); per-connection update_requests(-1) applied correctly to each _connections[i]; receive buffer size initialized to at least numcores (std::max(dev.default_receive_buffer_size, (int16_t)numcores) + 1).
Manager: resource message handler signature
server/executor_manager/manager.hpp, server/executor_manager/manager.cpp
Manager::_handle_res_mgr_message now accepts int update_requests (default 0); handler calls receive_wcs().update_requests(update_requests) before receive_wcs().refill(); polling loop sites updated to pass -1.
Client: lease path accounting
rfaas/include/rfaas/client.hpp
After polling RM response, calls receive_wcs().update_requests(-1) then receive_wcs().refill() to adjust receive-work accounting immediately after each poll.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(135,206,235,0.5)
    participant Client
    end
    rect rgba(144,238,144,0.5)
    participant Manager
    end
    rect rgba(255,182,193,0.5)
    participant RDMA_Lib
    end
    Client->>Manager: send request / wait for reply
    Manager->>RDMA_Lib: poll_wc() -> receives ibv_wc
    Manager->>RDMA_Lib: receive_wcs().update_requests(-1)
    Manager->>RDMA_Lib: receive_wcs().refill(threshold=-1)
    RDMA_Lib-->>Manager: posts receives as needed
    Manager-->>Client: propagate response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I tweak a threshold, hop and hum,
Buffers grow where small ones were glum.
More send windows open wide and bright,
Counts adjusted, queues sleep light.
A rabbit's nibble makes the network run right.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Bugfix refill' is vague and generic—it indicates a bug fix to something called 'refill' but provides minimal context about the scope or significance of the changes. Consider using a more descriptive title that clarifies the specific issue being fixed, e.g., 'Fix receive work-queue refill accounting across RDMA connections' or 'Fix refill threshold handling and receive buffer sizing'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix-refill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rdmalib/lib/queue.cpp`:
- Around line 145-154: The replenish logic in RecvWorkCompletions::replenish
resets _requests to _rcv_buf_size even if post_batched_empty_recv(...) fails,
corrupting queue accounting; modify replenish to check the return/result of
post_batched_empty_recv(qty) and only set _requests = _rcv_buf_size when that
call succeeds, otherwise leave _requests unchanged (and propagate/log the
failure/return false) so receive-queue depletion is not hidden; locate this
behavior in RecvWorkCompletions::replenish and the helper
post_batched_empty_recv to add the success check and appropriate error handling.

In `@rfaas/lib/executor.cpp`:
- Line 41: The initializer that constructs _state uses a narrowing cast of
numcores to int16_t which can overflow; remove that cast and ensure the std::max
comparison is done in a sufficiently wide integer type (e.g. promote
dev.default_receive_buffer_size to int or use std::max<int>) so you compare two
ints and then add 1, then pass the result (converted if the _state constructor
expects a narrower type) — update the expression in the _state(...) initializer
in executor.cpp to perform the max in an int-sized type instead of casting
numcores to int16_t.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d393705-3303-4e88-8d40-41614520face

📥 Commits

Reviewing files that changed from the base of the PR and between 4d57ce0 and ecf64e0.

📒 Files selected for processing (7)
  • rdmalib/include/rdmalib/queue.hpp
  • rdmalib/lib/queue.cpp
  • rdmalib/lib/rdmalib.cpp
  • rfaas/include/rfaas/executor.hpp
  • rfaas/lib/executor.cpp
  • server/executor_manager/manager.cpp
  • server/executor_manager/manager.hpp

Comment thread rdmalib/lib/queue.cpp Outdated
Comment on lines +145 to +154
bool RecvWorkCompletions::replenish()
{
if (_requests < _rcv_buf_size) {
SPDLOG_DEBUG("Post {} requests to buffer at QP {}", _rcv_buf_size - _requests, fmt::ptr(this->qp()));
this->post_batched_empty_recv(_rcv_buf_size - _requests);
_requests = _rcv_buf_size;
return true;
}
return false;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don’t mark requests as replenished when posting fails.

post_batched_empty_recv(...) can fail, but _requests is still reset to full. That corrupts queue accounting and can hide receive-queue depletion.

🔧 Proposed fix
 bool RecvWorkCompletions::replenish()
 {
-  if (_requests < _rcv_buf_size) {
-    SPDLOG_DEBUG("Post {} requests to buffer at QP {}", _rcv_buf_size - _requests, fmt::ptr(this->qp()));
-    this->post_batched_empty_recv(_rcv_buf_size - _requests);
-    _requests = _rcv_buf_size;
-    return true;
-  }
-  return false;
+  if(_requests < 0) {
+    _requests = 0;
+  }
+
+  if(_requests < _rcv_buf_size) {
+    const int missing = _rcv_buf_size - _requests;
+    SPDLOG_DEBUG("Post {} requests to buffer at QP {}", missing, fmt::ptr(this->qp()));
+    if(this->post_batched_empty_recv(missing) < 0) {
+      return false;
+    }
+    _requests = _rcv_buf_size;
+    return true;
+  }
+  return false;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bool RecvWorkCompletions::replenish()
{
if (_requests < _rcv_buf_size) {
SPDLOG_DEBUG("Post {} requests to buffer at QP {}", _rcv_buf_size - _requests, fmt::ptr(this->qp()));
this->post_batched_empty_recv(_rcv_buf_size - _requests);
_requests = _rcv_buf_size;
return true;
}
return false;
}
bool RecvWorkCompletions::replenish()
{
if(_requests < 0) {
_requests = 0;
}
if(_requests < _rcv_buf_size) {
const int missing = _rcv_buf_size - _requests;
SPDLOG_DEBUG("Post {} requests to buffer at QP {}", missing, fmt::ptr(this->qp()));
if(this->post_batched_empty_recv(missing) < 0) {
return false;
}
_requests = _rcv_buf_size;
return true;
}
return false;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rdmalib/lib/queue.cpp` around lines 145 - 154, The replenish logic in
RecvWorkCompletions::replenish resets _requests to _rcv_buf_size even if
post_batched_empty_recv(...) fails, corrupting queue accounting; modify
replenish to check the return/result of post_batched_empty_recv(qty) and only
set _requests = _rcv_buf_size when that call succeeds, otherwise leave _requests
unchanged (and propagate/log the failure/return false) so receive-queue
depletion is not hidden; locate this behavior in RecvWorkCompletions::replenish
and the helper post_batched_empty_recv to add the success check and appropriate
error handling.

Comment thread rfaas/lib/executor.cpp

executor::executor(const std::string& address, int port, int numcores, int memory, int lease_id, device_data & dev):
_state(dev.ip_address, dev.port, dev.default_receive_buffer_size + 1),
_state(dev.ip_address, dev.port, std::max(dev.default_receive_buffer_size, (int16_t)numcores) + 1),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect device_data definition and default_receive_buffer_size type.
rg -n "struct device_data|class device_data|default_receive_buffer_size" -C3 --type=cpp --type=hpp

# 2) Inspect executor constructor signature and usage of numcores.
rg -n "executor\\(const std::string& address, int port, int numcores" -C2 --type=cpp --type=hpp
rg -n "std::max\\(.*numcores|\\(int16_t\\)numcores|_numcores" -C2 --type=cpp --type=hpp

# 3) Check for any explicit input validation/range checks for numcores.
rg -nP "\\bnumcores\\b.*(<=|>=|<|>|assert|expect_|throw)" -C2 --type=cpp --type=hpp

Repository: spcl/rFaaS

Length of output: 81


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect device_data definition and default_receive_buffer_size type.
echo "=== Searching for device_data definition ==="
rg -n "struct device_data|class device_data|default_receive_buffer_size" -C3

# 2) Inspect executor constructor signature and usage of numcores.
echo "=== Searching for executor constructor ==="
rg -n "executor\(const std::string& address, int port, int numcores" -C2

echo "=== Searching for numcores usage with std::max or int16_t cast ==="
rg -n "std::max.*numcores|\(int16_t\)numcores|_numcores" -C2

# 3) Check for any explicit input validation/range checks for numcores.
echo "=== Searching for numcores range checks ==="
rg -n "numcores.*(<=|>=|<|>|assert)" -C2

Repository: spcl/rFaaS

Length of output: 15755


Remove explicit narrowing cast from buffer-size calculation.

Casting numcores (int) to int16_t causes integer overflow if numcores exceeds 32,767, producing an undersized receive buffer.

💡 Proposed fix
-    _state(dev.ip_address, dev.port, std::max(dev.default_receive_buffer_size, (int16_t)numcores) + 1),
+    _state(dev.ip_address, dev.port, std::max<int>(dev.default_receive_buffer_size, numcores) + 1),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_state(dev.ip_address, dev.port, std::max(dev.default_receive_buffer_size, (int16_t)numcores) + 1),
_state(dev.ip_address, dev.port, std::max<int>(dev.default_receive_buffer_size, numcores) + 1),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rfaas/lib/executor.cpp` at line 41, The initializer that constructs _state
uses a narrowing cast of numcores to int16_t which can overflow; remove that
cast and ensure the std::max comparison is done in a sufficiently wide integer
type (e.g. promote dev.default_receive_buffer_size to int or use std::max<int>)
so you compare two ints and then add 1, then pass the result (converted if the
_state constructor expects a narrower type) — update the expression in the
_state(...) initializer in executor.cpp to perform the max in an int-sized type
instead of casting numcores to int16_t.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rfaas/include/rfaas/client.hpp (1)

45-51: ⚠️ Potential issue | 🔴 Critical

Incorrect decrement value causes _requests tracking inconsistency.

poll_wc() can return multiple completions (up to 32), as evidenced by the warning check on line 49. However, update_requests(-1) always decrements by 1, not by the actual response_count. This under-decrements _requests when N > 1 completions are returned, causing the refill logic to malfunction over time.

The correct call should use -response_count:

🐛 Proposed fix
       auto [responses, response_count] = _resource_mgr.connection().poll_wc(rdmalib::QueueType::RECV, true);
-      _resource_mgr.connection().receive_wcs().update_requests(-1);
+      _resource_mgr.connection().receive_wcs().update_requests(-response_count);
       _resource_mgr.connection().receive_wcs().refill();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rfaas/include/rfaas/client.hpp` around lines 45 - 51, The decrement of the
pending request counter is wrong: when poll_wc returns multiple completions you
must decrement by the actual response_count instead of 1; update the call to
receive_wcs().update_requests(...) to pass -response_count (using the
response_count variable returned from poll_wc) so the _requests tracking stays
consistent and refill() behaves correctly; locate this in the block that calls
_resource_mgr.connection().poll_wc(...),
_resource_mgr.connection().receive_wcs().update_requests(...), and
_resource_mgr.connection().receive_wcs().refill().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@rfaas/include/rfaas/client.hpp`:
- Around line 45-51: The decrement of the pending request counter is wrong: when
poll_wc returns multiple completions you must decrement by the actual
response_count instead of 1; update the call to
receive_wcs().update_requests(...) to pass -response_count (using the
response_count variable returned from poll_wc) so the _requests tracking stays
consistent and refill() behaves correctly; locate this in the block that calls
_resource_mgr.connection().poll_wc(...),
_resource_mgr.connection().receive_wcs().update_requests(...), and
_resource_mgr.connection().receive_wcs().refill().

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03633af2-75ae-445e-b9fa-97a0f7c4a45f

📥 Commits

Reviewing files that changed from the base of the PR and between 04527d5 and 5982753.

📒 Files selected for processing (1)
  • rfaas/include/rfaas/client.hpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant