Skip to content

Sort incoming device tasks using heap#784

Open
devreal wants to merge 2 commits into
ICLDisco:masterfrom
devreal:device-task-priorities-heap
Open

Sort incoming device tasks using heap#784
devreal wants to merge 2 commits into
ICLDisco:masterfrom
devreal:device-task-priorities-heap

Conversation

@devreal

@devreal devreal commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Currently, sorting of device tasks is broken (they don't actually contain a priority field) and linear over a slice of the input list. Instead, we can use a max-heap to keep a sorted array of device tasks, push all new tasks into the heap, and take out the highest priority task.

Complexity of inserting the ring into the heap is O(N+M) or O(N*logM), depending on the whether we just insert elements one by one or rebuild from scratch (beneficial if we insert more tasks than exist already).

We can detach the full lifo at once to avoid repeated atomic operations by the management thread.

Also, use nolock variants on stream-local lists and change to lifo for the shared queue to avoid the lock of the fifo. They are only modified by the device management thread. Inter-thread communication happens through the pending queue.

Once tasks are trickling through the system they do not need to be sorted anymore.

@devreal devreal requested a review from Copilot June 4, 2026 21:20
@devreal devreal requested a review from a team as a code owner June 4, 2026 21:20

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reworks GPU device task intake to restore proper priority-based selection and reduce overhead by switching from FIFO-based pending queues and ad-hoc sorting to a lock-free LIFO intake plus a manager-private max-heap for prioritized pop.

Changes:

  • Replace shared pending queues in CUDA and Level Zero devices with a lock-free parsec_lifo_t, and introduce a pending_heap (parsec_heap_*) used by the device management thread to pop the highest-priority task.
  • Remove the old “sort pending tasks” mechanism and MCA parameters (sort_pending_tasks) along with the associated function pointer in parsec_device_module_t.
  • Add parsec/class/parsec_heap.{h,c} and integrate it into the build, plus add parsec_lifo_detach_all() helpers to batch-move tasks with fewer atomic operations.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
parsec/mca/device/level_zero/device_level_zero_module.c Switch Level Zero pending queue to LIFO and initialize/finalize the new pending heap.
parsec/mca/device/level_zero/device_level_zero_component.c Remove the Level Zero MCA knob and wiring for pending-task sorting.
parsec/mca/device/device.h Remove the sort_pending_list function pointer from the device module interface.
parsec/mca/device/device_gpu.h Introduce pq_priority, swap pending type to LIFO, and add a manager-private pending heap field.
parsec/mca/device/device_gpu.c Detach all pending tasks from LIFO, assign heap priorities, push into heap, and pop highest-priority task; switch stream-local queues to nolock list ops.
parsec/mca/device/cuda/device_cuda_module.c Switch CUDA pending queue to LIFO and initialize/finalize the new pending heap.
parsec/mca/device/cuda/device_cuda_component.c Remove the CUDA MCA knob and wiring for pending-task sorting.
parsec/CMakeLists.txt Add class/parsec_heap.c to the build.
parsec/class/parsec_heap.h Add new array-based max-heap API (pointer elements, int32_t priority key by offset).
parsec/class/parsec_heap.c Implement heap push/pop and ring batch-push with adaptive heapify strategy.
parsec/class/lifo.h Add helpers to detach the entire LIFO at once and convert the internal chain into a doubly-linked ring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +31
typedef struct parsec_heap_s {
parsec_object_t super;
void **nodes; /**< pointer array; nodes[0] is the maximum */
size_t size; /**< current number of elements */
size_t capacity; /**< allocated slots in nodes[] */
size_t comp_offset; /**< byte offset of int32_t priority key */
} parsec_heap_t;
Comment thread parsec/class/parsec_heap.c Outdated
Comment on lines +158 to +159
for (ssize_t i = (ssize_t)(heap->size / 2) - 1; i >= 0; i--)
heap_sift_down(heap, (size_t)i);
@devreal

devreal commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Looking into merging the two heap implementations. Didn't see the existing one because it wasn't in the class/ directory.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Comment on lines +22 to +27
static inline int heap_cmp(const parsec_heap_t *h,
const parsec_list_item_t *a,
const parsec_list_item_t *b)
{
return COMPARISON_VAL(a, h->comp_offset) - COMPARISON_VAL(b, h->comp_offset);
}
Comment thread parsec/maxheap.c
Comment on lines +92 to +96
char tmp[MAX_TASK_STRLEN];
PARSEC_DEBUG_VERBOSE(20, parsec_debug_output,
"MH:\tStole exec C %s (%p) from heap %p",
parsec_task_snprintf(tmp, MAX_TASK_STRLEN, task), task, heap);
}
Comment thread parsec/maxheap.c
Comment on lines 176 to 180
char tmp[MAX_TASK_STRLEN];
PARSEC_DEBUG_VERBOSE(20, parsec_debug_output, "MH:\tStole exec C %s (%p) from heap %p",
PARSEC_DEBUG_VERBOSE(20, parsec_debug_output,
"MH:\tStole exec C %s (%p) from heap %p",
parsec_task_snprintf(tmp, MAX_TASK_STRLEN, to_use), to_use, heap);
}
Comment thread parsec/mca/device/device_gpu.c Outdated
assert(NULL == task);
if( NULL == stream->tasks[stream->start] ) { /* there is room on the stream */
task = (parsec_gpu_task_t*)parsec_list_pop_front(stream->fifo_pending); /* get the best task */
task = (parsec_gpu_task_t*)parsec_list_nolock_pop_front(stream->fifo_pending); /* get the best task */
@devreal devreal force-pushed the device-task-priorities-heap branch from 524089e to 826f3cf Compare June 5, 2026 02:17

@bosilca bosilca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a good addition, and a significant fix.

Few nitpicks:

  • You remove spaces at the end of the lines. This is uncalled for in the context of another commit.
  • I just realized that the stream fifo_pending is not ordered by priority anymore. Indeed, now that tasks can bypass the input stream (which would have guaranteed the original, ordered by priority, order), they will land in the execution streams unordered. I don't think this is a problem, but your last point in the PR description is incorrect.
  • you mention O(N+M) cost, but I don't see where that would be. The current push_ring code is o(N log(N+M)).

Comment thread parsec/maxheap.h Outdated
Comment thread parsec/class/lifo.h Outdated
Comment thread parsec/maxheap.c
@devreal devreal force-pushed the device-task-priorities-heap branch from 826f3cf to 4bcdf09 Compare June 8, 2026 15:18
@devreal

devreal commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

I replaced the ring with a chain. To avoid iterating over it, we can set the priority when scheduling the device task. We only set the priority if the device task priority is negative (set in constructor), otherwise we take what the caller has set.

@devreal

devreal commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

The complexity was a leftover from when I had this implemented using an array. I removed references to the complexity.

devreal added 2 commits June 8, 2026 14:16
Currently, sorting of device tasks is broken (they don't actually contain
a priority field) and linear over a slice of the input list.
Instead, we can use a max-heap to keep a sorted array of device tasks,
push all new tasks into the heap, and take out the highest priority task.

We can detach the full lifo at once to avoid repeated atomic
operations by the management thread.

Also, use nolock variants on stream-local lists and change to lifo
for the shared queue to avoid the lock of the fifo. They are only modified
by the device management thread. Inter-thread communication happens through
the pending queue.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Spotted in CI. Likely introduced and not removed.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@devreal devreal force-pushed the device-task-priorities-heap branch from 4bcdf09 to ef37d59 Compare June 8, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants