Skip to content

Feat: Add Model RF-DETR#333

Open
DerrickUnleashed wants to merge 10 commits into
mlverse:mainfrom
DerrickUnleashed:feat/modelRfdetr
Open

Feat: Add Model RF-DETR#333
DerrickUnleashed wants to merge 10 commits into
mlverse:mainfrom
DerrickUnleashed:feat/modelRfdetr

Conversation

@DerrickUnleashed

Copy link
Copy Markdown
Contributor

This PR adds :

  • Real-Time Detection Transformers Model (Object Detection Model)
  • Test Suite for the same

Closes #327

@DerrickUnleashed

Copy link
Copy Markdown
Contributor Author
=======================================
Running inference with: model_rfdetr_base
========================================
Model weights for <rfdetr_base> (~123 MB) will be downloaded and processed if
not already available.
Loaded pretrained weights for <rfdetr_base> (487/487 keys, 0 skipped).
Top detections:
  score=0.8337 label=17 box=[156,37,1440,1148]
  score=0.1323 label=18 box=[156,37,1440,1148]
  score=0.0865 label=23 box=[156,37,1440,1148]
  score=0.0846 label=20 box=[156,37,1440,1148]
  score=0.0698 label=18 box=[164,23,1440,1148]
Detections above 0.3: 1

========================================
Running inference with: model_rfdetr_base_2
========================================
Model weights for <rfdetr_base_2> (~123 MB) will be downloaded and processed
if not already available.
Loaded pretrained weights for <rfdetr_base_2> (487/487 keys, 0 skipped).
Top detections:
  score=0.6605 label=17 box=[156,38,1440,1152]
  score=0.2869 label=18 box=[156,38,1440,1152]
  score=0.1364 label=23 box=[156,38,1440,1152]
  score=0.0740 label=20 box=[156,38,1440,1152]
  score=0.0676 label=16 box=[156,38,1440,1152]
Detections above 0.3: 1

========================================
Running inference with: model_rfdetr_base_o365
========================================
Model weights for <rfdetr_base_o365> (~127 MB) will be downloaded and
processed if not already available.
Loaded pretrained weights for <rfdetr_base_o365> (487/487 keys, 0 skipped).
Top detections:
  score=0.5500 label=140 box=[158,34,1440,1144]
  score=0.4891 label=93 box=[158,34,1440,1144]
  score=0.1077 label=343 box=[158,34,1440,1144]
  score=0.0971 label=321 box=[158,34,1440,1144]
  score=0.0838 label=84 box=[158,34,1440,1144]
Detections above 0.3: 2

========================================
Running inference with: model_rfdetr_large
========================================
Model weights for <rfdetr_large> (~518 MB) will be downloaded and processed
if not already available.
Loaded pretrained weights for <rfdetr_large> (533/533 keys, 0 skipped).
Top detections:
  score=0.9274 label=18 box=[156,40,1444,1136]
  score=0.0538 label=18 box=[163,36,1440,1137]
  score=0.0460 label=19 box=[156,40,1444,1136]
  score=0.0403 label=17 box=[163,36,1440,1137]
  score=0.0336 label=11 box=[156,40,1444,1136]
Detections above 0.3: 1

========================================
Running inference with: model_rfdetr_medium
========================================
Model weights for <rfdetr_medium> (~116 MB) will be downloaded and processed
if not already available.
Loaded pretrained weights for <rfdetr_medium> (465/465 keys, 0 skipped).
Top detections:
  score=0.3381 label=17 box=[155,35,1460,1158]
  score=0.2565 label=18 box=[155,35,1460,1158]
  score=0.1775 label=23 box=[155,35,1460,1158]
  score=0.1350 label=23 box=[157,34,1461,1156]
  score=0.0925 label=17 box=[157,34,1461,1156]
Detections above 0.3: 1

========================================
Running inference with: model_rfdetr_nano
========================================
Model weights for <rfdetr_nano> (~116 MB) will be downloaded and processed if
not already available.
Loaded pretrained weights for <rfdetr_nano> (465/465 keys, 0 skipped).
Top detections:
  score=0.3381 label=17 box=[155,35,1460,1158]
  score=0.2565 label=18 box=[155,35,1460,1158]
  score=0.1775 label=23 box=[155,35,1460,1158]
  score=0.1350 label=23 box=[157,34,1461,1156]
  score=0.0925 label=17 box=[157,34,1461,1156]
Detections above 0.3: 1

========================================
Running inference with: model_rfdetr_small
========================================
Model weights for <rfdetr_small> (~116 MB) will be downloaded and processed
if not already available.
Loaded pretrained weights for <rfdetr_small> (465/465 keys, 0 skipped).
Top detections:
  score=0.5835 label=17 box=[154,34,1459,1155]
  score=0.4582 label=18 box=[154,34,1459,1155]
  score=0.3348 label=23 box=[154,34,1459,1155]
  score=0.0870 label=64 box=[154,34,1459,1155]
  score=0.0798 label=20 box=[154,34,1459,1155]
Detections above 0.3: 3
fileef4133e485e8

@DerrickUnleashed DerrickUnleashed changed the title Feat/model rfdetr Feat: Add Model RF-DETR Jun 16, 2026
@DerrickUnleashed

DerrickUnleashed commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Can we implement a vignette for this perhaps? @cregouby

@cregouby cregouby left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise This is massive, thanks for it
todo see inline.

Comment thread NEWS.md

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improvement Could you move the added line into the ## New models section ?

@cregouby cregouby Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo missing Please add a representative example to the model documentation.
todo missing Please mention the attribution in a code comment # This code is modified from ...
todo Please fix merge conflicts

@cregouby cregouby Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improvement Could we be more specific on each tests through getting a little deeper than the output shape ?
suggestion You should use the expect_coco_model_detects_cat(model) for each and every pretrained model. See code in

expect_coco_model_detects_cat <- function(model, min_score = 0.25) {
input <- base_loader("assets/class/cat/cat.2.jpg") %>%
transform_to_tensor() %>%
transform_resize(c(640, 640)) %>%
transform_normalize(mean = c(0.485, 0.456, 0.406), std = c(0.229, 0.224, 0.225)) %>%
torch::torch_unsqueeze(1)
model$eval()
torch::with_no_grad({
out <- model(input)
})
expect_named(out, "detections")
expect_named(out$detections[[1]], c("boxes", "labels", "scores"), ignore.order = TRUE)
expect_equal(out$detections[[1]]$boxes$shape[2], 4L)
labels_vec <- as.integer(out$detections[[1]]$labels$cpu())
scores_vec <- as.numeric(out$detections[[1]]$scores$cpu())
expect_true(all(labels_vec >= 0 & labels_vec <= 90))
top <- which.max(scores_vec)
expect_equal(labels_vec[top], 17L)
expect_gt(scores_vec[top], min_score)
}

question Could we also test for expect_bbox_to_be_xyxy() from
expect_bbox_is_xyxy <- function(object, width, height) {
expect_tensor(object)
N <- object$shape[1]
expect_tensor_shape(object, c(N, 4))
x_min <- object[, 1]
y_min <- object[, 2]
x_max <- object[, 3]
y_max <- object[, 4]
## bbox range checks
expect_true((x_min >= 0)$all()$item(),
info = "All x_min values must be >= 0.")
expect_true((y_min >= 0)$all()$item(),
info = "All y_min values must be >= 0.")
expect_true((x_max <= torch_tensor(width))$all()$item(),
info = sprintf("All x_max values must be <= width (%s).", width))
expect_true((y_max <= torch_tensor(height))$all()$item(),
info = sprintf("All y_max values must be <= height (%s).", height))
expect_true((x_max > torch_tensor(1))$all()$item(),
info = "x looks like a relative delta and shall be converted back to image width.")
expect_true((y_max > torch_tensor(1))$all()$item(),
info = "y looks like a relative delta and shall be converted back to image height.")
## bbox ordering checks
expect_true((x_min <= x_max)$all()$item(),
info = "Each x_min must be smaller than its x_max.")
expect_true((y_min <= y_max)$all()$item(),
info = "Each y_min must be smaller than its y_max.")
}

Comment on lines +685 to +690
batch_size <- value$size(1)
n_heads <- value$size(2)
head_dim <- value$size(3)
len_query <- sampling_locations$size(2)
n_levels <- sampling_locations$size(4)
n_points <- sampling_locations$size(5)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion you could make use of zeallot %<-% for readability

}
)

ms_deform_attn_core_pytorch <- function(value, spatial_shapes, sampling_locations, attention_weights,

@cregouby cregouby Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought very strange name knowing that there is zero pytorch in the game... OK, this comes forme the code in roboflow repo. Then please mention the attribution as code coment.

Comment on lines +722 to +787
ms_deform_attn <- nn_module(
"ms_deform_attn",
initialize = function(d_model = 256, n_levels = 4, n_heads = 8, n_points = 4) {
self$d_model <- d_model
self$n_levels <- n_levels
self$n_heads <- n_heads
self$n_points <- n_points
self$sampling_offsets <- nn_linear(d_model, n_heads * n_levels * n_points * 2)
self$attention_weights <- nn_linear(d_model, n_heads * n_levels * n_points)
self$value_proj <- nn_linear(d_model, d_model)
self$output_proj <- nn_linear(d_model, d_model)
self$reset_parameters()
},
reset_parameters = function() {
nn_init_constant_(self$sampling_offsets$weight, 0)
thetas <- torch_arange(0, self$n_heads - 1, dtype = torch_float32()) * (2 * pi / self$n_heads)
grid_init <- torch_stack(list(thetas$cos(), thetas$sin()), dim = -1)
grid_init <- grid_init / grid_init$abs()$max(dim = -1, keepdim = TRUE)[[1]]
grid_init <- grid_init$view(c(self$n_heads, 1, 1, 2))$'repeat'(c(1, self$n_levels, self$n_points, 1))
for (i in seq_len(self$n_points)) {
grid_init[, , i, ] <- grid_init[, , i, ] * i
}
self$sampling_offsets$bias <- nn_parameter(grid_init$view(-1))
nn_init_constant_(self$attention_weights$weight, 0)
nn_init_constant_(self$attention_weights$bias, 0)
nn_init_xavier_uniform_(self$value_proj$weight)
nn_init_constant_(self$value_proj$bias, 0)
nn_init_xavier_uniform_(self$output_proj$weight)
nn_init_constant_(self$output_proj$bias, 0)
},
forward = function(query, reference_points, input_flatten, input_spatial_shapes,
input_level_start_index, input_padding_mask = NULL,
input_spatial_shapes_hw = NULL) {
batch_size <- query$size(1)
len_query <- query$size(2)
value <- self$value_proj(input_flatten)
if (!is.null(input_padding_mask)) {
value <- value$masked_fill(input_padding_mask$unsqueeze(3), 0)
}
sampling_offsets <- self$sampling_offsets(query)$view(c(
batch_size, len_query, self$n_heads, self$n_levels, self$n_points, 2
))
attention_weights <- self$attention_weights(query)$view(c(
batch_size, len_query, self$n_heads, self$n_levels * self$n_points
))
if (reference_points$size(-1) == 2) {
offset_normalizer <- torch_stack(list(
input_spatial_shapes[, 2], input_spatial_shapes[, 1]
), dim = -1)
sampling_locations <- reference_points$unsqueeze(3)$unsqueeze(5) +
sampling_offsets / offset_normalizer$unsqueeze(1)$unsqueeze(1)$unsqueeze(4)
} else {
sampling_locations <- reference_points[, , NULL, , NULL, 1:2] +
sampling_offsets / self$n_points * reference_points[, , NULL, , NULL, 3:4] * 0.5
}
attention_weights <- nnf_softmax(attention_weights, dim = -1)
value <- value$transpose(2, 3)$contiguous()$view(c(
batch_size, self$n_heads, self$d_model %/% self$n_heads, -1
))
output <- ms_deform_attn_core_pytorch(
value, input_spatial_shapes, sampling_locations, attention_weights,
input_spatial_shapes_hw
)
self$output_proj(output)
}
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please factorize this code with

lw_detr_ms_deform_attn <- torch::nn_module(
initialize = function(d_model = 256L, n_levels = 1L, n_heads = 8L, n_points = 4L) {
self$n_levels <- n_levels
self$n_heads <- n_heads
self$n_points <- n_points
self$head_dim <- d_model %/% n_heads
self$sampling_offsets <- torch::nn_linear(d_model, n_heads * n_levels * n_points * 2L)
self$attention_weights <- torch::nn_linear(d_model, n_heads * n_levels * n_points)
self$value_proj <- torch::nn_linear(d_model, d_model)
self$output_proj <- torch::nn_linear(d_model, d_model)
torch::with_no_grad({
torch::nn_init_constant_(self$sampling_offsets$weight, 0)
thetas <- torch::torch_arange(n_heads, dtype = torch::torch_float32()) * (2 * pi / n_heads)
grid_init <- torch::torch_stack(list(thetas$cos(), thetas$sin()), dim = -1L)
grid_init <- (grid_init / grid_init$abs()$amax(-1L, keepdim = TRUE))
grid_init <- grid_init$reshape(c(n_heads, 1L, 1L, 2L))$`repeat`(c(1L, n_levels, n_points, 1L))
for (i in seq_len(n_points)) {
grid_init[,, i, ] <- grid_init[,, i, ] * i
}
self$sampling_offsets$bias <- torch::nn_parameter(grid_init$reshape(c(-1L)))
torch::nn_init_constant_(self$attention_weights$weight, 0)
torch::nn_init_constant_(self$attention_weights$bias, 0)
torch::nn_init_xavier_uniform_(self$value_proj$weight)
torch::nn_init_constant_(self$value_proj$bias, 0)
torch::nn_init_xavier_uniform_(self$output_proj$weight)
torch::nn_init_constant_(self$output_proj$bias, 0)
})
},
forward = function(query, reference_points, input_flatten, spatial_shapes, level_start_index, mask = NULL) {
bs <- query$size(1L)
lenq <- query$size(2L)
nh <- self$n_heads
nl <- self$n_levels
np <- self$n_points
hd <- self$head_dim
value <- self$value_proj(input_flatten)
if (!is.null(mask)) {
value <- value$masked_fill(mask$logical_not()$unsqueeze(-1L), 0)
}
offsets <- self$sampling_offsets(query)$reshape(c(bs, lenq, nh, nl, np, 2L))
attn_w <- torch::nnf_softmax(
self$attention_weights(query)$reshape(c(bs, lenq, nh, nl * np)),
dim = -1L
)
ref_xy <- reference_points[,,, 1:2]
ref_wh <- reference_points[,,, 3:4]
ref_xy_exp <- ref_xy$unsqueeze(3L)$unsqueeze(5L)
ref_wh_exp <- ref_wh$unsqueeze(3L)$unsqueeze(5L)
sampling_locs <- ref_xy_exp + offsets / np * ref_wh_exp * 0.5
val_split <- list()
for (lvl in seq_len(nl)) {
h_l <- as.integer(spatial_shapes[lvl, 1])
w_l <- as.integer(spatial_shapes[lvl, 2])
s <- level_start_index[lvl] + 1L
e <- s + h_l * w_l - 1L
val_l <- value[, s:e, ]$reshape(c(bs, h_l, w_l, nh, hd))
val_l <- val_l$permute(c(1L, 4L, 5L, 2L, 3L))$reshape(c(bs * nh, hd, h_l, w_l))
val_split[[lvl]] <- val_l
}
sampling_grids <- 2 * sampling_locs - 1
out_list <- list()
for (lvl in seq_len(nl)) {
grid_l <- sampling_grids[,,, lvl, , ]
grid_l <- grid_l$permute(c(1L, 3L, 2L, 4L, 5L))
grid_l <- grid_l$reshape(c(bs * nh, lenq, np, 2L))
sampled <- torch::nnf_grid_sample(
val_split[[lvl]],
grid_l,
mode = "bilinear",
padding_mode = "zeros",
align_corners = FALSE
)
out_list[[lvl]] <- sampled
}
out_vals <- torch::torch_cat(out_list, dim = -1L)
attn_w2 <- attn_w$permute(c(1L, 3L, 2L, 4L))$reshape(c(bs * nh, 1L, lenq, nl * np))
output <- (out_vals * attn_w2)$sum(-1L)$reshape(c(bs, nh * hd, lenq))
self$output_proj(output$permute(c(1L, 3L, 2L)))
}
)

as the initialization is the same and forward here only additionnaly manages 2D reference_points.
suggestion rename it detr_ms_deform_attn

out$pred_boxes <- ref_enc
}
}
out

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo Please make the output data model identical to all other objects detection models so that it helps visualization: we expect out to have a $detections with each detection item having names c("boxes", "labels", "scores")

Comment on lines +857 to +875
mlp_module <- nn_module(
"mlp_module",
initialize = function(input_dim, hidden_dim, output_dim, num_layers) {
self$num_layers <- num_layers
h <- rep(hidden_dim, num_layers - 1)
dims <- c(input_dim, h, output_dim)
self$layers <- nn_module_list(lapply(seq_len(num_layers), function(i) {
nn_linear(dims[i], dims[i + 1])
}))
},
forward = function(x) {
for (i in seq_len(self$num_layers)) {
x <- self$layers[[i]](x)
if (i < self$num_layers) x <- nnf_relu(x)
}
x
}
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion looks like this should be factorized with / reused from

# MLP with $layers nn_module_list
.lw_detr_mlp_layers <- torch::nn_module(
initialize = function(input_dim, hidden_dim, output_dim, num_layers) {
dims_in <- c(input_dim, rep(hidden_dim, num_layers - 1L))
dims_out <- c(rep(hidden_dim, num_layers - 1L), output_dim)
self$layers <- torch::nn_module_list(mapply(
function(di, do) torch::nn_linear(di, do),
dims_in,
dims_out,
SIMPLIFY = FALSE
))
self$n <- num_layers
},
forward = function(x) {
for (i in seq_len(self$n)) {
x <- self$layers[[i]](x)
if (i < self$n) x <- torch::nnf_relu(x)
}
x
}
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Object Detection Model] Please implement RF-DETR

2 participants