Skip to content

Team-RRC-IIITH/clutt3r-seg-modified

 
 

Repository files navigation

Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Jeongho Noh1, Tai Hyoung Rhee1, Eunho Lee1, Jeongyun Kim1, Sunwoo Lee2, Ayoung Kim1,†

1 Seoul National University    2 Hyundai Motor Company    Corresponding Author

ICRA 2026

arXiv

Clutt3R-Seg is a zero-shot sparse-view 3D instance segmentation pipeline that builds hierarchy-based, open-vocabulary 3D instances for language-grounded grasping in cluttered scenes.

It groups noisy RGB-D masks across views, resolves over- and under-segmentation through an instance tree, and updates object correspondences after robot interactions without rescanning the full scene.

Paper: arXiv:2602.11660

Clutt3R-Seg pipeline

Clone

git clone https://github.com/jeonghonoh/clutt3r-seg
cd clutt3r-seg

Build

Build the runtime image locally:

docker build --build-arg INSTALL_DUODUOCLIP=1 -t clutt3r-seg:duoduoclip-local .

This local build downloads and installs DuoduoCLIP from its upstream repository inside the image. Clutt3R-Seg does not redistribute DuoduoCLIP source code, checkpoints, or Docker images with DuoduoCLIP preinstalled.

Use of DuoduoCLIP is subject to its upstream license and dependency licenses. Do not publish or redistribute Docker images with DuoduoCLIP preinstalled unless you have confirmed that all relevant licenses allow it.

Data

This release includes sample sequences from GraspClutter6D, along with custom real-world and synthetic sequences.

Each sequence should follow this layout:

samples/<sequence_name>/
  data/
    transforms.json
    images/
    depth/
    instance_masks/
    instance_tree.json

transforms.json must contain camera intrinsics and per-frame file_path, depth_file_path, and transform_matrix entries. Instance masks should be stored as mask_<frame_id>_<instance_id>.png.

data/depth should contain dense MVSAnywhere inference depth, as used in the paper pipeline. Raw measured depth with invalid pixels can break back-projection and geometry consistency.

data/instance_tree.json stores precomputed instance-tree assignments for this source-available release. The public release does not include the internal instance-tree builder because parts of that implementation depend on closed or restricted components that cannot be redistributed. The paper describes the instance-tree construction procedure at the level intended for reproduction; new sequences need a compatible precomputed artifact. The artifact must match the exact instance_masks/ files used by the run.

For update, the selected update frame must have RGB, instance masks, and update-frame tree entries in instance_tree.json. Update-frame depth is only used for optional depth evaluation when available.

Included sample sequences:

  • samples/sample_seq1: custom real-world sequence with more than eight frames; supports both initial segmentation and update.
  • samples/sample_seq2: custom real-world sequence with more than eight frames; supports both initial segmentation and update.
  • samples/sample_seq3: difficult sequence from GraspClutter6D.
  • samples/sample_seq4: easy sequence from GraspClutter6D.
  • samples/sample_seq5: custom synthetic sequence captured in Isaac Sim.

See samples/README.md for additional sequence layout notes.

Run

Initial segmentation:

docker run --rm --gpus all \
  -v "$PWD/samples:/workspace/samples" \
  -v "$PWD/.cache:/workspace/.cache" \
  clutt3r-seg:duoduoclip-local \
  bash scripts/run_initial.sh samples/sample_seq2 0,1,2,3,4,5,6,7 "cracker box"

Update segmentation and scene update:

docker run --rm --gpus all \
  -v "$PWD/samples:/workspace/samples" \
  -v "$PWD/.cache:/workspace/.cache" \
  clutt3r-seg:duoduoclip-local \
  bash scripts/run_update.sh samples/sample_seq2 8 1 "chips can"

Run update only after initial segmentation has created samples/<sequence_name>/state.pkl.

Outputs are written under samples/<sequence_name>/output/:

  • <target_prompt>.ply: prompt-matched target object point cloud.
  • updated_scene_<update_num>.ply: full updated scene from the update step.

The 6-DoF grasp pose estimation stage used in the paper is not included in this release; exported object point clouds can be used as input to external grasp-pose estimators.

For an interactive shell:

docker run --rm -it --gpus all \
  -v "$PWD:/workspace" \
  -v "$PWD/.cache:/workspace/.cache" \
  clutt3r-seg:duoduoclip-local \
  bash

License

This repository is released under the Clutt3R-Seg Non-Commercial Source-Available License. See LICENSE for details.

Third-party code, checkpoints, datasets, models, and generated assets are governed by their own licenses. See THIRD_PARTY.md for third-party notices.

Citation

If you found our work useful, please cite:

@inproceedings{noh2026clutt3rseg,
  title={Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes},
  author={Noh, Jeongho and Rhee, Tai Hyoung and Lee, Eunho and Kim, Jeongyun and Lee, Sunwoo and Kim, Ayoung},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026}
}

About

Adaptation for our project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 97.2%
  • Shell 1.8%
  • Dockerfile 1.0%