stmetrics

Temporally enforced metrics for 3D / 4D perception.

stmetrics packages is a set of DDP-compatible pytorch metrics for perception, understanding, and reconstruction tasks spanning multiple temporal observations. Metrics collapse to their standard 3D form if evaluating one timestep.

Submodules:

Instances: standard mAP, temporal AP (t-AP), per-timestep AP, per-aux-label recall, instance feature similarity (t-SIM)
Planned: points, tracking, geometry, lidar

Installation

git clone https://github.com/GradientSpaces/stmetrics.git
cd stmetrics
pip install -e .

Requires Python >= 3.8 and PyTorch >= 1.10.

Quickstart — instance segmentation

Each evaluator applies one temporal-enforcement policy. The configured list of evaluators is the metric set; recall=True adds per-aux-label recall.

from stmetrics.instances import (
    InstanceMetrics, LegacyAPEvaluator, TemporalEvaluator, SelectTimestepEvaluator,
)

metric = InstanceMetrics(
    dataset="/path/to/my_dataset.yaml",
    heads=[                                          # omit for a single TemporalEvaluator()
        TemporalEvaluator(recall=True, aux="changes"),   # t-AP + t-REC
        LegacyAPEvaluator(),                             # AP (overlap pooled over timesteps)
        SelectTimestepEvaluator(timesteps=[0]),          # stage1-AP
        SelectTimestepEvaluator(timesteps=[1]),          # stage2-AP
    ],
    log_prefix="val",
)

for preds, targets in your_validation_loader:
    metric.update(preds, targets)

results = metric.compute()
print(results["val_mean_AP"], results["val_mean_t-AP"])

See examples/minimal_instances.py for a runnable end-to-end script with synthetic data.

Instance Input format (instances)

InstanceMetrics.update(preds, targets) takes two lists of dicts of the same length — one entry per scene. Tensors can live on CPU or GPU; the metric moves them to its own device.

`preds[i]` — one scene's predictions

Key	Shape	dtype	Description
`pred_classes`	`(K,)`	int	Predicted class id (matches `valid_class_ids` in the dataset spec) for each of K predicted instances
`pred_scores`	`(K,)`	float	Confidence in [0, 1] per instance
`pred_masks`	`(N, K)`	bool / int	Per-point predicted masks; N is the total number of points/voxels in the scene

`targets[i]` — one scene's ground truth

Key	Shape	dtype	Description
`ids`	`(G,)`	int	GT instance id per GT object (>= 0 valid; negative means ignore)
`labels`	`(G,)`	int	GT class id per GT object
`masks`	`(G, N)`	bool	Per-point GT masks
`timesteps`	`(N,)`	int	Timestep id per point. Optional but will treat as 3D if not provided. Key configurable via `InstanceMetrics(timestep_key=...)`

optional temporal annotations

Key	Shape	dtype	Description
`aux_labels`	`(G,)`	int	Change-type id per GT object (matches `valid_aux_ids` in the spec)
`ambiguities`	`list[list[int]]`	--	Optional. Groups of instance ids that are interchangeable across stages (e.g. identical objects swapped between time t and t+1). Empty list if none.

Optional — for `TSimHead`

To enable pairwise temporal feature similarity:

Key (configurable)	Shape	dtype	Description
`features`	`(M, D)`	float	Per-segment / per-point embedding of dimension D
`ids`	`(M,)`	int	GT instance id each segment belongs to (the head compares same-id features across stages)
`timesteps`	`(M,)`	int	Stage id per segment. Note must be restricted to two distinct integer labels

Dataset spec (YAML)

Each dataset is described by a small YAML file. See examples/rio.yaml for a complete example (an 18-class indoor 4D-scene benchmark).

# my_dataset.yaml
name: my_dataset

# Human-readable class names, in iteration order. Must align with valid_class_ids.
class_labels:
- cabinet
- chair
- table
- door

# Integer class IDs corresponding 1:1 to class_labels. These are the values
# that appear in target.labels / pred.pred_classes. Omit to assume 0..N-1.
valid_class_ids:
- 3
- 5
- 7
- 8

# Key on each target dict holding the per-instance aux label (e.g. "changes").
aux: changes

# Aux-label names, in iteration order. Output keys use these (e.g. val_rigid_REC).
# Datasets with no aux concept can use a single label, e.g. ["all"] / [0].
aux_labels:
- static
- rigid
- nonrigid

# Integer aux IDs corresponding 1:1 to aux_labels. Match target[aux].
valid_aux_ids:
- 0
- 1
- 2

# Optional. Group class names into categories for "mean per-category" output keys:
#   val_mean_head_AP, val_mean_common_AP, val_mean_tail_AP, ...
# Omit the categories block entirely if your dataset has no such structure.
categories:
  head:
  - cabinet
  - chair
  common:
  - table
  tail:
  - door

Pass the path to your yaml file:

from stmetrics import load_dataset_spec
spec = load_dataset_spec("/path/to/my_dataset.yaml")

InstanceMetrics(dataset=...) accepts the same form.

Output keys (instances)

compute() returns a dict[str, torch.Tensor] with keys of the form:

<log_prefix>_<class_or_change_or_'mean'>_<metric_type>[_50|_25]

Where <metric_type> is one of:

Suffix	Source evaluator
`AP`	`LegacyAPEvaluator` — mAP, overlap pooled over timesteps, IoUs 0.5--0.9
`t-AP`	`TemporalEvaluator` — temporal mAP (IoU > threshold in every timestep)
`stage1-AP`, `stage2-AP`, ...	`SelectTimestepEvaluator(timesteps=[k])` — per-timestep AP
`REC`, `t-REC`, `stage1-REC`, ...	any evaluator with `recall=True` — per-aux-label recall

_50 / _25 variants are computed at IoU=0.5 / 0.25 only.

Per-category mean keys (<prefix>_mean_<cat>_<metric>) appear iff the dataset spec declares a categories: block.

TSimHead emits:

<log_prefix>_tsim_mean
<log_prefix>_tsim_median
<log_prefix>_tsim_std
<log_prefix>_tsim_n_pairs

Hydra integration

If you already drive your project via Hydra, the composite accepts a list of {"_target_": ...} head configs directly:

# my_metric.yaml
_target_: stmetrics.instances.InstanceMetrics
dataset: /path/to/my_dataset.yaml
heads:
  - _target_: stmetrics.instances.TemporalEvaluator
    recall: true
    aux: changes
  - _target_: stmetrics.instances.LegacyAPEvaluator
  - _target_: stmetrics.instances.SelectTimestepEvaluator
    timesteps: [0]

Citation

If you find our code and paper useful, please cite our work ReScene4D which first introduced t-mAP.

@inproceedings{steiner2026rescene4d,
      author = {Steiner, Emily and Zheng, Jianhao and Howard-Jenkins, Henry and Xie, Chris and Armeni, Iro},
      title = {ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
stmetrics		stmetrics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stmetrics

Installation

Quickstart — instance segmentation

Instance Input format (instances)

`preds[i]` — one scene's predictions

`targets[i]` — one scene's ground truth

optional temporal annotations

Optional — for `TSimHead`

Dataset spec (YAML)

Output keys (instances)

Hydra integration

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

stmetrics

Installation

Quickstart — instance segmentation

Instance Input format (instances)

preds[i] — one scene's predictions

targets[i] — one scene's ground truth

optional temporal annotations

Optional — for TSimHead

Dataset spec (YAML)

Output keys (instances)

Hydra integration

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`preds[i]` — one scene's predictions

`targets[i]` — one scene's ground truth

Optional — for `TSimHead`

Packages