Code for Overcoming Dependent Censoring in the Evaluation of Survival Models (2026)
Preprint: https://arxiv.org/abs/2502.19460
Accepted to UAI 2026
This repository accompanies our introduction of the dependent Brier score and integrated dependent Brier score (IBS-Dep), a survival-model evaluation metric designed for settings with dependent censoring. The method models the joint distribution of event and censoring times and uses the Copula-Graphic estimator to impute marginal event times for censored instances. The repository contains the implementation and experiments supporting our theoretical analysis of the CG-based margin-time estimator and our semi-synthetic evaluation of how well IBS-Dep approximates the oracle IBS based on uncensored event times.
In survival analysis, we are interested in the time until an event occurs. The event might be cancer relapse, disease progression, equipment failure, or something else entirely. The problem is that we do not always get to observe that event. A patient may leave the study, be lost to follow-up, or reach the end of the observation period before the event occurs. This is called censoring.
Let
-
$E$ denote the event time, -
$C$ denote the censoring time, and -
$X$ denote the observed covariates.
For each individual, we observe only
In other words, we observe whichever happens first: the event or censoring. The indicator
Informally, this says that after accounting for the observed covariates
This is not an unusual situation. For example, suppose that tumor grade affects both cancer relapse and the probability of dropping out of a study. If tumor grade is not observed, then relapse and dropout may remain dependent even after conditioning on all available covariates.
Why is this a problem for evaluation? Standard estimators generally treat censored individuals as being comparable to those who remain under observation, after accounting for the available covariates. Under dependent censoring, that may be wrong.
A patient who drops out because their health is deteriorating is not necessarily comparable to a patient who remains in the study. Kaplan–Meier and inverse-probability-of-censoring weighted estimators may therefore assign inappropriate survival probabilities or censoring weights.
The result is that an evaluation metric can report a distorted prediction error rather than the error we would have obtained if every event time had been observed. A survival model may consequently appear better or worse simply because the evaluation method does not account for the dependence between event and censoring times.
Unfortunately, not without additional assumptions.
For each individual, we observe either the event time or the censoring time, but not both. If a patient is censored, we do not know when their event would eventually have occurred. Likewise, if the event occurs first, we do not observe their latent censoring time.
This means that the joint distribution of
In practice, dependent censoring can instead be investigated using:
- Domain knowledge: Could dropout, loss to follow-up, or study termination be related to disease severity or event risk?
- Observed predictors of censoring: Are some measured variables associated with both censoring and the outcome?
- Sensitivity analysis: Do the conclusions change across plausible copula families or dependence strengths?
- Joint modeling assumptions: Does a specified copula model provide a reasonable description of the latent event and censoring processes?
The goal is therefore not to "discover" the true dependence structure from the observed data alone. Instead, we make the dependence assumptions explicit, estimate what can be estimated under those assumptions, and examine whether the conclusions are robust to other plausible choices.
Any fitted copula parameter should consequently be interpreted as model-dependent rather than as assumption-free evidence that censoring is dependent.
The repository includes an experiment for fitting and comparing candidate copula models:
python src/experiments/find_copula.py --helpSee src/experiments/find_copula.py.
The script can be used to:
- fit the supported copula families,
- estimate their dependence parameters,
- compare candidate dependence models,
- report the corresponding Kendall's
$\tau$ , and - perform sensitivity analyses under alternative dependence assumptions.
The basic idea is to model the event and censoring times jointly rather than treating them as unrelated processes. Their joint survival distribution is written as
The main contribution of the paper is the dependent Brier score and its integrated version, IBS-Dep, for evaluating survival models under dependent censoring.
IBS-Dep models the joint distribution of event and censoring times and uses the Copula-Graphic estimator to estimate the marginal event-time distribution. For censored individuals, this distribution is used to construct a margin-time surrogate for the unobserved event time.
The repository contains implementations of:
- the dependent Brier score and integrated dependent Brier score,
- copula-based joint modeling of event and censoring times,
- the Copula-Graphic estimator,
- CG-based margin-time estimation,
- Kaplan–Meier and IPCW comparison methods,
- copula fitting and parameter estimation,
- synthetic experiments with controlled dependence and censoring, and
- semi-synthetic experiments using covariates from real survival datasets.
Reusable implementations are located in src/.
Experiment entry points are provided in scripts/ and src/experiments/.
The analysis and figure-generation code is available in notebooks/.
DependentEvaluator computes the dependent Brier score at a single time point and its integrated version.
from evaluators import DependentEvaluator
dep_evaluator = DependentEvaluator(
predicted_survival_curves=survival_outputs,
time_coordinates=time_bins,
test_event_times=test_times,
test_event_indicators=test_events,
train_event_times=train_times,
train_event_indicators=train_events,
copula_name="clayton",
alpha=2.0,
)
bs_dep = dep_evaluator.brier_score(method="BG_UW", target_time=365.0)
ibs_dep = dep_evaluator.integrated_brier_score(method="BG_UW", num_points=10)Supported methods are BG, BG_UW, and CG_Q.
A small synthetic smoke test checks the pointwise dependent Brier score against the oracle score based on uncensored event times and verifies consistency between the single-point, multi-point, and integrated implementations:
python test_dep_brier_score.py --n-samples 2000 --copula clayton --tau 0.5 --target-censoring 0.5It can also be run with pytest:
pytest -q test_dep_brier_score.pyIf you find this paper useful in your work, please consider citing it:
@article{lillelund_overcoming_2025,
title={Overcoming Dependent Censoring in the Evaluation of Survival Models},
author={Christian Marius Lillelund and Shi-ang Qi and Russell Greiner},
journal={preprint, arXiv:2502.19460},
year={2025},
}


