DecisionRules.jl is a Julia package for training parametric decision rules through multi-stage optimization, following the Two-Stage General / Deep Decision Rules (TS-GDR / TS-DDR) framework. The main use-case is multi-stage stochastic control where the feasible (closed-loop) action at each stage is obtained by solving an optimization problem (e.g., OPF, MPC), and we want to train a policy end-to-end that maps observed states and uncertainties to target trajectories.
In TS-GDR, the policy does not directly output a control action. Instead, it outputs targets (typically a target state trajectory) that are enforced inside an optimization model through target constraints with slack. For a sampled uncertainty trajectory $(
- sample
$w_{1:T}$ - predict targets
$\hat x_{1:T} = \pi(\cdot;\theta)$ - solve an optimization problem that projects targets onto the feasible set (dynamics + constraints)
- differentiate to update
$\theta$ using dual information and/or implicit sensitivities (via DiffOpt)
DecisionRules.jl implements this workflow in three flavors:
- Deterministic equivalent (direct transcription): one coupled optimization over the full horizon.
- Stage-wise decomposition (single shooting): one optimization per stage in a sequential rollout.
- Windowed decomposition (multiple shooting): one coupled optimization per window, chained by the realized end-state.
using Pkg
Pkg.add(url="https://github.com/LearningToOptimize/DecisionRules.jl.git")DecisionRules.jl is intentionally “model-first”: you describe your problem in JuMP (DiffOpt-enabled for all but the deterministic equivalent approach), then the package handles simulation and training.
For any multi-stage model you will need:
subproblems::Vector{JuMP.Model}: one JuMP model per stage (DiffOpt-enabled).state_params_in[t]: a vector of parameter variables for the incoming state at staget.state_params_out[t]: a vector of(target_param, realized_state_var)tuples at staget. Thetarget_paramis the parameter variable that the policy sets; therealized_state_varis the JuMP decision variable whose value becomes the realized state.- an
uncertainty_samplerthat returns per-stage samples in the format used byDecisionRules.sample(...). - Differentiable policies built with Flux.jl or similar compatible libraries. Input size is the number of uncertainty components per stage plus the size of the initial state; output size is the size of the target state at each stage.
Working patterns are provided in examples/.
This corresponds to solving one coupled optimization over all stages (the deterministic equivalent of a sampled trajectory). You build the deterministic-equivalent JuMP model with DecisionRules.deterministic_equivalent! and then train with the deterministic-equivalent overload of train_multistage.
using DecisionRules, JuMP, DiffOpt, Flux
using SCS
# 1) Build per-stage subproblems (DiffOpt-enabled) and collect:
# subproblems, state_params_in, state_params_out, uncertainty_sampler, uncertainties_structure
# 2) Build the deterministic equivalent over the full horizon
det = DiffOpt.diff_model(() -> DiffOpt.diff_optimizer(SCS.Optimizer))
det, uncertainties_structure_det = DecisionRules.deterministic_equivalent!(
det,
subproblems,
state_params_in,
state_params_out,
Float64.(initial_state),
uncertainties_structure,
)
# 3) Train a TS-DDR policy end-to-end
num_uncertainties = length(uncertainty_sampler()[1]) # number of uncertainty components per stage
policy = Chain(
Dense(DecisionRules.policy_input_dim(num_uncertainties, length(initial_state)), 64, relu),
Dense(64, length(initial_state)),
)
DecisionRules.train_multistage(
policy,
initial_state,
det,
state_in_det,
state_out_det,
uncertainty_sampler;
num_batches=100,
num_train_per_batch=32,
optimizer=Flux.Adam(1e-3),
)This mode typically gives the most faithful gradient signal (full coupling across the horizon), but it requires solving the largest inner problem per sample.
Single shooting solves one optimization per stage and rolls forward using the realized state returned by the solver. The policy can be closed-loop because it receives the realized state
using DecisionRules, Flux
num_uncertainties = length(uncertainty_sampler()[1])
policy = Chain(
Dense(DecisionRules.policy_input_dim(num_uncertainties, length(initial_state)), 64, relu),
Dense(64, length(initial_state)),
)
DecisionRules.train_multistage(
policy,
initial_state,
subproblems,
state_params_in,
state_params_out,
uncertainty_sampler;
num_batches=100,
num_train_per_batch=32,
optimizer=Flux.Adam(1e-3),
)Internally, gradients are obtained by combining (i) dual information for target parameters and (ii) solution sensitivities computed through DiffOpt along the rollout.
Multiple shooting partitions the horizon into windows of length window_size. Each window solves a deterministic equivalent over its stages, then passes the realized end state to the next window. This can stabilize learning over long horizons compared to pure single shooting, while remaining cheaper than a full-horizon deterministic equivalent.
using DecisionRules, Flux, DiffOpt
using SCS
num_uncertainties = length(uncertainty_sampler()[1])
policy = Chain(
Dense(DecisionRules.policy_input_dim(num_uncertainties, length(initial_state)), 64, relu),
Dense(64, length(initial_state)),
)
windows = DecisionRules.setup_shooting_windows(
subproblems,
state_params_in,
state_params_out,
Float64.(initial_state),
uncertainty_samples;
window_size=24,
model_factory=() -> DiffOpt.nonlinear_diff_model(optimizer_with_attributes(
SCS.Optimizer,
"verbose" => 0,
)),
)
DecisionRules.train_multiple_shooting(
policy,
initial_state,
windows,
state_params_in,
state_params_out,
uncertainty_sampler;
window_size=24, # e.g., 6, 24, ...
num_batches=100,
num_train_per_batch=32,
optimizer=Flux.Adam(1e-3),
)Evaluating a trained policy only through the deterministic equivalent can overstate its quality: the coupled solve re-optimizes all stages jointly and absorbs targets that are not followable stage by stage through the slack penalty — exactly what deployment cannot do. The stage-wise rollout is the deployment semantics of a target-trajectory policy, so report it as the headline metric, together with a target-violation measure.
The training loops record metrics through a per-sample SampleLog cache and a per-batch record(sample_log, iter, model) callback. RolloutEvaluation is a ready-made helper that evaluates the policy stage-wise on a fixed held-out scenario set; call it from within record:
using DecisionRules, Random
# Materialize a FIXED held-out evaluation set once, before training
Random.seed!(1234)
eval_scenarios = [DecisionRules.sample(uncertainty_samples) for _ in 1:8]
rollout_eval = RolloutEvaluation(
subproblems, state_params_in, state_params_out, initial_state, eval_scenarios;
stride=25, # evaluate every 25 batches
policy_state=:realized,
)
train_multistage(policy, initial_state, det, state_in_det, state_out_det, uncertainty_sampler;
num_batches=100,
record=(sample_log, iter, model) -> begin
rollout_eval(iter, model)
return false
end)policy_state selects which state is fed back to the policy between stage solves:
:realizedfeeds the previous realized optimizer state to the policy. This is the closed-loop/deployment rollout and is the default.:targetfeeds the previous target/predicted state to the policy, matching the deterministic-equivalent target-generation path while still solving the stage subproblems sequentially.
Use policy_state=:target when comparing against a deterministic-equivalent
training loss, because both metrics then call the policy on the same target-state
history. Log :realized separately as the harder deployment diagnostic.
Each evaluation reports (a) the rollout objective excluding the target-slack penalty term (the operational cost) and (b) the target-violation share — the realized slack penalty divided by the full objective. Policy comparisons are only trustworthy when the violation share is small (≤ ~0.05): a larger share means the policy's targets are not followable stage by stage and the reported cost is not what deployment would realize. When training drives the violation share to ~0, the deterministic-equivalent and rollout views are expected to coincide; the rollout metric is the guard that detects when they don't.
Per-sample debugging hooks can be attached with SampleLog(on_sample=(s, models, log) -> ...); the training loop calls the hook after each sample's solve with the live JuMP model(s). The previous record_loss=(iter, model, loss, tag) -> ... keyword keeps working as a deprecated adapter.
Examples live in examples/. Run tests with:
julia --project -e 'using Pkg; Pkg.test()'If you use this package in academic work, please cite:
@article{rosemberg2024efficiently,
title={Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality},
author={Rosemberg, Andrew and Street, Alexandre and Vallad{\~a}o, Davi M and Van Hentenryck, Pascal},
journal={arXiv preprint arXiv:2405.14973},
year={2024}
}DiffOpt (for differentiating through optimization): https://github.com/jump-dev/DiffOpt.jl
MIT. See LICENSE.
