OR-benchmark
Empirical rigor in OR

Empirical Rigor Through Benchmarking in Operations Research

Jing Dong · Daksh Mittal · Hongseok Namkoong
Companion to the paper

Empirical Findings Enabled by Structural Benchmarks

Stochastic Operations Research has a rich tradition of deriving structural insights through theoretical analysis, with parsimonious mathematical models yielding foundational results in sequential decision-making under uncertainty. Systematic empirical evaluation, however, is often limited to small-scale numerical examples that are difficult to compare across studies. This stands in stark contrast to fields like machine learning, where standardized benchmarks have catalyzed rapid progress. Stochastic OR needs an analogous benchmarking culture that can discipline empirical claims, reveal the practical relevance of algorithmic ideas, and accelerate innovation.

This paper argues for one and shows what it would surface across two canonical queueing models and five case studies. The first two studies probe the brittleness of analytical baselines once classical assumptions are perturbed; the next three turn to learning-based methods and probe the sample complexity of competing algorithms, the hyperparameter sensitivity of any one of them across an instance family, and the unexpected effects of imposing structural restrictions on the policy class.

Five case studies, two themes

The first two studies highlight the brittleness of widely used researcher-driven policies: policies validated under clean theoretical assumptions that prove sensitive to modest deviations. The remaining three turn to learning-based methods, each illustrating a distinct way a benchmark contributes to their development.

Cite this work

If this work helped your research, please cite the paper.

@article{dong2026empirical,
  title   = {Empirical Rigor Through Benchmarking in Operations Research},
  author  = {Dong, Jing and Mittal, Daksh and Namkoong, Hongseok},
  journal = {Operations Research (under review)},
  year    = {2026},
  url     = {https://arxiv.org/abs/XXXX.XXXXX},
}

Two-class single-server queue

Setting

A scheduling problem with hidden state-dependence

A single server faces two streams of arriving customers and must repeatedly decide which class to serve. Class-i customers (i ∈ {1, 2}) arrive as independent Poisson processes with rates λi and require exponential service at class-dependent rates μi. Performance is the time-averaged holding cost h1q1(t)+h2q2(t).

Under classical assumptions the optimal policy is the cμ-rule. We then perturb the setting with state-dependent slowdown: when qi reaches a threshold ki, μi drops to 0.1·μi. The slowdown is invisible to the cμ-rule, which acts only on instantaneous service rates and holding costs.

The question

Does the analytical baseline still hold once service rates depend on the state of the queues?

1
Classical
Fixed service rates μi
Queue 1 Queue 2 Server λ1 λ2 μ1 μ2
2
State-dependent slowdown
μi → βi·μi when qi ≥ ki
Queue 1 Queue 2 Server λ1 λ2 q1 < k1 ⇒ μ1 q1 ≥ k1 ⇒ β1μ1 q2 < k2 ⇒ μ2 q2 ≥ k2 ⇒ β2μ2

Pick a regime: results

Cross-instance robustness

Train PPO on each of five slowdown configurations, evaluate every trained policy on every configuration. Hover or tap a cell. Values are PPO cost divided by the cost PPO achieved when trained and evaluated on that instance. 1.0× ≥10×

k₁=8
k₂=25
k₁=8
k₂=200
k₁=k₂
=10
k₁=k₂
=20
No
slowdown
Select a cell
Hover a cell on the matrix to see the train/evaluation pair and the relative cost.
Two complementary modes of evaluation

You can ask "can the algorithm produce a competent policy for a given instance?", or you can ask "is the produced policy robust across a specified set of environments?" A benchmark that spans an instance family supports both modes and makes the choice explicit.

N-Model

Setting

A routing problem under non-stationarity

Two queues are served by two servers in an "N" topology. Server 1 is dedicated to queue 1 at rate μ11. Server 2 is flexible and the agent's control: it can serve queue 1 at rate μ21 or queue 2 at rate μ22. Arrivals follow independent Poisson processes with rates λ1, λ2. Performance is the time-averaged holding cost h1q1(t)+h2q2(t), with h1=10, h2=1.

Under classical stationary conditions a static threshold rule routing server 2 to queue 1 whenever q1 exceeds a fixed level is asymptotically optimal under heavy-traffic scaling. We then perturb the setting with a service-rate surge: μ21 drops to 40% of nominal during a window in the middle of the horizon.

The question

Does a static threshold rule, calibrated for stationarity, still produce a competent policy when capacity drops mid-horizon?

1
Classical stationary
Fixed service rates μ11, μ21, μ22
Queue 1 Queue 2 Server 1 dedicated Server 2 flexible λ1 λ2 μ11 μ22 μ21
2
Service-rate surge
μ21 → 0.4·μ21 for t ∈ [0.35T, 0.70T]
Queue 1 Queue 2 Server 1 dedicated Server 2 flexible λ1 λ2 μ11 μ22 μ21 → 0.4·μ21

Pick a regime: results

Comparing algorithmic families

Comparison

How fast can different learners reach the optimum?

We benchmark four representative learning-based algorithms on the N-model in its stationary setting (the same regime where we know the dynamic-programming optimum exactly). All four are scored on a shared environment with hyperparameters grid-tuned per method, so differences reflect the algorithms themselves rather than tuning effort.

The methods differ in what information they ask of the simulator: black-box methods (PPO, REINFORCE) see only state, action, and a scalar reward; the pathwise estimator backpropagates through a differentiable discrete-event simulator. That structural information is what makes the comparison interesting.

The question

Which family converges fastest, and how much of the advantage comes from the information interface?

PPO
Proximal Policy Optimization
Black-box · clipped objective
Info interfaceBlack-box (s, a, r)
Gradient typeScore function, clipped
Sample efficiencyMedium
Estimator varianceLow (advantage normalised)
ReferenceSchulman et al. 2017
Pathwise
Pathwise gradient
Differentiable discrete-event simulator
Info interfaceGradients through dynamics
Gradient typePathwise (reparameterised)
Sample efficiencyHigh
Estimator varianceVery low
ReferenceChe et al. 2024
REINFORCE
REINFORCE, batch baseline
Score-function · batch-mean control variate
Info interfaceBlack-box (s, a, r)
Gradient typeScore function
Sample efficiencyLow
Estimator varianceModerate (batch-mean reduces it)
ReferenceWilliams 1992
REINFORCE
REINFORCE, no baseline
Score-function · raw returns
Info interfaceBlack-box (s, a, r)
Gradient typeScore function
Sample efficiencyLow
Estimator varianceHigh
ReferenceWilliams 1992
Hover or focus a card to see the details.

Result: sample complexity

Cost rate vs. PPO updates on the N-model stationary setting. Each line is the best-tuned configuration for that algorithm; the dashed line is the DP optimum (30.785).

Training curves for four algorithms on the N-model
The result

The pathwise gradient estimator reaches the DP optimum substantially faster than the score-function methods. REINFORCE variants approach the optimum more slowly and with higher variance, but eventually get there given enough simulator budget.

Takeaways

1
Information interface matters

The pathwise estimator's advantage stems from access to simulator gradients, not network architecture. Black-box methods cannot match it on this environment.

2
A shared environment enables methodological claims

Without a common testbed and protocol, sample-complexity improvements are hard to validate. Benchmarks make those claims falsifiable.

3
One instance is not enough

The relative ordering can shift with horizon and regime; we report the stationary N-model here, but other settings may rank algorithms differently.

4
Tuning cost is itself a comparison axis

The configurations shown were each grid-searched. Methods that need less tuning are easier to adopt, a dimension separate from sample efficiency.

Hyperparameter sensitivity

Tuning

Which choices drive performance, and which only stabilise it?

Modern learning-based methods come with many tunable hyperparameters: learning rate, batch size, gradient-clipping threshold, network architecture, activation function, entropy coefficient. Different defaults are reported across the literature with little a priori basis for choosing among them.

We run two sweeps on the two-class single-server queue with slowdown. First, a small sweep (Configs A–E) on a single instance to show how much the choice of configuration alone matters. Second, a larger sweep (Configs 1–12) across five slowdown instances, normalising each cell by the best configuration on that instance.

The question

Are hyperparameter rankings consistent across instances, and which choices merely stabilise training versus drive final performance?

BASE
PPO base configuration
Starting point all sweeps perturb from
lrpolicy1e-4
lrvalue3e-4
episodes / batch50
update epochs3
clip ratio0.2
entropy coef0.05 → 0.005
LR schedulecosine
activationTanh
hidden size64
max grad norm1.0
γ0.998
GAE λ0.99
Each sweep below perturbs a small subset of these.

Sweep A–E on a single instance

Symmetric tight slowdown, k₁=k₂=10. Final cost rates differ by more than 2× across configurations.

Evaluation curves for several PPO hyperparameter configurations
Hover a config to see what changed
A
relu_clip_decay
activation + LR + clip swept
activationReLU (vs Tanh)
lrpolicy5e-4 (vs 1e-4)
lrvalueconstant (vs decayed)
LR schedulelinear (vs cosine)
clip decayon (vs off)
obs scale1.0 (vs 20.0)
B
hidden_128
wider policy network
hidden size128 (vs 64)
noteseverything else at base
C
baseline
reference run
configmatches the base
D
epochs_1
fewer PPO epochs per update
update epochs1 (vs 3)
notesunder-trains within budget
E
small_batch
5× fewer episodes per batch
episodes / batch10 (vs 50)
notesnoisy gradients, fails to converge
The result

The gaps are large enough that a single-instance experiment with an unfortunate hyperparameter choice could plausibly be misread as evidence that PPO is uncompetitive on this instance, when in fact a different configuration of the same algorithm is.

Sweep 1–12 across the instance family

A 3 × 2 × 2 factorial over lrpolicy ∈ {1e-5, 1e-4, 5e-4}, clip ratio ∈ {0.10, 0.20}, and episodes / batch ∈ {50, 200}. All twelve share ReLU activation and a linear LR schedule. Each cell is normalised by the best configuration on that instance; hover any cell to see what's in that config.

Drilldown
Hover a cell to see the config (relative to the base) and the cost ratio.
The result

The winner on k₁=8, k₂=25 is not the winner on k₁=k₂=10. A practitioner tuning on a single instance over-fits to that instance's idiosyncrasies; reporting sweeps on a single instance may overstate the robustness of any individual claim.

Some hyperparameters stabilise, others swing

Top-3 ReLU vs. Tanh configurations on two slowdown instances. ReLU produces tighter clusters across the surrounding hyperparameter sweep.

Training curves under ReLU and Tanh activations on two slowdown instances
The result

ReLU acts as a stabilising element rather than merely an architectural detail. The activation choice tightens the cluster of nearby configurations, so other hyperparameters matter less when ReLU is fixed.

Takeaways

1
Single-instance tuning over-fits

The winning configuration on one instance can be 2.5× worse on another. Practitioners who tune on one instance bake in idiosyncrasies of that instance.

2
Some choices are stabilisers

ReLU vs. Tanh in this study isn't just an architectural detail. Stabilisers reduce sensitivity to the rest of the configuration and can be fixed without further tuning.

3
Report sweeps across instances, not within

An instance family is the natural setting for hyperparameter claims. Single-instance sweeps remain useful for ablations, not for robustness.

4
Tuning budget is itself a comparison axis

How much grid-search did a result need? Algorithms that need less tuning are more useful in practice, separate from how good they look at their peak.

Restricting the policy class

Structure

Does encoding domain knowledge into the policy help?

A natural intuition in applying learning-based methods to operational problems is that incorporating structural knowledge should improve performance. One concrete way to do this is to restrict the policy class: mask actions that violate work-conservation, so the agent never sees the option of idling a server when work is available.

Reducing the search space should, in principle, accelerate learning and improve the quality of the converged policy. We test this intuition on two environments we've already used as testbeds: the symmetric tight slowdown instance of the two-class queue and the stationary N-model.

The question

Does masking non-work-conserving actions speed up learning, or does the constraint distort the gradient signal?

WITH MASK
Work-conserving PPO
Domain knowledge baked in
Policy classWork-conserving only
Idle actionMasked out (forbidden)
Search spaceSmaller, structured
IntuitionShould speed up learning
NO MASK
Unrestricted PPO
Learn the constraint from rewards
Policy classFull action space
Idle actionAvailable, must be discovered as bad
Search spaceLarger, unstructured
IntuitionShould be slower / less stable
Hover or focus a card for the variant's details.

Result: training curves on both environments

Symmetric tight slowdown (left) and stationary N-model (right). Each pair compares PPO with and without the work-conservation mask.

PPO with and without an action mask enforcing work-conservation
The result

In both environments the non-masked variant converges at least as quickly and reaches at least as low a final cost. The masked variant is the one with noticeably higher run-to-run variability, the opposite of the conventional intuition.

Takeaways

1
"More structure helps" is not universal

The effect of structural restriction interacts with the algorithm and the environment. The intuition should be tested rather than assumed.

2
Constraints can distort gradients

A plausible explanation: the work-conservation constraint distorts the gradient signal in a way PPO's clipped objective handles less gracefully than alternative algorithms.

3
A benchmark makes this falsifiable

Whether the finding extends to other policy-gradient methods is exactly the kind of follow-up question a shared environment makes addressable, and reproducible.

4
Variance is a first-class measurement

The masked variant didn't lose on average cost; it lost on stability across runs. Reporting run-to-run variance is what surfaced this.