Empirical Rigor Through Benchmarking in Operations Research

Jing Dong · Daksh Mittal · Hongseok Namkoong

Companion to the paper

Empirical Findings Enabled by Structural Benchmarks

Stochastic Operations Research has a rich tradition of deriving structural insights through theoretical analysis, with parsimonious mathematical models yielding foundational results in sequential decision-making under uncertainty. Systematic empirical evaluation, however, is often limited to small-scale numerical examples that are difficult to compare across studies. This stands in stark contrast to fields like machine learning, where standardized benchmarks have catalyzed rapid progress. Stochastic OR needs an analogous benchmarking culture that can discipline empirical claims, reveal the practical relevance of algorithmic ideas, and accelerate innovation.

This paper argues for one and shows what it would surface across two canonical queueing models and five case studies. The first two studies probe the brittleness of analytical baselines once classical assumptions are perturbed; the next three turn to learning-based methods and probe the sample complexity of competing algorithms, the hyperparameter sensitivity of any one of them across an instance family, and the unexpected effects of imposing structural restrictions on the policy class.

Five case studies, two themes

The first two studies highlight the brittleness of widely used researcher-driven policies: policies validated under clean theoretical assumptions that prove sensitive to modest deviations. The remaining three turn to learning-based methods, each illustrating a distinct way a benchmark contributes to their development.

Brittleness of researcher-driven policies

Analytical baselines collapse once classical assumptions are perturbed.

Two-class single-server queue

One server, two Poisson streams. Decide which class to serve at each completion; the classical optimum is the cμ-rule. We then perturb service rates with state-dependent slowdown, an effect invisible to cμ, and ask whether the baseline still holds.

N-Model

Two queues, two servers in an "N" topology with one dedicated + one flexible server and asymmetric holding costs; the heavy-traffic optimum is a static threshold (Harrison 1998). We then drop the flexible server's rate to 40% during a window of the horizon and ask whether the static rule still survives the non-stationarity.

Learning-based methods

Each study shows a distinct way a benchmark contributes to methodological progress.

Algorithm families

Compare four learning-based methods (PPO, a pathwise gradient estimator, two REINFORCE variants) on a shared environment, with hyperparameters tuned per method. How does sample complexity depend on the information interface to the simulator?

Hyperparameter sensitivity

Sweep PPO configurations on one slowdown instance, then across five. Are hyperparameter rankings stable across instances, and which choices drive performance versus merely stabilise training?

Restricting the policy class

Compare PPO with and without a work-conservation action mask, on both canonical models. Does encoding the constraint into the policy class actually accelerate learning, or does it distort the gradient signal?

Cite this work

If this work helped your research, please cite the paper.

@article{dong2026empirical,
  title   = {Empirical Rigor Through Benchmarking in Operations Research},
  author  = {Dong, Jing and Mittal, Daksh and Namkoong, Hongseok},
  journal = {Operations Research (under review)},
  year    = {2026},
  url     = {https://arxiv.org/abs/XXXX.XXXXX},
}

Two-class single-server queue

Setting

A scheduling problem with hidden state-dependence

A single server faces two streams of arriving customers and must repeatedly decide which class to serve. Class-i customers (i ∈ {1, 2}) arrive as independent Poisson processes with rates λ_i and require exponential service at class-dependent rates μ_i. Performance is the time-averaged holding cost h₁q₁(t)+h₂q₂(t).

Under classical assumptions the optimal policy is the cμ-rule. We then perturb the setting with state-dependent slowdown: when q_i reaches a threshold k_i, μ_i drops to 0.1·μ_i. The slowdown is invisible to the cμ-rule, which acts only on instantaneous service rates and holding costs.

The question

Does the analytical baseline still hold once service rates depend on the state of the queues?

Classical

Fixed service rates μ_i

State-dependent slowdown

μ_i → β_i·μ_i when q_i ≥ k_i

Pick a regime: results

Regime 1 · Classical assumptions hold

A sanity check the algorithm passes

Fixed service rates, linear holding costs, work-conserving server. The cμ-rule is provably optimal. A learning algorithm trained from black-box interaction alone should recover that prescription, and it does, to within statistical noise.

Cost rate, 95% half-widths over 500 trajectories
Policy	Cost rate	Notes
cμ-rule baseline	2.007 ± 0.066	analytical optimum
Random baseline	2.036 ± 0.067	work-conserving
PPO	2.007 ± 0.066	recovers cμ-rule

The result

PPO matches the cμ-rule to within statistical noise; useful before perturbing the setting.

Learned PPO policy in the classical regime — The learned policy almost always serves class 2 (since h₂·μ₂ > h₁·μ₁), reproducing the cμ-rule from interaction alone.

Regime 2 · Asymmetric slowdown k₁=8, k₂=25

When the baseline favours the wrong class

Queue 1's threshold k₁=8 is tight; queue 2's k₂=25 is rarely binding. Protecting class 1 against the 10× rate drop becomes essential, the opposite of what the cμ-rule prescribes.

Policy	Cost rate	vs PPO
cμ-rule baseline	66.540 ± 2.590	9.2×
Random baseline	70.210 ± 5.190	9.7×
PPO	7.252 ± 1.504	1.0×

The result

PPO cuts cost by ≈89%. The cμ-rule is more than 9× worse than its own performance in the unperturbed setting.

Learned PPO policy in the asymmetric regime — Class-1 service concentrates in a tight band of small-to-moderate q₁, abandoning cμ whenever queue 1 approaches its slowdown threshold.

Regime 3 · Symmetric tight slowdown k₁=k₂=10

When there is no class to favour

Both thresholds are equally tight. The agent must continuously balance slowdown risk across both classes; there is no longer a single "right" class to prioritise. The cμ-rule is now worse than uniform random scheduling.

Policy	Cost rate	vs PPO
cμ-rule baseline	95.600 ± 6.780	3.1×
Random baseline	69.400 ± 7.040	2.3×
PPO	30.727 ± 5.576	1.0×

The result

PPO cuts cost by ≈68%, and uniform random scheduling beats the cμ-rule.

Learned PPO policy in the symmetric tight regime — The boundary now runs diagonally across the state space; the agent must protect both queues simultaneously.

Cross-instance robustness

Train PPO on each of five slowdown configurations, evaluate every trained policy on every configuration. Hover or tap a cell. Values are PPO cost divided by the cost PPO achieved when trained and evaluated on that instance. 1.0× 2× 3× 5× ≥10×

k₁=8
k₂=25

k₁=8
k₂=200

k₁=k₂
=10

k₁=k₂
=20

No
slowdown

Select a cell

Hover a cell on the matrix to see the train/evaluation pair and the relative cost.

Two complementary modes of evaluation

You can ask "can the algorithm produce a competent policy for a given instance?", or you can ask "is the produced policy robust across a specified set of environments?" A benchmark that spans an instance family supports both modes and makes the choice explicit.

N-Model

Setting

A routing problem under non-stationarity

Two queues are served by two servers in an "N" topology. Server 1 is dedicated to queue 1 at rate μ₁₁. Server 2 is flexible and the agent's control: it can serve queue 1 at rate μ₂₁ or queue 2 at rate μ₂₂. Arrivals follow independent Poisson processes with rates λ₁, λ₂. Performance is the time-averaged holding cost h₁q₁(t)+h₂q₂(t), with h₁=10, h₂=1.

Under classical stationary conditions a static threshold rule routing server 2 to queue 1 whenever q₁ exceeds a fixed level is asymptotically optimal under heavy-traffic scaling. We then perturb the setting with a service-rate surge: μ₂₁ drops to 40% of nominal during a window in the middle of the horizon.

The question

Does a static threshold rule, calibrated for stationarity, still produce a competent policy when capacity drops mid-horizon?

Classical stationary

Fixed service rates μ₁₁, μ₂₁, μ₂₂

Service-rate surge

μ₂₁ → 0.4·μ₂₁ for t ∈ [0.35T, 0.70T]

Pick a regime: results

Regime 1 · Classical stationary

Recovering the DP optimum

With state space bounded at q_i≤300 we compute the dynamic-programming optimum exactly, giving a ground-truth lower bound. A learning algorithm should match this; the static heavy-traffic threshold, while asymptotically optimal, leaves a small gap.

Policy	Cost rate	Type
DP optimal	30.785	exact
Static threshold q₁>5	32.632	exact
PPO	30.987 ± 0.303	learned
Random	79.610 ± 2.864	work-cons.

The result

PPO matches the DP optimum to within statistical noise; the learned policy tracks the optimal threshold structure with a slightly smoother boundary.

DP optimal policy on the N-model (stationary) — Dynamic-programming optimal policy. Red = route server 2 to queue 1; blue = queue 2.

Regime 2 · Service-rate surge μ₂₁ ↓ 40% mid-horizon

When the threshold rule cannot adapt

During t ∈ [0.35T, 0.70T], μ₂₁ drops to 40% of nominal. The static threshold rule conditions only on instantaneous queue lengths, so it cannot react. PPO, with access to the time component of the state, pivots its routing during the surge and recovers when capacity returns.

Policy	Cost rate	vs PPO
PPO (final)	39.259 ± 0.876	1.00×
Static threshold q₁>5	43.253 ± 1.214	1.10×
Random	75.549 ± 2.125	1.92×

The result

PPO cuts cost by ≈10% over the static threshold. More instructive than the gap: the shape of the PPO policy changes over time.

PPO policy through the service-rate surge — Before the surge (t/T=0.20) the agent tracks the stationary threshold. During (t/T=0.50) the policy pivots toward queue 2 over a wider region. Once μ₂₁ is restored (t/T=0.85) the structure returns.

Comparing algorithmic families

Comparison

How fast can different learners reach the optimum?

We benchmark four representative learning-based algorithms on the N-model in its stationary setting (the same regime where we know the dynamic-programming optimum exactly). All four are scored on a shared environment with hyperparameters grid-tuned per method, so differences reflect the algorithms themselves rather than tuning effort.

The methods differ in what information they ask of the simulator: black-box methods (PPO, REINFORCE) see only state, action, and a scalar reward; the pathwise estimator backpropagates through a differentiable discrete-event simulator. That structural information is what makes the comparison interesting.

The question

Which family converges fastest, and how much of the advantage comes from the information interface?

PPO

Proximal Policy Optimization

Black-box · clipped objective

Info interfaceBlack-box (s, a, r)

Gradient typeScore function, clipped

Sample efficiencyMedium

Estimator varianceLow (advantage normalised)

ReferenceSchulman et al. 2017

Pathwise

Pathwise gradient

Differentiable discrete-event simulator

Info interfaceGradients through dynamics

Gradient typePathwise (reparameterised)

Sample efficiencyHigh

Estimator varianceVery low

ReferenceChe et al. 2024

REINFORCE

REINFORCE, batch baseline

Score-function · batch-mean control variate

Info interfaceBlack-box (s, a, r)

Gradient typeScore function

Sample efficiencyLow

Estimator varianceModerate (batch-mean reduces it)

ReferenceWilliams 1992

REINFORCE

REINFORCE, no baseline

Score-function · raw returns

Info interfaceBlack-box (s, a, r)

Gradient typeScore function

Sample efficiencyLow

Estimator varianceHigh

ReferenceWilliams 1992

Hover or focus a card to see the details.

Result: sample complexity

Cost rate vs. PPO updates on the N-model stationary setting. Each line is the best-tuned configuration for that algorithm; the dashed line is the DP optimum (30.785).

Training curves for four algorithms on the N-model

The result

The pathwise gradient estimator reaches the DP optimum substantially faster than the score-function methods. REINFORCE variants approach the optimum more slowly and with higher variance, but eventually get there given enough simulator budget.

Takeaways

Information interface matters

The pathwise estimator's advantage stems from access to simulator gradients, not network architecture. Black-box methods cannot match it on this environment.

A shared environment enables methodological claims

Without a common testbed and protocol, sample-complexity improvements are hard to validate. Benchmarks make those claims falsifiable.

One instance is not enough

The relative ordering can shift with horizon and regime; we report the stationary N-model here, but other settings may rank algorithms differently.

Tuning cost is itself a comparison axis

The configurations shown were each grid-searched. Methods that need less tuning are easier to adopt, a dimension separate from sample efficiency.

Hyperparameter sensitivity

Tuning

Which choices drive performance, and which only stabilise it?

Modern learning-based methods come with many tunable hyperparameters: learning rate, batch size, gradient-clipping threshold, network architecture, activation function, entropy coefficient. Different defaults are reported across the literature with little a priori basis for choosing among them.

We run two sweeps on the two-class single-server queue with slowdown. First, a small sweep (Configs A–E) on a single instance to show how much the choice of configuration alone matters. Second, a larger sweep (Configs 1–12) across five slowdown instances, normalising each cell by the best configuration on that instance.

The question

Are hyperparameter rankings consistent across instances, and which choices merely stabilise training versus drive final performance?

BASE

PPO base configuration

Starting point all sweeps perturb from

lr_policy1e-4

lr_value3e-4

episodes / batch50

update epochs3

clip ratio0.2

entropy coef0.05 → 0.005

LR schedulecosine

activationTanh

hidden size64

max grad norm1.0

γ0.998

GAE λ0.99

Each sweep below perturbs a small subset of these.

Sweep A–E on a single instance

Symmetric tight slowdown, k₁=k₂=10. Final cost rates differ by more than 2× across configurations.

Evaluation curves for several PPO hyperparameter configurations

Hover a config to see what changed

relu_clip_decay

activation + LR + clip swept

activationReLU (vs Tanh)

lr_policy5e-4 (vs 1e-4)

lr_valueconstant (vs decayed)

LR schedulelinear (vs cosine)

clip decayon (vs off)

obs scale1.0 (vs 20.0)

hidden_128

wider policy network

hidden size128 (vs 64)

noteseverything else at base

baseline

reference run

configmatches the base

epochs_1

fewer PPO epochs per update

update epochs1 (vs 3)

notesunder-trains within budget

small_batch

5× fewer episodes per batch

episodes / batch10 (vs 50)

notesnoisy gradients, fails to converge

The result

The gaps are large enough that a single-instance experiment with an unfortunate hyperparameter choice could plausibly be misread as evidence that PPO is uncompetitive on this instance, when in fact a different configuration of the same algorithm is.

Sweep 1–12 across the instance family

A 3 × 2 × 2 factorial over lr_policy ∈ {1e-5, 1e-4, 5e-4}, clip ratio ∈ {0.10, 0.20}, and episodes / batch ∈ {50, 200}. All twelve share ReLU activation and a linear LR schedule. Each cell is normalised by the best configuration on that instance; hover any cell to see what's in that config.

Drilldown

Hover a cell to see the config (relative to the base) and the cost ratio.

The result

The winner on k₁=8, k₂=25 is not the winner on k₁=k₂=10. A practitioner tuning on a single instance over-fits to that instance's idiosyncrasies; reporting sweeps on a single instance may overstate the robustness of any individual claim.

Some hyperparameters stabilise, others swing

Top-3 ReLU vs. Tanh configurations on two slowdown instances. ReLU produces tighter clusters across the surrounding hyperparameter sweep.

Training curves under ReLU and Tanh activations on two slowdown instances

The result

ReLU acts as a stabilising element rather than merely an architectural detail. The activation choice tightens the cluster of nearby configurations, so other hyperparameters matter less when ReLU is fixed.

Takeaways

Single-instance tuning over-fits

The winning configuration on one instance can be 2.5× worse on another. Practitioners who tune on one instance bake in idiosyncrasies of that instance.

Some choices are stabilisers

ReLU vs. Tanh in this study isn't just an architectural detail. Stabilisers reduce sensitivity to the rest of the configuration and can be fixed without further tuning.

Report sweeps across instances, not within

An instance family is the natural setting for hyperparameter claims. Single-instance sweeps remain useful for ablations, not for robustness.

Tuning budget is itself a comparison axis

How much grid-search did a result need? Algorithms that need less tuning are more useful in practice, separate from how good they look at their peak.

Restricting the policy class

Structure

Does encoding domain knowledge into the policy help?

A natural intuition in applying learning-based methods to operational problems is that incorporating structural knowledge should improve performance. One concrete way to do this is to restrict the policy class: mask actions that violate work-conservation, so the agent never sees the option of idling a server when work is available.

Reducing the search space should, in principle, accelerate learning and improve the quality of the converged policy. We test this intuition on two environments we've already used as testbeds: the symmetric tight slowdown instance of the two-class queue and the stationary N-model.

The question

Does masking non-work-conserving actions speed up learning, or does the constraint distort the gradient signal?

WITH MASK

Work-conserving PPO

Domain knowledge baked in

Policy classWork-conserving only

Idle actionMasked out (forbidden)

Search spaceSmaller, structured

IntuitionShould speed up learning

NO MASK

Unrestricted PPO

Learn the constraint from rewards

Policy classFull action space

Idle actionAvailable, must be discovered as bad

Search spaceLarger, unstructured

IntuitionShould be slower / less stable

Hover or focus a card for the variant's details.

Result: training curves on both environments

Symmetric tight slowdown (left) and stationary N-model (right). Each pair compares PPO with and without the work-conservation mask.

PPO with and without an action mask enforcing work-conservation

The result

In both environments the non-masked variant converges at least as quickly and reaches at least as low a final cost. The masked variant is the one with noticeably higher run-to-run variability, the opposite of the conventional intuition.

Takeaways

"More structure helps" is not universal

The effect of structural restriction interacts with the algorithm and the environment. The intuition should be tested rather than assumed.

Constraints can distort gradients

A plausible explanation: the work-conservation constraint distorts the gradient signal in a way PPO's clipped objective handles less gracefully than alternative algorithms.

A benchmark makes this falsifiable

Whether the finding extends to other policy-gradient methods is exactly the kind of follow-up question a shared environment makes addressable, and reproducible.

Variance is a first-class measurement

The masked variant didn't lose on average cost; it lost on stability across runs. Reporting run-to-run variance is what surfaced this.