LatentGym

A Testbed for Cross-Task Experiential Learning

Daksh Mittal*· Tommaso Castellani*· Thomson Yen*· Naimeng Ye· Fangyu Wu· Minghui Chen· Tiffany Cai· Emmanouil Koukoumidis· William Zeng· Hongseok Namkoong

arXiv GitHub

LatentGym makes cross-task learning measurable by giving the evaluator control over the hidden structure shared across tasks. With a known latent, we can diagnose where agents succeed, design metrics that separate exploration from exploitation, and train recipes that target each.

Setting: Each task is to identify a hidden number between [1, 1000], but the hidden number is always either 137 or 793. Latent = {137, 793}

Prompt: Solve a sequence of number games

Task 1 (Target: 137)

Task: Identify the number in range [1, 1000]

Agent: I will do binary search, 500

It is less than 500

Agent: 250

Continues for 7 more turns

Feedback: You guessed 137 correctly in 9 Turns

Task 2 (Target: 793)

Task: Identify the number in range [1, 1000]

Agent: I will do binary search, 500

It is higher than 500

Agent: 750

Continues for 6 more turns

Feedback: You guessed 793 correctly in 8 Turns

Task 10 (Target: 137)

Task: Identify the number in range [1, 1000]

Agent: I am seeing 137 and 793 repeatedly, let me first guess one of them. 137

Correct!

Feedback: You guessed 137 correctly in 1 Turn

Learning Across Tasks (Turns taken reduces)

Our Contributions

Environment Design

Task Dynamics

Latent Structure

Prompt Behaviour

Feedback Type

Horizon

Evaluation and Training

New Metrics
Exploration & Exploitation Efficiency

RL post-training on task sequence

The agent plays a sequence of N=10 tasks; each asks for a hidden integer in [1, 1000], but targets are secretly drawn from z = {137, 793}. Early on (Task 1) the agent does binary search and takes 9 turns. By Task 10, having seen 137 and 793 recur, it solves the task in 1 turn. Each environment in LatentGym factorizes along axes (task dynamics, prompt behavior, feedback type, horizon); we evaluate via new exploration/exploitation metrics and train with Cross-Task RL.

What you can do with LatentGym

Define environments, measure how agents adapt, and train them to adapt better, each piece modular and reusable.

Environment Design

LatentGym Environment

Compose any environment from five swappable modules: core game, latent, prompt, feedback, horizon. Seven ship ready to use.

Evaluation

Explorer

gathers experience

early tasks later tasks

Exploiter

acts on it

Separate exploration efficiency (gathering information about the latent) from exploitation efficiency (acting on it), via a mid-sequence agent hand-off.

Training

Post-train on full task sequences with Cross-Task RL. SkyRL-ready: add an advantage estimator; PPO, GRPO, and SFT out of the box.

Framework

Walk through the framework

Define environments, evaluate how agents adapt, and train them to adapt better, all from the same composable pieces.

Environment Design

Every environment is the product of five swappable components. Change one axis without touching the others.

Five design axes

Each component registers itself; the registry resolves any choice into a single runnable environment.

Core environment

The within-task game: states, actions, reward in isolation.

Latent

The ground truth shared across all tasks in a sequence: the object of adaptation.

Prompt

How much of the latent the agent is told up front; sets the prior.

Feedback

What the agent observes between tasks; controls how fast evidence accumulates.

Horizon N

How many tasks the sequence runs, setting how much support the agent has for adaptation.

FullyDefinedEnv = core-env × latent × prompt × feedback × N

Features

Each axis changes independently

Any combination resolves to one runnable env, named by its choice (e.g. number-guessing / set-of-3 / no-info / standard / ep10).

Built for evaluation and training

One environment object serves both evaluation and training. A thin adapter exposes each LatentGym env to SkyRL, which handles rollouts, weight sync, and policy optimization; built-in advantage estimators plug into the trainer.

Adding a new environment

Write the single-task dynamics and register the parts (latents, prompts, feedbacks register on import). Sequencing, evaluation, and RL are inherited.

# latentgym/envs/number_guessing/__init__.py from latentgym.core.registry import register_env from .core_env import NumberGuessingSingleEpisodeEnv register_env( name="number_guessing", env_class=NumberGuessingSingleEpisodeEnv, default_num_episodes=7, min_range=1, max_range=1000, max_turns_per_episode=30, ) from . import latents, prompts, feedbacks # each registers itself on import

Full guide on GitHub ↗

Mix-and-match across a sequence

A single sequence can even draw on a different core environment for each task while sharing one latent.

Difficulty axes

Example settings shown for Number Guessing.

Within-task

Set by · visible range

[1, 100] [1, 1000] [1, 10000]

Latent-identification

Set by · size of hidden set |z|

|z| = 2 |z| = 5 |z| = 10

Cross-task

Set by · joint of the two

|z|=2 in [1,1000] vs |z|=10 in [1,100]

Prompt and feedback conditions

How much support the agent gets for inferring and exploiting the latent.

Prompt	What's revealed	What it tests
`no_info`	Nothing about cross-task structure.	Does the agent discover structure on its own?
`some_info`	A vague hint that recurring structure may exist.	Does it identify the regularity efficiently?
`full_info`	An explicit description of the latent.	Can it use latent information when given?

Feedback	What's revealed between tasks	What it tests
`standard`	Only a binary success/failure signal after each task.	Can the agent learn from sparse, realistic feedback?
`information`	Ground-truth outcome revealed regardless of success.	Does it improve once evidence accumulates faster?

Seven environments

Each instantiates the five axes with its own latent families. Click an environment or one of its latents to watch a recorded trajectory.

Evaluation

Beyond whether an agent improves, LatentGym measures how, separating exploration (gathering information about the latent) from exploitation (acting on what it gathered).

What standard summaries tell us

Three views on the reward curve r₁, …, r_N. Each says whether the agent got better, not how.

Cumulative reward R = Σ_i r_i

Uniform competence across the whole sequence. The reward we optimize during Cross-Task RL.

Cross-task gain r_N − r₁

Directly isolates whether the agent gets better with experience.

Final-task reward r_N

Final capability after the agent has had the whole sequence to adapt.

Separating exploration and exploitation

Tail reward mixes two skills: probing the latent (exploration) and acting on what's learned (exploitation). A two-agents hand-off separates them: agent A plays tasks 1…K and builds the history; agent B inherits it and plays K+1…N. Swapping who plays which half attributes any change to the explorer or the exploiter alone.

A sequence of 10 tasks · Hand-off K = 4 slide to change K

Agent A

Agent B

Agent C

Exploration efficiency

→ A explores, C exploits

T₁ r₅ + … + r₁₀

→ B explores, C exploits

T₂ r₅ + … + r₁₀

ExplorationEff(B vs A) = (T₂ − T₁) / T₁

Exploitation efficiency

→ C explores, A exploits

T₁′ r₅ + … + r₁₀

→ C explores, B exploits

T₂′ r₅ + … + r₁₀

ExploitationEff(B vs A) = (T₂′ − T₁′) / T₁′

Run an evaluation

One CLI run scores a model over a task sequence and writes per-task rewards and metrics.

python -m latentgym.cli.run_eval single \ --models openrouter/openai:gpt-4o \ --env number_guessing --latent set_of_3 \ --prompt no_info --feedback standard \ --num-episodes 10 --n-trajectories 50 \ --output latentgym/results/ng_gpt4o/

Full guide on GitHub ↗

Training

Cross-Task RL

The same environments serve training: fine-tune on full sequences of N tasks so the policy is rewarded for inferring the latent early and exploiting it later.

Plug into SkyRL for the rollouts and policy optimization (PPO, GRPO, SFT). Per-task rewards are exposed across the sequence, so advantages span the whole run, not a single task.

# 1. Generate the training task sequences (parquets + trajectory JSONs) python -m latentgym.cli.generate_data train \ --env number_guessing --latent set_of_3 \ --n-trajectories 500 --num-episodes 10 \ --output latentgym/data/train/ # 2. Cross-Task RL with GRPO (SkyRL under the hood) bash latentgym/training/train_minimal.sh # single GPU bash latentgym/training/train_fsdp.sh # multi-GPU / FSDP

Full guide on GitHub ↗

See what training this recipe achieves in the Blog.

Interactive

Trajectory explorer

Three modes for browsing recorded trajectories. Frontier models: pick a frontier agent and a task and watch it play. Failure modes: jump directly to trajectories that illustrate Adaptation neglect, breakdown, or miscalibration. RL variants: place any two of the Qwen3-8B variants (Base / Single-task RL / Cross-task RL) side by side.

Model

Environment

Latent

Prompt

Feedback

Horizon

Seed

Writeup

Findings & background

What LatentGym reveals: how frontier models fail to adapt, why Cross-Task RL fixes it, and how training conditions shape transfer.

A Testbed for Cross-Task Experiential Learning

LLM agents are increasingly deployed in settings where they handle sequences of related tasks: personalization, customer support, ongoing research assistants. The hidden structure shared across these tasks could be reused to improve later performance, yet whether current LLMs can actually leverage earlier experience to do better on later tasks is poorly understood. We introduce LatentGym, a testbed where every environment is built around an explicit, ground-truth latent that defines the structure shared across a sequence of tasks, so we can pinpoint where an agent succeeds or fails, and study the findings below.

Three findings at a glance

The headline result from each finding below — click a card to jump to it.

Finding 1 · Frontier models fail to adapt

Claude Sonnet 4.6, GPT-4o, and Gemini 2.5 Flash exhibit three failure modes (adaptation neglect, breakdown, miscalibration) even under simple latents.

Finding 2 · Cross-Task RL induces adaptation

GRPO over full sequences improves cumulative reward by +99% over the base model and +39% over single-task RL, generalizing to held-out latents (+55%) and environments (+14%).

Finding 3 · Sparser feedback transfers better

Counterintuitively, training under standard (binary) feedback transfers more robustly across deployment conditions than richer feedback, driven by robustness to the eval-feedback shift rather than faster learning.

Finding 1

Failure Modes of Frontier Models

A controlled testbed is only useful if existing models struggle on it. We evaluate three representative frontier models (GPT-4o, Claude Sonnet 4.6, and Gemini 2.5 Flash) chosen to span providers and capability profiles. Even under simple latents and across prompt conditions, these models fail at cross-task adaptation; inspecting individual trajectories surfaces three recurring failure modes.

Adaptation neglect (models restart from scratch on every task)

When models are not explicitly told that a hidden pattern exists, they solve each task largely independently, without attempting to infer reusable structure from previous experience. This appears across environments and even under very simple latents. The behavior may look reasonable in isolation (an instructed agent should not impose hidden assumptions), but real deployments contain recurring regularities that are not specified, and an agent that only acts on explicitly stated structure cannot improve with experience.

Example trajectories

▶ Claude Sonnet 4.6 · Number Guessing (set_of_3) ▶ GPT-4o · Number Guessing (set_of_3) ▶ Gemini 2.5 Flash · Secretary (threshold_06) ▶ Browse all Adaptation neglect trajectories →

Average reward across tasks under no_info.

Flat reward curves under no_info across Number Guessing, Secretary, and Bandits: no improvement with experience.

Adaptation breakdown (model sees the pattern but fails to act on it)

When the prompt hints that some pattern may exist, models show partial but ineffective adaptation. Sometimes they ignore the hint and act myopically; sometimes they search for a pattern but identify the wrong one; sometimes they correctly infer part of the structure but never act on it; sometimes they overfit a pattern and drift away from the original task. Recognizing latent structure and exploiting it turn out to be separable capabilities: models can partially succeed at one while failing at the other.

Example trajectories

▶ Gemini 2.5 Flash · Number Guessing (dynamic_range) ▶ Claude Sonnet 4.6 · Number Guessing (two_ranges) ▶ GPT-4o · Mastermind (consecutive) ▶ Browse all Adaptation breakdown trajectories →

Gemini 2.5 Flash · Number Guessing (dynamic_range, some_info)

Task 10, Agent: I've reviewed the previous nine games.
The numbers, ranging from 1461 to 2278, are well below
the upper bound of 10365, although I'll continue using
binary search… [5365]   …12 more turns

The agent names the pattern, then ignores it.

Adaptation miscalibration (more information can make performance worse)

Even when the relevant pattern is described explicitly, performance can degrade relative to a vaguer prompt. The explicit description introduces a competing sub-goal: instead of treating the latent as a calibrated prior, the model optimizes for demonstrating the stated pattern, over-applying the rule or choosing actions that satisfy its interpretation while violating the game's operational constraints. Examples include proposing off-graph words in Word Ladder, choosing the wrong sub-range in Number Guessing, committing to a single arm in alternating Bandits, and accepting too early in Secretary. In-context adaptation requires not just the right information, but the right calibration on it.

Example trajectories

▶ Claude Sonnet 4.6 · Bandits (ping_pong) ▶ GPT-4o · Word Ladder (hub_word_3letter) ▶ Gemini 2.5 Flash · Number Guessing (set_of_2) ▶ Browse all Adaptation miscalibration trajectories →

Examples where full information harms performance.

Giving the latent explicitly (full_info) can hurt relative to a vaguer prompt, across Number Guessing, Word Ladder, and Bandits.

Finding 2

Cross-Task RL

Can RL fine-tuning instill general cross-task experiential learning? We compare two recipes on Qwen3-8B: single-task RL, whose reward depends on a single task r; and cross-task RL, whose reward depends on the full sequence r₁, …, r_N (we use R = Σ_i r_i), training the policy to use early tasks to infer the latent and act better on later ones.

Three questions we address

Is cross-task RL necessary, or does single-task RL on the same task family suffice?

Do the learned strategies generalize beyond the training distribution?

Which capability drives the lift, better exploration or better exploitation?

Q1 · Cross-task RL is necessary

Cross-task RL leads on every environment, beating the base by an average of +99% and single-task RL by +39% on cumulative reward. It is the only variant with consistently positive Gain across environments; single-task RL is negative on 5 of 6 environments, indistinguishable from the base. Largest single-environment lifts on Wordle (+91% vs single-task) and Number Guessing (+72%).

Training across multiple environments helps further

A single cross-task model trained on a mixture of environments beats the per-environment cross-task models by an average of +19% on cumulative reward (largest gain on Secretary, +33%). Suggests cross-task RL learns a general adaptation strategy rather than environment-specific cues.

See it for yourself: how training changes behavior

Compare the same task solved by different Qwen3-8B variants. Cross-task RL needs fewer turns and uses the cross-task history; single-task RL improves the per-task ability but does not adapt across the sequence.

▶ Base vs Cross-task RL ▶ Single-task RL vs Cross-task RL

Q2 · Strategies generalize beyond the training distribution

A policy with a transferable adaptation strategy should generalize to held-out latents within trained environments (OOD-1) and to entirely held-out environments (OOD-2). Cross-task RL beats the base by an average of +55% on OOD-1 and +14% on OOD-2, and lifts r_N in almost every OOD case despite not being directly trained on it.

Latent shift (OOD-1)

Environment shift (OOD-2, leave-one-out)

Q3 · Where does the lift come from?

To attribute Cross-Task RL's gain cleanly to either exploration or exploitation, we use the two-agents hand-off experiment defined on the diagnostics page. The numbers reported there isolate exploration efficiency and exploitation efficiency for CT vs ST on Number Guessing.

▶ See exploration vs exploitation efficiency →

Finding 3

How training conditions shape generalization

To show the kind of question LatentGym makes answerable, we ask how the in-context feedback and prompt used at training time affect generalization. We train Qwen3-8B under all combinations of {full-info, some-info} training prompt × {standard, information} training feedback, holding the cumulative-reward objective fixed, and evaluate each across all four prompt × feedback eval combinations.

Two findings

Standard feedback at training time outperforms full feedback under both prompt regimes, despite providing strictly less information between tasks. The advantage comes from robustness to the eval-feedback shift: full-feedback-trained models drop by 0.509 points when the eval feedback switches to standard; standard-feedback-trained models hold up.
Prompt richness produces no substantive difference in aggregate, but with an asymmetry. Full-info-trained models score +0.36 higher on full-info eval than on some-info eval, whereas some-info-trained models are essentially indifferent to the eval prompt. Some-info training thus yields a more uniformly transferable model, at the cost of the in-distribution advantage full-info training enjoys on full-info eval.

Effect of training feedback

Two views of the same sweep. Left: per training feedback, lines = eval feedbacks. Right: per eval feedback, lines = training feedbacks. Standard-feedback training transfers more robustly.

Compare eval feedbacks within each training feedback

By training feedback panel, comparing eval feedbacks.

Compare training feedbacks within each eval feedback

By eval feedback panel, comparing training feedbacks.

Effect of training prompt

Same two-view layout. Left: per training prompt, lines = eval prompts. Right: per eval prompt, lines = training prompts. Full-info-trained models lead on full-info eval; some-info-trained models are nearly flat across eval prompts.

Compare eval prompts within each training prompt

By training prompt panel, comparing eval prompts.

Compare training prompts within each eval prompt

By eval prompt panel, comparing training prompts.

An Example: Cross-Task RL vs Single-Task RL (Number Guessing)

Each cell is the relative tail-reward advantage of the Cross-task RL agent (CT) over the Single-task RL agent (ST) at switch point K. ExploreGain_C holds the exploiter fixed at the reference agent C and swaps the explorer. ExploitGain_C holds the explorer fixed at C and swaps the exploiter. We report both choices C ∈ {CT, ST} so you can see the gap is robust to the held-fixed role. All values positive: CT is both a better explorer and a better exploiter than ST.

Switch point K	Reference C	ExploreGain_C	ExploitGain_C
2	C = CT	+0.7%	+25.3%
2	C = ST	+4.4%	+30.1%
4	C = CT	+2.5%	+16.4%
4	C = ST	+14.2%	+29.5%
6	C = CT	+2.9%	+13.0%
6	C = ST	+16.9%	+28.5%
8	C = CT	+11.6%	+16.3%
8	C = ST	+17.4%	+22.7%

Per-env agent-switching on Number Guessing: Single-task RL vs Cross-task RL across switch points.

LatentGym

What you can do with LatentGym

Environment Design

Evaluation

Training

Walk through the framework

Environment Design

Five design axes

Core environment

Latent

Prompt

Feedback

Horizon N

Features

Each axis changes independently

Built for evaluation and training

Adding a new environment

Mix-and-match across a sequence

Difficulty axes

Within-task

Latent-identification

Cross-task

Prompt and feedback conditions

Seven environments

Evaluation

What standard summaries tell us

Cumulative reward R = Σi ri

Cross-task gain rN − r1

Final-task reward rN

Separating exploration and exploitation

Exploration efficiency

Exploitation efficiency

Run an evaluation

Training

Cross-Task RL

Trajectory explorer

Findings & background

A Testbed for Cross-Task Experiential Learning

Three findings at a glance

Finding 1 · Frontier models fail to adapt

Finding 2 · Cross-Task RL induces adaptation

Finding 3 · Sparser feedback transfers better

Failure Modes of Frontier Models

Adaptation neglect (models restart from scratch on every task)

Example trajectories

Adaptation breakdown (model sees the pattern but fails to act on it)

Example trajectories

Adaptation miscalibration (more information can make performance worse)

Example trajectories

Cross-Task RL

Three questions we address

Q1 · Cross-task RL is necessary

Training across multiple environments helps further

Q2 · Strategies generalize beyond the training distribution

Latent shift (OOD-1)

Environment shift (OOD-2, leave-one-out)

Q3 · Where does the lift come from?

How training conditions shape generalization

Two findings

Effect of training feedback

Effect of training prompt

An Example: Cross-Task RL vs Single-Task RL (Number Guessing)

Cite this work

Cumulative reward R = Σ_i r_i

Cross-task gain r_N − r₁

Final-task reward r_N