LatentGym

A Testbed for Cross-Task Experiential Learning

Daksh Mittal*· Tommaso Castellani*· Thomson Yen*· Naimeng Ye· Fangyu Wu· Minghui Chen· Tiffany Cai· Emmanouil Koukoumidis· William Zeng· Hongseok Namkoong

LatentGym makes cross-task learning measurable by giving the evaluator control over the hidden structure shared across tasks. With a known latent, we can diagnose where agents succeed, design metrics that separate exploration from exploitation, and train recipes that target each.

Setting: Each task is to identify a hidden number between [1, 1000], but the hidden number is always either 137 or 793. Latent = {137, 793}
Prompt: Solve a sequence of number games
Task 1 (Target: 137)
Task: Identify the number in range [1, 1000]
Agent: I will do binary search, 500
It is less than 500
Agent: 250
Continues for 7 more turns
Feedback: You guessed 137 correctly in 9 Turns
Task 2 (Target: 793)
Task: Identify the number in range [1, 1000]
Agent: I will do binary search, 500
It is higher than 500
Agent: 750
Continues for 6 more turns
Feedback: You guessed 793 correctly in 8 Turns
Task 10 (Target: 137)
Task: Identify the number in range [1, 1000]
Agent: I am seeing 137 and 793 repeatedly, let me first guess one of them. 137
Correct!
Feedback: You guessed 137 correctly in 1 Turn
Learning Across Tasks (Turns taken reduces)
Our Contributions
Environment Design
Task Dynamics
Latent Structure
Prompt Behaviour
Feedback Type
Horizon
Evaluation and Training
New Metrics
Exploration & Exploitation Efficiency
RL post-training on task sequence
The agent plays a sequence of N=10 tasks; each asks for a hidden integer in [1, 1000], but targets are secretly drawn from z = {137, 793}. Early on (Task 1) the agent does binary search and takes 9 turns. By Task 10, having seen 137 and 793 recur, it solves the task in 1 turn. Each environment in LatentGym factorizes along axes (task dynamics, prompt behavior, feedback type, horizon); we evaluate via new exploration/exploitation metrics and train with Cross-Task RL.
Framework

Walk through the framework

Define environments, evaluate how agents adapt, and train them to adapt better, all from the same composable pieces.

Environment Design

Every environment is the product of five swappable components. Change one axis without touching the others.

Five design axes

Each component registers itself; the registry resolves any choice into a single runnable environment.

Core environment

The within-task game: states, actions, reward in isolation.

Latent

The ground truth shared across all tasks in a sequence: the object of adaptation.

Prompt

How much of the latent the agent is told up front; sets the prior.

Feedback

What the agent observes between tasks; controls how fast evidence accumulates.

Horizon N

How many tasks the sequence runs, setting how much support the agent has for adaptation.

FullyDefinedEnv = core-env × latent × prompt × feedback × N

Features

Each axis changes independently

Any combination resolves to one runnable env, named by its choice (e.g. number-guessing / set-of-3 / no-info / standard / ep10).

Built for evaluation and training

One environment object serves both evaluation and training. A thin adapter exposes each LatentGym env to SkyRL, which handles rollouts, weight sync, and policy optimization; built-in advantage estimators plug into the trainer.

Adding a new environment

Write the single-task dynamics and register the parts (latents, prompts, feedbacks register on import). Sequencing, evaluation, and RL are inherited.

# latentgym/envs/number_guessing/__init__.py from latentgym.core.registry import register_env from .core_env import NumberGuessingSingleEpisodeEnv register_env( name="number_guessing", env_class=NumberGuessingSingleEpisodeEnv, default_num_episodes=7, min_range=1, max_range=1000, max_turns_per_episode=30, ) from . import latents, prompts, feedbacks # each registers itself on import

Full guide on GitHub ↗

Mix-and-match across a sequence

A single sequence can even draw on a different core environment for each task while sharing one latent.

Difficulty axes

Example settings shown for Number Guessing.

Within-task

Set by · visible range
[1, 100] [1, 1000] [1, 10000]

Latent-identification

Set by · size of hidden set |z|
|z| = 2 |z| = 5 |z| = 10

Cross-task

Set by · joint of the two
|z|=2 in [1,1000] vs |z|=10 in [1,100]

Prompt and feedback conditions

How much support the agent gets for inferring and exploiting the latent.

PromptWhat's revealedWhat it tests
no_infoNothing about cross-task structure.Does the agent discover structure on its own?
some_infoA vague hint that recurring structure may exist.Does it identify the regularity efficiently?
full_infoAn explicit description of the latent.Can it use latent information when given?
FeedbackWhat's revealed between tasksWhat it tests
standardOnly a binary success/failure signal after each task.Can the agent learn from sparse, realistic feedback?
informationGround-truth outcome revealed regardless of success.Does it improve once evidence accumulates faster?

Seven environments

Each instantiates the five axes with its own latent families. Click an environment or one of its latents to watch a recorded trajectory.

Evaluation

Beyond whether an agent improves, LatentGym measures how, separating exploration (gathering information about the latent) from exploitation (acting on what it gathered).

What standard summaries tell us

Three views on the reward curve r1, …, rN. Each says whether the agent got better, not how.

Cumulative reward R = Σi ri

Uniform competence across the whole sequence. The reward we optimize during Cross-Task RL.

Cross-task gain rN − r1

Directly isolates whether the agent gets better with experience.

Final-task reward rN

Final capability after the agent has had the whole sequence to adapt.

Separating exploration and exploitation

Tail reward mixes two skills: probing the latent (exploration) and acting on what's learned (exploitation). A two-agents hand-off separates them: agent A plays tasks 1…K and builds the history; agent B inherits it and plays K+1…N. Swapping who plays which half attributes any change to the explorer or the exploiter alone.

A sequence of 10 tasks · Hand-off K = 4 slide to change K
Agent A
Agent B
Agent C

Exploration efficiency

A explores, C exploits
T1 r5 + … + r10
B explores, C exploits
T2 r5 + … + r10
ExplorationEff(B vs A) = (T2 − T1) / T1

Exploitation efficiency

C explores, A exploits
T1 r5 + … + r10
C explores, B exploits
T2 r5 + … + r10
ExploitationEff(B vs A) = (T2′ − T1′) / T1

Run an evaluation

One CLI run scores a model over a task sequence and writes per-task rewards and metrics.

python -m latentgym.cli.run_eval single \ --models openrouter/openai:gpt-4o \ --env number_guessing --latent set_of_3 \ --prompt no_info --feedback standard \ --num-episodes 10 --n-trajectories 50 \ --output latentgym/results/ng_gpt4o/

Full guide on GitHub ↗

Training

Cross-Task RL

The same environments serve training: fine-tune on full sequences of N tasks so the policy is rewarded for inferring the latent early and exploiting it later.

Plug into SkyRL for the rollouts and policy optimization (PPO, GRPO, SFT). Per-task rewards are exposed across the sequence, so advantages span the whole run, not a single task.

# 1. Generate the training task sequences (parquets + trajectory JSONs) python -m latentgym.cli.generate_data train \ --env number_guessing --latent set_of_3 \ --n-trajectories 500 --num-episodes 10 \ --output latentgym/data/train/ # 2. Cross-Task RL with GRPO (SkyRL under the hood) bash latentgym/training/train_minimal.sh # single GPU bash latentgym/training/train_fsdp.sh # multi-GPU / FSDP

Full guide on GitHub ↗

See what training this recipe achieves in the Blog.

Interactive

Trajectory explorer

Three modes for browsing recorded trajectories. Frontier models: pick a frontier agent and a task and watch it play. Failure modes: jump directly to trajectories that illustrate Adaptation neglect, breakdown, or miscalibration. RL variants: place any two of the Qwen3-8B variants (Base / Single-task RL / Cross-task RL) side by side.

Writeup

Findings & background

What LatentGym reveals: how frontier models fail to adapt, why Cross-Task RL fixes it, and how training conditions shape transfer.

A Testbed for Cross-Task Experiential Learning

LLM agents are increasingly deployed in settings where they handle sequences of related tasks: personalization, customer support, ongoing research assistants. The hidden structure shared across these tasks could be reused to improve later performance, yet whether current LLMs can actually leverage earlier experience to do better on later tasks is poorly understood. We introduce LatentGym, a testbed where every environment is built around an explicit, ground-truth latent that defines the structure shared across a sequence of tasks, so we can pinpoint where an agent succeeds or fails, and study the findings below.

Finding 1

Failure Modes of Frontier Models

A controlled testbed is only useful if existing models struggle on it. We evaluate three representative frontier models (GPT-4o, Claude Sonnet 4.6, and Gemini 2.5 Flash) chosen to span providers and capability profiles. Even under simple latents and across prompt conditions, these models fail at cross-task adaptation; inspecting individual trajectories surfaces three recurring failure modes.

Adaptation neglect (models restart from scratch on every task)

When models are not explicitly told that a hidden pattern exists, they solve each task largely independently, without attempting to infer reusable structure from previous experience. This appears across environments and even under very simple latents. The behavior may look reasonable in isolation (an instructed agent should not impose hidden assumptions), but real deployments contain recurring regularities that are not specified, and an agent that only acts on explicitly stated structure cannot improve with experience.

Average reward across tasks under no_info.
Flat reward curves under no_info across Number Guessing, Secretary, and Bandits: no improvement with experience.

Adaptation breakdown (model sees the pattern but fails to act on it)

When the prompt hints that some pattern may exist, models show partial but ineffective adaptation. Sometimes they ignore the hint and act myopically; sometimes they search for a pattern but identify the wrong one; sometimes they correctly infer part of the structure but never act on it; sometimes they overfit a pattern and drift away from the original task. Recognizing latent structure and exploiting it turn out to be separable capabilities: models can partially succeed at one while failing at the other.

Gemini 2.5 Flash · Number Guessing (dynamic_range, some_info)
Task 10, Agent: I've reviewed the previous nine games.
The numbers, ranging from 1461 to 2278, are well below
the upper bound of 10365, although I'll continue using
binary search… [5365]   …12 more turns
The agent names the pattern, then ignores it.

Adaptation miscalibration (more information can make performance worse)

Even when the relevant pattern is described explicitly, performance can degrade relative to a vaguer prompt. The explicit description introduces a competing sub-goal: instead of treating the latent as a calibrated prior, the model optimizes for demonstrating the stated pattern, over-applying the rule or choosing actions that satisfy its interpretation while violating the game's operational constraints. Examples include proposing off-graph words in Word Ladder, choosing the wrong sub-range in Number Guessing, committing to a single arm in alternating Bandits, and accepting too early in Secretary. In-context adaptation requires not just the right information, but the right calibration on it.

Examples where full information harms performance.
Giving the latent explicitly (full_info) can hurt relative to a vaguer prompt, across Number Guessing, Word Ladder, and Bandits.
Finding 2

Cross-Task RL

Can RL fine-tuning instill general cross-task experiential learning? We compare two recipes on Qwen3-8B: single-task RL, whose reward depends on a single task r; and cross-task RL, whose reward depends on the full sequence r1, …, rN (we use R = Σi ri), training the policy to use early tasks to infer the latent and act better on later ones.

Q1 · Cross-task RL is necessary

Cross-task RL leads on every environment, beating the base by an average of +99% and single-task RL by +39% on cumulative reward. It is the only variant with consistently positive Gain across environments; single-task RL is negative on 5 of 6 environments, indistinguishable from the base. Largest single-environment lifts on Wordle (+91% vs single-task) and Number Guessing (+72%).

Training across multiple environments helps further

A single cross-task model trained on a mixture of environments beats the per-environment cross-task models by an average of +19% on cumulative reward (largest gain on Secretary, +33%). Suggests cross-task RL learns a general adaptation strategy rather than environment-specific cues.

See it for yourself: how training changes behavior

Compare the same task solved by different Qwen3-8B variants. Cross-task RL needs fewer turns and uses the cross-task history; single-task RL improves the per-task ability but does not adapt across the sequence.

Q2 · Strategies generalize beyond the training distribution

A policy with a transferable adaptation strategy should generalize to held-out latents within trained environments (OOD-1) and to entirely held-out environments (OOD-2). Cross-task RL beats the base by an average of +55% on OOD-1 and +14% on OOD-2, and lifts rN in almost every OOD case despite not being directly trained on it.

Latent shift (OOD-1)

Environment shift (OOD-2, leave-one-out)

Q3 · Where does the lift come from?

To attribute Cross-Task RL's gain cleanly to either exploration or exploitation, we use the two-agents hand-off experiment defined on the diagnostics page. The numbers reported there isolate exploration efficiency and exploitation efficiency for CT vs ST on Number Guessing.

▶ See exploration vs exploitation efficiency →
Finding 3

How training conditions shape generalization

To show the kind of question LatentGym makes answerable, we ask how the in-context feedback and prompt used at training time affect generalization. We train Qwen3-8B under all combinations of {full-info, some-info} training prompt × {standard, information} training feedback, holding the cumulative-reward objective fixed, and evaluate each across all four prompt × feedback eval combinations.

Two findings

  • Standard feedback at training time outperforms full feedback under both prompt regimes, despite providing strictly less information between tasks. The advantage comes from robustness to the eval-feedback shift: full-feedback-trained models drop by 0.509 points when the eval feedback switches to standard; standard-feedback-trained models hold up.
  • Prompt richness produces no substantive difference in aggregate, but with an asymmetry. Full-info-trained models score +0.36 higher on full-info eval than on some-info eval, whereas some-info-trained models are essentially indifferent to the eval prompt. Some-info training thus yields a more uniformly transferable model, at the cost of the in-distribution advantage full-info training enjoys on full-info eval.

Effect of training feedback

Two views of the same sweep. Left: per training feedback, lines = eval feedbacks. Right: per eval feedback, lines = training feedbacks. Standard-feedback training transfers more robustly.

Compare eval feedbacks within each training feedback
By training feedback panel, comparing eval feedbacks.
Compare training feedbacks within each eval feedback
By eval feedback panel, comparing training feedbacks.

Effect of training prompt

Same two-view layout. Left: per training prompt, lines = eval prompts. Right: per eval prompt, lines = training prompts. Full-info-trained models lead on full-info eval; some-info-trained models are nearly flat across eval prompts.

Compare eval prompts within each training prompt
By training prompt panel, comparing eval prompts.
Compare training prompts within each eval prompt
By eval prompt panel, comparing training prompts.

An Example: Cross-Task RL vs Single-Task RL (Number Guessing)

Each cell is the relative tail-reward advantage of the Cross-task RL agent (CT) over the Single-task RL agent (ST) at switch point K. ExploreGainC holds the exploiter fixed at the reference agent C and swaps the explorer. ExploitGainC holds the explorer fixed at C and swaps the exploiter. We report both choices C ∈ {CT, ST} so you can see the gap is robust to the held-fixed role. All values positive: CT is both a better explorer and a better exploiter than ST.

Switch point K Reference C ExploreGainC ExploitGainC
2C = CT+0.7%+25.3%
C = ST+4.4%+30.1%
4C = CT+2.5%+16.4%
C = ST+14.2%+29.5%
6C = CT+2.9%+13.0%
C = ST+16.9%+28.5%
8C = CT+11.6%+16.3%
C = ST+17.4%+22.7%
Per-env agent-switching on Number Guessing: Single-task RL vs Cross-task RL across switch points.

Cite this work