TABX

A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

ICML 2026

1Korea University, 2Gauss Labs Inc.
* Equal contribution

Abstract

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardwareaccelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research.

TABX Scenarios

clover

clover

crossfire

crossfire

ribbon

ribbon

superking

superking

grid

grid

bypss

bypass

Representative designed scenarios. Each agent possesses a partial, fan-shaped observation field oriented along its facing direction, analogous to a firstperson perspective. Colored ellipses represent terrain zones with distinct functional effects, such as movement speed reduction or visibility occlusions. Allies are denoted by a red outline, while enemies are indicated by a green outline. Within this field, agents observe the current status (e.g., remaining health points), specifications of all visible units, and terrain zone information. As a result, agents must actively rotate and navigate the environment to detect and engage enemies effectively.

Distinguishable TABX Properties

1. Configurable Free Parameters

TABX offers a diverse set of environmental parameters that enable custom configurations to address specific research questions, such as evaluating the fundamental properties and generalization of various MARL algorithms. These parameters span four primary dimensions of the environment: unit specifications, environmental zones, heuristic policy parameters, and physical dynamics. In TABX, these parameters are dynamically reconfigurable, allowing environment conditions to be varied across episodes without code modification or recompilation.

scenario_editor

Scenario GUI Editor

The interface enables visual authoring of scenarios by allowing users to place ally and enemy units, configure unit specifications, and define environmental zones with adjustable functional effects. The editor provides direct access to key environment parameters through an interactive, code-free workflow.

2. Non-targeting Mechanism

dynamics

Interaction Dynamics

A distinguishing aspect of TABX is its unit interaction system, which incorporates non-targeted attack and healing mechanisms. Each unit is associated with a forward-facing rectangular hurtbox of length \(L\), corresponding to its attack range, and bounded laterally by the field of view (FoV). An attack is registered whenever a target unit's circular body collider intersects the hurtbox.

3. Role-Appropriate Heuristic Policy

heuristic_policy

Operation of TABX heuristic policy

We propose a role-appropriate heuristic policy wherein each unit attribute contributes an independent behavioral bias. The final action of a unit emerges from the composition of these role-specific primitives, allowing for the generation of diverse and sophisticated adversarial behaviors. We define three orthogonal role classes: Ranger, Assassin, and Healer. These roles are mapped from intrinsic unit attributes, namely attack range, movement speed, and attack damage polarity. These classes are non-mutually exclusive, allowing a single agent to embody multiple roles—for instance, a high-mobility unit with long-range capabilities is categorized as both a Ranger and an Assassin. Each role imparts a distinct behavioral tendency; the agent's final action emerges from the composition of these role-specific logics within a unified decision-making pipeline.

Experiments

1. Benchmarks of MARL Algorithms

main

An evaluation of various MARL baselines shows that while MAPPO and IPPO generally perform well, IPPO outperforms MAPPO in specific scenarios like crossfire and ambush, proving that centralized value learning is not universally advantageous. Among value-based methods, QMIX consistently beats IQL because its structured value decomposition network effectively facilitates coordination under shared rewards. Overall, scenarios within the TABX environment require extensive training due to complex spatial interactions and partial observability.

2. Zero-shot Generalization

ued

To evaluate multi-agent zero-shot generalization, researchers integrated UED algorithms with the MAPPO framework using two parameter categories: environmental zone layouts and agent-level unit specifications (health, speed, and attack damage). While most baselines successfully generalized to unseen spatial terrain layouts, they significantly struggled with unseen unit configurations, resulting in average win rates dropping below 50%.

3. Scalability of TABX

tabx_scalability

Scalability of TABX with increasing numbers of parallel environments

scalability_with_compile

Speed comparison between TABX and SMAX, including JIT compilation overhead, with an increasing number of parallel environments

By leveraging JAX-based vectorization on a single GPU, TABX scales near-linearly with the number of parallel environments, sustaining high throughput even as the unit count grows from 11 to 57 units. Compared against SMAX under matched conditions, TABX achieves higher steps/sec across all tested configurations, with the advantage widening as the number of parallel environments increases. Crucially, TABX supports dynamic reconfiguration of scenarios across episodes without JIT recompilation overhead, making it practical for workloads that require frequent scenario changes, such as UED, curriculum learning, and large-scale ablations.