OpenRA.Bot is a Python-side RL/control package for OpenRA. It uses pythonnet to load the built game assemblies, calls the engine-side PythonAPI, and exposes a Gym-style environment for random agents, rule-based agents, and a baseline PPO training loop.
Status (2026-06-18): The PPO stack is technically working with entity observations, macro production actions, asset-value reward, headless mode, and spawned multi-environment rollout support. The current best development-only PPO runs can approach and sometimes slightly exceed the upgraded RuleBasedAgent on asset reward; repeated UE4/KL0.03 runs can reach ~0.106 peak reward, but do not reliably keep that policy through the final checkpoint. Recent experiments show that the main bottleneck is no longer a single bug in PPO, observation, or reward sparsity; it is the weak expert prior plus long-horizon RTS credit assignment under a mostly on-policy PPO setup.
- Python environment wrapper around the in-engine API
- Engine-side
PythonAPIbridge for local game start, stepping, state extraction, and action dispatch - Rule-based and random agents for smoke testing
- A custom
ActorCritic+PPOAgentbaseline with BC warm-start - Local game, local hosted lobby, and remote lobby connection helpers
- Build-order distance reward (DI-star / AlphaStar inspired)
- Asset-value reward with production-start / active-production credit
- Macro production action space for development-only training
- Decision-step policy gradient masking
- Action masking with
-infpenalty (prevents policy collapse) - Entity-based observations (Phase 1 —
observation_type="entity") - Headless local rollout and
SubprocVecEnvmulti-process training support
envs/openra_env.py: main Gym environment (MCV type detection fix, BO / asset reward, macro actions, decision mask support)envs/vector_env.py: spawnedSubprocVecEnvwrapper for multi-process headless rolloutenvs/wrappers.py:ShapedRewardWrapper(passthrough + diagnostics),AugmentedStateWrapper(frame stacking)utils/engine.py: loadsOpenRA.Game.dllandPythonAPIthroughpythonnetutils/obs.py: convertsPythonAPI.GetState()output into Python dictionariesutils/actions.py: encodes Python action dicts intoRLActionutils/net.py: local host / remote join / lobby helpersutils/PythonAPI.cs: engine bridge source (C#, reflection-based — known fragility on ARM64)agent/agent.py:RandomMoveAgent, upgradedRuleBasedAgent,PPOAgent(BC warm-start, diagnostics, vectorized rollout support)models/actor.py:VectorEncoder,AugmentedVectorEncoder,MultiDiscretePolicy,ActorCriticmodels/buffer.py: rollout buffer for PPO (recurrent + GAE)scripts/train_rl.py: PPO training entry with entity / macro / headless / multi-env flagsscripts/train_best.ps1: current recommended Windows launcher for development-only PPO runsscripts/warmstart.py: BC data collection + pre-training from RuleBasedAgentscripts/verify_asset_reward.py: compares asset-reward headroom between rule-based development agentsscripts/rl_smoke_test.py: quick RL smoke testscripts/remote_rule_based.py: join a remote lobby and runRuleBasedAgentscripts/remote_ppo.py: join a remote lobby and inspectPPOAgentactions, masks, and queue statePLAN.md: long-term roadmap (AlphaStar-style architecture)REPORT.md: detailed experiment log and findings
The current execution path is:
envs/openra_env.pycallsutils/engine.pyto load the OpenRA assemblies.PythonAPI.StartLocalGame(...)or the lobby helpers initialize a match.PythonAPI.GetState()returns a simplifiedRLState.utils/obs.pyconverts that state into Python dicts.OpenRAEnvconverts the raw dict intofeature,vector, orimageobservations.- An agent chooses either a legacy dict action list or a
MultiDiscreteaction. utils/actions.pyandPythonAPI.SendActions(...)translate that into OpenRA orders.PythonAPI.Step()advances the simulation.
- A platform supported by your OpenRA build and
pythonnet - Python 3.8+
- A built OpenRA tree with
OpenRA.Game.dllandOpenRA.runtimeconfig.json - A mod and map that can be started from code, for example
ra
The Python package expects the compiled PythonAPI type to be available from OpenRA.Game.dll.
Recommended workflow:
- Keep the bridge source in
OpenRA.Bot/utils/PythonAPI.cs. - Add or sync that file into the
OpenRA.Gameproject in your OpenRA solution. - Build OpenRA so that the Python side can load the resulting assemblies from
bin_dir.
OpenRAApiBridge.cs is deprecated and should not be used by new code.
cd F:\Projects\OpenRA\OpenRA.Bot
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtRule-based / manual smoke test:
python scripts/example_usage.pyBaseline PPO training:
python scripts/train_rl.pyCurrent recommended development-only PPO launcher on Windows:
.\scripts\train_best.ps1 -Updates 100 -RunName stronger_teacher_no_head_repeatUseful overrides:
.\scripts\train_best.ps1 `
-Updates 150 `
-NumSteps 128 `
-MaxEpisodeTicks 1800 `
-LearningRate 7e-5 `
-TargetKl 0.03 `
-UpdateEpochs 4 `
-RunName ppo_asset_macro_repeatThe launcher uses observation_type="entity", action_space_mode="macro", reward_mode="asset" through the default environment, headless mode, BC warm-start from RuleBasedAgent, and does not load the BC action head unless -LoadBcActionHead is specified. Current experiments show that hard-loading the BC action head anchors PPO too strongly to the teacher's action mix and usually hurts training.
Parallel rollout can be enabled from train_rl.py:
python scripts/train_rl.py --observation-type entity --action-space-mode macro --headless --num-envs 4Remote rule-based control:
python scripts/remote_rule_based.py --host 127.0.0.1 --port 1234 --slot Multi0Remote PPO action debugging:
python scripts/remote_ppo.py --host 127.0.0.1 --port 1234 --slot Multi0Remote PPO training:
python scripts/train_rl.py --remote-host 127.0.0.1 --remote-port 1234 --remote-slot Multi0from envs.openra_env import make_env
env = make_env(
bin_dir="F:/Projects/OpenRA/bin",
mod_id="ra",
map_uid="b53e25e007666442dbf62b87eec7bfbe8160ef3f",
ticks_per_step=10,
observation_type="vector",
enable_actions=["noop", "move", "attack", "produce", "build", "deploy"],
)
obs, info = env.reset()
for _ in range(1000):
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
env.close()OpenRAEnv currently supports four observation modes:
feature: returns the raw Python dict built fromPythonAPI.GetState()vector: flattened numeric observation for MLP policiesimage:128 x 128 x 10semantic map for CNN-style policiesentity: fixed-cap entity tensor plus scalar features for the current lightweight entity encoder
This is the most complete mode and is the best option for debugging. It includes:
actorsresourcesproductionproducible_catalogplaceable_areascashresources_totalresource_capacitypowermy_owner
Each actor now exposes two order-related fields:
available_orders: a filtered list intended for bot/RL logicavailable_order_ids: the raw order ids exposed by engine traits
Important nuance: available_order_ids is closer to "what traits are present on this actor", while available_orders is the safer field to use for decision logic. For example, transformable buildings may expose a raw Move order id even when they should not be treated as currently mobile.
Current vector layout:
- Up to 100 friendly units, 6 features each
- Up to 100 enemy units, 5 features each
- 7 resource/power slots
- 2 map-size slots
Important caveat: the resource/power section is currently placeholder-filled in envs/openra_env.py, so the vector observation does not yet fully use the economy state already available from PythonAPI.GetState().
Current image layout:
- Shape:
(128, 128, 10) - Channels currently used reliably:
- friendly infantry / non-infantry
- enemy infantry / non-infantry
- Channels for resources, cash, and power are not fully populated yet
This is the current default for new PPO experiments. It uses utils/entity_obs.py to build:
entities: up toMAX_ENTITIESactor rows with per-actor featuresentity_mask: valid-row maskscalar: compact economy / power / game-state features
The lightweight entity encoder is enough to clone the current RuleBasedAgent, but PPO still plateaus near the teacher without a stronger training signal.
The default multidiscrete RL action space is:
[action_type, unit_idx, target_x, target_y, target_idx, unit_type_idx]
Supported action types depend on enable_actions, but the usual set is:
noopmoveattackproducebuilddeploy
Semantics:
move: usesunit_idx,target_x,target_yattack: usesunit_idx,target_idxproduce: uses queue actor index +unit_type_idxbuild: uses queue actor index +unit_type_idx+ target celldeploy: usesunit_idx
The environment also accepts legacy Python dict actions, which is what the rule-based agent uses.
For development-only training, action_space_mode="macro" collapses the meaningful decision to the action_type head:
noop, produce:powr, produce:proc, produce:barr, produce:weap, produce:e1, ...
The remaining argument heads are masked to index 0. MCV deployment and finished-building placement are handled by environment automation. This was added because the original 6-head action space made successful production a low-probability joint event (produce + correct queue + correct unit_type), which was too hard to reinforce reliably in single-env PPO.
info["action_mask"] currently includes some of the following fields:
action_typemove_maskattack_maskdeploy_maskproduce_queue_maskproduce_unit_type_maskbuild_maskbuild_unit_type_maskunit_idxtarget_idxtarget_xtarget_yunit_type
These masks are no longer only heuristic action-type hints. The current implementation mixes:
- engine-side feasibility checks for
move,attack, anddeploy - queue-state- and placement-driven checks for
produceandbuild - per-head masks consumed by
PPOAgentduring both sampling and training
Current behavior:
move_mask: only set when the actor has a feasible move in a nearby neighborhoodattack_mask: per-attacker / per-target feasibility matrixdeploy_mask: checked through engine feasibilityproduce_queue_mask: only queues that are enabled, empty, and can actually produce something in the current catalogbuild_mask: only queues with a completed item and a currently available placement areatarget_x/target_y: conditioned on the selected actor or queue- move targets are restricted to a local neighborhood around the selected actor
- build targets are restricted to coordinates present in
placeable_areas
Remaining limitation: target_x and target_y are still masked independently rather than as a joint (x, y) cell distribution, so some invalid coordinate pairs can still be sampled.
The default reward in envs/openra_env.py is development-oriented, not combat-oriented. The current default reward_mode="asset" rewards:
- cost-weighted growth in owned actors (
AssetValueTracker) - starting new production items
- keeping production active
- canceling in-progress production, as a penalty
- optional idle-cash and power-deficit penalties, currently disabled by default because per-step penalties can swamp one-time asset gains
reward_mode="legacy" keeps the older build-order / unit-count shaping path. The asset reward fixes the capped [powr, proc, barr] build-order reward and makes army production visible, but it is still not enough on its own for strong tactical play or decisive improvement over a scripted macro teacher.
OpenRAEnv.reset() supports three startup modes:
- local single-player start through
PythonAPI.StartLocalGame(...) - host-local lobby flow through
env.configure_host(...) - remote server join through
env.configure_remote(...)
See utils/net.py for the exact lobby helper flow.
For remote control, the current flow is:
env.configure_remote(...)reset()joins the server- the client claims a slot, acknowledges the selected map, and marks itself ready
- lobby/network state is pumped until the host starts the game
- once the world exists, normal observation / action stepping begins
Recent bridge changes were specifically made to keep network traffic progressing while still in the lobby, so remote clients can stay synchronized through the lobby-to-game transition.
Policy collapse (KL explosion to 15+)Fixed with-infmask penaltyMCV deploy reward=0Fixed with type-based_is_buildingdetection (ARM64 reflection bug workaround)Reward signal too sparse (0.7% non-zero)Fixed with BO distance reward + producing_per_step (now 98%+ non-zero)Action mask penalty insufficientFixed:log(clamp(1e-6))→-infShort rollout withFixed innum_steps < seq_lenproduced zero PPO batchesmodels/buffer.pyMacro action auto-placement could fail silentlyFixed: done buildings are placed by queue actor idSingle-process-only rolloutImproved with headlessSubprocVecEnvsupport
- PPO can match the upgraded
RuleBasedAgentand occasionally reaches higher asset reward, but does not yet decisively retain that improvement over long runs - The final checkpoint is often not the best checkpoint; best-model saving is now required for meaningful comparisons
- Loading the BC action head directly is too strong an anchor and caused high-KL early stopping in recent runs
- The current expert prior is still weak: it is a hand-written macro script, not a full-game expert with scouting, combat, tech transitions, or opponent adaptation
- There is no soft teacher-KL regularizer yet; PPO either drifts unstably or is over-anchored by hard-loaded BC heads
- There is no goal conditioning yet: build-order / strategy targets are not fed into the policy as an explicit
z - Rewards are still scalar and development-heavy; there are no separate value heads for economy, army, combat, and win/loss
- Combat and terminal win/loss training are still missing from the main training loop
PythonAPI.GetState()is expensive (scans all world state every call, heavy reflection usage)target_x/target_ymasking is still factorized rather than fully cell-jointPythonAPI.csuses reflection to accessOpenRA.Mods.Commontypes — fragile across OpenRA versions and platforms- Multi-environment rollout works in headless mode, but Windows multiprocessing / CLR process management remains operationally fragile and needs more soak testing
- Deploy mask blocks building undeploy (
openra_env.py): The deploy action mask excludes known building types (fact,afld,weap, etc.) from undeploying via DeployTransform. This prevents the agent from constantly undeploying the Construction Yard and canceling in-progress production. In a real game, undeploying to relocate the base is a valid strategy. TODO: Remove the building-type blocklist and let the agent learn the cost of interrupting production via reward penalties (e.g. production-cancel penalty, time-waste penalty). - Building queue single-item guard (
PythonAPI.cs):SendActionssuppressesStartProductionfor building-type items when the target queue already has an item (in-progress or Done). This prevents the agent from accidentally overwriting completed items before they can be placed. Unit-type queues (infantry, vehicle) are unaffected. TODO: Consider whether this should eventually be relaxed for advanced queue management strategies. - Per-category produce mask (
openra_env.py): Theproduce_unit_type_maskblocks unit types whose production queue category (e.g. "building") is already occupied. This is the soft counterpart of the C#-side guard above. TODO: Re-evaluate once the agent can reliably complete the produce→build cycle.
- Use
featureobservations first when debugging action execution. - For remote debugging, start with
scripts/remote_rule_based.pyorscripts/remote_ppo_debug.pybefore running long PPO training jobs. - Treat the current PPO stack as a baseline to iterate on, not a final trainer.
- For current PPO experiments, prefer
scripts/train_best.ps1defaults: entity obs, macro actions, asset reward, headless, no BC action head. - Track
mean_reward,last20,KL,entropy,batches,decision_steps,atype_dist, andreward_comp; reward alone hides collapse. - If policy quality is the goal, the next highest-leverage work is a stronger teacher / goal-conditioned strategy prior / soft teacher-KL, not a larger network.
- If sample efficiency is the goal, optimize state extraction and keep hardening multi-env headless rollout.
- Local start fails: check
bin_dir,mod_id, andmap_uid, and confirm the OpenRA build artifacts exist. - Python cannot load the engine: verify
OpenRA.runtimeconfig.jsonand the required assemblies are present inbin_dir. - Remote join enters the lobby but does not stay synchronized: rebuild OpenRA after syncing
utils/PythonAPI.cs, because remote-lobby behavior depends on the latest bridge code. - Production/build actions appear invalid: inspect
production,placeable_areas,available_orders, andavailable_order_idsfromfeatureobservations first. - Actor indices behave strangely: remember that
unit_idxandtarget_idxare mapped through the latest cached unit-id lists, not raw actor IDs. - A transformable building appears to have
Move: checkavailable_order_idsvsavailable_orders. The raw field may still include transform-related move orders, while the filtered field is the one intended for control logic.
This section documents a deep analysis of the current architecture, focusing on environment interface issues and the gaps between the current baseline and an AlphaStar-style RTS RL agent.
- Complete closed loop: pythonnet bridge → state extraction → Gym env → PPO training all wired up end-to-end
- Rich action masking: engine-side feasibility checks for move/attack/deploy, queue-state-aware produce/build masks
- Multiple connection modes: local game, host-local lobby, remote server join
- Auto-placement: MCV auto-deploy and Done-item auto-place reduce action complexity
The current observation_type="vector" uses a fixed 100-slot bin for both friendly and enemy units (openra_env.py:1276-1304). This has several critical flaws:
- Fixed capacity truncation: RTS unit counts vary from 1 (start) to 50+ (late game). Fixed 100 slots waste space early and may overflow late.
- No relational information: Units are encoded independently — the network cannot learn which units are fighting which.
- No terrain awareness: The vector obs lacks terrain type, passability, and fog-of-war boundary information.
- Order-dependent encoding: Unit order in the fixed slots changes across ticks, making it hard for the network to track identities.
AlphaStar's approach: Entity-based attention (Transformer over variable-length entity list) + spatial grid encoder (ResNet over minimap), which naturally handles variable entity counts and preserves spatial/relational structure.
The current action space is MultiDiscrete([action_type, unit_idx, target_x, target_y, target_idx, unit_type]) — 6 independently-sampled categorical heads.
target_x/target_yare independently factorized (openra_env.py:1167-1173): Valid x-coordinates can be paired with invalid y-coordinates, producing impossible targets. This should be a joint 2D spatial distribution.- No hierarchical conditioning: All 6 heads are predicted from the same shared feature vector without autoregressive conditioning. The network cannot learn that
targetdepends onunit_selectionandaction_type. unit_idxis a fixed categorical: 100-way classification over slots. Unit ordering changes across ticks, so index 5 means different units at different times.- Missing action types: No
harvest,repair,sell,guard,stop, orcancel_productionactions.
AlphaStar's approach: Hierarchical autoregressive action space (action_type → unit_selection → target) with pointer networks for entity selection and 2D spatial logits maps for spatial targets.
PythonAPI.GetState() (PythonAPI.cs:459-651) rebuilds the entire RLState every call:
- Heavy reflection usage: Economy traits (
PlayerResources,PowerManager), production queues, and building placement are all accessed via reflection. EachGetState()call repeatsAssembly.GetType()scans andMethodInfolookups. - Full map scan for placeable areas:
CollectPlaceableAreas()iterates every cell on the map (map.AllCells) and checksCanPlaceBuilding+IsCloseEnoughToBasefor each Done item type. This is O(cells × building_types). - No incremental updates: Even if only one tick passed and nothing changed, the entire state is rebuild from scratch.
Target: Cache reflection calls in static fields, use incremental state updates, and optimize spatial queries.
SubprocVecEnv now supports spawned headless workers, so the old single-environment hard limit is gone. This improves wall-clock sample throughput, but recent 4-env and 8-env experiments still did not produce a decisive improvement over the scripted teacher. The remaining issue is algorithmic and task-level: on-policy PPO still has high variance on long-horizon RTS development, especially when the reward is development-only and the teacher already covers the easy macro path.
Target: Keep hardening 8-64 parallel environments, but pair scale with stronger priors, teacher-KL, goal conditioning, and combat / win-loss rewards.
The current reward (openra_env.py:407-467) only incentivizes:
- Unit/building count increase
- Starting new production
- MCV deployment
Critically missing:
- Combat rewards: Damage dealt, enemy units killed
- Win/loss signal: The most important reward in any competitive game
- Resource efficiency: Ore collection rate, not just total
- Map control / exploration: Reconnaissance value
| Issue | Location | Impact |
|---|---|---|
| Reflection not cached | CollectProductionInfo(), CollectPlaceableAreas() |
~50-200ms per GetState() call |
| Full-cell-scan for build areas | CollectPlaceableAreas() map.AllCells loop |
O(cells × types) per call |
| Building queue single-item guard | SendActions() IsBuildingQueueOccupied |
Prevents advanced queue management |
| Deploy mask blocklist | openra_env.py _building_types hardcoded set |
Prevents agent from learning base relocation |
| Dimension | Current OpenRA-Bot | AlphaStar |
|---|---|---|
| Entity encoding | Fixed 100-slot bin, order-dependent | Transformer over variable-length entity list |
| Spatial encoding | 128×128×10 semantic map (channels mostly empty) | ResNet over rich minimap + camera view |
| Action structure | Independent 6-way MultiDiscrete categoricals | Hierarchical autoregressive with pointer networks + 2D spatial heads |
| Unit selection | Fixed categorical over 100 slots | Attention-based pointer network over entity embeddings |
| Spatial target | Independent x, y categoricals (validity broken) | 2D logits map with spatial masking |
| Network core | MLP-based encoder + optional LSTM | Deep LSTM + Transformer + ResNet |
| Training parallelism | Headless SubprocVecEnv works at small scale; needs hardening and better sample efficiency |
Thousands of parallel environments (TPU pods) |
| Opponents | Single built-in bot | Self-play league with PFSP, historical agents, exploiter agents |
| Reward | Development shaping only | Win/loss + game statistics |
| Pre-training | None (random init) | Supervised pre-training on human/expert replays |
| Curriculum | Fixed map | Progressive difficulty + map diversity |
See PLAN.md for the full implementation plan. Below is a high-level summary:
- Fix
target_x/target_yindependent factorization → joint 2D mask - Complete
unit_typesmapping from CSV or game data - Add combat + win/loss rewards
- Establish performance baselines (random, rule-based, PPO, built-in bot)
- Entity-based observation builder (variable-length, feature-rich)
- Extended spatial observation with terrain, fog, threat channels
- Scalar observation (economy, power, game time)
- Dict-based
gym.spaces.Dictobservation space - Backward compatible with
observation_type="vector"
- Hierarchical autoregressive action structure
- Spatial action head (2D logits map, not factorized x/y)
- Pointer network for unit selection (attention-based)
- Joint 2D spatial action masks
- Expanded action types (harvest, repair, sell, guard, stop)
- Entity Transformer encoder (self-attention over variable-length entities)
- Spatial ResNet encoder (deeper, richer than current CNN)
- AlphaStarActorCritic: integrated encoder + LSTM + hierarchical head
- Maintain backward compatibility with old
ActorCritic
SubprocVecEnvfor parallel environments (target: 8-64 envs)- Self-play environment (two Python-controlled players)
- Elo rating evaluation framework
- 5x+ throughput improvement over single-env
- League training with PFSP (Prioritized Fictitious Self-Play)
- Modular reward system (win/loss, combat, economy, exploration)
- Curriculum learning (economy-only → static opponent → full combat → map generalization)
- Supervised pre-training via behavior cloning from built-in bots
- Cache reflection calls in static fields → ~10x speedup for GetState()
- Incremental state updates (
GetStateDelta()) - Batch order feasibility checks
- Target: GetState() < 10ms (currently 50-200ms)
-
Parallelism strategy: Python multiprocessing with
spawnstart method (each subprocess loads its own CLR) vs. single-process multi-instance (requires engine-side changes) -
Transformer scale: Start with 3 layers / 4 heads / 256-dim model. Scale up only after training stability is proven.
-
Action space granularity: Start with 8-10 action types. Add more only when the agent masters the basics.
-
Curriculum vs. end-to-end: Curriculum (Phase 5) is recommended for training stability, but the architecture should support end-to-end from day one.
-
Self-play vs. fixed opponents: Start with built-in bots for baseline, then add self-play gradually. Full league training is the last piece.
MIT