Agentic Planners in Military Exercises: What the 2025 Rollout Teaches Us About Validation, Risk, and Design

The introduction of agentic AI planners into command-level exercises is no longer speculative. In 2025 several programs and experiments moved agentic workflows out of lab demos and into real exercise environments: the Defense Innovation Unit-backed Thunderforge effort (Scale AI, Anduril, Microsoft) that the Pentagon began fielding for theater planning; INDOPACOM’s early use of those tools in Pacific Sentry; and applied research work at Johns Hopkins APL that integrates large language models with physics-based simulation to automate parts of wargame setup and adjudication. These developments mean planners are now experimenting with systems that can decompose objectives, propose multi-step courses of action, and interact with simulation engines at speed.

What I mean by an agentic planner for exercises: a software stack that accepts a high-level goal, decomposes it into subtasks, generates or selects candidate actions, simulates outcomes (often via an external model or physics engine), evaluates those outcomes against objectives and constraints, and iterates without step-by-step human prompting until it outputs recommended courses of action or executes defined tool calls. Architecturally this is usually a modular loop of decomposition, actor/proposal, predictor/simulation, evaluator/scoring, and an orchestrator that transitions between planning horizons. The concept is widely discussed in recent literature on agentic workflows and in public summaries of agent architectures.

How exercises are using agentic planners today

Rapid scenario generation and staff augmentation. Johns Hopkins APL demonstrated using LLMs to assist analysts in defining and running many more scenario variants than a human staff could prepare in the same time, and to translate plain-language intent into inputs for simulation engines such as AFSIM. That lowers the friction of repeated runs and parametric sweeps.
Decision-support in tabletop and campaign-level exercises. The Thunderforge initiative was explicitly chartered to help commanders synthesize sensor and intelligence feeds into draft operations orders, course-of-action assessments, and auditable recommendations. INDOPACOM has started combining reasoning models and agentic components inside Pacific Sentry to speed campaign analysis.
Government R&D and transition contracts. The Army and research organizations have awarded tasking and IDIQs focused on “agentic” capabilities for planning, reasoning, and execution. For example, Sandtable received a multi-year IDIQ to advance operational agentic AI research for DEVCOM ARL, and the Pentagon’s Chief Digital and AI Office pursued commercial partnerships to develop agentic AI workflows. Those awards show acquisition tracks aimed at moving agentic planners into operational testbeds.

The upside in exercises is clear: speed, scale, and exploration. Agentic planners let staffs run hundreds of permutations quickly, surface nonobvious COAs, and free human time for higher-order judgment. They can augment red teams by acting as scripted or adaptive adversaries, or by playing staff roles so human players face richer interactive opponents. APL’s experiments show how an LLM-mediated loop can translate intent into simulation inputs and translate simulation outputs back into analyst-ready prose, improving iteration velocity for mission analysis.

But these advantages come with concrete, testable failure modes that exercises must be designed to reveal

1) Hallucination and domain mismatch. LLM-driven decompositions and rationale can sound plausible while reflecting mistaken assumptions or stale knowledge. When those outputs feed a physics model or adjudicator they can produce systematically biased COAs. The Thunderforge solicitation and public descriptions explicitly require audit trails for recommendations for this reason. 2) Simulation-adjudicator coupling errors. Translating high-level intent into deterministic simulation inputs is nontrivial. If the translator layer misunderstands force definitions, timing, or rules of engagement, the planner’s simulated outcomes will be garbage-in garbage-out. APL’s work highlights the translation layer as a critical engineering boundary. 3) Adversarial and operational security vulnerabilities. Laboratory adversarial attacks on models do not always reflect battlefield realities. DARPA’s SABER program was stood up in 2025 precisely because the DoD lacks operational frameworks to red-team AI-enabled battlefield systems and to measure how adversaries could exploit them in the field. Exercises that do not include adversarial-in-the-loop testing will overestimate robustness. 4) Overfitting exercise priors. Fast iteration can create a feedback loop where planners optimize to the modeling assumptions of the agentic stack rather than to real-world uncertainty. That risk grows if exercises rely on a single modeling and LLM vendor stack. 5) Interoperability friction with legacy C2. Many headquarters run disparate planning tools, databases, and classified systems. Integrating an agentic planner requires robust data normalization, provenance tagging, and secure interfaces. Without that, the planner will either be starved for curated inputs or will ingest low-quality streams that erode trust.

Design rules for safe and informative agentic-planner exercises

Below are practical prescriptions that emerged from cross-referencing public programs, procurement signals, and established exercise design principles. These are framed to be actionable for exercise directors, capability owners, and requirements professionals.

Define the hypothesis set up front. Treat each agentic planner run as an experiment with explicit null hypotheses: e.g., “The planner improves mean time to produce 3 vetted COAs by 50 percent without increasing unacceptable risk to civilian infrastructure.” Capture metrics and success criteria before running scenarios.
Insist on auditable decision trails. Any agentic output used by staff must include provenance, model/version tags, confidence scores, key assumptions, and the simulation seeds that generated outcomes. The Thunderforge public descriptions explicitly highlight the need for auditable recommendations. Exercise injects should test whether audit artifacts are sufficient for meaningful human review.
Build adversarial testing into the exercise plan. Use red-team agents plus human red teams to probe data poisoning, spoofed sensor feeds, and adversarial prompts. DARPA’s SABER program is a template for how to structure operational counter-AI testing at scale. Ensure adversarial faults are realistic, not just canonical lab perturbations.
Hybridize roles: human-in-the-loop and human-on-the-loop. For high-consequence decisions the human must vet and authorize. Exercises should vary the human role to measure where automation provides cognitive lift and where it introduces risk. Track decision latency and error attribution across modes.
Validate translator/adjudicator stacks separately. Put the natural-language-to-simulation translator through a battery of unit tests that check for order-of-magnitude errors in force numbers, time-step misalignments, and ROE misapplication before letting it drive larger campaign runs. APL’s experiments expose this boundary as a common failure point.
Use multiple model families and data vintages in parallel. Running the same scenario across diverse agentic stacks and differing data snapshots reveals brittle recommendations. Avoid single-vendor single-model dependence when assessing doctrinal outcomes.
Measure the human learning curve. One not-often-tracked metric is how quickly staff learn to interrogate agent outputs constructively. Faster iteration can produce illusionary confidence. Include post-exercise cognitive interviews and red-team challenge sessions.

Operational and policy implications

Exercises are the right place to stress-test agentic planners precisely because they let organizations iterate safely. But policymakers and program managers must avoid two pitfalls: treating successful exercise outputs as operational certification and rushing capability transition without commensurate validation. The Pentagon’s 2025 procurement signals show an appetite to field agentic workflows quickly via partnerships with large commercial model providers and defense-focused startups. That acquisition cadence demands stronger S&T and red-team pipelines so fielded systems are measurably robust.

Finally, there is a reputation and trust problem. If planners accept recommendations because they appear polished or they save time, institutional learning can atrophy. Exercise designers must therefore require explicit traceability, independent validation, and adversarial proof points before doctrinal or resource decisions are influenced. DARPA’s SABER program and interagency tabletop work on AI security are early attempts to institutionalize that skepticism; exercises should be the crucible where skepticism is operationalized into test plans and acceptance gates.

Conclusion

Agentic planners are changing how exercises are run, enabling speed and scale that were previously impractical. That capability is valuable, but only if the community treats agentic planners like experimental instrumentation rather than infallible oracles. Design exercises to reveal the failure modes described above, require auditability and adversarial stress, and measure human trust and learning explicitly. When those engineering and organizational preconditions are met, agentic planners will be powerful tools to expand strategic imagination, not shortcuts that institutionalize blind spots.