The U.S. Department of Defense is now moving from concept to live experimentation with agentic AI systems — that is, autonomous or semi-autonomous AI agents that plan, decide, and act across multiple steps with minimal human micromanagement. These pilots are not a single program but a collection of exercises and initiatives across services and research agencies that test agentic workflows in aircraft, attritable unmanned swarms, and decision support for distributed systems. The experiments share common technical enablers and common unresolved governance questions.
What we have seen so far: high-performance airborne agents and system-level autonomy enablers. In air combat and experimentation, AI agents have progressed from simulated dominance in DARPA’s AlphaDogfight Trials to physical-aircraft sorties and demonstrators. DARPA’s AlphaDogfight Trials produced an AI agent that beat an experienced F-16 pilot in simulation, establishing a capability baseline for reinforcement learning and closed-loop tactics. In follow-on work and service experiments, AFRL and industry moved algorithms into hardware-in-the-loop and live sorties. AFRL reported an AI-trained agent flying on the XQ-58A Valkyrie and demonstrating a multi-layer safety framework during a three-hour sortie, and the Air Force ran an AI-enabled engagement on the X-62A VISTA that was publicized as a dogfight demonstration. These events show the operational test path from simulation to flight trials and the heavy reliance on high-fidelity simulation and hardware-in-the-loop before live demonstrations.
Parallel to aircraft experiments, the department’s Replicator effort reflects a push to scale attritable autonomous systems across air, sea, and surface domains. Replicator’s stated objective is to accelerate fielding of large numbers of inexpensive unmanned systems and the software enablers that let many platforms operate collaboratively and resiliently in contested electromagnetic and cyber environments. The program explicitly funds integrated autonomy enablers and seeks to combine commercial software with military platforms to produce rapid quantity and iterative improvement. Those integrated enablers are precisely the architectures where agentic systems — multi-agent planners, task allocators, and adaptive decision-making modules — will be evaluated.
Technical posture and test modalities. Across these pilots the common technical pattern is: train and validate agents at scale in simulation, then progress through hardware-in-the-loop, closed-range flight tests, and finally supervised live missions with layered human oversight. Programs are emphasizing traceability, logging, and run-time monitoring to detect and constrain unexpected agent behavior. AFRL’s reporting highlights deliberate investment in a layered safety architecture and long simulation training cycles, not one-off experiments, to move agentic algorithms toward operational realism. In short, the department is following a conservative ladder-of-assurance approach — but the ladder rungs are still immature relative to widespread operational use.
Industry and ecosystem dynamics. The innovation curve is largely driven by nontraditional defense firms and nimble AI startups working in tandem with prime contractors. Market entrants that produced high-performance agents in simulation have been absorbed into larger development efforts or teamed with vehicle integrators to move algorithms into aircraft and testbeds. This blending of boutique AI research teams, autonomy middleware providers, and platform integrators is accelerating capability delivery but complicates lifecycle assurance, software supply chain visibility, and certification.
Policy and governance realities. The DoD’s existing AI governance posture provides relevant guardrails, including the five DoD AI ethical principles — responsible, equitable, traceable, reliable, and governable — which require that systems be auditable and have clear human responsibility and the ability to be disengaged if they behave unexpectedly. Those principles are now being stress-tested by agentic pilots because agents can plan, re-plan, and chain actions in ways that are less interpretable than earlier single-output ML models. The technical need to produce actionable traceability data and bounded operational envelopes maps directly onto the DoD principle set, but implementation is uneven and in many programs still experimental.
Key operational and technical risks.
- Opaque multi-step reasoning. Agentic pipelines often rely on intermediate planning states and emergent behaviors that are not easily reduced to simple logs. That reduces the value of traditional audit trails unless new telemetry and semantic-state representations are engineered into the agent architecture.
- Composition risk in multi-vendor stacks. When agents, communications middleware, and vehicle controllers come from different vendors, emergent safety and security interactions can appear only in integration testing. Programs that attempt rapid on-ramp integration risk skipping integration test permutations essential to safety.
- Human-machine teaming and trust. Live demonstrations such as the X-62A engagement improve confidence but do not substitute for rigorous human factors research into how operators maintain situational awareness over agentic decisions or how responsibility is assigned in dynamic engagements.
What good verification for agentic systems looks like. If the DoD wants agentic systems that can be mission-turned on at scale, verification and validation must evolve along three axes.
- Semantic-state telemetry and causal trace logs. Agents must expose human-interpretable summaries of intent, plan alternatives considered, and reasons for selected actions at every decision point. These summaries must be compact, signed, and stored in a tamper-evident audit trail. This implements the DoD traceability and governability principles in a technical form.
- Integration-level scenario testing. Beyond unit testing, programs need combinatorial integration tests that exercise cross-vendor failure modes and contested comms, including adversarial inputs and degraded sensing. Simulation must be expanded to include adversarial models of jamming, spoofing, and data-poisoning attacks.
- Operator-centered human factors evaluation. Formal metrics for operator trust, decision latency, and escalation fidelity are required. Live demos provide proof of concept; only structured human factors studies produce the operational rules for safe employment.
Policy recommendations for program managers and policymakers. Based on public pilots to date, I recommend three near-term actions.
- Mandate semantic trace interfaces as a program requirement. Any contract for an agentic capability should include deliverables that make agent reasoning auditable at scale. This is not optional if traceability and governability are to be more than platitudes.
- Fund independent integration testbeds that focus on composition risk. DIU, services, and labs should accelerate dedicated integration ranges where multi-vendor stacks are intentionally combined and stress-tested under contested conditions before any fielding decision.
- Accelerate human factors studies in parallel with flight and ground tests. Commit a portion of pilot program budgets to prolonged human-in-the-loop experiments that measure trust, escalation behavior, and operator cognitive load in realistic operational timelines.
Conclusion. The DoD’s pilot work with agentic AI systems demonstrates both capability and immaturity. Aircraft and attritable swarm experiments show that agents can be trained and brought into live testing, and Replicator-style investments show the department is serious about scaling autonomy. The gap is not technical curiosity. It is creating durable assurance, integration architectures, and operator frameworks that make agentic systems auditable, safe, and controllable in warfighting conditions. If those gaps are not closed before scale fielding, we risk fielding systems that are tactically potent but operationally brittle. The pilots are right now the instrument to close those gaps. The next 12 to 24 months of rigorous V&V, human factors study, and cross-vendor integration testing will determine whether agentic systems become reliable force multipliers or induce new systemic risks.