Artificial intelligence is transforming the mechanics of targeting on the modern battlefield. That transformation brings measurable operational advantages and measurable ethical risks. At its core the risk I will focus on is bias: systematic, asymmetric error rates that lead an AI-enabled tool to misidentify or preferentially select certain people or objects as targets. In a kinetic context the consequence is not a misrouted ad or a bad loan decision. The consequence is bodily harm or death, and the attendant legal and political fallout.

The problem is not hypothetical. Work across civilian domains has repeatedly shown that AI can produce highly uneven error distributions across demographic groups. A well known example found that commercial facial analysis systems misclassified darker-skinned women at far higher rates than lighter-skinned men, with error rates for the worst-performing subgroup reaching the mid 30 percent range. That intersectional skew is instructive because the underlying mechanisms that produced it in civilian face recognition are the very mechanisms that can produce biased targeting in military systems: unrepresentative training data, unchecked proxies, opaque model behavior, and inadequate subgroup testing.

Policy frameworks are trying to keep up. The U.S. Department of Defense adopted five AI ethical principles that explicitly include an equity principle: the Department will take deliberate steps to minimize unintended bias in AI capabilities. The DoD also emphasizes traceability and governability as countermeasures against unpredictable or discriminatory outputs. Those principles are a solid starting point, but they are high level. Translating them into engineering requirements and test procedures is the hard part.

Operational doctrine has evolved as well. The DoD updated its Directive on Autonomy in Weapon Systems to reaffirm that autonomous and semi-autonomous systems must allow commanders and operators to exercise appropriate human judgment and to require rigorous testing and assurance before deployment. At the technical level the directive signals that developers and program managers will be held to lifecycle testing and review requirements before any system that influences use of force is fielded. Those governance steps reduce risk but they do not eliminate it. Technical blind spots remain, especially where ML models rely on proxies such as thermal signatures, gait, posture, or other sensor-derived cues that may correlate with demographic or contextual features in biased ways.

At the international level the concern is shared. United Nations fora and the Convention on Certain Conventional Weapons processes have repeatedly highlighted both the humanitarian and discrimination risks posed by delegating targeting decisions to algorithms. The debate has coalesced around the notion of meaningful human control and around calls for strong testing and transparency measures for any system that could select or engage targets. Several multilateral instruments and state submissions have specifically flagged algorithmic bias and dataset skew as core risks to distinguishability and non-discrimination under international humanitarian law.

How does bias manifest in a targeting pipeline? Think in stages: sensor input, pre-processing, model inference, downstream decision logic, and human oversight. Bias can enter at any stage.

  • Sensor bias. Thermal imagers, electro-optical cameras, and SAR sensors each have different sensitivities and failure modes. Environmental factors and sensor placement can cause systematic omissions or distortions of certain targets.
  • Training data bias. Models trained on data from one theatre or on curated imagery sets will fail to generalize in others. If a dataset underrepresents women, children, or particular ethnic groups, the model will be less accurate on those groups.
  • Proxy bias. Designers often adopt proxies that simplify the classification problem: weight signatures, ammunition type, vehicle shape, certain clothing patterns. Those proxies may correlate with combatant status in historical data but not in new contexts, producing systematic over or under targeting of specific populations.
  • Feedback loops. Automated triage or prioritization that is left unchecked can amplify errors. If an algorithm flags a category of person as high risk and humans come to trust those flags without scrutiny, subsequent data will be biased toward that label and the model will self-reinforce incorrect correlations.

From an engineering perspective these failure modes are tractable but not trivial. They require disciplined, evidence-driven controls that many civilian AI programs lack. Practical measures include disaggregated evaluation metrics, subgroup error reporting, ODDs or Operational Design Domain definitions that limit where a model may be applied, rigorous adversarial and edge-case testing, and constant monitoring in the field with human-in-the-loop override thresholds.

Specifically I recommend the following operational controls for any targeting system that uses ML components:

  1. Disaggregated performance reporting. Test and publish model accuracy, false positive and false negative rates across meaningful demographic and contextual subgroups. Use those metrics to set hard deployment constraints.
  2. Operational Design Domain gating. Define explicit environmental and contextual boundaries where the model is validated to operate. Outside those boundaries the system must fall back to human-only judgment.
  3. Confidence-calibrated workflows. Tie automation levels to calibrated model confidence and to mission criticality. Require human confirmation when the model’s calibrated confidence is below a verified threshold.
  4. Red-team and adversarial testing. Simulate deliberate attempts to confuse sensors and models. Evaluate for proxy-based misclassification and dataset drift.
  5. Audit logging and traceability. Maintain immutable logs from sensor ingest to final decision so that output can be reconstructed for legal review and post-incident analysis.
  6. Continuous dataset governance. Regularly refresh training and validation sets with geographically and temporally diverse data and require third-party audits of dataset provenance.
  7. Independent validation. Subject high-risk systems to independent testing by accredited agencies prior to deployment and periodically thereafter.

These are not merely technical preferences. They map directly to the DoD principles of traceability, reliability, and governability. They also help meet international legal obligations by making it more feasible to demonstrate distinction and proportionality in attack decisions. But even with strong engineering controls the political and ethical questions remain. Who is accountable when an algorithmic recommendation leads to civilian harm? How should states weigh the operational benefit of faster, data-driven targeting against the elevated risk of disparate misidentification among vulnerable groups? Those questions will require both law and policy to evolve alongside technology.

The bottom line is straightforward. AI bias in targeting is not an abstract fairness problem. It is a failure mode with catastrophic human consequences. The existing policy scaffolding from the DoD and the growing international consensus on meaningful human control give us a governance foundation. To convert those principles into reliable, lawful outcomes requires engineers, commanders, ethicists, and legal reviewers to build repeatable, auditable processes that constrain automation rather than glorify it. If military organizations want the operational gains that AI promises they will need to invest in the less glamorous work of dataset governance, subgroup testing, and sustained independent auditability. Without those investments bias will remain the single largest ethical and operational vulnerability of AI-enabled targeting systems.