Reliable AI tools perform consistently to expectations. But that same consistency can be a danger if human consequences aren’t considered, writes Andrew Bloom, an AI ethicist and adviser. Safety has to be the foundation of considerations when enterprise leaders are contemplating AI systems.
As AI systems become more deeply embedded in institutions, boardrooms and daily operations, the language we use to evaluate them matters enormously. Two terms appear constantly in governance discussions, vendor claims and regulatory guidance — reliable and safe. The failure to distinguish between these two characteristics is producing real harm.
Reliability constitutes whether a system performs consistently. Safety is whether that performance stays within ethical and operational limits. An AI system can be highly reliable and profoundly unsafe at the same time.
For enterprise leaders, understanding this distinction is the difference between choosing a tool that performs well in a test environment and a system that can be trusted when real people are affected by its decisions.
Reliability vs. safety
A reliable system performs predictably. It delivers accurate results, maintains stability and operates as expected across a range of conditions. Reliability is measured by accuracy rates, uptime, output consistency and reproducibility. When a system is reliable, it earns confidence because it appears dependable.
But reliability answers only one question: Does the system work? It does not answer the more consequential question: What happens when it works in ways that produce harm?
This is not a hypothetical concern. Reliable systems can and do operate precisely as designed while generating outcomes that are biased, discriminatory or dangerous. Their reliability actually makes the problem worse. A system that consistently produces harmful outputs could be performing exactly as intended.
When we praise reliability without asking what it is reliably doing, we risk confusing consistency with responsibility.
A safe system focuses not on whether a system performs but on whether its performance stays within acceptable limits. A safe system prevents harmful outcomes, protects privacy, reduces bias and keeps actions aligned with ethical and legal standards. It can limit or halt its own operation when risk becomes too great.
Safety is about defining what outputs should never occur, regardless of how efficiently or consistently the system produces them.
Why the Human Body Still Matters in an AI-Driven Workplace
Build short body-and-mood checks into risk meetings, and make it safe to say “something doesn’t feel right, but I can’t tell you why”
Read moreDetailsThe reliability and safety gap is causing harm
The gap between reliability and safety has a documented history that is still being written.
The clearest recent evidence comes from hiring. In 2024, a class-action lawsuit was filed against Workday, alleging that its applicant screening platform engaged in a pattern of discrimination based on race, age and disability. The plaintiff, Derek Mobley, a Black man over 40 with a disability, reported being rejected by hundreds of employers using Workday’s system, often receiving automated rejection notices in the middle of the night with no human having ever reviewed his applications. In May 2025, a federal court certified the case as a nationwide collective action, refusing to grant vendors a special exemption from anti-discrimination law simply because the deciding factor was an algorithm rather than a person. The court’s reasoning was pointed: Removing the human from the loop does not remove the legal or ethical obligation. The system was reliable, screening candidates consistently and efficiently, but it could not distinguish between screening and discrimination.
A parallel case filed in March 2025 sharpens the point. The ACLU of Colorado filed a complaint against Intuit and its vendor HireVue after an Indigenous and deaf applicant was rejected in part because the video-analysis platform flagged deficiencies in her “active listening” skills, according to the lawsuit. The system had evaluated a deaf person’s attentiveness through audio-visual cues it was never designed to adapt. Its reliable output was functionally absurd and potentially illegal. The lesson is the same one the field keeps learning and keeps forgetting: what the system measures and what the system should measure are not always the same thing. When they diverge, reliability ensures the harm occurs at scale.
What good implementation looks like
What does it actually look like to build systems where safety governs reliability?
The answer requires moving ethics from aspiration to infrastructure. The NIST risk management framework identifies seven characteristics of trustworthy systems, and the ordering is deliberate. Valid and reliable come first followed by safe, secure, accountable, explainable, privacy enhanced and fair with harmful bias managed. That framework treats reliability as a necessary but insufficient condition.
In practice, building systems that are both safe and reliable requires at least four structural commitments that go beyond technical performance metrics and consider human consequences.
Problem framing as a safety question is the first consideration that must be addressed. Both the Workday and HireVue purported failures likely originated not in the algorithm itself but in how the problem was framed before development began. Workday appears to have chosen historical hiring patterns as the training signal. HireVue chose audio-visual cues as proxies for professional competence from the complaint’s description. In each case, the framing seems to have embedded inequity before a single line of code was written. Safe system design requires asking before training begins. Questions must be asked; What are we actually trying to measure? What does the training data reflect? And what populations will be affected if the data is skewed?
The second consideration is outcome monitoring across demographic groups. A system that performs well on aggregate metrics can conceal underperformance for specific populations. Responsible implementation requires disaggregated testing, meaning breaking performance data down by race, gender, income, geography and other relevant factors before deployment and continuously afterward. Bias in tools must be surfaced.
The third is human oversight at consequential decision points. Workday’s recruiting tool and HireVue’s interview platform seem to not have required human review before generating an outcome. Consequential decisions, such as who advances in a hiring process, require meaningful human judgment, not just human awareness. Oversight must not be ratification of a result.
The fourth is the willingness to stop. Amazon disbanded its recruiting tool years ago rather than deploy a system it could not trust. That decision cost resources and time. It also prevented the systematic discrimination of an unknown number of job applicants. Organizational culture must support the ability to halt deployment when safety conditions are not met even if business pressure pushes in the opposite direction.
Safety as foundation
For these systems to be trusted at scale, performance must be built on a prior foundation of safety.
This begins with foundational design questions that too few organizations ask at the outset: What actions are permissible? What outcomes are unacceptable, regardless of performance? Where must human oversight be required? Under what conditions must the system stop? When these constraints are embedded early, they can be operationalized through guardrails that prevent the system from bypassing them in real time. Only then does reliability become meaningful because the system is consistently performing within boundaries that have already been defined.
As these systems take on more consequential roles, the bar for trust will be earned through the assurance that systems are constrained, accountable and aligned with human well-being.
The hiring tools now at the center of federal litigation seemed reliable. They don’t appear safe. In each case, the institutions that signed the contracts may discover they cannot trust these tools.
Leaders considering these tools must ask not just whether the system can produce an answer but whether the organization has built the moral architecture to determine which answers should never be produced at all.


Andrew Bloom is an AI ethicist, governance adviser, and author, and the founder of Bloom Ethical AI Consulting. He advises executives, boards and public sector leaders on responsible AI deployment and is the author of “The Ten Commandments of Ethical AI and Technology and Theology.” Bloom serves as rabbi of Congregation Ahavath Sholom in Fort Worth. 






