AI: Reliable or Reliably Unsafe?

Reliable AI tools perform consistently to expectations. But that same consistency can be a danger if human consequences aren’t considered, writes Andrew Bloom, an AI ethicist and adviser. Safety has to be the foundation of considerations when enterprise leaders are contemplating AI systems.

As AI systems become more deeply embedded in institutions, boardrooms and daily operations, the language we use to evaluate them matters enormously. Two terms appear constantly in governance discussions, vendor claims and regulatory guidance — reliable and safe. The failure to distinguish between these two characteristics is producing real harm.

Reliability constitutes whether a system performs consistently. Safety is whether that performance stays within ethical and operational limits. An AI system can be highly reliable and profoundly unsafe at the same time.

For enterprise leaders, understanding this distinction is the difference between choosing a tool that performs well in a test environment and a system that can be trusted when real people are affected by its decisions.

Reliability vs. safety

A reliable system performs predictably. It delivers accurate results, maintains stability and operates as expected across a range of conditions. Reliability is measured by accuracy rates, uptime, output consistency and reproducibility. When a system is reliable, it earns confidence because it appears dependable.

But reliability answers only one question: Does the system work? It does not answer the more consequential question: What happens when it works in ways that produce harm?

This is not a hypothetical concern. Reliable systems can and do operate precisely as designed while generating outcomes that are biased, discriminatory or dangerous. Their reliability actually makes the problem worse. A system that consistently produces harmful outputs could be performing exactly as intended.

When we praise reliability without asking what it is reliably doing, we risk confusing consistency with responsibility.

A safe system focuses not on whether a system performs but on whether its performance stays within acceptable limits. A safe system prevents harmful outcomes, protects privacy, reduces bias and keeps actions aligned with ethical and legal standards. It can limit or halt its own operation when risk becomes too great.

Safety is about defining what outputs should never occur, regardless of how efficiently or consistently the system produces them.

The reliability and safety gap is causing harm

The gap between reliability and safety has a documented history that is still being written.

The clearest recent evidence comes from hiring. In 2024, a class-action lawsuit was filed against Workday, alleging that its applicant screening platform engaged in a pattern of discrimination based on race, age and disability. The plaintiff, Derek Mobley, a Black man over 40 with a disability, reported being rejected by hundreds of employers using Workday’s system, often receiving automated rejection notices in the middle of the night with no human having ever reviewed his applications. In May 2025, a federal court certified the case as a nationwide collective action, refusing to grant vendors a special exemption from anti-discrimination law simply because the deciding factor was an algorithm rather than a person. The court’s reasoning was pointed: Removing the human from the loop does not remove the legal or ethical obligation. The system was reliable, screening candidates consistently and efficiently, but it could not distinguish between screening and discrimination.

A parallel case filed in March 2025 sharpens the point. The ACLU of Colorado filed a complaint against Intuit and its vendor HireVue after an Indigenous and deaf applicant was rejected in part because the video-analysis platform flagged deficiencies in her “active listening” skills, according to the lawsuit. The system had evaluated a deaf person’s attentiveness through audio-visual cues it was never designed to adapt. Its reliable output was functionally absurd and potentially illegal. The lesson is the same one the field keeps learning and keeps forgetting: what the system measures and what the system should measure are not always the same thing. When they diverge, reliability ensures the harm occurs at scale.

What good implementation looks like

What does it actually look like to build systems where safety governs reliability?

The answer requires moving ethics from aspiration to infrastructure. The NIST risk management framework identifies seven characteristics of trustworthy systems, and the ordering is deliberate. Valid and reliable come first followed by safe, secure, accountable, explainable, privacy enhanced and fair with harmful bias managed. That framework treats reliability as a necessary but insufficient condition.

In practice, building systems that are both safe and reliable requires at least four structural commitments that go beyond technical performance metrics and consider human consequences.

Problem framing as a safety question is the first consideration that must be addressed. Both the Workday and HireVue purported failures likely originated not in the algorithm itself but in how the problem was framed before development began. Workday appears to have chosen historical hiring patterns as the training signal. HireVue chose audio-visual cues as proxies for professional competence from the complaint’s description. In each case, the framing seems to have embedded inequity before a single line of code was written. Safe system design requires asking before training begins. Questions must be asked; What are we actually trying to measure? What does the training data reflect? And what populations will be affected if the data is skewed?

The second consideration is outcome monitoring across demographic groups. A system that performs well on aggregate metrics can conceal underperformance for specific populations. Responsible implementation requires disaggregated testing, meaning breaking performance data down by race, gender, income, geography and other relevant factors before deployment and continuously afterward. Bias in tools must be surfaced.

The third is human oversight at consequential decision points. Workday’s recruiting tool and HireVue’s interview platform seem to not have required human review before generating an outcome. Consequential decisions, such as who advances in a hiring process, require meaningful human judgment, not just human awareness. Oversight must not be ratification of a result.

The fourth is the willingness to stop. Amazon disbanded its recruiting tool years ago rather than deploy a system it could not trust. That decision cost resources and time. It also prevented the systematic discrimination of an unknown number of job applicants. Organizational culture must support the ability to halt deployment when safety conditions are not met even if business pressure pushes in the opposite direction.

Safety as foundation

For these systems to be trusted at scale, performance must be built on a prior foundation of safety.

This begins with foundational design questions that too few organizations ask at the outset: What actions are permissible? What outcomes are unacceptable, regardless of performance? Where must human oversight be required? Under what conditions must the system stop? When these constraints are embedded early, they can be operationalized through guardrails that prevent the system from bypassing them in real time. Only then does reliability become meaningful because the system is consistently performing within boundaries that have already been defined.

As these systems take on more consequential roles, the bar for trust will be earned through the assurance that systems are constrained, accountable and aligned with human well-being.

The hiring tools now at the center of federal litigation seemed reliable. They don’t appear safe. In each case, the institutions that signed the contracts may discover they cannot trust these tools.

Leaders considering these tools must ask not just whether the system can produce an answer but whether the organization has built the moral architecture to determine which answers should never be produced at all.

Tags: Artificial Intelligence (AI)Corporate Culture

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

AI: Reliable or Reliably Unsafe?

Recent lawsuits over AI applicant-screening tools highlight important differences

Why the Human Body Still Matters in an AI-Driven Workplace

Enron, Blue Bell & FTX: Revisiting Corporate Governance Failures

Your Compliance Dashboard Can’t Tell You Everything About Employee Relations

Andrew Bloom

Related Posts

Your Compliance Dashboard Can’t Tell You Everything About Employee Relations

Enron, Blue Bell & FTX: Revisiting Corporate Governance Failures

How Compliance Officers Can Be Better Coaches

The Most Overlooked Risk in the EU AI Act: Misunderstanding Your Role

Your Compliance Dashboard Can’t Tell You Everything About Employee Relations

Browse Topics:

AI: Reliable or Reliably Unsafe?

Recent lawsuits over AI applicant-screening tools highlight important differences

Reliability vs. safety

Why the Human Body Still Matters in an AI-Driven Workplace

The reliability and safety gap is causing harm

What good implementation looks like

Safety as foundation

Enron, Blue Bell & FTX: Revisiting Corporate Governance Failures

Your Compliance Dashboard Can’t Tell You Everything About Employee Relations

Andrew Bloom

Related Posts

Your Compliance Dashboard Can’t Tell You Everything About Employee Relations

Enron, Blue Bell & FTX: Revisiting Corporate Governance Failures

How Compliance Officers Can Be Better Coaches

The Most Overlooked Risk in the EU AI Act: Misunderstanding Your Role

Your Compliance Dashboard Can’t Tell You Everything About Employee Relations

Follow Us

Browse Topics: