As compliance teams experiment with AI for everything from risk assessments to policy interpretation, a practical question emerges: Which tasks can be automated reliably, and which still require human judgment? Steph Holmes, director of compliance and ethics strategy at EQS Group, dives into her organization’s research on six AI models, finding that it excels at rule-driven work but struggles in the gray zone where data meets intent and culture intersects with language. The findings suggest oversight should be strategic, not universal, but there’s no doubt the loop isn’t complete without humans.
When people talk about responsible AI, the concept of “human in the loop” tends to surface quickly, and for good reason. It sounds reassuring, almost self-evident. But as compliance teams start experimenting with AI, a practical question emerges: Where in the loop should humans actually be?
To move beyond slogans, we needed evidence. To that end, EQS Group tested six AI models across 120 real-world compliance tasks. The goal was to understand how these systems perform in the same scenarios compliance professionals face every day — from risk assessments and policy interpretation to drafting complex executive briefings.
The benchmark findings reveal the models have made remarkable progress but still have their limits. For example: On clear, rule-based work, such as classifying policies or ranking risks, AI models routinely achieved accuracy above 90%. But in more nuanced, judgment-based scenarios — interpreting complex data or assessing proportionality, for example — the results diverged dramatically. In one category, data analysis, the top-performing model scored 88%, while the weakest reached only 28%.
For compliance professionals, this isn’t just a statistic. It means two systems, given the same disclosure or risk scenario, could reach entirely different conclusions. The lesson to take away, then, is not that AI in general is unreliable but that reliability depends on the right context and tools.
The research provides a rare, data-driven map of where the technology already supports sound decisions — and where human insight remains the deciding factor.
The ‘messy middle’ of compliance — where AI still struggles (and excels)
Much of compliance work happens in what might be called the “messy middle” — the gray zone between clear rules and ethical nuance. It’s where data meets intent and where culture, language and judgment intersect. It’s the difference between a mistake in an expense report and a deliberate scheme, or between an unclear policy and a genuine compliance gap.
Here, the results draw a clear boundary. The latest AI models like Gemini 2.5 Pro and GPT-5 excel at structured, rule-driven tasks — matching data sets, ranking risks, extracting information — with remarkable precision. But when ambiguity rises, so does divergence. Across the 120 tasks tested, the more open-ended the problem, the wider the spread between best and worst AI model performance.
This pattern doesn’t diminish the technology’s general value; AI already performs reliably where the parameters are explicit. The challenge is that compliance rarely deals in perfect clarity. Policies evolve, regulation interpretations vary, and ethical questions require more than binary answers. That is precisely where human interpretation — context, empathy, proportionality — remains irreplaceable. The report makes this visible: AI can lighten the workload, but only people can decide what the results mean.
Redefining what ‘good enough’ means in compliance
While our research highlights clear boundaries to what AI can currently do, it also shows just how far the technology has advanced. The strongest models — Gemini 2.5 Pro and GPT-5 — delivered overall accuracy above 85%, a level likely not possible just a year ago. Equally noteworthy is their reliability: Across structured output and open-ended tasks, only three clear hallucinations were identified, which translates to a rate of just 0.71%.
However, the data also highlights a key distinction often overlooked in the rush toward generative tools: Broad chatbots may feel accessible, but in compliance, accessibility without accuracy creates risk. Domain-specific AI — grounded in structure, governance and context — provides the reliability and accountability that true compliance work demands.
In compliance, reliability has a very specific meaning: not perfection but predictability within clearly defined risk thresholds. A 10-point margin of error for accuracy might be tolerable when reviewing employee sentiment analysis; it is unacceptable when classifying a whistleblower retaliation case. The difference lies not in the technology itself but in where it is applied and how results are reviewed.
This is also where human judgment remains indispensable. AI thrives when the question and output is clearly defined but still struggles when meaning must be inferred. Oversight, therefore, should not be universal — checking every output — but strategic. And this is crucial: Oversight must sit at the decision points where errors carry the highest ethical or legal weight. In practice, that means establishing checkpoints where judgment and proportionality matter most — final case closures, sanctions-screening exceptions or high-value third-party approvals. AI can handle the repetition, while humans must handle the risk, including knowing when AI itself introduces new risk.
From gatekeepers to collaborators
The idea of compliance teams working collaboratively with AI rather than supervising it from above may sound futuristic to some of the more skeptical or risk-averse professionals, but the data hints that this future is closer than we might think. Several of the tested workflows — such as a multi-step conflict-of-interest review process — already chain together tasks that models can complete reliably: categorizing a disclosure, routing it for review, even assessing the potential risk exposure.
Yet the same sequence also shows us the limits. When the task required identification of follow-up information needed to assess the risk more fully or recommending corrective action, performance dropped, and human validation remained essential. For now, that’s the right balance: AI can prepare a decision, not make it.
This gives us a clear trajectory. As model reasoning improves, oversight will need to evolve from gatekeeping to collaboration: humans shaping, training and auditing systems rather than merely reviewing the outputs. That shift won’t happen overnight, especially since most compliance teams are still in the experimentation stage with AI. But if the past year’s pace of model improvement is any indication — 17 percentage-point gains between older and the latest AI models within just nine months — then the profession must prepare now for a role that looks less like supervision and more like partnership.
The new skillset of compliance: understanding the loop
To stay credible, compliance professionals must evolve alongside the technology they increasingly use, manage and for which they are accountable. If this research makes one thing clear, it is that success with AI depends less on technical mastery and more on the ability to frame context and questions well. The strongest models performed best when the tasks and prompts were clear, structured and contextual — when human input gave them direction. In practice, that means compliance professionals must become translators between regulations, policies and AI usage within compliance programs.
The skills this requires are not exotic, but they are increasingly non-negotiable. Compliance professionals will need to understand where AI adds value, where it introduces bias or risk and how to test its reasoning against ethical and legal standards. Eventually, they’ll need to collaborate with data scientists as comfortably as with auditors and learn to read AI outputs with the same critical eye once reserved for human outputs.
Mindset matters just as much as method. Instead of viewing AI as a black box to supervise, compliance teams should see it as a system to design and refine. The most consistent performers — those achieving over 85% accuracy — succeeded when given tasks that mirrored structured workflows: defined, contextual and bounded by clear goals. In the same way, the human side of compliance should evolve more toward managing systems of judgment, not just enforcing rules.
Keeping humans in the loop, in other words, will soon mean more than approving an outcome. It will mean understanding the process that produced it as well as being able to explain it.
Taking the lead on ethical AI
As AI becomes more embedded in compliance operations — from triaging cases to analyzing disclosure patterns — accountability cannot be delegated to algorithms. It must be designed, documented and owned.
Regulatory frameworks are beginning to reflect this reality. The EU AI Act and updated guidance like the DOJ’s 2024 update, both place responsibility squarely on organizations to ensure that AI systems are explainable, auditable and aligned with governance standards. Yet compliance leaders should not wait to be directed. They must lead by defining guardrails, demanding transparency and embedding explainability into every workflow that relies on AI.
Our research provides both a warning and a roadmap. The technology is already capable of extraordinary precision when used correctly, but its variability in judgment-based tasks shows why human oversight remains irreplaceable. AI can automate the what; only humans can define the why and the how.
As compliance enters this next phase of digital transformation, the task is not to defend the human role, but to strengthen it and to ensure that integrity, accountability and ethical reasoning remain at the core of every system we build. The loop isn’t complete without us.


Steph Holmes is director of compliance & ethics strategy at EQS Group. With more than a decade of industry experience, she helps organizations achieve strategic business goals through cultivating ethics, risk and corporate compliance. Passionate about empowering organizations to foster trust, transparency and accountability, she draws on her background in psychology and her credentials as a Leadership Professional in Ethics & Compliance (LPEC) and Certified Compliance & Ethics Professional (CCEP). In her role at EQS, she provides insights and guidance to clients to enhance their ethical culture and performance. 







