Theta Lake’s Marc Gilman explains that as firms increasingly leverage transcription-based technologies to gain insights into data, they must understand the basic mechanics behind these applications to make informed choices.
Voice-to-text transcription has become part of everyday life, whether we realize it or not. Applications like Apple’s Visual Voicemail, Siri and Google Voice transcribe the spoken word to text and provide new, powerful tools for consumer use. Businesses have deployed transcription to gain better insights into their information, using it for training and dispute resolution in call center interactions, creating voice-enabled robotic process automation (RPA) tools and sifting through large volumes of audio data for litigation or research.
While many use cases have been popular in regulated industries like financial services and health care, the benefits of these emerging technologies have been limited due to a lack of meaningful evolution in transcription techniques. For example, global regulatory mandates to record telephone conversations in financial institutions have proliferated since the 2008 crisis, but supervision of these communications presents challenges because transcription technology has remained stagnant. Similar attempts to use voice-to-text solutions to pre-review advertising and marketing materials have proven difficult, given the rudimentary nature of available applications.
Transcription systems still cannot disambiguate homonyms like “for” and “four” or parse the contextual meanings of words like “account” or “promise.” Attempting to review voice recordings of swap-related conversations, marketing videos or advisor-client discussions is significantly more complex if interpretation of that content is done only through a basic transcript. The inability of rudimentary transcription mechanisms to extract the context and meaning of a conversation renders those technologies impractical at best.
Moreover, the metrics used to benchmark transcription vendors often lack transparency. A “word error rate” (WER) is often used to determine the reliability and accuracy of a transcription tool; however, the mechanisms used to generate WERs are not standard and the statistic can lead to confusion. Some WERs do not count misidentification of the singular of a noun as a plural (“account” vs. “accounts”) and vice versa. Additionally, “stop words” such as “the,” “on,” “is” or “at” impact the WER. The importance of a particular stop word may vary based on industry or conversation and, given that there is no authoritative list of stop words, transcription technologies do not uniformly account for them. As a result, a vendor’s purported WER may not be a true representation of the accuracy of their technology. When assessing an application, it is critical to understand how measurements like the WER are generated — a simple comparison of the statistic among several competing vendors is not sufficient.
Given these limitations, searching through recordings for a specific word or phrase can be particularly frustrating and unproductive. A simple search for a person, counterparty, product or concept may require multiple passes through a dataset to uncover the relevant universe of results. These process and technology inefficiencies put compliance personnel at a disadvantage — each false positive or omitted result requires extensive manual work to re-review content and confirm suspected risks.
In an effort to address issues with basic transcription applications, systems employing machine learning and artificial intelligence to analyze, enhance and clarify transcripts are being developed. Building technologies aligned to specific industries and tuned to specific terms and conversational clues produces more reliable results and measurement criteria.
Advanced techniques leverage sophisticated methods for parsing terms by examining a transcribed word to determine if the original, or a slightly different variant (“trade” or “trace”), is most accurate. This comparison technique — analyzing the “Levenshtein Distance” between two terms — is particularly powerful when incorporated into machine learning algorithms trained on financial services data. For example, analyzing a transcript that reads “I’m calling about the cast swan” with the Levenshtein Distance will produce the correct, context-relevant transcription, “I’m calling about the cash swap.” Deploying more sophisticated solutions to the existing transcription problem space provides greater transparency into the meaning of the content itself and drives efficiencies by allowing compliance and risk teams to focus on accurately identified content.
In addition to AI-assisted transcription, supervisory systems should be designed to facilitate intuitive, user-friendly reviews and produce meaningful audit trails that evidence internal compliance review processes. These audit records can be referenced in the event of an exam or audit to demonstrate the specifics of your risk control framework to regulators or other third parties.
It is important to understand that machines are not a panacea. The improvement of transcription applications will provide real efficiencies in surfacing relevant, potentially risky content to compliance officers, but these processes will ultimately require a human to make an informed decision about potentially problematic information. That said, smarter machines can amplify smart decision-making.
Given the ever-increasing use of audio and video platforms to communicate with clients and policyholders, distribute marketing content and collaborate with internal and external stakeholders, ensuring that these channels are properly supervised to meet regulatory expectations is critical. Employing smarter, AI-enabled transcription applications and unified supervisory platforms will reinforce defensible, risk-based supervision strategies and facilitate a more open and collaborative approach to interacting with regulators and others outside an organization.