Editor’s Note: Tarun Wadhwa is CEO of Day One Insights, a strategy and advisory firm focusing on corporate reinvention, and a visiting instructor at Carnegie Mellon University’s College of Engineering. The opinions expressed in this commentary are his own.
The problem of unsolicited robocalls has gotten so bad that many people now refuse to pick up calls from numbers they don’t know. It’s become a defense of last resort in an increasingly frustrating situation that’s led to nearly 25 million Americans becoming victims of fraud. If only it were that simple to solve.
By next year, it’s estimated that half of the calls we receive will be scams, but even more worrisome, 90% of those calls will be “spoofed” — falsely appearing as if they’re coming from a familiar number in your contact book.
The government is finally waking up to the severity of the issue by funding and developing a suite of tools, apps and approaches intended to prevent the scammers from getting through.
Unfortunately, it’s too little too late. By the time these “solutions” become widely available, scammers will have moved onto radically more sophisticated tactics. In the near future, it’s not just going to be the number you see on your screen that will be in doubt. You will soon also question whether the voice you’re hearing is actually real.
That’s because there is a series of powerful voice manipulation, impersonation and automation technologies that are about to become widely available for anyone to use. Gone are the robotic-sounding voice changers of yesterday. With machine learning, software can now understand and mimic the intonations, speaking style and emotions we use in daily conversation.
And we may already be past the point where we are able to tell whether there’s a human being or a bot on the other end of the phone.
At this year’s Google’s I/O Conference, the company demonstrated a new voice technology able to produce such a convincing human-sounding voice, it was able to speak to a receptionist and book a reservation without detection. Then we saw Buzzfeed reporter Charlie Warzel use a free program called Lyrebird to create an “avatar” of his voice by reading phrases into a program for an hour that was good enough to fool his own mother.
As these systems collect more data and evolve, they require fewer and shorter audio clips in order to make believable replicas.
Take Chinese tech giant Baidu’s progress in developing its text-to-speech technology named DeepVoice, for example. When the first version was released in early 2017, it was capable of assembling short sentences that sounded quite realistic, but it required hours of recordings and could only process a single voice. Two releases later, the software is now capable of processing thousands of different voices and requires only 30 minutes of training data.
These developments threaten to make our current frustrations with robocalls much worse. The reason that robocalls are a thorny issue has less to do with volume than precision. A decade of data breaches of personal information has led to a situation where scammers can easily learn your mother’s maiden name, and far more. Armed with this knowledge, they’re able to carry out large-scale but individually targeted campaigns to deceive people when and where they are most vulnerable. This means, for instance, that a scammer could call you from what looks to be a familiar number and talk to you using a voice that sounds exactly like your bank teller’s, saying they’ve found suspicious activity on your account. You’re then tricked into “confirming” your address, mother’s maiden name, card number and PIN number.
Scammers follow the money, so companies will be the worst hit. A lot of business is still done over the phone, and much of it is based on trust and existing relationships. Voice manipulation technologies threaten to undermine that — imagine the employees of a large corporation receiving a call from what sounds like the head of the accounting department asking to verify their payroll information.
More Tech & Innovation Perspectives
There are few existing safeguards to protect against the confusion, doubt and chaos that falsified computerized speech can create. We may soon witness it being used to make bomb threats and calls to the police in the voice of others — and there’s going to be frightening new types of extortion attempts. We’ve seen how disinformation campaigns have destroyed the credibility of our media networks; if we are not careful the same thing could happen with voice-based communication in our personal lives too.
We need to deal with the insecure nature of our telecom networks and the outdated systems of authentication and filtering now. Phone carriers, federal agencies and consumers need to work together and find ways of determining and communicating what is real. That might mean either developing a uniform way to mark videos, images and audio with a tag showing when and who they were made by, or abandoning phone calls altogether and moving toward data-based communications — using apps like FaceTime Audio or WhatsApp, which use strong encryption and can be tied to your identity.
Credibility is hard to earn but easy to lose, and the problem is only going to get harder from here on out.