The internet is now flooded with images or videos generated by artificial intelligence (AI), almost reaching a level of realism that is hard to distinguish from reality, leaving people defenseless and even deceived. However, experts say there is a reliable way to identify AI videos, and that is by listening to the human voice.
One incident that gained attention was the viral AI short film featuring a scenery in Perak, Malaysia. An elderly couple from the country was so captivated after watching it that they traveled over 300 kilometers only to realize upon arrival that the location depicted in the video was fictional. Even the many tourists shown in the video and the female journalist interviewing them along the way were non-existent.
To avoid being deceived by AI videos like this elderly couple, several experts shared why the voices and sound effects in AI videos often reveal obvious signs of being AI-generated.
According to the Huffington Post, natural human speech has a rhythm, with some words pronounced slower. However, AI-generated voices often sound hurried and very unnatural.
Jeremy Carrasco, an expert who specializes in debunking AI videos on social media, noted that videos released by Sora, an AI video application under OpenAI, are often “too active.” He mentioned, “They say a lot without actually saying anything, just cramming text in.”
Even OpenAI has acknowledged this telltale sign. Regarding the use of dashes in Sora’s AI videos, Bill Peeples, the head of Sora, stated during an interview on the live program TBPN that the dashes served as an odd speech pattern for Sora, as it preferred to quickly say many words.
Linguists point out that the rhythm of human speech involves coarticulation, where airflow passes through the nasal and oral cavities as sound transitions naturally from one syllable to another.
However, many AI-generated voices still struggle in this aspect, producing unclear sounds that flatten out natural intonations. Melissa Baese-Berk, a language professor at the University of Chicago, stated, “No one can produce as unclear speech as AI-generated voices because we simply can’t do it.”
Migüel Jetté from the speech-to-text service company Rev mentioned that models trained for text-to-speech predict the most likely pronunciation of a series of words but often struggle to smoothly connect syllables between words.
Jetté gave an example where people naturally say “did you” as “didja,” but AI tends to overly emphasize each word’s pronunciation or rigidly connect them.
If there are obvious pronunciation errors in a video, it could also be a signal, as AI voices may struggle to recognize uncommon or unique words not present in their training database.
Carrasco observed that Google’s text-to-video model Veo “may not cram too many words, but they rearrange word order or involve certain errors.”
Camila Bruder from the Max Planck Institute for Empirical Aesthetics in Germany mentioned that AI speech often expresses emotions too strongly, not fitting the required scene. She noted that if AI speech rigidly portrays happiness with phrases like “Wow!” or anger like a poor actor, these features indicate the video is AI-generated.
Carrasco added that one should also note any strange emotional reactions in the expressions. For instance, in a popular AI video where fish fall from the sky, a woman exclaims, “They are fish, they are really fish!” Such reactions are not typical in real life.
Jetté suggested observing the synchronization of speakers’ lip movements and sounds in the video for clues. “If the speaker’s lips and voice are not perfectly synchronized… that’s a strong indicator.”
While these clues may not definitively identify AI-generated voices, overall, they strongly suggest that you are likely watching a machine-generated video. This is undoubtedly a helpful start. As AI continues to advance, people need as much assistance as possible in discerning truths from falsehoods.
Jetté emphasized, “If it doesn’t feel right, it probably isn’t. Maintaining a healthy sense of skepticism and having keen powers of observation and hearing are crucial for identifying details.”
