When Does Localization Become Deepfake?

A few weeks ago I saw a video of an American CEO speaking Japanese. His lips formed the Japanese words and his voice sounded like his. The gestures matched too. The catch: he doesn’t speak Japanese.

The lip-sync translation was celebrated as a breakthrough and from a technical perspective rightly so. It was fascinating when it looks like the person you know is suddenly speaking a completely foreign language and you genuinely get the impression that the last cultural barriers are being broken down and seamless communication is possible.

And as always the problem with the synthetic nature that AI so gladly produces: a person forms words they never said. The word for that is deepfake. But nobody uses that word in this context. Instead: localization, personalization, scaling, I don’t know. I’ve heard a thousand terms for it. At some point, looked at soberly, the process wasn’t that cool anymore: artificial intelligence alters the face of a real person so that they say something they never said, in a language they don’t speak.

I’ll take the eye-rolling when I say, perhaps pedantically, that the difference between localization and forgery isn’t in the technology but in the intent and the transparency about it.

If I know the video was translated, it’s a tool. If I don’t know, it’s a lie. The question is: how many viewers will know? Because it’s not always stated. And I’ve already written several essays about ethics and this is an ethical question for me. And even if the topic of ethics isn’t directly connected with this question, hardly anyone addresses it.

It should be common understanding in a society that a person’s words belong to that person. When someone says something in an interview, it’s a quote and can be verified. It can be disputed or distorted in context, but it was said.

What happens when words are synthetically produced and put into someone’s mouth? What are the consequences when a politician says something in a video that is lip-synced and voice-matched but was never said by them, and their words trigger further political decisions?

The implicit argument is that intent makes the difference. A company that has its CEO speak in twenty languages doesn’t want a forgery, it wants reach. That’s fine, but the technology can do more than just translate. It interprets just as a human translator would interpret. Only it’s more verifiable and correctable than simultaneous real-time translation.

A good human translator takes a thought and finds a way to express the same thought in another language. The result sounds different from the original and that’s how it should be, because the language is different too. The resulting gap is an honest one. It says: this is a translation. Someone made this accessible for you, but it comes from somewhere else.

Lip-sync translation, when not explicitly marked as such, removes this difference. It removes the indication that it’s a translation. The video looks like an original and the translation behind it is invisible. They call it localization, but strictly speaking it’s deepfake.

How these texts are written is explained here.