When Does Localization Become Deepfake?

A few weeks ago I saw a video of an American CEO speaking Japanese. His lips formed the Japanese words and his voice sounded like his. The gestures matched too. The catch: he doesn’t speak Japanese.

The lip-sync translation was celebrated as a breakthrough, and from a technical standpoint rightly so. It really was fascinating. Someone you know is suddenly speaking a foreign language, and you think the last language barriers are falling.

And then the usual problem with this kind of AI output: a person forms words they never said. The word for that is deepfake. But nobody uses that word in this context. Instead: localization, personalization, scaling, I don’t know. I’ve heard a thousand terms for it. At some point, looked at soberly, the process wasn’t that cool anymore: artificial intelligence alters the face of a real person so that they say something they never said, in a language they don’t speak.

I’ll take the eye-rolling when I say, perhaps pedantically, that the difference between localization and forgery isn’t in the technology but in the intent and the transparency about it.

If I know the video was translated, it clearly works as a tool. If I don’t know, I’m being deceived. The question is: how many viewers will know? Because it’s not always stated. I’ve written several essays about ethics. For me this is one too. In all the excitement about the technology it barely comes up.

A society should agree that a person’s words belong to that person. When someone says something in an interview, it’s a quote and can be verified. It can be disputed or distorted in context, but it was said.

What happens when words are synthetically produced and put into someone’s mouth? What are the consequences when a politician says something in a video that is lip-synced and voice-matched but was never actually said by them? And when that then triggers political decisions?

Sure, intent makes the difference. A company that has its CEO speak in twenty languages doesn’t want a forgery, it wants reach. That’s fine, but the technology can do more than just translate. It interprets just as a human translator would interpret. Only it’s more verifiable and correctable than simultaneous real-time translation.

A good human translator takes a thought and finds a way to express the same thought in another language. The result sounds different from the original and that’s how it should be, because the language is different too. The resulting gap is an honest one. It says: this is a translation. Someone made this accessible for you, but it comes from somewhere else.

Lip-sync translation, when not explicitly marked as such, removes this difference. It removes the indication that it’s a translation. The video looks like an original and the translation behind it is invisible. They call it localization, but strictly speaking it’s deepfake.