When Does Localization Become Deepfake?

A few weeks ago, I saw a video of an American CEO speaking Japanese. His lips formed the Japanese words. His voice sounded like his voice. The gestures matched. Everything fit, except for one detail: He doesn’t speak Japanese.

The industry calls this “lip-sync translation” and celebrates it as a breakthrough in localization. One speaker, one video, a hundred languages. Every version looks as if the person spoke that language. It gets described as an efficiency gain and a way to break down cultural barriers.

I see something different. I see a person forming words they never said.

There’s a word for that. It’s called deepfake. But nobody uses that word in this context. Instead: localization, personalization, scaling. Technical terms for a process that, when you describe it instead of label it, sounds like this: An artificial intelligence alters the face of a real person so that they say something they never said, in a language they don’t speak.

The difference between localization and forgery isn’t in the technology. It’s in the intent. And in the transparency.

If I know the video was translated, it’s a tool. If I don’t know, it’s a lie. The question is: How many viewers will know?

Nobody addresses this. The benefit gets described and the question gets skipped. But the question is everything.

As a society, we agreed that a person’s words belong to that person. When someone says something in an interview, it’s a quote. It can be verified. It can be disputed. It can be placed in context. But it was said.

What happens to this agreement when words are synthetic? When a politician says something in a video that is lip-synced and voice-matched but never actually happened? The technology is the same. The application is different. But the boundary between them isn’t a line. It’s fog.

The implicit argument is that intent makes the difference. A company that has its CEO speak in twenty languages doesn’t want a forgery. It wants reach. Agreed. But the technology that makes this possible can do both. And it will do both.

I think about translation the way I know it. A good translator takes a thought and finds a way to express the same thought in another language. The result sounds different from the original. It has to sound different, because the language is different. This difference is honest. It says: This is a translation. Someone made this accessible for you, but it comes from somewhere else.

Lip-sync translation removes this difference. It removes the signal that you’re looking at a translation. The video looks like an original. The translation makes itself invisible. And a translation that makes itself invisible is something other than a translation.

The technology is impressive. The use cases are real. But when you talk about localization and the word deepfake doesn’t appear once, you’ve made a choice. You’ve chosen not to talk about it.

The question of when a translation becomes a forgery has no simple answer. But not asking it has a very clear meaning: It’s inconvenient for what you’re trying to sell.

When exactly does it tip? When does a useful tool become a weapon? The technology can’t answer that. The industry won’t answer it. Who’s supposed to, then?