The brand new tokenizer has 200,000 tokens in whole, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to rely the variety of tokens in numerous languages, and the highest languages, in addition to English, are Russian, Arabic, and Vietnamese.
“So the tokenizer’s fundamental influence, in my view, is you get the fee down in these languages, not that the standard in these languages goes dramatically up,” Das says. When an LLM has higher and longer tokens in non-English languages, it might analyze the prompts sooner and cost customers much less for a similar reply. With the brand new tokenizer, “you’re nearly 4 instances value discount,” he says.
Das, who additionally speaks Hindi and Bengali, took a have a look at the longest tokens in these languages. The tokens replicate discussions taking place in these languages, so that they embody phrases like “Narendra” or “Pakistan,” however widespread English phrases like “Prime Minister,” “college,” and “worldwide” additionally come up continuously. Additionally they don’t exhibit the problems surrounding the Chinese language tokens.
That possible displays the coaching information in these languages, Das says: “My working idea is the web sites in Hindi and Bengali are very rudimentary. It’s like [mostly] information articles. So I might anticipate this to be the case. There are usually not many spam bots and porn web sites attempting to occur in these languages. It’s largely going to be in English.”
Polluted information and an absence of cleansing
Nevertheless, issues are drastically totally different in Chinese language. In line with a number of researchers who’ve regarded into the brand new library of tokens used for GPT-4o, the longest tokens in Chinese language are nearly solely spam phrases utilized in pornography, playing, and scamming contexts. Even shorter tokens, like three-character-long Chinese language phrases, replicate these subjects to a major diploma.
“The issue is evident: the corpus used to coach [the tokenizer] is just not clear. The English tokens appear high quality, however the Chinese language ones are usually not,” says Cai from Princeton College. It’s not uncommon for a language mannequin to crawl spam when gathering coaching information, however normally there will probably be vital effort taken to wash up the info earlier than it’s used. “It’s potential that they didn’t do correct information clearing in relation to Chinese language,” he says.
The content material of those Chinese language tokens may counsel that they’ve been polluted by a particular phenomenon: web sites hijacking unrelated content material in Chinese language or different languages to spice up spam messages.
These messages are sometimes ads for pornography movies and playing web sites. They might be actual companies or merely scams. And the language is inserted into content material farm web sites or typically respectable web sites to allow them to be listed by engines like google, circumvent the spam filters, and are available up in random searches. For instance, Google listed one search outcome web page on a US Nationwide Institutes of Well being web site, which lists a porn website in Chinese language. The identical website identify additionally appeared in no less than 5 Chinese language tokens in GPT-4o.
