Cerebras and Abu Dhabi build the world’s most powerful Arabic-language AI model


Jais-Chat, named after the highest mountain in the United Arab Emirates, can take a message in Arabic or English and complete the sentence, just like Chat-GPT does. Tiernan Ray + DreamStudio from Stability.ai

In a sign of times to come, startup Cerebras Systems has just announced an association with Inception, a subsidiary of United Arab Emirates investment firm G42. The idea is to create the world’s largest open language model for Arabic, a language spoken by around 400 million people.

Using the program, called Jais-Chat, is done in the same way as typing into the Chat-GPT prompt, except that Jais-Chat can take and produce Arabic text as input and output. He can, for example, write a letter in Arabic when asked to do so in English:


jais-chat-example-of-writing-arabic-2023

Inception

Or it can take an Arabic language prompt and generate an Arabic response:


jais-example-arabic-language-prompt-2023

Inception

Trained on a large, special corpus of Arabic texts, the program forgoes the typical approach of building a general-purpose program that handles hundreds of languages, in many cases poorly, and focuses exclusively on English and Arabic translations. .

Jais-Chat got 10 points more than LlaMA 2

In tests – such as the MMLU MCQ test from the University of California, Berkeley, and the HellaSwag test from the Allen Institute for AI – Jais-Chat scored 10 points higher than leading LLMs such as LlaMA 2 by Meta. It beat top open-source programs such as this year’s Big Science Workshop’s Bloom, and it also beat specialized language models built exclusively for Arabic.


jais-chat-versus-other-language-models-2023

Jais-Chat performs better on several Arabic tests compared to much larger models such as Meta’s LlaMA 2. Inception

“A lot of companies are talking about democratizing AI,” says Andrew Feldman, co-founder and CEO of Cerebras, in an interview with ZDNET. “We are giving 400 million Arabic speakers a voice in AI. This is democratizing AI. It is the main language of 25 countries.”

Language disparity in the AI ​​sector has been the subject of considerable attention for some time now. The “No Language Left Behind” (NLLB) initiative, launched last year by Meta Properties, works on the simultaneous processing of 200 languages, with a focus on so-called “low-resource” languages, i.e. that is, those that do not have a large corpus of online texts that can be used to train the models.

“If only 25.9% of Internet users speak English, 63.7% of all websites are in English”

As Meta’s authors note, studies in this area “indicate that while only 25.9% of Internet users speak English, 63.7% of all websites are in English.”

“The truth is that the largest data sets rely on scraping from the Internet, and the Internet is primarily in English, which is a really unfortunate situation,” Feldman said.

Attempts to bridge the language gap in AI involve general-purpose AI programs, such as Meta’s NLLB. However, these programs are currently failing to improve in a number of languages, including low-resource languages ​​such as Oromo (originating in Ethiopia and Kenya), but also languages ​​whose material translation is however very widespread, like Greek and Icelandic.

Reversing multimodal models

So-called multimodal programs, such as NLLB’s successor, Meta’s SeamlessM4T, unveiled this month, attempt to perform many different tasks in dozens of languages ​​using a single model, including transcription of text-to-speech and text-to-speech generation. This can weigh down the whole process with additional goals.

Instead of a generalist or multimodal approach, Inception and Cerebras have built a program that only trains in Arabic and English. How ?

  • For this, they created a special dataset of texts in the Arabic language. They compiled 55 billion tokens of data from a myriad of sources such as Abu El-Khair, a collection of over 5 million articles, spanning a period of 14 years, from major news sources; the Arabic version of Wikipedia; and United Nations transcripts, among others. Then, the authors managed to increase the Arabic language training data from 55 billion tokens to 72 billion by performing machine translation of English texts into Arabic.
  • The authors then increased the sampling of the Arabic language text by 1.6 times, increasing the Arabic language data to a total of 116 billion tokens.
  • The authors took another innovative approach. They combined the Arabic and English texts with billions of tokens from computer code snippets collected from GitHub.

The final dataset includes 29% Arabic, 59% English, and 12% code.

A specific tokenizer

The researchers didn’t just use a special dataset. They also employed several specific techniques to represent Arabic vocabulary.

To do this, the researchers built their own “tokenizer”. The usual tokenizer used by programs such as GPT-3 “is mostly trained on English corpora,” the researchers write. So that common Arabic words “are over-segmented into individual characters […]which decreases the performance of the model and increases the computational cost”.

The researchers also used an “integration” algorithm, ALiBi, developed last year by the Allen Institute and Meta. This algorithm performs much better at dealing with very long contexts, that is, entries of a linguistic model typed at the prompt or recalled from memory.

Jais’ code is released under the Apache 2.0 license and is available on Hugging Face

“We were looking to capture the linguistic nuances of Arabic and the cultural references,” says Feldman, who has traveled extensively in the Middle East. “And it’s not easy when most of the model is in English.”

Thanks to these modifications, the result is a linguistic model called Jais, and its chat application, Jais-Chat, measuring 13 billion “parameters”, the neural weights that form the critical active elements of the neural network. Jais is based on the GPT-3 architecture designed by OpenAI, a so-called “decoder” version of Google’s Transformer dating from 2017.

The Jais program code is released under the Apache 2.0 Source Code License and is available for download from Hugging Face. A demonstration of Jais can be used by signing up for a waiting list. The authors plan to make the dataset public “in the near future,” according to Feldman.

The programs were run on what Cerebras calls “the world’s largest supercomputer for AI”, called Condor Galaxy 1, which was built for G42 and unveiled last month.

The machine is made up of 32 specialized Cerebras AI computers, the CS-2s, whose chips collectively contain a total of 27 million computing cores, 41 terabytes of memory and 194 trillion bits per second of bandwidth. They are overseen by 36,352 EPYC x86 server processors from AMD. The researchers used part of this capacity, 16 machines, to train and “refine” Jais.

With its 13 billion parameters, the program is very powerful. It is a relatively small neural network, compared to things like GPT-3, which has 175 billion parameters.

“Its pre-trained capabilities surpass all known open-source Arabic models,” the researchers write, “and are comparable to open-source English models that have been trained on larger datasets.”

As the authors note, the Arab dataset of 72 billion tokens would normally not be sufficient for a model of more than 4 billion parameters, according to the AI ​​rule of thumb known as “Chinchilla’s Law “, formulated by Google DeepMind researchers.

In fact, not only does Jais-Chat in its 13 billion parameter form outperform LlAMA 2, but in a smaller version of their program with only 6.7 billion parameters, they are also able to achieve better results at same tests such as MMLU and HellaSwag.


jet-slide-deck-08-30-23pptx-slide-14

Jais-Chat performs better on several Arabic tests compared to much larger models such as Meta’s LlaMA 2. Inception

“What was interesting was that Arabic also improved English,” Mr. Feldman said, referring to Jais’s performance. “We ended up getting a model as good as LlaMA in English, although we trained it on about a tenth of the data.


Source: “ZDNet.com”



Source link -97