LLM Developers Building for Language Diversity in 2025

From Switzerland's 1,000-language Apertus to India's 350+ AI models, 2025 marked a watershed in multilingual LLM development. Here’s a comprehensive overview of developers building language-diverse AI across six continents.

Share

For billions of people worldwide, AI has spoken only one language: English.

In 2025, that changed. Nations across every continent are no longer waiting for Silicon Valley to translate AI into their languages—they’re building it themselves. Switzerland’s Alps supercomputer now trains models in over 1,000 languages. India’s Bhashini platform supports 350+ AI-powered language models. Twelve Latin American countries are collaborating to preserve Indigenous languages like Rapa Nui. Africa’s InkubaLM serves five languages whilst using 75% fewer parameters than comparable models.

This isn’t just about translation. It’s about linguistic sovereignty, cultural preservation, and ensuring AI reflects the world’s diversity rather than flattening it.

What follows is the most comprehensive overview of multilingual LLM development in 2025—the year language diversity stopped being an afterthought and became a strategic imperative.

BY THE NUMBERS:

  • 1,000+ languages (Switzerland’s Apertus)
  • 350+ AI models (India’s Bhashini)
  • 24 EU languages (multiple initiatives)
  • 12 countries (Latin America collaboration)
  • 101 languages (Cohere’s Aya Expanse)

Regional and National AI Initiatives

Switzerland

Switzerland released Apertus in September 2025, the country’s first fully open-source multilingual LLM developed by EPFL, ETH Zurich, and the Swiss National Supercomputing Centre. Trained on the Alps supercomputer, Apertus comes in 8B and 70B parameter versions and supports over 1,000 languages, with 40% of training data in non-English languages. This includes underrepresented languages like Swiss German and Romansh.​

United Arab Emirates

The UAE launched Falcon Arabic in May 2025, developed by the Technology Innovation Institute in Abu Dhabi. Built on Falcon 3-7B, it is trained on high-quality native Arabic data spanning Modern Standard Arabic and regional dialects, capturing the full linguistic diversity of the Arab world. According to benchmarks, Falcon Arabic outperforms models up to 10 times its size and ranks as the best-performing Arabic model in its class.​

Southeast Asia

AI Singapore developed SEA-LION (South East Asian Languages In One Network), supporting 11 Southeast Asian languages: English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. The project represents a collaborative effort to ensure regional linguistic representation in AI development.​

Alibaba DAMO Academy released SeaLLM, targeting similar regional languages shortly after SEA-LION’s debut.​

SEA AI Lab and Singapore University of Technology and Design introduced Sailor, covering English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao.​

Indonesia is developing a 70 billion parameter LLM through collaboration between telecom operator Indosat and tech startup Goto, operating in Bahasa Indonesia and five local languages including Javanese, Balinese, and Bataknese.​

Vietnam unveiled GreenMind-Medium-14B-R1, the country’s first open-source Vietnamese LLM optimised for NVIDIA NIM. Developed by GreenNode, it’s designed for practical enterprise applications including AI assistants, chatbots, and Vietnamese reasoning tasks.​

Thailand, Malaysia, and other Southeast Asian nations have developed monolingual models like OpenThaiGPT and MaLLaM to serve their specific linguistic and cultural contexts.​

ALSO READ: Agentic AI Market Correction: What’s Next for Enterprises?

India

India has launched multiple initiatives to address its 22+ official languages and numerous regional dialects:

Bhashini, launched by the Ministry of Electronics & Information Technology in 2022, serves as India’s multilingual AI infrastructure. The platform has contributed to creating over 350 AI-powered language models and supports speech-to-text, text-to-speech, and translation services across Indian languages.​

BharatGen, launched in June 2025, is India’s first government-funded multimodal LLM for Indian languages. Developed under the National Mission on Interdisciplinary Cyber-Physical Systems and implemented through IIT Bombay, it integrates text, speech, and image modalities across 22 Indian languages.​

Adi Vaani, launched in 2025 under the Adi Karmayogi Abhiyan, is the world’s first AI-powered tribal language bridge. It currently covers Santali, Mundari, Gondi, and Bhili (22 million speakers) with plans to expand to 30+ languages by 2027 and 50+ endangered languages by 2030.​

Sarvam AI is building India’s first indigenous AI model designed to reason and be fluent in Indian languages, targeting population-scale deployment.​

Hanooman represents another indigenous alternative to ChatGPT, developed with support from Indian research institutions.​

Private companies like Karya are building datasets for firms like Microsoft and Google to improve AI models for Indian languages.​

Latin America

Twelve Latin American countries are collaborating to launch Latam-GPT in September 2025, the region’s first large AI language model. Led by Chile’s National Center for Artificial Intelligence (CENIA) with over 30 regional institutions, the model is based on Llama 3 and trained on regional datasets including court decisions, library records, and school textbooks. The initiative prioritises Indigenous language preservation, with an initial translation tool created for Rapa Nui, the native language of Easter Island.​

Africa

Lelapa AI launched InkubaLM, Africa’s first multilingual small language model supporting Swahili, Yoruba, IsiXhosa, Hausa, and IsiZulu. Named after the dung beetle for its efficient design, InkubaLM was compressed by 75% without losing performance, making it highly suitable for low-resource environments.​

Nigeria launched its first multilingual LLM in 2024, trained on five low-resource languages and accented English to ensure stronger language representation.​

The African Next Voices initiative brought together linguists and computer scientists to develop AI-compatible datasets for 18 African languages, gathering 9,000 hours of spoken language from Kenya, Nigeria, and South Africa.​

Orange partnered with OpenAI and Meta to fine-tune AI models for regional African languages including Wolof and Pulaar, spoken by 16 million and 6 million people respectively in West Africa. The initiative launched in the first half of 2025 with plans to expand across Orange’s 18-country African footprint.​

Jacaranda Health developed UlizaLlama (AskLlama), extending Meta’s Llama models to support Swahili, Hausa, Yoruba, Xhosa, and Zulu for maternal health support.​

South Korea

South Korea has launched an ambitious sovereign AI initiative with major tech companies developing Korean-language models:

Naver’s HyperCLOVA X was trained using 6,500 times more Korean data than GPT-4 and outperforms it on Korean-specific benchmarks. The lineup includes HyperClova X Think (reasoning-specialised), HyperClova X Dash (lightweight), and HyperClova X Seed (open-source).​

SK Telecom’s A.X models, developed on top of Chinese Qwen 2.5 and expanded in-house, have shown stronger performance than GPT-4o on local benchmarks. A.X 4.0 processes Korean inputs about 33% more efficiently than GPT-4o.​

LG, Upstage, Kakao, and NC AI are also developing Korean-language models as part of the government’s ₩530 billion ($390 million) sovereign AI initiative.​

ALSO READ: UX Is the New Moat: Why AI Startups Win on Experience, Not Technology

Japan

While specific 2025 Japanese LLM releases are limited in the search results, Fujitsu launched Takane in 2024, a Japanese-language LLM based on Cohere’s Command R+ with additional training for enhanced Japanese capabilities. The model achieved industry-leading performance on the JGLUE benchmark for Japanese language understanding.​

Research continues on Japanese multimodal understanding with projects like JMMMU (Japanese MMMU), a benchmark designed to evaluate LMMs on expert-level tasks based on Japanese cultural context.​

Middle East and Arabic-Speaking Regions

Beyond the UAE’s Falcon Arabic, several other Arabic-focused initiatives emerged:

Saudi Arabia’s ALLaM, developed by SDAIA, leverages reinforcement learning from AI feedback to enhance instruction-following in Arabic.​

Qatar’s Fanar, launched by Qatar Computing Research Institute in December 2024, specifically targets Gulf dialect understanding with dual-model approach and multimodal capabilities.​

UAE’s AIN-7B, launched in early 2025 by MBZUAI, represents the first Arabic-inclusive Large Multimodal Model processing both text and images with Arabic proficiency.​

Cohere’s Command R7B Arabic, released in February 2025, is optimised for Arabic language understanding and generation.​

The Jais family of models, developed through collaboration between Inception, MBZUAI, and Cerebras Systems, stands as a landmark achievement in Arabic AI with strong bilingual Arabic-English capabilities.​

Turkey

Turkey is developing locally-produced Turkish LLMs as part of its National Artificial Intelligence Strategy through 2025. The T3AI model, developed by the Turkish Technology Team (T3) Foundation and Baykar, launched its beta version in July 2025 on the TEKNOFEST platform. The government launched the “Turkish Large Language Model Sectoral Adaptation Project Call” offering grants up to TL 50 million ($1.3 million) per project.​

Russia

Russia developed the GigaChat family of Russian LLMs, created from scratch specifically for the Russian language. The models employ Mixture-of-Experts (MoE) architecture and a specialised tokeniser to address Russian linguistic and cultural nuances. GigaChat 2 MAX achieved near-state-of-the-art results on Russian benchmarks, though it still lags behind American models in some areas.​

Brazil and Portuguese-Speaking Countries

Beyond Google’s GAIA model, Brazil committed R$ 23 billion ($4.24 billion) through 2028 to build a national AI ecosystem centered on Portuguese. The plan includes R$ 5.7 billion for a “sovereign cloud” hosting a supercomputer for training Portuguese-language models.​

The Brazilian Linguistic Diversity Platform aims to train AI models with structured data from Portuguese and 250+ indigenous and regional languages spoken in Brazil.​

Portugal and Brazil held their first Portuguese-Brazilian Dialogue on AI in November 2025 to strengthen cooperation in AI and digital services.​

ALSO READ: Digital Twins: 8 Companies Enhancing Manufacturing with AI

Canada

While Canada has no approved AI regulation framework as of May 2025, the government allocated CAD 2.4 billion to promote AI businesses and is working on the Artificial Intelligence and Data Act (AIDA). The Pan-Canadian Artificial Intelligence Strategy continues to guide Canada’s approach to becoming a world leader in AI.​

Europe

Mistral AI (France) developed the Mistral Large 2 model with support for dozens of languages including French, German, Spanish, and Italian, plus over 80 coding languages. The 123 billion parameter model operates with a 128k context window.​

Aleph Alpha (Germany) develops sovereign LLMs focused on multilingualism, explainability, and EU AI Act compliance. Their Pharia-1-LLM-7B-Control model supports German, French, and Spanish under an Open Aleph License.​

Italy developed Minerva, the country’s first LLM family built on Italian language data through collaboration by Sapienza NLP, FAIR, and CINECA. The 7.4B parameter model trained on 2.5T tokens maintains a 50/50 Italian-English data balance.​

Velvet AI (Italy/Almawave) emphasises sustainability and multilingual reach, trained on the Leonardo supercomputer for healthcare, finance, and public administration applications.​

EuroLLM-9B represents a pan-European initiative supporting all 24 official EU languages plus 11 additional languages, making it one of the most linguistically inclusive European models.​

Unbabel (Portugal) released EuroVLM models in June 2025, supporting 35 languages including all 24 official EU languages.​

Tilde won the European AI Grand Challenge and released TildeOpen LLM, a 30-billion parameter model optimised for all 24 EU official languages plus Ukrainian, Norwegian, and several Balkan languages.​

The LLMs4Europe project, launched in 2025 with over 70 partners, aims to build open, trustworthy multilingual LLMs for five strategic sectors: Energy, Telecommunications, Tourism, Public Services, and Science.​

Startups and Research Organisations

Hugging Face continues to serve as the primary platform for open-source multilingual models, hosting over 373,000 models as of September 2025. The platform released analysis showing that 88% of models with language tags support English, while multilingual models represent 76% of all tagged models.​

DeepSeek (China) released DeepSeek R1 and DeepSeek-V3 in late 2024/early 2025, with 671B total parameters (37B active) and strong multilingual capabilities.​

AI21 Labs developed models for Hebrew language processing, working with partners to create RAG-based chatbots for Hebrew legal and regulatory content.​

Ivrit.ai provides high-quality Hebrew datasets under permissive licenses, enabling first-class support for Hebrew in AI models.​

Sartify developed Swahili-LLM to provide general and domain-specific AI capabilities for Swahili-speaking regions, supporting multilingual interactions including Kiswa-English mix common in East Africa.​

01.AI developed the Yi series of open-source LLMs offering strong performance in both English and Chinese.​

Stability AI released StableLM 2 supporting seven languages: English, Spanish, German, Italian, French, Portuguese, and Dutch.​

Upstage (South Korea) developed Solar Pro 2 with 31 billion parameters, optimised for Korean language performance and outperforming global models on Korean benchmarks.​

NVIDIA has become a key infrastructure partner, with models like Vietnam’s GreenMind optimised for NVIDIA NIM and multiple initiatives leveraging NVIDIA’s hardware for training multilingual models.

Major Tech Companies

Meta AI continues to expand multilingual capabilities across its Llama model family. Llama 4 Scout, released in April 2025, added support for eight additional languages beyond previous versions, demonstrating Meta’s commitment to broader linguistic coverage. The Llama 3.1 series supports multilingual applications with strong performance across diverse language tasks.​

Google DeepMind has made significant strides with its Gemini series. Gemini 2.5 Pro and subsequent releases support extensive multilingual functionality. The company partnered with Brazilian institutions to develop GAIA, a Portuguese-language model based on Gemma 3, specifically optimised for Brazilian use cases. Google’s multilingual infrastructure now supports over 140 languages with high accuracy rates.​

ALSO READ: Can Big Tech Bridge the Disability-Digital Divide?

OpenAI released GPT-5 in August 2025, which the company describes as its “smartest, fastest, most useful model yet”. However, multilingual performance benchmarks revealed only marginal improvements over previous models across 13 tested languages, including Brazilian Portuguese, Arabic, French, German, Italian, Spanish, Bengali, Chinese, Hindi, Indonesian, Japanese, Korean, Swahili, and Yoruba. The company confirmed that GPT-5 “significantly improves multilingual understanding across over 12 Indian languages”.​

Alibaba continues developing its Qwen series with strong multilingual capabilities. Qwen 2.5 and Qwen 3 models support over 100 languages and dialects, with particularly strong performance in Asian languages including Mandarin, Japanese, and Korean. The models have gained significant adoption in Southeast Asian markets for their speed and multilingual efficiency.​

Anthropic’s Claude models support multiple languages, though the company’s primary focus remains on English, French, Spanish, Japanese, and other major languages. Claude 4 and its variants demonstrate strong performance across these supported languages while maintaining the company’s Constitutional AI approach to safety.​

Microsoft developed the Phi-3 model family with multilingual support, though specific language coverage details remain limited. The company has also partnered with various organisations globally to enable multilingual AI applications.​

Cohere has emerged as a significant player in enterprise multilingual AI. The company released Command A Translate in August 2025, specifically designed for AI translation across 23 business languages. Cohere’s Aya initiative represents one of the most ambitious multilingual projects globally, involving over 3,000 researchers across 119 countries to develop models supporting 101 languages. The Aya Expanse model family includes 8B and 32B parameter versions optimised for 23 languages including Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.

Sovereign AI Movement: Nations across Asia, Latin America, Africa, and Europe are prioritising development of locally-controlled AI models that reflect their linguistic and cultural contexts rather than depending on US or Chinese technology.​

Low-Resource Language Focus: Significant efforts are underway to address the “digital divide” affecting languages with limited training data, including Indigenous languages, tribal languages, and minority languages.​

Multilingual-by-Design: New models like Switzerland’s Apertus and Cohere’s Aya Expanse are designed from the ground up to be multilingual rather than English-centric with multilingual capabilities added later.​

Open-Source Trend: Many regional initiatives emphasise open-source models to enable community contribution and reduce dependence on proprietary systems.​

Parameter Efficiency: Models like Africa’s InkubaLM demonstrate that smaller, efficiently designed models can deliver strong performance for specific linguistic contexts without requiring massive scale.​

Multimodal Expansion: Language models are increasingly incorporating vision and speech capabilities, as seen with India’s BharatGen, UAE’s AIN-7B, and Europe’s EuroVLM.​

Community Collaboration: Projects like Cohere’s Aya (3,000+ researchers across 119 countries) and African Next Voices demonstrate the power of global collaboration for multilingual AI development.​

The landscape of multilingual LLM development in 2025 represents a fundamental democratisation of AI technology, with developers across every region working to ensure their languages and cultures are represented in the AI-powered future. This shift from English-dominated systems toward truly global, inclusive language technology marks one of the most significant developments in AI’s evolution.

Anushka Pandit
Anushka Pandit
Anushka is a Principal Correspondent at AI and Data Insider, with a knack for studying what's impacting the world and presenting it in the most compelling packaging to the audience. She merges her background in Computer Science with her expertise in media communications to shape tech journalism of contemporary times.

Related

Unpack More