
An ambitious new AI project has begun to take shape in Europe, with the aim of developing open-source AI models that support the region’s 24 official languages and more—while also complying as much as possible with its thicket of digital legislation.
The OpenEuroLLM project, which commenced work at the start of the month, has a budget of just €37.4 million ($38.6 million): a pittance compared with the sums being invested in other AI-related projects like the $100 billion first tranche of the U.S.’s Stargate AI infrastructure project. Although participating companies such as Germany’s Aleph Alpha and Finland’s Silo AI are also contributing their researchers’ time to an equivalent value, the bulk of the funding comes from the European Commission.
EU-funded projects don’t tend to move fast, and this one has a three-year road map in a sector that’s currently undergoing significant evolution each month. But organizers and participants tell Fortune that it could be possible to deliver an intermediate model within a year—and the effort will be worth it.
Speaking in tongues
“Most model development efforts that have worldwide visibility focus on the English language,” said Yasser Jadidi, chief research officer at Aleph Alpha. “It’s a consequence of most of the internet text data that is available and accessible being in English, and it puts other languages at a disadvantage.”
For people in places like Sweden or Turkey (the OpenEuroLLM project is also targeting the tongues of eight countries that have applied for EU membership, so that the project encompasses a total of 32 languages) the lack of AI models that understand the intricacies of their languages can be a serious problem. For a start, it makes it harder for local companies and public authorities to adopt the technology and start providing new services.
“It’s first and foremost a commercial question,” said Peter Sarlin, the CEO of Silo AI, Europe’s largest private AI lab, which was acquired by AMD last year and is participating in OpenEuroLLM. “Are there models that are performant in that specific low-resource language, be it Albanian or Finnish or Swedish or some other, that allows companies within that region to eventually build services on top?”
The issue also has consequences for evaluating the accuracy and safety of AI models in the local context, Jadidi said. Indeed, Aleph Alpha’s role in the project is chiefly to provide AI-model evaluation benchmarks that aren’t simply machine-translated from English, as most are.
The OpenEuroLLM project may have relatively meager funding, but it isn’t starting from scratch.
Most of its participants have already been involved in a separate scheme called High Performance Language Technologies (HPLT), which started two years ago with a budget of just €6 million. The original proposal was for HPLT to deliver AI models, but then OpenAI’s ChatGPT changed the AI landscape and the organizers pivoted to creating a high-quality dataset that can be used to train multilingual models. The HPLT dataset is currently being “cleaned” of errors, and it will form the basis of OpenEuroLLM’s work.
OpenEuroLLM will create a base model trained on a dataset of all the European languages. Once that’s done, yet another EU-funded project, called LLMs4EU, will fine-tune it for various applications. Apart from cash, the EU is also providing computational resources to all these schemes.
Sticking to the rules
Europe is not the easiest place for AI companies to do business. Quite apart from the AI Act that is gradually coming into force, placing all sorts of reporting responsibilities on model providers and their customers, there’s also copyright and competition law to consider—and the General Data Protection Regulation (GDPR), which places strict limits on the personal data that AI companies can use.
These laws have had real effects on AI’s European progress, with Meta delaying the rollout of Meta AI because of GDPR limits, and Apple also delaying the deployment of Apple Intelligence because of unspecified antitrust issues. (Apple Intelligence will come to EU iPhones in limited form in April, while Meta has started offering some Meta AI features to European wearers of its smart glasses.)
As far as OpenEuroLLM’s organizers are concerned, these laws are manageable. “We believe we can live with all of them,” said Jan Hajič of Charles University in Czechia, who is co-leading the project with Sarlin.
Hajič said the participants had already dealt with the copyright and most privacy issues when developing the HPLT dataset. “The GDPR could be a problem, but that’s something we are trying to get around with pseudonymizing the data, meaning that if we encounter people’s names it gets deleted,” he said, while acknowledging that the necessary automation in this process may not have a 100% success rate.
“Our goal is to do things in such a way that they will not clash with the European regulation in any way,” Hajič said, adding that this could be a draw for companies wanting to target EU markets. For high-risk use cases that will require a lot of reporting to the EU authorities under the AI Act, the open-source approach will be essential for the transparency it allows, he argued.
The OpenEuroLLM project has 20 participants including companies, research institutions, and high-performance computing clusters like Finland’s Lumi. This setup could be seen as a liability with the potential for diverging priorities, but Aleph Alpha’s Jadidi argued that open-source projects often include a wide array of participants without being dragged down.
“We have all the opportunity to ensure that a high amount of contributors is not a hindrance but an opportunity,” he said.