Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Evening Standard
Evening Standard
Business
Sir Nigel Shadbolt

Why we need to fear the risk of AI model collapse

There is little doubt that the potential of generative AI is enormous. It has, rightly, been presented as a capability that could herald a new, tech-led era full of benefits for humanity. It can speed up mundane tasks at work, aid medical breakthroughs and analyse patterns in ways that Alan Turing and the Bletchley Park codebreakers could only have dreamt of. 

At the recent AI Safety Summit, much was made of the dangers posed by ‘Frontier AI’. There was some criticism that the conversation dwelled too much on future threats and not enough on clear and present dangers. Amongst those cited are large-scale job redundancies – as AIs take over tasks previously performed by humans – and biases in the data on which AIs are trained that might entrench prejudices in human decision-making. These are significant concerns, but thereis another potential pitfall - the risk of model collapse. And the key to that is data. 

Model collapse happens when generative AI becomes unstable, wholly unreliable or simply ceases to function. This occurs when generative models are trained on AI-generated content – or “synthetic data” – instead of human-generated data. As time goes on, “models begin to lose information about the less common but still important aspects of the data, producing less diverse outputs.”

There are several scenarios where AI model collapse could occur, but almost all relate to the data on which these AI models, including well-known tools like ChatGPT, are trained.

While much is made of the vast scale of data used for this purpose, we don’t know enough about that data’s provenance and lineage. We do know, however, that most data is not AI-assured and, hence, is not trustworthy. So, the risk of model collapse is significant.

Model collapse could have serious consequences if we have become reliant on the AIs that are affected, including everything from job or financial losses to increased bias and data breaches.

Lack of knowledge about whether training data can be trusted is problematic, but this is multiplied when you consider how AIs work and how they ‘learn’. LLMs use various sources, including news media, academic papers, books and Wikipedia. They work by training on vast amounts of text data to learn patterns and associations between words, allowing them to understand and generate coherent and contextually relevant language based on the input it receives.

They can answer questions on anything from how to build a website to how to treat a kidney infection. The assumption is that such advice or answers will become better and more nuanced over time as the AI learns, technology advances and more data is used for training. However, if the data feeding the generative AI exaggerates certain features – and minimises others – of the data, existing prejudices and biases will be increasingly amplified.

Additionally, if the data lacks specific domains or diverse perspectives, the model may exhibit a limited understanding of certain topics, further contributing to its collapse. For example, consider future news reports that may be partially or wholly written by AI and Wikipedia articles, edited by – or with input from – AI, and you can see the beginning of a cycle that could lead to model collapse. When an AI is subsisting on a diet of AI-flavoured content, then the quality and diversity of content is likely to decrease over time. 

There have already been discussions and research on perceived problems with ChatGPT, particularly how its ability to write code may be getting worse rather than better. This could be down to the fact that the AI is trained on data from sources such as Stack Overflow, and users have been contributing to the programming forum using answers sourced in ChatGPT. Stack Overflow has now banned using generative AIs in questions and answers on its site. 

Model collapse could have serious consequences if we have become reliant on the AIs that are affected, including everything from job or financial losses to increased bias and data breaches. That said, there are solutions and mitigations which also lie in the data. The first is a strong AI data infrastructure, ensuring that new data comes from reliable sources that do not use recursive or recycled data likely to pollute the model. This is an opportunity to make better, stronger, more stable AIs that will benefit society in the long term. Secondly, we need openness about the fine-grained details ofthe data sources used in training – that expert users can assess.

This would help with the robustness of models and encourage collaboration alongside increased trust. Thirdly, continued research is needed on the effects of ablating or removing particular data from the models and seeing what the impacton output quality is. We are currently seeing increasing varieties of models of different compositions and sizes - for example, smaller models for specialist applications.

Such models could have highly specific applications or areas of work, they could use data sets that have been assessed with respect to data ethics standards, and have been developed with collaborative oversight and human feedback, whether that be from, for example, medical professionals, statisticians or software engineers. This would encourage greater data literacy, which, as I have recently argued, is essential for a world where AI is not going away. This means new data leaders who, on a daily basis, can successfully plot a path through the early stages of our interaction with generative AI.

Finally we need a strong commitment to displaying and encoding the provenance of data. If content has been machine generated it should carry that imprint - I have long argued that this is a crucial part of navigating our AI augmented future “a thing should say what it is and be what it says”. Not least because as the authors of the original model collapse paper observe  “the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.”

AI is far more likely to be a useful friend than a destructive enemy, but we need to question and interrogate its development as our reliance on it grows. That way, we can work with it thoughtfully and sensibly rather than simply being let down when it doesn’t all work out quite as we expected. 

Sir Nigel Shadbolt is Executive Chair of the Open Data Institute, which he co-founded with Sir Tim Berners-Lee, Principal of Jesus College Oxford, a Professor of Computer Science at the University of Oxford and a visiting Professor of Artificial Intelligence at the University of Southampton.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.