AI ‘gold rush’ for chatbot training data could run out of human-written text

According to a new study released by Epoch AI, artificial intelligence systems such as ChatGPT may soon face a shortage of the valuable resource that fuels their intelligence: the vast amount of written content available online. The study predicts that tech companies will exhaust the publicly available training data for AI language models by around 2026 to 2032. This depletion of resources is likened to a “literal gold rush” that depletes finite natural resources. Tamay Besiroglu, one of the study’s authors, suggests that the AI field may encounter challenges in maintaining its current pace of progress once it has drained the reserves of human-generated writing. In response to this impending shortage, companies like OpenAI and Google are actively seeking and sometimes paying for high-quality data sources to train their AI language models. They are signing deals to gain access to the continuous flow of sentences from platforms like Reddit forums and news outlets.
In the future, there will be a shortage of new blogs, news articles, and social media commentary to support the ongoing development of AI. This will put pressure on companies to access sensitive data that was previously considered private, such as emails or text messages. Alternatively, they may have to rely on less reliable “synthetic data” generated by the AI chatbots themselves.

Besiroglu highlighted the seriousness of this bottleneck, stating that once data constraints are reached, it becomes difficult to efficiently scale up AI models. Scaling up models has been crucial for expanding capabilities and enhancing the quality of AI output.
Two years ago, the researchers made their initial projections, which predicted a high-quality text data shortage by 2026, just before the launch of ChatGPT. Since then, several advancements have occurred, including new techniques that have improved the utilization of existing data and the possibility of “overtraining” on the same sources multiple times.

However, there are still limitations, and after further investigation, Epoch researchers now anticipate a depletion of public text data within the next two to eight years.

The latest study conducted by the team has undergone peer-review and is scheduled to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch is a nonprofit institute hosted by Rethink Priorities, based in San Francisco, and receives funding from proponents of effective altruism, a philanthropic movement dedicated to addressing the worst-case risks associated with AI.
According to Besiroglu, AI researchers recognized over ten years ago that by significantly increasing computing power and internet data, the performance of AI systems could be greatly enhanced.

A study conducted by Epoch reveals that the amount of text data being utilized in AI language models has been growing at a rate of approximately 2.5 times per year, while computing power has been increasing at a rate of around 4 times per year. In fact, Meta Platforms, the parent company of Facebook, recently claimed that their upcoming Llama 3 model, which has not yet been released, has been trained using up to 15 trillion tokens, with each token representing a fragment of a word.

However, the extent to which we should be concerned about the data bottleneck remains a subject of debate.
According to Nicolas Papernot, an assistant professor of computer engineering at the University of Toronto and researcher at the nonprofit Vector Institute for Artificial Intelligence, it is not necessary to continuously train larger models. He suggests that building more skilled AI systems can be achieved by training models that are specialized for specific tasks. However, Papernot raises concerns about training generative AI systems using the same outputs they produce, as this can lead to degraded performance known as “model collapse.” He compares training on AI-generated data to making photocopies of a photocopy, where some information is lost. Additionally, Papernot’s research has revealed that this process can further perpetuate mistakes, bias, and unfairness that already exist in the information ecosystem.
As real human-generated sentences continue to play a crucial role in AI data, those responsible for managing the most valuable sources of such data, including platforms like Reddit and Wikipedia, as well as news and book publishers, are being compelled to carefully consider how this data is being utilized.

Selena Deckelmann, the chief product and technology officer at the Wikimedia Foundation, which operates Wikipedia, jests, “Perhaps we shouldn’t be chopping off the peaks of every mountain.” She adds, “We are currently grappling with the interesting issue of treating human-created data as a natural resource. I shouldn’t find it amusing, but I can’t help but be amazed by it.”
While some individuals have chosen to restrict access to their data for AI training, often after it has already been taken without compensation, Wikipedia has implemented few limitations on how AI companies can utilize the entries written by volunteers. However, Deckelmann expressed her hope that incentives will still be present to encourage people to contribute, especially as a flood of low-quality and automatically generated “garbage content” begins to contaminate the internet.

Deckelmann emphasized the importance for AI companies to consider the preservation and accessibility of human-generated content.

According to Epoch’s study, from the perspective of AI developers, it is not economically viable to pay millions of humans to generate the text required for AI models in order to achieve better technical performance.
During a recent United Nations event, OpenAI CEO Sam Altman discussed the company’s plans for training the next generation of its GPT large language models. Altman mentioned that OpenAI has already conducted experiments involving the generation of large amounts of synthetic data for training purposes.

Altman emphasized the importance of high-quality data in training AI models, noting that there can be both low-quality synthetic data and low-quality human data. However, he expressed concerns about relying too heavily on synthetic data as a primary method for improving AI models.

Altman questioned the efficiency of solely generating an enormous amount of synthetic data, such as a quadrillion tokens, and using that for training. He suggested that there might be more effective approaches to training AI models than solely relying on synthetic data.