Where does the data used by generative AI come from?

Innovation & technology

01 October 2024 4 min read

Training the algorithms that make AI ‘think,’ requires a large amount of data and computational power. There is some doubt as to whether the exponential growth of recent years can be sustained.

Marc Cortés

For the past two years, many of us have embraced generative AI tools (ChatGPT, Copilot, Midjourney...). Without fully understanding how they work, viewing them almost like magic tricks, we ask them questions and make requests to create a text, a speech, a translation, to organize information into a table, or to represent text in image format. Moreover, we've witnessed how, day by day, the accuracy, quality, and performance of these tools have improved exponentially.

In the race to use these tools, in the satisfaction of seeing that they allow us to accomplish in minutes what used to take hours, perhaps we've forgotten to ask ourselves a few questions.

Where does the data these tools use come from?
Is there a limit to their use?
How do they improve so quickly? And will they continue to improve at this pace?

Let’s take a moment to reflect on these questions.

Where does the data come from?

For tools like ChatGPT or Copilot to work as they do, they need to be trained on vast amounts of data. We often hear, “They use everything on the internet.” But where exactly do they get it?

In 2007, the Common Crawl foundation was established in California with the goal of providing access to all the data on the internet to anyone. This foundation maintains an open and free repository of web-crawled data that anyone can use. Every three months, they download the Internet (that is, they crawl the entire web, which, measured in tokens, amounts to about five trillion tokens), they organize it (removing duplicates, non-entry pages, etc.), and they make it freely available to anyone who wishes to use it.

And here is where all organizations in need of data to train their algorithms turn. In addition to these data, which represent between 70%-80% of the total data used, many organizations also reach agreements with repositories (media outlets, news agencies...) to obtain additional “closed” data and complete the training of their algorithms.

Isn't it fascinating that these tools, which seem magical, rely on data downloaded by a foundation and made available openly and freely?

Is there a limit to generative AI growth?

AI is not new. In fact, its origins are thought to date back to the 1950s. One reason for its relatively slow development up until the last decade was that, for AI systems to function at their full potential, they need enormous computational power. It's important to remember that generative AI is a process of correlating data, which means it needs to simultaneously compute correlations between millions of data points in order to provide an answer.

Between 2012 and 2014, there was an exponential growth in computing power

A simple question posed to ChatGPT requires millions of calculations to correlate the words used in its response and to display and arrange them in a coherent way.

After a period of nearly linear growth in computing power, between 2012 and 2014 there was an exponential increase in computational capacity and the development of supercomputers, allowing total capacity to nearly double each year.

But now, as Pep Martorell, director of the Barcelona Supercomputing Center — home to MareNostrum 5, the supercomputer with the greatest computational power in Europe — explains, we are at a point where it is logical to think that it is physically impossible for this evolution to continue. With current technology, it is not possible to build chips (the foundation that enables computing) that can keep pace with the current growth in computational capacity (currently around 106 parameters).

The conclusion is that we cannot expect massive evolution in the next two to four years at the same speed we’ve seen in the last three. As Dr. Martorell points out, we may be entering a period where we see specialization. This powerful generative AI will stop racing ahead like a runaway horse and begin to specialize, offering specific solutions for various sectors, areas of research, or even for the development of products and services.

Marc Cortés

Director of the Executive Master Digital Business

View profile

All written content is licensed under a Creative Commons Attribution 4.0 International license.

Where does the data used by generative AI come from?

Where does the data come from?

Is there a limit to generative AI growth?

Related posts

Opening the black box: The right to transparent algorithms

Why organizations should not rush towards AI

What is the future of programming with Large Language Models?

Do you want to receive the Do Better newsletter?