Where does the data used by generative AI come from?
Training the algorithms that make AI ‘think,’ requires a large amount of data and computational power. There is some doubt as to whether the exponential growth of recent years can be sustained.
For the past two years, many of us have embraced generative AI tools (ChatGPT, Copilot, Midjourney...). Without fully understanding how they work, viewing them almost like magic tricks, we ask them questions and make requests to create a text, a speech, a translation, to organize information into a table, or to represent text in image format. Moreover, we've witnessed how, day by day, the accuracy, quality, and performance of these tools have improved exponentially.
In the race to use these tools, in the satisfaction of seeing that they allow us to accomplish in minutes what used to take hours, perhaps we've forgotten to ask ourselves a few questions.
- Where does the data these tools use come from?
- Is there a limit to their use?
- How do they improve so quickly? And will they continue to improve at this pace?
Let’s take a moment to reflect on these questions.
Where does the data come from?
For tools like ChatGPT or Copilot to work as they do, they need to be trained on vast amounts of data. We often hear, “They use everything on the internet.” But where exactly do they get it?
In 2007, the Common Crawl foundation was established in California with the goal of providing access to all the data on the internet to anyone. This foundation maintains an open and free repository of web-crawled data that anyone can use. Every three months, they download the Internet (that is, they crawl the entire web, which, measured in tokens, amounts to about five trillion tokens), they organize it (removing duplicates, non-entry pages, etc.), and they make it freely available to anyone who wishes to use it.
And here is where all organizations in need of data to train their algorithms turn. In addition to these data, which represent between 70%-80% of the total data used, many organizations also reach agreements with repositories (media outlets, news agencies...) to obtain additional “closed” data and complete the training of their algorithms.
Isn't it fascinating that these tools, which seem magical, rely on data downloaded by a foundation and made available openly and freely?
Is there a limit to generative AI growth?
AI is not new. In fact, its origins are thought to date back to the 1950s. One reason for its relatively slow development up until the last decade was that, for AI systems to function at their full potential, they need enormous computational power. It's important to remember that generative AI is a process of correlating data, which means it needs to simultaneously compute correlations between millions of data points in order to provide an answer.
Between 2012 and 2014, there was an exponential growth in computing power
A simple question posed to ChatGPT requires millions of calculations to correlate the words used in its response and to display and arrange them in a coherent way.
After a period of nearly linear growth in computing power, between 2012 and 2014 there was an exponential increase in computational capacity and the development of supercomputers, allowing total capacity to nearly double each year.
But now, as Pep Martorell, director of the Barcelona Supercomputing Center — home to MareNostrum 5, the supercomputer with the greatest computational power in Europe — explains, we are at a point where it is logical to think that it is physically impossible for this evolution to continue. With current technology, it is not possible to build chips (the foundation that enables computing) that can keep pace with the current growth in computational capacity (currently around 106 parameters).
The conclusion is that we cannot expect massive evolution in the next two to four years at the same speed we’ve seen in the last three. As Dr. Martorell points out, we may be entering a period where we see specialization. This powerful generative AI will stop racing ahead like a runaway horse and begin to specialize, offering specific solutions for various sectors, areas of research, or even for the development of products and services.
- Compartir en Twitter
- Compartir en Linked in
- Compartir en Facebook
- Compartir en Whatsapp Compartir en Whatsapp
- Compartir en e-Mail
Do you want to receive the Do Better newsletter?
Subscribe to receive our featured content in your inbox.