Data: An important resource in the AI revolution
Data might have become the new oil of the 21st century.
Hi!
The other day I was thinking that many people still don’t know the value of data, which is surprising given that data is what makes AI possible. Companies like OpenAI have been collecting data for many years to train their models, creating the tools we all know today.
In the coming articles, I’ll show you the techniques these companies use to collect data and what you can do with this data but, first, let’s see why data is so important nowadays.
The concept of data as a strategic asset has been gaining momentum in the past years, however, regular people aren’t able to see the real value in data.
We know big tech companies have been collecting data for a long time. We know that year after year new regulations about the use of data are created. That said, most of us still don’t understand the impact data has on our society.
A few years ago, The Economist published an article called “The world’s most valuable resource is no longer oil, but data.” However, for regular folks, it’s still hard to understand how data can be the new oil.
Data and oil have some similarities, but also some differences. Here are some of them.
1. Data and oil need to be refined
Data and oil are rarely used in their raw state.
If oil is unrefined, it cannot be used. For oil to be useful, it has to be extracted, refined, and distributed. The same happens with data. We don’t use the data as soon as it’s extracted, but we have to process it first before it’s ready for analysis.
Here’s how Clive Humby, the data science entrepreneur who coined the phrase “data is the new oil,” compares oil and data.
“Data is the new oil. Like oil, data is valuable, but if unrefined, it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity. So, must data be broken down, analysed for it to have value.”
This is true. Once data is collected, it needs to be cleaned and transformed to get it in the desired format. Why? Well, real-world data is messy, so there might be inaccurate or missing data that we need to deal with.
To put it simply, imagine you have collected data from a survey. You can be confident that the results obtained from the multiple-choice questions don’t need much preprocessing, but things change with the open-ended questions because people can answer whatever they want (sometimes without following a common pattern) and even leave an answer blank.
Real-world data is sometimes as messy as those open-ended questions.
This is why raw data isn’t enough. Only after the data is “refined” we can make the most of it by making reports, doing analysis, and creating something valuable.