An introduction to Data Mining
By Davide Pagin
Data Mining is one of the most trending concepts of this era. Every person who has studied a scientific subject in their life has almost surely met this concept, but also people who don’t study or work with data have probably heard about it. So the question is “What is Data Mining? And why is it so popular nowadays?”
Obviously, it is a very broad theme, and for this reason, there are a lot of ways to define it. Before trying to describe it, it is better to inspect the two elements that compose the expression; so the first important thing is the true meaning of data. To understand the intrinsic nature of data, it is necessary to have some mathematical or statistical knowledge. In this way, we can avoid making mistakes when we think about the meaning of data. First, numbers are not data but data could be numbers.
This last sentence has two different connotations: a number could be a data if it is used to describe something and there are different types of data. In fact, data could also be words or expressions, so the main types of data are quantitative or qualitative. More specifically, the quantitative data could be continuous when they refer to some type of measurements (eg, the height of a person is a continuous datum) or they could be discrete when they indicate a count of something (eg, the number of students with black hair in a classroom is a discrete datum).
The second term, mining, represents what we want to do with our data in order to extract some kind of information. In fact, there is a consistent difference between data and information. The former means a collection of facts or numbers and we could define them as “raw information”, whereas the latter requires that we interpret and summarize the data and we understand their meaning in some context.
At this point, we may harbour some doubts within our heads. The term mining suggests something difficult to reach, but why would extracting information from data be so hard or complicated? The answer is that usually people who work with data have to face huge amounts of data, what is typically called Big Data.
“Big Data” is another trending expression like that of Data Mining; however, the concept is more recent. In fact, Big Data is closely connected to the explosive technological advancements, in particular with the greater possibility to collect and store a wide range of data. In the below graph, created with Google Books Ngram Viewer, we can see the importance of the expressions Data Mining and Big Data from 1990 to 2021 within a collection of documents. In particular, what is represented is the percentage value of a particular function of weights called “tf-idf”, and as we mentioned before, it shows the importance of a term (or an expression) in a corpus of texts. Its value grows proportionally with the number of times the term appears within a document, but the value decreases when the frequency of the term rises in the entire collection.
So, the meaning of Big Data is data in a large quantity and/or high dimensionality. That happens when we collect data from a lot of different statistical units like, for example, different people, and we have many characteristics (which we call variables in statistics) collected from these statistical units. To make an illustrative example, there is a considerable difference between examining data from one thousand people with five characteristics, on one hand, or one milion people with one thousand of characteristics, on the other. In particular, in statistics we usually want to examine the relation between a dependent variable and some known features (independent variables). When there are few features, it is easy to detect the relation with the dependent variable. As a matter of fact, for every feature, we can make some kind of graph (eg, scatterplot, violin plot, etc) to have useful information, and then, to understand the complete process, we can make some simple model to identify the way in which independent variables influence the dependent variable.
The more the number of features increases, the more it will be difficult to choose a variable to plot with the dependent variable. Furthermore, the choice of the correct model could be harder, and often, to describe the phenomenon (or make some statistical forecast) in the right way, it is necessary to adapt more complicated models. In such cases, machine learning methods could be fundamental.
There are two more aspects that we haven’t mentioned before. Firstly, one of the most common problems when there are a lot of independent variables is that, often, many of them are less informative, so we have to implement some statistical method to remove these types of variables. Secondly, in data mining problems, there is the chance that we do not have just a collection of data to analyse, but rather we have to deal with a continuous stream of data. As a consequence, the chosen model for the current data may not be adequate when new data arrive. Those are the reasons why scalability is essential, where we refer here to the capacity to adapt to new data.
We are now in position to define the expression “Data Mining”, which is the work of analyzing large volumes of raw data and trying to individuate some patterns inside them with the intent of extracting some useful information to help companies who work with data.
However, we have to be careful in the data mining process. In fact, one of the most typical errors that data scientists could make is to extract distorted information, or information that is not helpful for the aims of the company.
“If you mine the data hard enough, you can also find messages from God”
Dogbert, Dilbert Comic Strip (Jan 3, 2000)
Data Mining and its neighbours
Data Mining is frequently confused with other processes that deal with data, and for this reason other expressions are utilized as synonyms of Data Mining. In particular, the Knowledge Discovery in Database (KDD) is often considered the same concept as Data Mining. Although KDD also refers to activities that don’t concern Data Mining, it alludes to a larger concept. In fact, KDD includes other essential procedures to identify the data of interest:
- The creation of a Data Warehouse, that is, a strategic database obtained by merging different types of databases, each one of them with different data
- The application of a series of software tools called On-Line Analytical Processing (OLAP) to make a primary form of descriptive analysis by querying the Data Warehouse and extracting multiway tables in which every cell shows the frequency of the intersection between various modalities of the variables included in the table
In conclusion, Data Mining is a crucial concept that everyone should be familiar with because, as mentioned in an article of the journal “The Economist”, the world’s most valuable resource is no longer oil but data. Extracting information from data has become a fundamental task and for this reason the popularity of the expression “data mining” has increased over the last few years.
However, Data Mining, being a broad concept, has no clear boundaries. So our aim has been to help people who don’t know very well the world of data to understand what data is, and furthermore, what it means to work with it nowadays.
Discover more about Ennova Research