Disclaimer: This post does not involve the actual Fuzzy Logic. The term was originally intended as just a pun, but I later realised that it also demonstrates how the improper use of buzzwords can be quite misleading. Apologies for any confusion caused.
With the growing popularity of Data Science, many buzzwords have been loosely thrown around without a proper understanding of what they truly mean. Some of these buzzwords include terms like Data Analytics, Big Data, Artificial Intelligence and Machine Learning. But unlike the terms mentioned in the posts on ANCOVA, Moderation and The Confusion Matrix, many of these terms in Data Science are not actually interchangeable. This post attempts to explain the subtle differences of these buzzwords, so that we can all speak a common language that is less confusing.
Introducing The Data Science Fuzzy Buzzy
Despite its recent trend, Data Science is actually not a new field. If we rephrase Data Science as the “Science of Data”, it becomes apparent that Data Science is just the formalisation of all things related to data. But performing the science of data does not only involve analysing data; collecting, cleaning and preparing the data are all important sub-disciplines of Data Science, where each one can be a whole area of specialisation on its own. In other words, the broad definition of a Data Scientist should include anyone who deals with data, even those who only use spreadsheets to manage data and make simple calculations (although such an association would probably be frowned upon by more sophisticated Data Scientists with larger egos :P).
Let’s now imagine the hierarchy of terms in Data Science as a Fuzzy Buzzy (which is in fact just a cute Venn diagram), from the broadest in the outer most circle to the most specific in the inner most circle. It is also a nice analogy to think of Data Science as a Fuzzy Buzzy, using different techniques to collect pollen and nectar from data flowers, and subsequently converting them to honey insights! To simplify matters, we will only focus on the area of Data Analysis in Data Science.
Some might be wondering, what exactly is the difference between Data Analysis and Data Analytics? Data Analysis is a general term to describe the process of inspecting data with the intention to obtain some insight. Because data can be either qualitative or quantitative, Data Analysis can also be conducted in different ways. Qualitative analysis aims to understand intangibles such as underlying reasons and motivations through interpretive techniques; while quantitative analysis aims to make quantifiable measurements using computing techniques such as Data Analytics. In other words, Data Analytics is the collection of techniques that allows Data Analysis to be conducted quantitatively.
That puts Data Mining as one of the many techniques used in Data Analytics, among other well-known ones such as Machine Learning and Statistical Analysis. While the three techniques seem to be similar and have overlaps, their end-goals are quite different. Data Mining simply focuses on looking for patterns in the data; Machine Learning cares more about making accurate predictions using information from the data; Statistical Analysis is most interested in drawing inferences from the data. It is inappropriate to use these techniques for purposes other than their own, but such a mistake is still often committed by researchers and analysts alike, resulting in fallacious conclusions. More on this will be discussed in the later section.
You might ask, where does Artificial Intelligence come in then? Artificial Intelligence (AI) is a general description of the processes that mimic human intelligence in the form of “learning” and “problem solving” (not that human intelligence is the epitome of intelligence, but we’ll just have to work with that). AI has existed for many years, most commonly known in the form of non-playable characters in computer games. But most of that AI has been hard-programmed using if-else conditions, which is not an accurate depiction of human cognition. In recent years, however, the advancement of machine learning, especially in reinforcement learning (e.g. AlphaGo beating professional Go players), has allowed AI to come closer to the way humans process information. Hence, when we talk about AI in Data Science, we are actually referring to the use of seemingly-intelligent algorithms to process and analyse data, and machine learning is often the most common method for achieving that.
That pretty much sums up the structure of the Data Science Fuzzy Buzzy, and clarifies that AI is actually achieved through machine learning and not really a technique by itself. But wait! What about Big Data? Big Data are merely datasets that are so large, specialised techniques are required to process and analyse them. Other that that, they are just like any other data flower, waiting for the Data Science Fuzzy Buzzy to collect pollen and nectar from them.
Do We Really Need to Split Hairs on This?
Some people might be wondering if it is really necessary to get so technical on the terms. Besides getting confused by people who use these buzzwords loosely without understanding what they mean, it is more important to clarify the terms because they are often designed for very different purposes. The various techniques in Data Analytics are not equivalent and can have different levels of interpretability and accuracy. In general, techniques that are more complex tend to have better accuracy for the dataset that they have been trained on, but often end up overfitting and become poor at generalising to new scenarios. For this reason, the interpretability of models tend to suffer as prediction accuracy improves (see trade-off chart below).
Depending on the type of data and the objective of the analysis, the choice of technique to use should be different. For example, if the focus is on making predictions using a large dataset with many variables, Machine Learning is usually the way to go. However, if the intention is to make inferences about a small dataset with just a few variables, conducting a Statistical Analysis with a proper hypothesis is a necessary procedure. The act of Data Mining without a clear hypothesis is no different from p-hacking, and it is part of the reason for the Replication Crisis in a number of disciplines. Some researchers also commit the mistake of using Machine Learning to make inferences. Machine Learning techniques like Deep Learning are not only notorious for being black boxes that are difficult to interpret, trying to explain the data based on the inputs and outputs is also akin to treating correlation as causation.
In essence, we should first determine the question we are trying to answer before we can decide on the type of analytics to use. Based on the Gartner Analytic Model, if we simply want to know “what has happened”, Descriptive Analytics such as Data Mining is sufficient; but if we want to ask the question of “why did it happen”, Diagnostic Analytics such as Statistical Analysis is necessary for the explanation; yet if we want to make predictions about “what will happen”, Predictive Analytics such as Machine Learning is a more suitable approach; and finally, if there is a need to know “what should be done”, Prescriptive Analytics such as Modelling and Simulation in Operations Research will be required for suggesting the optimal course of action.
* * * * * * * * * *
Admittedly, the boundaries between these various buzzwords and technical terms are probably more fuzzy than what I have portrayed. But instead of treating the terms as interchangeable, I hope the simple illustration of a Fuzzy Buzzy will be useful in differentiating them. This should then allow their purposes to be better understood, so that their usage can become more precise.
References & Further Reading: