Disclaimer: I have just completed a Specialist Diploma in Data Science with Specialization in Big Data & Streaming Analytics, but I am by no means an expert in this area. What I hope to achieve from this post is to simplify what I have learnt from the programme, to explain to laypeople who wish to know more about the field. I will certainly not be able to cover everything about Big Data in this post, but I would like to at least draw the parallels of analysis in Big Data to traditional data analysis, so that more people will be less afraid to approach it. Please feel free to correct any misconceptions that I may have.
For a long time, the term “Big Data” has been used to mystify people both inside and outside the field of data science. Say to anyone at a networking session that you deal with “AI, Machine Learning and Big Data”, you’d be instantly looked upon with admiration as though you were some god-like figure. The holy trinity that seems to mean everything, and yet nothing. Because what exactly are we talking about when we say “AI, Machine Learning and Big Data”? In my previous post on the fuzzy buzzwords used in data science, I explained how these terms actually mean very different things that are conceptually unrelated.
“Big Data” simply means huge quantities of data. Machine Learning or AI does not necessarily have to be involved with Big Data, because the term merely describes the type of data. You could ingest big data, store big data, process big data or analyse big data, all without having to use Machine Learning. Similarly, “Artificial Intelligence” is simply the description of a computational process that is human-like, which may or may not involve Machine Learning. AI can be achieved in many different ways, and Machine Learning is just one of the many techniques. In fact, it makes sense to say “I used Machine Learning to solve this problem”, but if someone says “I used AI to solve this problem”, what they probably mean is that they used Machine Learning.
Difference between Machine Learning and AI:
If it’s written in Python, it’s probably Machine Learning;
if it’s written in PowerPoint, it’s probably AI.
– Mat Velloso (Technical Advisor to CTO at Microsoft)
What is Big Data?
Now that we have gotten those confusing terms out of the way, let’s focus on explaining what is Big Data proper. I know it feels good to be telling people that you’re working with Big Data, but as a rule of thumb, if you can load and work on your data using Microsoft Excel or a SQL database manager, then I’m sorry to inform you that you’re probably not dealing with Big Data. As the name implies, the size of Big Data can enter the range of terabytes, such that trying to load it in Excel or SQL will probably cause your entire machine to hang, due to insufficient processing power.
But instead of using this vague reference, let’s take a look at the formal definition of what Big Data is. In general, there are 3 V’s that define Big Data: Volume, Velocity and Variety.
Volume refers to the amount of data collected, measured in at least gigabytes, and can go up to terabytes or even petabytes.
Velocity refers to the rate that data is streaming in and accumulating, which would eventually result in a very large volume.
Variety refers to the many different types of data being collected, such as text, audio and video, resulting in a very complex and unstructured data lake.
Fulfilling all 3 V’s is not necessary for determining what is Big Data. As long as your data satisfies one of the V’s, it can be considered as Big Data. Recently, 2 more V’s have been added to the mix: Veracity and Value. Veracity refers to how accurate the data is, and Value refers to how much the data is worth. However, these V’s don’t seem to only describe and differentiate Big Data, so we’ll stick with the original 3 V’s.
What is Hadoop?
So I’ve mentioned that it’s not normally possible for traditional machines to store and process Big Data. That is where specialised tools such as Hadoop and Cloud Computing come in. Apache Hadoop is an open-source software framework that manages the storage and processing of Big Data through a cluster of machines. The technical details of the entire Hadoop Architecture is quite complex, but for the purpose of this post, I will briefly explain about one of its major components: the Hadoop Distributed File System.
Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. The distributed file system works by splitting the big data into many blocks, duplicates 3 copies of each block, and distributes the copies to different servers. The idea behind such a design is to achieve high availability and fault-tolerance, which allows data processing to be more efficient and resilient against failures. This type of processing is also known as parallel processing.
In order to gain access to Hadoop, one of the ways is through Cloudera. Cloudera is a software company that helps to distribute Hadoop via its software and services, and it is available on-premise and across a number of cloud platforms. To familiarise yourself with Cloudera and Hadoop, it is possible to download Cloudera’s QuickStart Virtual Machine (VM) to try it out in a sandbox environment.
Within Cloudera, there is a browser interface called Hue UI. Hue is the open source web interface for Hadoop that lets you access HDFS through a File Browser, and lets you process and query Big Data by connecting to Hive.
The File Browser works just like any other file explorer, and it is especially useful if you are not familiar with using linux command line to execute commands in HDFS. Once you have transferred your data file from the Cloudera platform to HDFS, you will see it appear in the File Browser, and you can use the HDFS file path to load your data.
Apache Hive is a data warehouse system built on top of Hadoop for processing and querying Big Data. While Hive uses a SQL dialect called HiveQL that closely resembles the query language in traditional databases, there are still some differences due to having to comply with the restrictions of Hadoop. Nonetheless, as you can see from the image above, the Hive Editor in Hue UI looks very similar to traditional SQL editors, and people who are already familiar with SQL will definitely find themselves at home.
Now that you have a better idea of how Big Data can be stored and processed, let’s move on to how Big Data is analysed.
What is Spark?
Apache Spark is a unified analytics engine for Big Data and Machine Learning, that is able to interface with HDFS. During processing, Spark stores the data being used as a Resilient Distributed Dataset (RDD) in memory, which is an immutable, partitioned collection of data items that is distributed over a cluster of machines, and can be operated on in parallel, making it fault-tolerant.
Spark was originally written in Scala programming language, but because of the popularity of other programming languages in data science, Spark now also supports Python, R, SQL and Java. Which is really neat, because Python, R and SQL are heavily used in the data science industry, so the transition to Spark will actually not be that foreign. In fact, the version of Spark that uses Python programming language is known as PySpark, which is a Python API that lets you interface with Spark RDDs.
The next logical question to ask then, is how do we gain access to Spark and use it? As can be seen from the image above, one way of accessing Spark is through the Cloudera terminal. By typing “pyspark” in the terminal, a Python Spark shell will be invoked, and PySpark can be used in command line. But since command line isn’t the most user-friendly way of analysing data, another option of accessing Spark is through a web-based user interface called Databricks.
Databricks was founded by the creators of Spark, and it provides a just-in-time cloud-based platform for processing Big Data, in the form of notebooks that are similar to Jupyter notebooks. This again helps in the transition to Spark, as many people in the data science industry are accustomed to using Jupyter. Having a programming language and user interface that most people are familiar with, what more can we ask for?
As it turns out, the ease of use doesn’t just stop there. Users of R and Python Pandas will be familiar with the term “DataFrames”. Spark uses DataFrames as well. A Spark DataFrame is essentially a dataset that has been organised into columns. It is conceptually the same as dataframes in R and Python, but with richer optimisations under the hood. The source for Spark DataFrames can come from many places – Spark RDDs that we mentioned earlier, Hive tables from Hadoop, or even external databases such as MySQL.
But when it comes to using the functions in Spark, some differences exist when compared against Python. To make things simple, we will compare the functions for Spark RDDs against the equivalents for Python arrays, and the functions for Spark DataFrames against the equivalents for Pandas DataFrames. To load files in Databricks, the files must first be uploaded to the Databricks File System (DBFS). Hence, you will notice that the file path for the subsequent comparisons to Spark begins with “dbfs://FileStore/table/”.
As you can see in the image above, a good number of functions for Python arrays can be achieved by Spark RDD operations as well. RDD operations such as “union(otherData)” and “distinct()” are called transformation methods as they create new RDDs, while operations such as “collect()” and “count()” are called action methods as they run a computation on the RDD and return a value. Notice that an RDD cannot simply be created just by defining a variable; the method sc.parallelize() needs to be used even when creating an empty RDD. Also, indexing in RDDs cannot simply be achieved by using a colon (:), hence the method “take(n)” is required for printing the first n rows.
Interestingly, Spark DataFrames look very similar to Pandas DataFrames, with some slight differences in the words being used to call out the same functions. Nonetheless, the logic is pretty much the same, so adapting to Spark DataFrames shouldn’t be too difficult. In fact, many of the functions such as “show”, “select” and “union” are named after clauses used in SQL, so users of Python and SQL should find this very comfortable.
One more nifty function in Spark DataFrames is “toPandas()“. This function allows you to convert your Spark DataFrame into a Pandas DataFrame in Databricks, if there is ever a need to use a Python plotting library to visualise the data. However, this conversion should only be done after you have filtered your Big Data into a much smaller dataframe using Spark DataFrames, as Pandas DataFrames is not meant for handling Big Data.
Bonus: Machine Learning
To be honest upfront, I do not have much experience in using Machine Learning on Python or Spark yet. But from what I have read and understood, scikit-learn is the most popular machine learning library on traditional Python, while MLlib is the machine learning library developed by Spark for running on Big Data. If that is true, the concept should then be the same as what we have already gone through above.
If your data is too large and you need to run machine learning algorithms on it, it would probably be better to use MLlib as it is a distributed framework designed to do parallel processing. But if you are able to scale down your data to a more manageable size for conversion into a Pandas DataFrame, scikit-learn would then become a viable option.
For more information about the differences between scikit-learn and Spark MLlib, check out this Quora question and answers: https://www.quora.com/How-is-scikit-learn-compared-with-Apache-Sparks-MLlib. And if you would like to learn more about the different types of Machine Learning, please feel free to read my previous post.
Is Big Necessarily Better?
We have covered quite a bit on what is Big Data, and how we can store, process and analyse it using a few specialised tools. But if we go back to the crux of the matter, we should probably also ask ourselves why are we dealing with Big Data in the first place?
There is no doubt that in today’s highly digitalised world, we have tonnes of data waiting for us to harness, and the advancement in technology for distributed file systems and parallel processing has given us the ability to handle Big Data. But does having the ability to handle Big Data mean that we should always strive to gather Big Data? The problem with Big Data is that with so much information, it is often difficult to focus on answering very specific questions. To analyse such large amounts of information, it is inevitable that automatic processes need to be used for model selection. But the drawback of using automatic processes is that the resulting model usually becomes uninterpretable.
In February this year, a statistician from Rice University, Dr Genevera Allen, issued a warning that the way machine learning is being used today may be contributing to the reproducibility crisis in science. A large part of this is due to an over-reliance on the results produced by machine learning models. When the datasets are huge, it is not uncommon for models to pick out patterns that don’t actually mean anything. Hence, it is very important for the person running the model to have control over what goes into it, and to have a good grasp on the context of the problem that is being solved.
The reason why Small Data may have more value than Big Data, is because it is often more accessible, understandable and actionable as compared to Big Data. To answer specific questions, statistical inference has to be used and Small Data is more suitable for such analysis. By testing hypotheses that are based on a solid theoretical foundation, statistical analyses can help to interpret the results in a more nuanced fashion, and the conclusions will be a lot more realistic and conservative.
In any case, what is perhaps most important is that we understand that different tools are meant for dealing with Big Data and Small Data separately. As long as we don’t abuse the methods for what they were not intended for, insights can probably still be found in both Big Data and Small Data.
If you would like to know more about the reproducibility crisis, check out my post on Bayesian Analysis and The Replication Crisis:
If you would like to know more about accuracy scores for machine learning, check out my post on the Confusion Matrix: