Over the span of last few months we have seen that Apache Spark has become the buzz word people are talking about. This buzz however is no where close to IoT buzz that we have seen in the past few years. While performing my research for trends on Google, I was surprised to find out that the term IoT today is very next to the term Big Data (In some cases out-growing the trend).
For this article we will keep our focus limited to some Fact Check for Apache Spark.
What is Apache Spark
Apache Spark is an open source cluster computing framework. It is a data processing engine which is highly scaleable. Originally developed at UC Berkeley’s AMPLab in 2009 it was open sourced in 2010 under BSD license. And ultimately donated to ASF in 2013. It is now distributed under Apache License 2.0.
Apache Spark provides a unified and comprehensive framework which can take care of the various requirements for processing large datasets. Spark provides you with high-level APIs in Java, Scala, Python and R. It is also provides higher-level rich set of tools referred to as Libraries. Some of the libraries which are included inside the Spark ecosystem are mentioned below.
- Spark Streaming – Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Unlike MapReduce framework Apache Spark looks beyond the batch processing. Using Spark Streaming, you can ingest data from various sources like Kafka, Flume, Twitter, Kinesis or TCP sockets. It basically works as micro batch processing method of computing. Using the input data you create discretized stream or DStream. A DStream is represented as a sequence of RDDs
- Spark SQL & DataFrames
- SQL — Spark SQL is the Spark module for processing structured data. It allows you to query the datasets using the traditional SQL like queries and BI tools which can connect over the JDBC API. Spark SQL can be used to access or read data from the existing Hive installation as well.
- DataFrames — When you run Spark SQL from another programming language the results will be returned as a DataFrame. A DataFrame is a distributed collection of data organized into named columns. You can related this to a table in traditional SQL world or a Data Frame in Python or R, but with richer features under the hood.
- Spark MLlib – MLlib is Spark’s machine learning library. It is scalable and consists of common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. It is further classified into spark.mllib – which includes API built on top of RDDs – and spark.ml – which includes higher-level API built on top of DataFrames. Using spark.ml is now recommended since DataFrames API provide flexibility and versatility.
- Spark GraphX – GraphX is the new component in Spark graphs and graph-parallel computation. It extends the Spark RDD by introducing a Graph abstraction with properties attached to each vertex and edge.
There are a few more libraries which include SparkR, Bagel. Plus there are a few applications which leverage Spark.
Apache Spark – RDD
A Resilient Distributed Dataset or RDD is the basic abstraction in Spark. It is one of the basic components for Apache Spark. A RDD is collection elements which are immutable, partitioned and can be operated on in parallel. This makes RDD efficient and resilient (fault-tolerant).
The efficiency is achieved by providing the ability to processing the RDD in parallel across the cluster. And resilience or fault-tolerance is achieved mostly with the ability to track the data lineage.
Apache Spark – The Buzz
Everyone working in Big Data vertical has already heard the word Apache Spark more than once. As I mentioned in the beginning of this article it is very popular these days. Considerably more than Apache Hadoop itself.
And I did perform some research around comparing the two trends since 2013, when Spark was donated to ASF, till April of 2016. The results look pretty interesting.
Apache Spark – Fact Check
With the buzz around Apache Spark there are also a lot of misconceptions about the framework in general. I think it will be a good way to bust those myths by providing some facts about Apache Spark.
Apache Spark is NOT an In-memory Computation Framework
Let us get this straight first and foremost. Apache Spark is NOT an in-memory computation framework. The biggest misconception today is that Apache Spark is just an in-memory “thing”. But that is not true. There is more than just in-memory processing for Spark. Yes, in-memory is one of the features of Spark but it is not the only thing about Spark. Spark uses the available memory to cache the data. This cached data however isn’t kept there for ever. For a computation engine to be called in-memory it should store the data for a longer period of time. Spark uses the available memory to store this cache using the Least Recently Used (LRU) algorithm. The properties for LRU cached data makes it unmodified and can be evicted if the contents are least recently used.
If the data you are processing can fit into the available memory for Spark you can consider it as in-memory processing framework – but this is just an assumption. In reality we already have other technologies using the LRU for memory caching but they can’t be referred to in-memory technologies.
Apache Spark Provides Unified Platform
Apache Spark brings with it multiple options in a single package. Spark does not limit itself to batch processing computation like MapReduce does. It scales beyond batch processing. It has the capabilities to process real-time stream data, structured & unstructured data. The possibilities also include graphing the data or driving Machine Learning using the commonly used machine learning algorithms.
You no longer have to combine the different frameworks or processing engines like MapReduce for batch processing or Storm/SpringXD for real-time data processing or Giraph for graph processing. Spark ensures it is single solution for most of today’s Big Data requirements.
Spark is NOT a replacement for MapReduce
Another biggest misconception surrounding Apache Spark is that it is a replacement for MapReduce or Hadoop. The points I mentioned in the previous “fact” does not mean that Spark will replace MapReduce. Spark was designed to meet the challenges and limitations which MapReduce face. There still are use cases which are well suited for MapReduce. But not all data processing fit into the Map and Reduce pattern. This is where Spark can help us out. It can co-exist, grow and even out-shine MapReduce but not replace MapReduce.
There is more to Hadoop than just MapReduce. How can we forget HDFS – the highly distributed file system which provides cheap and reliable storage. Spark will not replace Hadoop either. It can run on top of Hadoop or access data from HDFS to process.
Spark NOT limited to Hadoop
Spark is not limited to running on top of a Hadoop cluster only. Spark is designed to run on variety of platforms and distributed systems. Outside of Apache Hadoop, Spark can run as a standalone cluster by itself. You can also use a different cluster manager like Apache Mesos.
Even in Hadoop, Spark can run on Hadoop V1 (Spark Inside MapReduce) and Hadoop V2 (over YARN).
Spark Provides Single Programming Abstraction
Spark provides a single programming abstraction called Resilient Distributed Dataset (RDD). This abstraction is understood by APIs as well as the Libraries of Spark. The advantage of using RDD as a single abstraction is that the data is not required to be re-formatted. The stream data is a series of RDD which can be repurposed for batch processing or executing real time SQL queries to it or even machine learning algorithm on the data. The time taken to covert the data from one abstraction to another saves a lot of time.
Spark is Fast
Apache Spark is fast. But the speed will certainly depend on the kind of operation you are performing. Some numbers which we see over the internet varies from 10 times to 100 times faster than MapReduce. I however believe that speed will vary with the variety of computing performed.
Take for instance a job which iterates over the same set of data. This kind of job will be considerably faster in Spark and will outshine MapReduce. Reason being Spark’s ability to cache the data in-memory. And this data will be “hot” since it is iterated over and over. So chances of eviction of the cached data (under LRU algorithm) is slim.
The other area where Spark proves to be faster over MapReduce is the shuffle. Spark’s design allows the shuffles to be carried out in-memory itself.
And lastly, each executor has an efficient method to start the tasks assigned to them. MapReduce startups a completely new JVM to execute a task. Contrary to this Spark’s executors fork new thread to execute a task. Spark’s executors and tasks can be a whole different topic for another article. To put it in a simple way, the data processing in Spark is carried out on the worker nodes which run executor. Each executor is responsible for certain part – know as task – of the entire processing job.
So Spark can be fast for certain kind of processing operation and some shuffle related advantages. But mostly with the way Spark forks out threads to execute tasks has considerable advantage over MapReduce making it “faster”.
Spark can Access Data from Multiple Sources
Spark can access data not just from HDFS but also from number of different storage solutions. The following list gives you an idea of its range.
– Apache HDFS
– Apache HBase
– Apache Hive
– Apache Cassandra
– Amazon S3
– Alluxio (formerly known as Tachyon)
The sole purpose of mentioning some of these facts for Apache Spark was to clear off some of known misconceptions about the framework. It is “faster” in comparison to the current frameworks but the speed varies on a number of factors. MapReduce still has it’s use cases and can’t be replaced by Spark. And lastly, Spark is NOT an in-memory processing technology.