Early this month DataBricks provided an overview of Apache Spark‘s next major release Spark 2.0. The release is expected to be somewhere around early June 2016. There are major changes in the abstraction, API and Libraries with this new release. We will take a look at some of these changes.
What is Apache Spark
Apache Spark is an open source cluster computing framework. It is a data processing engine which is highly scaleable. Originally developed at UC Berkeley’s AMPLab in 2009 it was open sourced in 2010 under BSD license. And ultimately donated to ASF in 2013. It is now distributed under Apache License 2.0.
Apache Spark provides a unified and comprehensive framework which can take care of the various requirements for processing large datasets. Spark provides you with high-level APIs in Java, Scala, Python and R. It is also provides higher-level rich set of tools referred to as Libraries.
Spark 2.0 – What’s New
With the upcoming release of Spark 2.0 there has been some significant improvements in the API, Libraries and Abstraction layers. Spark is well know for it’s rich set of APIs and a wide Library. Spark – as we know is a fast processing engine. Spark 2.0 attempts to improve the performance further. It is tipped that Spark 2.0 can be 10X faster than Spark 1.x.
Let’s take a look at some of the changes in Spark 2.0.
More SQL Friendly – SQL 2003 Compliant
SQL is one of the primary interfaces Spark applications use. Spark 2.0 introduces a new ANSI SQL parser. The new parser provides good error reporting. Spark 2.0 will have the ability of subqueries (both correlated & uncorrelated). Spark 2.0 can run all the 99 TPC-DS queries.
This is a major improvement which can encourage moving of applications from the traditional SQL Engines to Spark.
Unified API – DataFrames & Datasets
DataFrames is a higher level structured data API introduced in Spark 1.3 in 2015. In a nutshell, DataFrame is a collection of rows with a schema. It provides better performance, ease-of-use and flexibility in comparison with RDD (Resilient Distributed Data) API.
For the users who prefer to use type safety a new API was introduced in Spark 1.6 called DataSets. DataSet is an attempt to provide type safety on top of DataFrame.
In Spark 2.0 the two APIs will be unified together into a single API. Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. The new Dataset API includes typed methods and untyped methods.
SparkSession – Single Entry Point
Spark 1.6 provided SparkContext API to connect to Spark cluster. There were several different context provided for different APIs. For instance to connect to SQL we required SQLContext and StreamContext for Streaming. While using DataFrames API a common confusion is to decide which “context” to use.
Spark 2.0 introduces SparkSession. SparkSession provides a single entry point for DataFrame and DataSet API for Spark. For now SparkSession will cover SQLContext & HiveContext. It will be extended to StreamContext as well.
Please note that the SQLContext & HiveContext will be present in Spark 2.0 for backward compatibility.
Spark as a Compiler – Faster Spark
Spark is known for its performance and speed. Spark 2.0 attempts to take this performance a step further. Spark 1.x – like many other modern data engines – uses the compilers which uses of various function calls and CPU cycles. These CPU cycles are pretty much spent on unwanted work.
Spark 2.0 includes the second generation Tungsten engine. The way this new engine works is that it gets the query plan and collapses it into a single function. Thereby eliminating all the unwanted function calls. The engine uses the CPU register for storing the intermediate data (unlike the traditional method of using memory for storing intermediate data). This method promises around 10X improvement in the performance, but then it purely depends on the data you are executing.
Structured Streaming – Continous Applications
The current Spark streaming API called DStream was introduced in Spark 0.7. It provides the ability to stream real-time data and process it. Spark 2.0 introduces Structured Streaming.
Spark Structured Streaming is a declarative API that extends DataFrames & DataSets. Spark Structured Streaming is largely built on Spark SQL and also includes ideas from Spark Streaming. It is based on the Datasets API.
Spark Streaming, which uses what’s been called a “micro-batch” architecture for streaming applications, is among the most popular Spark engines. The new Structured Streaming engine will represents Spark’s second attempt at solving some of the tough problems that developers face when building real-time applications.
Essentially, Structured Streaming enables Spark developers to run the same type of DataFrame queries against data streams as they had previously been running against static queries. Thanks to the Catalyst optimizer, the framework figures out the best way to make this all work in an efficient fashion, freeing the developer from worrying about the underlying plumbing.
Upcoming releases of Spark 2.x will include more features and improvements in Spark Structured Streaming.
DataFrame based ML API
In Spark 2.0 Machine Learning “Pipeline” DataFrame-based API will become the primary Machine Learning API.
Spark has already made a mark by providing an easy-to-use, unified and fast data framework. With Spark 2.0 we can expect further improvements in the performance of Spark overall. We can look forward to the GA release of Apache Spark 2.0 in the upcoming days.