Spark Ecosystem

The previous segment was a comprehensive study of Spark versus MapReduce. We concluded that Spark was far more efficient in case of both iterative and interactive queries on a data set in terms of speed and ease of use. In this segment, you will learn about the Spark ecosystem. You will also learn about the various components and features that are offered by Spark.

Apache Spark comes with a bundle of features along with high-speed processing. Different packages and features are built over Spark, serving different purposes. In the upcoming video, you will learn about these from our SME.

To summarise, Spark supports different APIs, which allow it to work as a unified platform for various aspects of data processing and analysis. You have the comfort of loading data from a wide range of sources on which these APIs can work and produce the desired output.

Spark Core

Spark Core is the heart of Spark and is responsible for all kinds of processing.

Everything in Spark, including Spark API, is built on top of Spark Core, which provides an execution platform for every other Spark API.

The Spark Core engine executes all Spark jobs with RDDs as inputs.

Even if you use any high-level API for optimising your Spark tasks, Spark handles the data internally in the distributed environment using RDDs. You will learn about ‘Spark RDDs’ in upcoming sessions.

Here is a glimpse of RDD storage and representation of data:

DataFrame API

When dealing with structured data, you will be using the Dataframe API.

The Dataframe API stores data in a tabular format. You will learn more about the various data types in upcoming sessions.

Here is a glimpse of dataframe storage and representation of data:

SparkSQL

Spark SQL is Apache Spark’s module for working with structured data.

It is a high-level API that lets you utilise Spark’s power using SQL queries.

You don’t have to modify your SQL query code. Spark SQL functionality can run your SQL query.

Spark SQL supports HiveQL syntax as well as Hive SerDes and UDFs, allowing you to access existing Hive warehouses. We will discuss Spark SQL in upcoming sessions as well.

Here is a glimpse of an SQL query in Spark:

Spark Streaming

Spark Streaming involves processing live data streams in Spark. This feature of Spark is another advantage over MapReduce.

Spark Streaming provides APIs that resemble the RDD API, which is used in Spark Core. Because of this, you can easily manipulate data that is either stored in-disk or is coming from live data streams.

It provides the same level of fault tolerance as provided by Spark Core.

MLlib

MLlib involves applying machine learning algorithms. It offers a wide range of features to analyse and build models over distributed data sets.

It supports functionalities, which include model evaluation and data import.

GraphX

GraphX is used to process large volumes of data in Spark in the form of graphs.

This package also provides different operators for working with graphs and manipulating them.

Packages

Different Spark packages, such as PySpark and SparkR, help you run queries on big data through libraries that are supported by languages like Python, R, etc. This flexibility of using different languages to write code adds to the Spark’s ease of use.

Pro Tip:

The Spark ecosystem adds to Spark’s ‘Runs Everywhere’ feature by supporting several libraries and functionalities. All of these make Spark a highly powerful and easy-to-use framework.

Additional Reading:

An article on the Spark ecosystem

Report an error