IKH

Spark Overview

In the first segment of this session, you will get a brief overview of Apache Spark. So, in the upcoming video, our SME will first introduce you to Apache Spark and then you will learn about its use cases in much more detail.

Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an open-source, distributed computing engine, and it provides a productive environment for data analysis owing to its lightning speed and support for various libraries.

Being an open-source, distributed-computing engine, Spark allows developers and companies to constantly improve the rich library that it offers. Some typical use cases of Spark include the following:

  • To perform exploratory analyses on data sets of size in the scale of hundreds of GBs (or even TBs) within realistic time frames.
  • To run near-real-time reports from streaming data.
  • To develop machine learning models.

According to Spark’s documentation, the four features listed below make Spark a powerful unified engine for data processing at a massive scale:

  • Speed: This is one of the defining characteristics of Apache Spark. It is known for its speed, which is considered to be 100x faster than that of the MapReduce framework. Spark’s speed is a result of its in-memory computation and the use of DAG scheduler.
  • Ease of use: Spark offers language support in Java, Scala, SQL, Python and R to run queries interactively using more than 80 different operators.
  • Generality: Spark’s support for various libraries includes Spark SQL, Spark Streaming, MLlib and GraphX. This ecosystem strengthens Spark’s data analysis.
  • Runs everywhere: Spark can use its standalone cluster manager to run on Apache Mesos, Hadoop Yarn or Kubernetes. It can access data stored in various storages, including HDFS, S3 and others.

Do Not compare Spark with HDFS. Spark is a data-processing layer, whereas HDFS is a data storage layer. You may come across many instances of comparisons between Spark and Hadoop, but those instances consider Hadoop as a combination of HDFS and MapReduce. Spark was invented at AMP Lab at UC Berkeley as an alternative to the MapReduce paradigm. It can use HDFS as its data storage layer.

Spark and its use in different Industries

In the upcoming video, our SME Vishwa Mohan will talk about the various reasons why different industrial sectors are using Spark.

So, as you learnt in the video, Spark is used extensively in multiple industry domains for carrying out day-to-day activities. This is because of the following reasons:

  • Spark can process a large pool of data and return results very quickly. Thus, it aids in many business decisions and optimises the workflow for companies. Companies like Big Bazaar can analyse data across multiple stores quickly and thus manage their sales and inventory efficiently. 
  • The abilities of Spark also benefit the banking and finance industry. Banks process billions of transactions daily, owing to their worldwide network, and they have databases that collect real-time data, which is then analysed using frameworks like Spark to detect frauds that are occurring in the background.
  • Several industries rely on Spark to maximise their revenue. For instance, ride-sharing applications, such as Ola and Uber, calculate surge prices based on bookings and the availability of drivers in a particular area. They run complex machine learning algorithms in real-time and the results are reflected directly on their mobile applications. Running such machine learning algorithms is possible because of the rich support of libraries offered by Spark.
  • Another use case is that of personalised advertisements. Companies track the browser histories of users and show them personalised ads on various websites or applications, which they access. Running such ads becomes possible owing to the rich library support and the quick analysis features of Spark.

The use cases listed above are broad use cases of Spark in industries. Now, once you run some analysis on data sets in this module, you will also realise that Spark is an easy-to-use tool for data analysis. In this module, you will mostly use PySpark, which will help you write Python- and SQL-like queries using Spark’s computation engine.

Highly Recommended

Refer to the Spark documentation once as it stores information on all operations in Spark.

Additional Reading

  • Spark’ Official Documentation-This contains information on each component of Spark.
  • A blog on companies that are using Apache Spark.
  • Rise and predominance of Apache Spark.

Report an error