In the previous segment, you had a glance at the powerful features of Spark that have led to its rise and predominance as a tool for data analysis. Now, in this segment, you will take a closer look at the differences between the MapReduce framework and Spark. But before proceeding further, let’s understand the primary storage systems.
Storage systems are primarily of the following two types:
- Memory
- Disk
Every time you want to analyse or use the stored data, it is first loaded into memory and then operations are performed on the same. MapReduce follows disk-based processing systems. In such systems, data is stored on hard disk drives (large storage systems such as HDFS and S3). The final output is arrived at in two phases: Map and Reduce.
In the Map phase, first, data is processed and then split into partitions (mapped). The output of the Map phase is again transferred to the disk as intermediate output. The same output acts as input for the Reduce phase. So, the data is again read from the disk to the memory in the Reduce phase. The output of the Reduce phase is then stored in the disk.
Here, you can see that the two phases, Map and Reduce, are not performed together in memory. There is an intermediate output that is stored in the disk. This back-and-forth movement of data between the disk and the memory creates an overhead, which slows down the entire cycle of data processing.
In the upcoming video, our SME Vishwa Mohan will analyse the access time for retrieving a file from a disk.
Now, let’s discuss the types of delays that occur while accessing data from a disk:
- Seek time: This is the time taken by the read/write head to move from the current position to a new position.
- Rotational delay: This is the time taken by the disc to rotate so that the read/write head points to the beginning of the data chunk.
- Transfer time: This is the time taken to read/write the data from/to the data chunk in the hard disk.
Access time denotes the total time delay or the latency between when a request to read data is submitted and when the disk returns the requested data. The access time of a disk is calculated by adding all the delays, as shown below:
Access Time = Seek time + Rotational delay + Transfer time
Let’s watch the upcoming video to get a deeper understanding of disk-based processing systems and also to learn why and why not we should use them.
As explained in the video, in in-memory processing systems, once data is loaded in memory, it remains there until all operations are performed. Since there is no intermediate output involved, the time required to write/read files to the intermediate disk is eliminated.
Why in-memory data processing systems?
- Real-time data processing – Since data can be accessed fast, in-memory processing can be used in cases where immediate results are required.
- Accessing data randomly in memory – Since data is stored in the RAM, the memory can be accessed randomly without scanning the entire storage.
- Iterative and interactive operations – Intermediate results are stored in memory and not in disk storage and so, we can use this output in other computations.
Now, in the upcoming video, Vishwa will discuss the interactive and iterative queries in MapReduce and Spark.
In the iterative operations on MapReduce slide, it is specified that the “memory ” between Map and Reduce jobs is disc-based memory and not in-memory.
So, as explained in the video, iterative queries are much faster in Spark than in MapReduce, and this is because of Spark’s in-memory processing. Interactive queries also work better on Spark, since for every query, data is not fetched again from disk memory.
Pro Tip:
Processing is of the following two types:
- Batch processing
- Real-time processing
In batch processing, data is collected over a period of time and is available for analysis in a batch. Output from such tasks is not expected instantly, and, hence, you can work without a high-speed processing framework.
In real-time processing, data is collected and processed continuously. Typically, for such tasks, the system needs to produce output in a very short span of time, sometimes almost instantly. As an example, payment systems such as PayPal, credit card companies like Visa, and many others use fraud detection algorithms to identify suspicious transactions, for instance, someone hacking a person’s account and making payments. Since fraud detection will be useful only if one can identify a fraudulent transaction right when it occurs, the data needs to be analysed as soon as the transaction is initiated (additional reading about this is provided below).
So, now that we have extensively discussed the differences between Spark and MapReduce, in the upcoming video, let’s summarise these differences.
So, as discussed in the video, Spark is nearly 100x faster than MapReduce(For in-memory processing).
So far, we have discussed Spark in comparison with MapReduce. In the next few segments, you will learn about various other features that are available in Spark and in the Spark ecosystem.
Highly Recommended:
Under In-Segment Questions, there is a question on the differences between Spark and MapReduce. Do answer that question to summarise your learning from this segment.
Additional Reading:
- Visualisation of seek time and rotational latency