IKH

Late-Arriving Data and Watermarks

Now that you have understood the concept of windows applied through code, let’s discuss what late-arriving data is and how it can be handled using watermarks. Since there are so many dependencies in the real-world, it leads to substantial latencies quite frequently. In the case of streaming data, there must be a mechanism to handle late-arriving data effectively so as to get proper results. Hence, we use of watermarks.

Let’s summarise the concepts that you learnt in the video given above.

Late-arriving data occurs due to a delay between the event time and the processing time. Stream processing systems need a way to handle such data. The system cannot keep waiting forever for the data to arrive because of resource constraints. This is where we use watermarks.

Watermarks define the time period after which Spark will start dropping records of the stream. It is a scalable way of managing late-arriving data. With every incoming batch, Spark checks the maximum event time that it has already received and applies the watermark on that event time.

While choosing the watermark, we must bear in mind that it should be enough time for failovers, network lags, etc. However, it should not be so large that it puts an excess load on the system. The actual watermark will depend on the available memory and incoming traffic per minute.

With that, let’s move on to the next segment to summarise your learnings from this session.

Additional Readings

Handling Late Arriving Data With Structured Streaming – The article demonstrates how to handle late-arriving data with watermarks.

Report an error