IKH

Joins With Streams

In the previous segment, you learnt about transformations and aggregations and applied them through a coding lab.

In a real industry scenario, we may end up in a situation wherein a Spark Structured Streaming application is supposed to process multiple streams of data. In such a situation, one possible operation that we have to apply is to join the different streams into a single one. As you know, the primary data structure of a Spark Structured Streaming application is a streaming DataFrame, so, let’s watch the upcoming video and learn how we can apply joins into two different streaming DataFrames.

So, in the video, you saw that joins in streams is quite similar to joins in SQL; you only need to assume that each streaming DataFrame is a table. So, stream DataFrames can be joined with other stream or static DataFrames in the same way.

Joins are of the following different types:

  • Inner Joins – Will return records that have matching values in both tables.
  • Outer Joins
    •  Left – Will return all records from the left DataFrame and matching records from the right DataFrame
    • Right – Will return all records from the right DataFrame and matching records from the left DataFrame
    • Full – Will return all records when there is a match in either the left or the right DataFrame.

There are, however, a few restrictions on outer joins in streams. These include the following:

  • Stream – Stream outer joins can be performed only using watermarks.
  • Stream – Static right outer or full outer join is not permitted.
  • Static – Stream left outer or full outer join is not permitted.

The reason for the above restrictions is that you cannot have the entire data of a static DataFrame for a join as it could be too huge a load on the system owing to the volume of the data.

Stream – Stream full outer join is not permitted since streams are unbounded and a full outer join would mean putting everything that the stream has ever received. This would again be a huge load on the system.

Only append mode is supported for Stream–Stream joins because of the computational restraints as stated above.

Now, in the next segment, we will put these into concepts to use via a coding lab.

Report an error