Kafka Integration

In the next video, you will learn how to integrate Kafka with Spark. Kafka is a potential messaging and integration platform for Spark Streaming. It acts as the central hub for real-time streams of data, which are processed in Spark Streaming using complex algorithms. Once the data is processed, Spark Streaming either publishes the results into another Kafka topic or stores them in HDFS, databases or dashboards.

Kafka is a state-of the-art messaging system. It is highly scalable as well; hence, it is popular in the industry. It follows the publisher-subscriber or pub-sub model. In this model, a publisher can give their inputs to a topic, and multiple subscribers can access that topic. A topic can be thought of as equivalent to a table in a database system. This integrates well with Spark Streaming for further processing of the data streams.

Note:

You will find the steps to create a Kafka Cluster in the previous module (on Apache Kafka).

The lab flow for the first coding example is as follows:

Set up Kafka
Create a Topic
Publish into the Kafka Topic
Set up a Spark job to read from the Kafka Topic
Execute

Now, let’s move on to our first coding lab. In this lab you will how Spark Streaming can read data from a Kafka topic.

Let’s summarise what you have learned in the above video.

You first created a Kafka topic with a single partition and then set the replication factor to 1. You subscribed to the Kafka topic and loaded the stream.
Then, you set the output mode to append and printed out the stream on the console. Remember that, before using the topic, you need to publish something into the topic.
After you were done with this, you were able to successfully read from the Kafka topic.
Now, you will read from a kafka topic that you specified in the code and also give the path of the Kafka server.

In the next lab, you will see how Spark Streaming can be used to write data into a Kafka topic.

The lab flow for the next example is as follows:

Read from a Streaming file source – HDFS using Spark
Publish to a Kafka Topic
Verify from the Kafka consumer console.

Now, let’s move on to the next lab.

Let’s summarise what you have learned in the above video.

You created a data frame in the form of key-value pairs and specified to the host of the Kafka topic which port to listen from. Then, you created a stream using the CSV file and were able to successfully print the age (values) of the players (keys) to the Kafka topic.

In order to run a Spark Streaming job where you are using connecting with a Kafka stream, you need to run the Spark job as follows:

Additional Readings

Apache Spark Streaming Use Cases- The document has a few real-life use cases of Spark Streaming.

Report an error