IKH

Flink Ecosystem & Programming Model

In the previous segment, you looked at some key highlights of Apache Flink and learnt how Alibaba is using it. In this segment, you will learn about the Apache Flink ecosystem and its programming model.
In the following video, you will learn about the various components of the Apache Flink ecosystem.

Let’s summarise what you have learnt in this video.


Flink Component Stack

The components of the Apache Flink ecosystem are as follows:

Libraries

  • Multiple libraries and APIs are provided by Flink which interacts with DataSet or DataStream APIs (Core APIs).
  • Some common libraries and their functionalities are as follows:
    • Table API/SQL is used for running queries on logical tables. It can interact with both DataSet and DataStream APIs.
    • Apache Flink ML is used for machine learning purposes and can interact with DataSet API only.
    • Gelly is used for graph processing and can interact with DataSet API only.
    • CEP is used for complex event processing and can interact with DataStream API only.

Core APIs

  • The ecosystem of Apache Flink has two core APIs: DataSet API and DataStream API.
  • DataSet API is used for batch processing, while DataStream API is used for stream processing.
  • Both DataStream API and DataSet API generate jobGraphs.
  • DataSet API uses an optimiser the determine the optimal plan for a program, while DataStream API uses a stream builder.

Runtime Layer

  • The Apache Flink runtime layer receives a program in the form of a JobGraph from the APIs.
  • A JobGraph is a dataflow with tasks that consume and produce data streams.

Deployment Layer

  • Various deployment options are available in Apache Flink (for example, local, cluster, YARN, etc.) that execute a JobGraph.

The following image depicts all the parts of the Apache Flink component stack.

Apache Flink Runtime Environment

  • The Apache Flink runtime environment consists of two processes:
    • JobManagers: Also known as masters, Job Managers execute jobs such as the recovery from failure, task scheduling and checkpoint coordinating. There are always one or more than one JobManager, where one of these is the leader and the others are standby.
    • TaskManagers: Also known as workers, TaskManagers execute the subtasks of a dataflow. There should always be at least one TaskManager.

There are multiple ways to start a JobManager and a TaskManager. For instance, it can be started directly on the machines as a standalone cluster or managed by resource frameworks such as YARN and Mesos. TaskManagers are assigned work by JobManagers once they announce themselves as available, and they are connected to JobManagers.

  • The client is not a part of runtime and program execution. The client prepares and sends the dataflow to the JobManager.
  • Each TaskManager is a JVM process. It can execute one or more subtasks in separate threads. A worker has at least one task slot. A task slot represents a fixed subset of resources of a manager.

The following image depicts the Apache Flink runtime environment.

In the following video, our expert will explain the basic components of the Apache Flink programming model.

Let’s summarise what you have learnt in the video above.

Apache Flink Programming Model

The programming model of Apache Flink consists of three basic components:

  • The program reads input data from the Data Source. It can be a stream, database or file/object store.
  • In the Transform phase, the input data stream/data set transforms into a new data stream/data set. 
  • The transformed data set/data stream is consumed by Data Sinks and transferred to files, external systems, etc.

The following image depicts the Apache Flink programming model.

Additional Reading

  • Job & Scheduling – This is the official documentation link explaining how job scheduling is done in Flink.
  • Flink Architecture –  This is the official documentation link explaining the Flink architecture.
  • Dataflow Programming Model – This is the official documentation link explaining the Flink programming model.

Report an error