sergeysamsonov March 26, 2019 at 12:35

Spark Structured Streaming Applications on Kubernetes. Experience FASTEN RUS

From the sandbox

Today I will tell you how we managed to solve the problem of porting Spark Structured Streaming Applications to Kubernetes (K8s) and implement CI streaming.

How it all began?

Streaming is a key component of the FASTEN RUS BI platform. Real-time data is used by the date analysis team to build operational reports.

Streaming applications are implemented using Spark Structured Streaming . This framework provides a convenient data transformation API that meets our needs in terms of the speed of improvements.

The streams themselves rose on the AWS EMR cluster. Thus, when raising a new stream to the cluster, an ssh script was laid out to submit Spark-jobs, after which the application was launched. And at first everything seemed to suit us. But with the increasing number of streams, the need for CI streaming became more and more obvious, which would increase the autonomy of the analysis date command when launching applications for delivering data on new entities.

And now we’ll look at how we managed to solve this problem by porting the streaming to Kubernetes.

Why Kubernetes?

Kubernetes, as a resource manager, best suited our needs. This is a deploy without downtime, and a wide range of CI implementation tools on Kubernetes, including Helm. In addition, our team had sufficient expertise in the implementation of CI pipelines on K8s. Therefore, the choice was obvious.

How is the Kubernetes-based Spark application management model organized?

The client runs spark-submit on K8s. An application driver pod is created. Kubernetes Scheduler binds a pod to a cluster node. Then the driver sends a request to create pod's to run executives, pods are created and attached to cluster nodes. After that, a standard set of operations is performed with the subsequent conversion of the application code into DAG, decomposition into stages, breaking down into tasks and their launch on executables.

This model works quite successfully when manually starting Spark applications. However, the approach of launching spark-submit outside the cluster did not suit us in terms of CI implementation. It was necessary to find a solution that would allow Spark to run (perform spark-submit) directly on the nodes of the cluster. And here the Kubernetes Operator model fully met our requirements.

Kubernetes Operator as a Spark Application Lifecycle Management Model

Kubernetes Operator is a concept of managing statefull applications in Kubernetes, proposed by CoreOS , which involves the automation of operational tasks, such as deploying applications, restarting applications in case of files, updating the configuration of applications. One of the key Kubernetes Operator patterns is CRD ( CustomResourceDefinitions ), which involves adding custom resources to the K8s cluster, which, in turn, allows you to work with these resources as with native Kubernetes objects.

Operator is a daemon that lives in the pod of the cluster and responds to the creation / change of the state of a custom resource.

Consider this concept for Spark application lifecycle management.

The user runs the kubectl apply -f spark-application.yaml command, where spark-application.yaml is the specification of the Spark application. Operator receives the Spark application object and executes spark-submit.

As we can see, the Kubernetes Operator model involves managing the life cycle of a Spark application directly in the Kubernetes cluster, which was a serious argument in favor of this model in the context of solving our problems.

As a Kubernetes Operator for managing streaming applications, it was decided to use spark-on-k8s-operator . This operator offers a fairly convenient API, as well as flexibility in configuring the restart policy for Spark applications (which is quite important in the context of supporting streaming applications).

CI implementation

To implement CI streaming, GitLab CI / CD was used . The deployment of Spark applications on K8s was conducted using Helm tools .

The pipeline itself involves 2 stages:

test - syntax checking is performed, as well as rendering of Helm-templates;
deploy - deployment of streaming applications to the test (dev) and product (prod) environments.

Let us consider these stages in more detail.

At the test stage, the Spark application Helm template (CRD - SparkApplication ) is rendered with environment-specific values.

The key sections of the Helm-template are:

spark:
- version - Apache Spark version
- image - Docker image used
nodeSelector - contains a list (key → value) corresponding to the labels of the hearths.
tolerations - indicates the list of tolerances of the Spark application.
mainClass - Spark application class
applicationFile - local path where the Spark application jar is located
restartPolicy - Spark application restart policy
- Never - the completed Spark application does not restart
- Always - the completed Spark application restarts regardless of the reason for the stop.
- OnFailure - Spark application restarts only in case of file
maxSubmissionRetries - maximum number of submissions of a Spark application
driver / executor:
- cores - the number of kernels allocated to the driver / executor
- instances (used only for configuration of executives) - the number of executives
- memory - the amount of memory allocated to the driver / executor process
- memoryOverhead - the amount of off-heap memory allocated to the driver / executor
streams:
- name - name of the streaming application
- arguments - arguments to the streaming application
sink - the path to Data Lake datasets on S3

After rendering the template, applications are deployed to the dev test environment using Helm.

Worked out the CI pipeline.

Then we launch the deploy-prod job - launching applications in production.

We are convinced of successful performance of job.

As we can see below, the applications are running, the pods are in the RUNNING status.

Conclusion

Porting Spark Structured Streaming Applications to K8s and the subsequent implementation of CI allowed us to automate the launch of streams for delivering data to new entities. To raise the next stream, it is enough to prepare a Merge Request with a description of the configuration of the Spark application in the yaml file of values and when the deploy-prod job starts, data delivery to Data Lake (S3) will be initiated. This solution ensured the autonomy of the analysis date command when performing tasks related to adding new entities to the repository. In addition, porting streaming to K8s and, in particular, managing Spark applications using Kubernetes Operator spark-on-k8s-operator significantly increased the resiliency of streaming. But more on that in the next article.

Tags: