Zeppelin on Kubernetes

Zeppelin can run on clusters managed by Kubernetes. When Zeppelin runs in Pod, it creates pods for individual interpreter. Also Spark interpreter auto configured to use Spark on Kubernetes in client mode.

Key benefits are

  • Interpreter scale-out
  • Spark interpreter auto configure Spark on Kubernetes
  • Able to customize Kubernetes yaml file
  • Spark UI access

Prerequisites

  • Zeppelin >= 0.9.0 docker image
  • Spark >= 2.4.0 docker image (in case of using Spark Interpreter)
  • A running Kubernetes cluster with access configured to it using kubectl
  • Kubernetes DNS configured in your cluster
  • Enough cpu and memory in your Kubernetes cluster. We recommend 4CPUs, 6g of memory to be able to start Spark Interpreter with few executors.

    • If you're using minikube, check your cluster capacity (kubectl describe node) and increase if necessary

      $ minikube delete    # otherwise configuration won't apply
      $ minikube config set cpus <number>
      $ minikube config set memory <number in MB>
      $ minikube start
      $ minikube config view
      

Quickstart

Let's first clone the Zeppelin repository from GitHub:

git clone https://github.com/apache/zeppelin.git
cd zeppelin
# you can check out to your desired version/branch
# git checkout tags/v0.10.1
# just make sure you check the version inside "./pom.xml"

Now we are going to create the zeppelin-distribution image. This may take some time and this image will be used as a base for the upcoming required images:

docker build -t zeppelin-distribution:latest -f ./Dockerfile .

Next, we will build our zeppelin-server image:

cd scripts/docker/zeppelin-server
# Looking at the "./pom.xml" we can see the version is 0.11.0-SNAPSHOT
# Let's set the correct version in our Dockerfile:
# vi Dockerfile
# ARG version="0.11.0-SNAPSHOT"
# Once you saved the Dockerfile with the correct version we can build our image:
docker build -t zeppelin-server:0.11.0-SNAPSHOT -f ./Dockerfile .

The last image we build is zeppelin-interpreter:

cd scripts/docker/zeppelin-interpreter
docker build -t zeppelin-interpreter:0.11.0-SNAPSHOT -f ./Dockerfile .

So we should now have the following images:

# sudo if you are on Linux and Docker requires root
$ docker images

REPOSITORY                    TAG               IMAGE ID       CREATED          SIZE
zeppelin-interpreter          0.11.0-SNAPSHOT   4f77fe989eed   3 minutes ago    622MB
zeppelin-server               0.11.0-SNAPSHOT   4f77fe989eed   3 minutes ago    622MB
zeppelin-distribution         latest            bd2fb4b321d2   40 minutes ago   1.27GB

Reminder: Please adjust the images in the YAML-File of zeppelin-server.yaml

Start zeppelin on Kubernetes cluster,

kubectl apply -f zeppelin-server.yaml

Port forward Zeppelin server port,

kubectl port-forward zeppelin-server 8080:80

and browse localhost:8080. Try running some paragraphs and see if each interpreter is running as a Pod (using kubectl get pods), instead of a local process.

To shut down,

kubectl delete -f zeppelin-server.yaml

Spark Interpreter

Build spark docker image to use Spark Interpreter. Download spark binary distribution and run following command. Spark 2.4.0 or later version is required.

# if you're using minikube, set docker-env
$ eval $(minikube docker-env)

# build docker image
$ <spark-distribution>/bin/docker-image-tool.sh -m -t 2.4.0 build

Run docker images and check if spark:2.4.0 is created. Configure sparkContainerImage of zeppelin-server-conf ConfigMap in zeppelin-server.yaml.

Create note and configure executor number (default 1)

%spark.conf
spark.executor.instances  5

And then start your spark interpreter

%spark
sc.parallelize(1 to 100).count
...

While spark.master property of SparkInterpreter starts with k8s:// (default k8s://https://kubernetes.default.svc when Zeppelin started using zeppelin-server.yaml), Spark executors will be automatically created in your Kubernetes cluster. Spark UI is accessible by clicking SPARK JOB on the Paragraph.

Check here to know more about Running Spark on Kubernetes.

Build Zeppelin image manually

To build your own Zeppelin image, first build Zeppelin project with -Pbuild-distr flag.

$ ./mvnw package -DskipTests -Pbuild-distr <your flags>

Binary package will be created under zeppelin-distribution/target directory. Move created package file under scripts/docker/zeppelin/bin/ directory.

$ mv zeppelin-distribution/target/zeppelin-*.tar.gz scripts/docker/zeppelin/bin/

scripts/docker/zeppelin/bin/Dockerfile downloads package from internet. Modify the file to add package from filesystem.

...

# Find following section and comment out
#RUN echo "$LOG_TAG Download Zeppelin binary" && \
#    wget -O /tmp/zeppelin-${Z_VERSION}-bin-all.tgz http://archive.apache.org/dist/zeppelin/zeppelin-${Z_VERSION}/zeppelin-${Z_VERSION}-bin-all.tgz && \
#    tar -zxvf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
#    rm -rf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
#    mv /zeppelin-${Z_VERSION}-bin-all ${ZEPPELIN_HOME}

# Add following lines right after the commented line above
ADD zeppelin-${Z_VERSION}.tar.gz /
RUN ln -s /zeppelin-${Z_VERSION} /zeppelin
...

Then build docker image.

# configure docker env, if you're using minikube
$ eval $(minikube docker-env)

# change directory
$ cd scripts/docker/zeppelin/bin/

# build image. Replace <tag>.
$ docker build -t <tag> .

Finally, set custom image <tag> just created to image and ZEPPELIN_K8S_CONTAINER_IMAGE env variable of zeppelin-server container spec in zeppelin-server.yaml file.

Currently, single docker image is being used in both Zeppelin server and Interpreter pods. Therefore,

Pod Number of instances Image Note
Zeppelin Server 1 Zeppelin docker image User creates/deletes with kubectl command
Zeppelin Interpreters n Zeppelin docker image Zeppelin Server creates/deletes
Spark executors m Spark docker image Spark Interpreter creates/deletes

Currently, size of Zeppelin docker image is quite big. Zeppelin project is planning to provides lightweight images for each individual interpreter in the future.

How it works

Zeppelin on Kubernetes

k8s/zeppelin-server.yaml is provided to run Zeppelin Server with few sidecars and configurations. Once Zeppelin Server is started in side Kubernetes, it auto configure itself to use K8sStandardInterpreterLauncher.

The launcher creates each interpreter in a Pod using templates located under k8s/interpreter/ directory. Templates in the directory applied in alphabetical order. Templates are rendered by jinjava and all interpreter properties are accessible inside the templates.

Spark on Kubernetes

When interpreter group is spark, Zeppelin sets necessary spark configuration automatically to use Spark on Kubernetes. It uses client mode, so Spark interpreter Pod works as a Spark driver, spark executors are launched in separate Pods. This auto configuration can be overrided by manually setting spark.master property of Spark interpreter.

Accessing Spark UI (or Service running in interpreter Pod)

Zeppelin server Pod has a reverse proxy as a sidecar, and it splits traffic to Zeppelin server and Spark UI running in the other Pods. It assume both <your service domain> and *.<your service domain> point the nginx proxy address. <your service domain> is directed to ZeppelinServer, *.<your service domain> is directed to interpreter Pods.

<port>-<interpreter pod svc name>.<your service domain> is convention to access any application running in interpreter Pod.

For example, When your service domain name is local.zeppelin-project.org Spark interpreter Pod is running with a name spark-axefeg and Spark UI is running on port 4040,

4040-spark-axefeg.local.zeppelin-project.org

is the address to access Spark UI.

Default service domain is local.zeppelin-project.org:8080. local.zeppelin-project.org and *.local.zeppelin-project.org configured to resolve 127.0.0.1. It allows access Zeppelin and Spark UI with kubectl port-forward zeppelin-server 8080:80.

If you like to use your custom domain

  1. Configure Ingress in Kubernetes cluster for http port of the service zeppelin-server defined in k8s/zeppelin-server.yaml.
  2. Configure DNS record that your service domain and wildcard subdomain point the IP Addresses of your Ingress.
  3. Modify serviceDomain of zeppelin-server-conf ConfigMap in k8s/zeppelin-server.yaml file.
  4. Apply changes (e.g. kubectl apply -f k8s/zeppelin-server.yaml)

Persist /notebook and /conf directory

Notebook and configurations are not persisted by default. Please configure volume and update k8s/zeppelin-server.yaml to use the volume to persiste /notebook and /conf directory if necessary.

Customization

Zeppelin Server Pod

Edit k8s/zeppelin-server.yaml and apply.

Interpreter Pod

Since Interpreter Pod is created/deleted by ZeppelinServer using templates under k8s/interpreter directory, to customize,

  1. Prepare k8s/interpreter directory with customization (edit or create new yaml file), in a Kubernetes volume.
  2. Modify k8s/zeppelin-server.yaml and mount prepared volume dir k8s/interpreter to /zeppelin/k8s/interpreter/.
  3. Apply modified k8s/zeppelin-server.yaml.
  4. Run a paragraph will create an interpreter using modified yaml files.

The interpreter pod can also be customized through the interpreter settings. Here are some of the properties:

Property Name Default Value Description
zeppelin.k8s.interpreter.namespace default Specify the namespace of the current interpreter. Users can set different namespaces for different interpreters. In order to minimize permissions, the interpreter pod can only be created in the default namespace by default. If users need to create an interpreter pod in other namespaces, they need to add the corresponding rolebinding in k8s/zeppelin-server.yaml.
zeppelin.k8s.interpreter.serviceAccount default The Kubernetes service account to use.
zeppelin.k8s.interpreter.container.image apache/zeppelin:<ZEPPELIN_VERSION> The interpreter image to use.
zeppelin.k8s.interpreter.cores (optional) The number of cpu cores to use.
zeppelin.k8s.interpreter.memory (optional) The memory to use, e.g., 1g.
zeppelin.k8s.interpreter.gpu.type (optional) Set the type of gpu to request when the interpreter pod is required to schedule gpu resources, e.g., nvidia.com/gpu.
zeppelin.k8s.interpreter.gpu.nums (optional) Tne number of gpu to use.
zeppelin.k8s.interpreter.imagePullSecrets (optional) Set the comma-separated list of Kubernetes secrets while pulling images, e.g., mysecret1,mysecret2
zeppelin.k8s.interpreter.container.imagePullPolicy (optional) Set the pull policy of the interpreter image, e.g., Always
zeppelin.k8s.spark.container.imagePullPolicy (optional) Set the pull policy of the spark image, e.g., Always
zeppelin.spark.uiWebUrl //-. The URL for user to access Spark UI. The default value is a jinjava template that contains three variables.
zeppelin.k8s.spark.useIngress (optional) If true, the Ingress will be created when creating the spark interpreter. So users can access the Spark UI through Ingress.
zeppelin.k8s.spark.ingress.host -. If zeppelin.k8s.spark.useIngress is true, it configures the host value of the Ingress. The default value is a jinjava template that contains three variables. Users can access the Spark UI through a customized zeppelin.k8s.spark.ingress.host.

Future work

  • Smaller interpreter docker image.
  • Blocking communication between interpreter Pod.
  • Spark Interpreter Pod has Role CRUD for any pod/service in the same namespace. Which should be restricted to only Spark executors Pod.
  • Per note interpreter mode by default when Zeppelin is running on Kubernetes

Development

Instead of build Zeppelin distribution package and docker image everytime during development, Zeppelin can run locally (such as inside your IDE in debug mode) and able to run Interpreter using K8sStandardInterpreterLauncher by configuring following environment variables.

Environment variable Value Description
ZEPPELIN_RUN_MODE k8s Make Zeppelin run interpreter on Kubernetes
ZEPPELIN_K8S_PORTFORWARD true Enable port forwarding from local Zeppelin instance to Interpreters running on Kubernetes
ZEPPELIN_K8S_CONTAINER_IMAGE <image>:<version> Zeppelin interpreter docker image to use
ZEPPELIN_K8S_SPARK_CONTAINER_IMAGE <image>:<version> Spark docker image to use
ZEPPELIN_K8S_NAMESPACE <k8s namespace> Kubernetes namespace to use
KUBERNETES_AUTH_TOKEN <token> Kubernetes auth token to create resources