Zeppelin on Kubernetes

Zeppelin can run on clusters managed by Kubernetes. When Zeppelin runs in Pod, it creates pods for individual interpreter. Also Spark interpreter auto configured to use Spark on Kubernetes in client mode.

Key benefits are

Interpreter scale-out
Spark interpreter auto configure Spark on Kubernetes
Able to customize Kubernetes yaml file
Spark UI access

Prerequisites

Zeppelin >= 0.9.0 docker image
Spark >= 2.4.0 docker image (in case of using Spark Interpreter)
A running Kubernetes cluster with access configured to it using kubectl
Kubernetes DNS configured in your cluster
Enough cpu and memory in your Kubernetes cluster. We recommend 4CPUs, 6g of memory to be able to start Spark Interpreter with few executors.
- If you're using minikube, check your cluster capacity (kubectl describe node) and increase if necessary
```
$ minikube delete    # otherwise configuration won't apply
$ minikube config set cpus <number>
$ minikube config set memory <number in MB>
$ minikube start
$ minikube config view
```

Quickstart

Get zeppelin-server.yaml from github repository or find it from Zeppelin distribution package.

# Get it from Zeppelin distribution package.
$ ls <zeppelin-distribution>/k8s/zeppelin-server.yaml

# or download it from github
$ curl -s -O https://raw.githubusercontent.com/apache/zeppelin/master/k8s/zeppelin-server.yaml

Start zeppelin on kubernetes cluster,

kubectl apply -f zeppelin-server.yaml

Port forward Zeppelin server port,

kubectl port-forward zeppelin-server 8080:80

and browse localhost:8080. Try run some paragraphs and see each interpreter is running as a Pod (using kubectl get pods), instead of a local process.

To shutdown,

kubectl delete -f zeppelin-server.yaml

Spark Interpreter

Build spark docker image to use Spark Interpreter. Download spark binary distribution and run following command. Spark 2.4.0 or later version is required.

# if you're using minikube, set docker-env
$ eval $(minikube docker-env)

# build docker image
$ <spark-distribution>/bin/docker-image-tool.sh -m -t 2.4.0 build

Run docker images and check if spark:2.4.0 is created. Configure sparkContainerImage of zeppelin-server-conf ConfigMap in zeppelin-server.yaml.

Create note and configure executor number (default 1)

%spark.conf
spark.executor.instances  5

And then start your spark interpreter

%spark
sc.parallelize(1 to 100).count
...

While spark.master property of SparkInterpreter starts with k8s:// (default k8s://https://kubernetes.default.svc when Zeppelin started using zeppelin-server.yaml), Spark executors will be automatically created in your Kubernetes cluster. Spark UI is accessible by clicking SPARK JOB on the Paragraph.

Check here to know more about Running Spark on Kubernetes.

Build Zeppelin image manually

To build your own Zeppelin image, first build Zeppelin project with -Pbuild-distr flag.

$ mvn package -DskipTests -Pbuild-distr <your flags>

Binary package will be created under zeppelin-distribution/target directory. Move created package file under scripts/docker/zeppelin/bin/ directory.

$ mv zeppelin-distribution/target/zeppelin-*.tar.gz scripts/docker/zeppelin/bin/

scripts/docker/zeppelin/bin/Dockerfile downloads package from internet. Modify the file to add package from filesystem.

...

# Find following section and comment out
#RUN echo "$LOG_TAG Download Zeppelin binary" && \
#    wget -O /tmp/zeppelin-${Z_VERSION}-bin-all.tgz http://archive.apache.org/dist/zeppelin/zeppelin-${Z_VERSION}/zeppelin-${Z_VERSION}-bin-all.tgz && \
#    tar -zxvf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
#    rm -rf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
#    mv /zeppelin-${Z_VERSION}-bin-all ${ZEPPELIN_HOME}

# Add following lines right after the commented line above
ADD zeppelin-${Z_VERSION}.tar.gz /
RUN ln -s /zeppelin-${Z_VERSION} /zeppelin
...

Then build docker image.

# configure docker env, if you're using minikube
$ eval $(minikube docker-env) 

# change directory
$ cd scripts/docker/zeppelin/bin/

# build image. Replace <tag>.
$ docker build -t <tag> .

Finally, set custom image <tag> just created to image and ZEPPELIN_K8S_CONTAINER_IMAGE env variable of zeppelin-server container spec in zeppelin-server.yaml file.

Currently, single docker image is being used in both Zeppelin server and Interpreter pods. Therefore,

Pod	Number of instances	Image	Note
Zeppelin Server	1	Zeppelin docker image	User creates/deletes with kubectl command
Zeppelin Interpreters	n	Zeppelin docker image	Zeppelin Server creates/deletes
Spark executors	m	Spark docker image	Spark Interpreter creates/deletes

Currently, size of Zeppelin docker image is quite big. Zeppelin project is planning to provides lightweight images for each individual interpreter in the future.

How it works

Zeppelin on Kubernetes

k8s/zeppelin-server.yaml is provided to run Zeppelin Server with few sidecars and configurations. Once Zeppelin Server is started in side Kubernetes, it auto configure itself to use K8sStandardInterpreterLauncher.

The launcher creates each interpreter in a Pod using templates located under k8s/interpreter/ directory. Templates in the directory applied in alphabetical order. Templates are rendered by jinjava and all interpreter properties are accessible inside the templates.

Spark on Kubernetes

When interpreter group is spark, Zeppelin sets necessary spark configuration automatically to use Spark on Kubernetes. It uses client mode, so Spark interpreter Pod works as a Spark driver, spark executors are launched in separate Pods. This auto configuration can be overrided by manually setting spark.master property of Spark interpreter.

Accessing Spark UI (or Service running in interpreter Pod)

Zeppelin server Pod has a reverse proxy as a sidecar, and it splits traffic to Zeppelin server and Spark UI running in the other Pods. It assume both <your service domain> and *.<your service domain> point the nginx proxy address. <your service domain> is directed to ZeppelinServer, *.<your service domain> is directed to interpreter Pods.

<port>-<interpreter pod svc name>.<your service domain> is convention to access any application running in interpreter Pod.

For example, When your service domain name is local.zeppelin-project.org Spark interpreter Pod is running with a name spark-axefeg and Spark UI is running on port 4040,

4040-spark-axefeg.local.zeppelin-project.org

is the address to access Spark UI.

Default service domain is local.zeppelin-project.org:8080. local.zeppelin-project.org and *.local.zeppelin-project.org configured to resolve 127.0.0.1. It allows access Zeppelin and Spark UI with kubectl port-forward zeppelin-server 8080:80.

If you like to use your custom domain

Configure Ingress in Kubernetes cluster for http port of the service zeppelin-server defined in k8s/zeppelin-server.yaml.
Configure DNS record that your service domain and wildcard subdomain point the IP Addresses of your Ingress.
Modify serviceDomain of zeppelin-server-conf ConfigMap in k8s/zeppelin-server.yaml file.
Apply changes (e.g. kubectl apply -f k8s/zeppelin-server.yaml)

Persist /notebook and /conf directory

Notebook and configurations are not persisted by default. Please configure volume and update k8s/zeppelin-server.yaml to use the volume to persiste /notebook and /conf directory if necessary.

Customization

Zeppelin Server Pod

Edit k8s/zeppelin-server.yaml and apply.

Interpreter Pod

Since Interpreter Pod is created/deleted by ZeppelinServer using templates under k8s/interpreter directory, to customize,

Prepare k8s/interpreter directory with customization (edit or create new yaml file), in a Kubernetes volume.
Modify k8s/zeppelin-server.yaml and mount prepared volume dir k8s/interpreter to /zeppelin/k8s/interpreter/.
Apply modified k8s/zeppelin-server.yaml.
Run a paragraph will create an interpreter using modified yaml files.

The interpreter pod can also be customized through the interpreter settings. Here are some of the properties: | Property Name | Default Value | Description | | ----- | ----- | ----- | | zeppelin.k8s.namespace | default | The Kubernetes namespace to use. | | zeppelin.k8s.interpreter.container.image | apache/zeppelin:<ZEPPELIN_VERSION> | The interpreter image to use. | | zeppelin.k8s.interpreter.cores | (optional) | The number of cpu cores to use. | | zeppelin.k8s.interpreter.memory | (optional) | The memory to use, e.g., 1g. | | zeppelin.k8s.interpreter.gpu.type | (optional) | Set the type of gpu to request when the interpreter pod is required to schedule gpu resources, e.g., nvidia.com/gpu. | | zeppelin.k8s.interpreter.gpu.nums | (optional) | Tne number of gpu to use. | | zeppelin.k8s.interpreter.imagePullSecrets | (optional) | Set the comma-separated list of Kubernetes secrets while pulling images, e.g., mysecret1,mysecret2 | | zeppelin.k8s.interpreter.container.imagePullPolicy | (optional) | Set the pull policy of the interpreter image, e.g., Always | | zeppelin.k8s.spark.container.imagePullPolicy | (optional) | Set the pull policy of the spark image, e.g., Always |

Future work

Smaller interpreter docker image.
Blocking communication between interpreter Pod.
Spark Interpreter Pod has Role CRUD for any pod/service in the same namespace. Which should be restricted to only Spark executors Pod.
Per note interpreter mode by default when Zeppelin is running on Kubernetes

Development

Instead of build Zeppelin distribution package and docker image everytime during development, Zeppelin can run locally (such as inside your IDE in debug mode) and able to run Interpreter using K8sStandardInterpreterLauncher by configuring following environment variables.

Environment variable	Value	Description
`ZEPPELIN_RUN_MODE`	`k8s`	Make Zeppelin run interpreter on Kubernetes
`ZEPPELIN_K8S_PORTFORWARD`	`true`	Enable port forwarding from local Zeppelin instance to Interpreters running on Kubernetes
`ZEPPELIN_K8S_CONTAINER_IMAGE`	`<image>:<version>`	Zeppelin interpreter docker image to use
`ZEPPELIN_K8S_SPARK_CONTAINER_IMAGE`	`<image>:<version>`	Spark docker image to use
`ZEPPELIN_K8S_NAMESPACE`	`<k8s namespace>`	Kubernetes namespace to use
`KUBERNETES_AUTH_TOKEN`	`<token>`	Kubernetes auth token to create resources