Zeppelin on Kubernetes
Zeppelin can run on clusters managed by Kubernetes. When Zeppelin runs in Pod, it creates pods for individual interpreter. Also Spark interpreter auto configured to use Spark on Kubernetes in client mode.
Key benefits are
- Interpreter scale-out
- Spark interpreter auto configure Spark on Kubernetes
- Able to customize Kubernetes yaml file
- Spark UI access
Prerequisites
- Zeppelin >= 0.9.0 docker image
- Spark >= 2.4.0 docker image (in case of using Spark Interpreter)
- A running Kubernetes cluster with access configured to it using kubectl
- Kubernetes DNS configured in your cluster
Enough cpu and memory in your Kubernetes cluster. We recommend 4CPUs, 6g of memory to be able to start Spark Interpreter with few executors.
If you're using minikube, check your cluster capacity (
kubectl describe node
) and increase if necessary$ minikube delete # otherwise configuration won't apply $ minikube config set cpus <number> $ minikube config set memory <number in MB> $ minikube start $ minikube config view
Quickstart
Get zeppelin-server.yaml
from github repository or find it from Zeppelin distribution package.
# Get it from Zeppelin distribution package.
$ ls <zeppelin-distribution>/k8s/zeppelin-server.yaml
# or download it from github
$ curl -s -O https://raw.githubusercontent.com/apache/zeppelin/master/k8s/zeppelin-server.yaml
Start zeppelin on kubernetes cluster,
kubectl apply -f zeppelin-server.yaml
Port forward Zeppelin server port,
kubectl port-forward zeppelin-server 8080:80
and browse localhost:8080.
Try run some paragraphs and see each interpreter is running as a Pod (using kubectl get pods
), instead of a local process.
To shutdown,
kubectl delete -f zeppelin-server.yaml
Spark Interpreter
Build spark docker image to use Spark Interpreter. Download spark binary distribution and run following command. Spark 2.4.0 or later version is required.
# if you're using minikube, set docker-env
$ eval $(minikube docker-env)
# build docker image
$ <spark-distribution>/bin/docker-image-tool.sh -m -t 2.4.0 build
Run docker images
and check if spark:2.4.0
is created.
Configure sparkContainerImage
of zeppelin-server-conf
ConfigMap in zeppelin-server.yaml
.
Create note and configure executor number (default 1)
%spark.conf
spark.executor.instances 5
And then start your spark interpreter
%spark
sc.parallelize(1 to 100).count
...
While spark.master
property of SparkInterpreter starts with k8s://
(default k8s://https://kubernetes.default.svc
when Zeppelin started using zeppelin-server.yaml), Spark executors will be automatically created in your Kubernetes cluster.
Spark UI is accessible by clicking SPARK JOB
on the Paragraph.
Check here to know more about Running Spark on Kubernetes.
Build Zeppelin image manually
To build your own Zeppelin image, first build Zeppelin project with -Pbuild-distr
flag.
$ mvn package -DskipTests -Pbuild-distr <your flags>
Binary package will be created under zeppelin-distribution/target
directory. Move created package file under scripts/docker/zeppelin/bin/
directory.
$ mv zeppelin-distribution/target/zeppelin-*.tar.gz scripts/docker/zeppelin/bin/
scripts/docker/zeppelin/bin/Dockerfile
downloads package from internet. Modify the file to add package from filesystem.
...
# Find following section and comment out
#RUN echo "$LOG_TAG Download Zeppelin binary" && \
# wget -O /tmp/zeppelin-${Z_VERSION}-bin-all.tgz http://archive.apache.org/dist/zeppelin/zeppelin-${Z_VERSION}/zeppelin-${Z_VERSION}-bin-all.tgz && \
# tar -zxvf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
# rm -rf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
# mv /zeppelin-${Z_VERSION}-bin-all ${ZEPPELIN_HOME}
# Add following lines right after the commented line above
ADD zeppelin-${Z_VERSION}.tar.gz /
RUN ln -s /zeppelin-${Z_VERSION} /zeppelin
...
Then build docker image.
# configure docker env, if you're using minikube
$ eval $(minikube docker-env)
# change directory
$ cd scripts/docker/zeppelin/bin/
# build image. Replace <tag>.
$ docker build -t <tag> .
Finally, set custom image <tag>
just created to image
and ZEPPELIN_K8S_CONTAINER_IMAGE
env variable of zeppelin-server
container spec in zeppelin-server.yaml
file.
Currently, single docker image is being used in both Zeppelin server and Interpreter pods. Therefore,
Pod | Number of instances | Image | Note |
---|---|---|---|
Zeppelin Server | 1 | Zeppelin docker image | User creates/deletes with kubectl command |
Zeppelin Interpreters | n | Zeppelin docker image | Zeppelin Server creates/deletes |
Spark executors | m | Spark docker image | Spark Interpreter creates/deletes |
Currently, size of Zeppelin docker image is quite big. Zeppelin project is planning to provides lightweight images for each individual interpreter in the future.
How it works
Zeppelin on Kubernetes
k8s/zeppelin-server.yaml
is provided to run Zeppelin Server with few sidecars and configurations.
Once Zeppelin Server is started in side Kubernetes, it auto configure itself to use K8sStandardInterpreterLauncher
.
The launcher creates each interpreter in a Pod using templates located under k8s/interpreter/
directory.
Templates in the directory applied in alphabetical order. Templates are rendered by jinjava
and all interpreter properties are accessible inside the templates.
Spark on Kubernetes
When interpreter group is spark
, Zeppelin sets necessary spark configuration automatically to use Spark on Kubernetes.
It uses client mode, so Spark interpreter Pod works as a Spark driver, spark executors are launched in separate Pods.
This auto configuration can be overrided by manually setting spark.master
property of Spark interpreter.
Accessing Spark UI (or Service running in interpreter Pod)
Zeppelin server Pod has a reverse proxy as a sidecar, and it splits traffic to Zeppelin server and Spark UI running in the other Pods.
It assume both <your service domain>
and *.<your service domain>
point the nginx proxy address.
<your service domain>
is directed to ZeppelinServer, *.<your service domain>
is directed to interpreter Pods.
<port>-<interpreter pod svc name>.<your service domain>
is convention to access any application running in interpreter Pod.
For example, When your service domain name is local.zeppelin-project.org
Spark interpreter Pod is running with a name spark-axefeg
and Spark UI is running on port 4040,
4040-spark-axefeg.local.zeppelin-project.org
is the address to access Spark UI.
Default service domain is local.zeppelin-project.org:8080
. local.zeppelin-project.org
and *.local.zeppelin-project.org
configured to resolve 127.0.0.1
.
It allows access Zeppelin and Spark UI with kubectl port-forward zeppelin-server 8080:80
.
If you like to use your custom domain
- Configure Ingress in Kubernetes cluster for
http
port of the servicezeppelin-server
defined ink8s/zeppelin-server.yaml
. - Configure DNS record that your service domain and wildcard subdomain point the IP Addresses of your Ingress.
- Modify
serviceDomain
ofzeppelin-server-conf
ConfigMap ink8s/zeppelin-server.yaml
file. - Apply changes (e.g.
kubectl apply -f k8s/zeppelin-server.yaml
)
Persist /notebook and /conf directory
Notebook and configurations are not persisted by default. Please configure volume and update k8s/zeppelin-server.yaml
to use the volume to persiste /notebook and /conf directory if necessary.
Customization
Zeppelin Server Pod
Edit k8s/zeppelin-server.yaml
and apply.
Interpreter Pod
Since Interpreter Pod is created/deleted by ZeppelinServer using templates under k8s/interpreter
directory,
to customize,
- Prepare
k8s/interpreter
directory with customization (edit or create new yaml file), in a Kubernetes volume. - Modify
k8s/zeppelin-server.yaml
and mount prepared volume dirk8s/interpreter
to/zeppelin/k8s/interpreter/
. - Apply modified
k8s/zeppelin-server.yaml
. - Run a paragraph will create an interpreter using modified yaml files.
The interpreter pod can also be customized through the interpreter settings. Here are some of the properties:
| Property Name | Default Value | Description |
| ----- | ----- | ----- |
| zeppelin.k8s.namespace
| default
| The Kubernetes namespace to use. |
| zeppelin.k8s.interpreter.container.image
| apache/zeppelin:<ZEPPELIN_VERSION>
| The interpreter image to use. |
| zeppelin.k8s.interpreter.cores
| (optional) | The number of cpu cores to use. |
| zeppelin.k8s.interpreter.memory
| (optional) | The memory to use, e.g., 1g
. |
| zeppelin.k8s.interpreter.gpu.type
| (optional) | Set the type of gpu to request when the interpreter pod is required to schedule gpu resources, e.g., nvidia.com/gpu
. |
| zeppelin.k8s.interpreter.gpu.nums
| (optional) | Tne number of gpu to use. |
| zeppelin.k8s.interpreter.imagePullSecrets
| (optional) | Set the comma-separated list of Kubernetes secrets while pulling images, e.g., mysecret1,mysecret2
|
| zeppelin.k8s.interpreter.container.imagePullPolicy
| (optional) | Set the pull policy of the interpreter image, e.g., Always
|
| zeppelin.k8s.spark.container.imagePullPolicy
| (optional) | Set the pull policy of the spark image, e.g., Always
|
Future work
- Smaller interpreter docker image.
- Blocking communication between interpreter Pod.
- Spark Interpreter Pod has Role CRUD for any pod/service in the same namespace. Which should be restricted to only Spark executors Pod.
- Per note interpreter mode by default when Zeppelin is running on Kubernetes
Development
Instead of build Zeppelin distribution package and docker image everytime during development, Zeppelin can run locally (such as inside your IDE in debug mode) and able to run Interpreter using K8sStandardInterpreterLauncher by configuring following environment variables.
Environment variable | Value | Description |
---|---|---|
ZEPPELIN_RUN_MODE |
k8s |
Make Zeppelin run interpreter on Kubernetes |
ZEPPELIN_K8S_PORTFORWARD |
true |
Enable port forwarding from local Zeppelin instance to Interpreters running on Kubernetes |
ZEPPELIN_K8S_CONTAINER_IMAGE |
<image>:<version> |
Zeppelin interpreter docker image to use |
ZEPPELIN_K8S_SPARK_CONTAINER_IMAGE |
<image>:<version> |
Spark docker image to use |
ZEPPELIN_K8S_NAMESPACE |
<k8s namespace> |
Kubernetes namespace to use |
KUBERNETES_AUTH_TOKEN |
<token> |
Kubernetes auth token to create resources |