Zeppelin Interpreter on Docker

Zeppelin service runs on local server. Zeppelin is able to run the interpreter in the docker container, Isolating the operating environment of the interpreter through the docker container. Zeppelin can be easily used without having to install python, spark, etc. on the local node.

Key benefits are

  • Interpreter environment isolating
  • Not need to install python, spark, etc. environment on the local node
  • Docker does not need to pre-install zeppelin binary package, Automatically upload local zeppelin interpreter lib files to the container
  • Automatically upload local configuration files (such as spark-conf, hadoop-conf-dir, keytab file, ...) to the container, so that the running environment in the container is exactly the same as the local.
  • Zeppelin server runs locally, making it easier to manage and maintain

Prerequisites

  • apache/zeppelin docker image
  • Spark >= 2.2.0 docker image (in case of using Spark Interpreter)
  • Docker 1.6+ Install Docker
  • Use docker's host network, so there is no need to set up a network specifically

Docker Configuration

Because DockerInterpreterProcess communicates via docker's tcp interface.

By default, docker provides an interface as a sock file, so you need to modify the configuration file to open the tcp interface remotely.

vi /etc/docker/daemon.json, Add tcp://0.0.0.0:2375 to the hosts configuration item.

{
    ...
    "hosts": ["tcp://0.0.0.0:2375","unix:///var/run/docker.sock"]
}

hosts property reference: https://docs.docker.com/engine/reference/commandline/dockerd/

Security warning

Making the Docker daemon available over TCP is potentially dangerous: as you can read here, the docker daemon typically has broad privileges, so only trusted users should have access to it. If you expose the daemon over TCP, you must use firewalling to make sure only trusted users can access the port. This also includes making sure the interpreter docker containers that are started by Zeppelin do not have access to this port.

Quickstart

  1. Modify these 2 configuration items in zeppelin-site.xml.

    <property>
    <name>zeppelin.run.mode</name>
    <value>docker</value>
    <description>'auto|local|k8s|docker'</description>
    </property>
    
    <property>
    <name>zeppelin.docker.container.image</name>
    <value>apache/zeppelin</value>
    <description>Docker image for interpreters</description>
    </property>
    
  2. set timezone in zeppelin-env.sh

    Set to the same time zone as the zeppelin server, keeping the time zone in the interpreter docker container the same as the server. E.g, "America/New_York" or "Asia/Shanghai"

    export ZEPPELIN_DOCKER_TIME_ZONE="America/New_York"
    

Build Zeppelin image manually

To build Zeppelin image, support Kerberos certification & install spark binary.

Use the /scripts/docker/interpreter/Dockerfile to build the image.

FROM apache/zeppelin:0.8.0
MAINTAINER Apache Software Foundation <dev@zeppelin.apache.org>

ENV SPARK_VERSION=2.3.3
ENV HADOOP_VERSION=2.7

# support Kerberos certification
RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean

RUN apt-get update && apt-get install -y curl unzip wget grep sed vim tzdata && apt-get clean

# auto upload zeppelin interpreter lib
RUN rm -rf /zeppelin

RUN rm -rf /spark
RUN wget https://www-us.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
RUN tar zxvf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
RUN mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
RUN rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz

Then build docker image.

# build image. Replace <tag>.
$ docker build -t <tag> .

How it works

Zeppelin interpreter on Docker

Zeppelin service runs on local server, it auto configure itself to use DockerInterpreterLauncher.

DockerInterpreterLauncher via DockerInterpreterProcess launcher creates each interpreter in a container using docker image.

DockerInterpreterProcess uploads the binaries and configuration files of the local zeppelin service to the container:

  • ${ZEPPELIN_HOME}/bin
  • ${ZEPPELIN_HOME}/lib
  • ${ZEPPELIN_HOME}/interpreter/${interpreterGroupName}
  • ${ZEPPELIN_HOME}/conf/zeppelin-site.xml
  • ${ZEPPELIN_HOME}/conf/log4j.properties
  • ${ZEPPELIN_HOME}/conf/log4j_yarn_cluster.properties
  • HADOOP_CONF_DIR
  • SPARK_CONF_DIR
  • /etc/krb5.conf
  • Keytab file configured in the interpreter properties
    • zeppelin.shell.keytab.location
    • spark.yarn.keytab
    • submarine.hadoop.keytab
    • zeppelin.jdbc.keytab.location
    • zeppelin.server.kerberos.keytab

All file paths uploaded to the container, Keep the same path as the local one. This will ensure that all configurations are used correctly.

Spark interpreter on Docker

When interpreter group is spark, Zeppelin sets necessary spark configuration automatically to use Spark on Docker. Supports all running modes of local[*], yarn-client, and yarn-cluster of zeppelin spark interpreter.

SPARK_CONF_DIR

  1. Configuring in the zeppelin-env.sh

    Because there are only spark binary files in the interpreter image, no spark conf files are included. The configuration file in the spark-<version>/conf/ local to the zeppelin service needs to be uploaded to the /spark/conf/ directory in the spark interpreter container. So you need to setting export SPARK_CONF_DIR=/spark-<version>-path/conf/ in the zeppelin-env.sh file.

  2. Configuring in the spark Properties

    You can also configure it in the spark interpreter properties.

    properties name Value Description
    SPARK_CONF_DIR /spark--path.../conf/ Spark--path/conf/ path local on the zeppelin service

HADOOP_CONF_DIR

  1. Configuring in the zeppelin-env.sh

    Because there are only spark binary files in the interpreter image, no configuration files are included. The configuration file in the hadoop-<version>/etc/hadoop local to the zeppelin service needs to be uploaded to the spark interpreter container. So you need to setting export HADOOP_CONF_DIR=hadoop-<version>-path/etc/hadoop in the zeppelin-env.sh file.

  2. Configuring in the spark Properties

    You can also configure it in the spark interpreter properties.

    properties name Value Description
    HADOOP_CONF_DIR hadoop--path/etc/hadoop hadoop--path/etc/hadoop path local on the zeppelin service

Accessing Spark UI (or Service running in interpreter container)

Because the zeppelin interpreter container uses the host network, the spark.ui.port port is automatically allocated, so do not configure spark.ui.port=xxxx in spark-defaults.conf

Future work

  • Configuring container resources that can be used by different interpreters by configuration.

Development

Instead of build Zeppelin distribution package and docker image everytime during development, Zeppelin can run locally (such as inside your IDE in debug mode) and able to run Interpreter using DockerInterpreterLauncher by configuring following environment variables.

  1. zeppelin-site.xml
Configuration variable Value Description
ZEPPELIN_RUN_MODE docker Make Zeppelin run interpreter on Docker
ZEPPELIN_DOCKER_CONTAINER_IMAGE <image>:<version> Zeppelin interpreter docker image to use