Zeppelin on Yarn

Zeppelin on yarn means to run interpreter process in yarn container. The key benefit is the scalability, you won't run out of memory of the zeppelin server host if you run large amount of interpreter processes.

Prerequisites

The following is required for yarn interpreter mode.

  • Hadoop client (both 2.x and 3.x are supported) is installed.
  • $HADOOP_HOME/bin is put in PATH. Because internally zeppelin will run command hadoop classpath to get all the hadoop jars and put them in the classpath of Zeppelin.
  • Set USE_HADOOP as true in zeppelin-env.sh.

Configuration

Yarn interpreter mode needs to be set for each interpreter. You can set zeppelin.interpreter.launcher to be yarn to run it in yarn mode. Besides that, you can also specify other properties as following table.

Name Default Value Description
zeppelin.interpreter.yarn.resource.memory 1024 memory for interpreter process, unit: mb
zeppelin.interpreter.yarn.resource.memoryOverhead 384 Amount of non-heap memory to be allocated per interpreter process in yarn interpreter mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc.
zeppelin.interpreter.yarn.resource.cores 1 cpu cores for interpreter process
zeppelin.interpreter.yarn.queue default yarn queue name

Differences with non-yarn interpreter mode (local mode)

There're several differences between yarn interpreter mode with non-yarn interpreter mode (local mode)

  • New yarn app will be allocated for the interpreter process.
  • Any local path setting won't work in yarn interpreter process. E.g. if you run python interpreter in yarn interpreter mode, then you need to make sure the python executable of zeppelin.python exist in all the nodes of yarn cluster. Because the python interpreter may launch in any node.
  • Don't use it for spark interpreter. Instead use spark's built-in yarn-client or yarn-cluster which is more suitable for spark interpreter.