Zeppelin on Yarn
Zeppelin on yarn means to run interpreter process in yarn container. The key benefit is the scalability, you won't run out of memory of the zeppelin server host if you run large amount of interpreter processes.
The following is required for yarn interpreter mode.
- Hadoop client (both 2.x and 3.x are supported) is installed.
$HADOOP_HOME/binis put in
PATH. Because internally zeppelin will run command
hadoop classpathto get all the hadoop jars and put them in the classpath of Zeppelin.
Yarn interpreter mode needs to be set for each interpreter. You can set
zeppelin.interpreter.launcher to be
yarn to run it in yarn mode.
Besides that, you can also specify other properties as following table.
|zeppelin.interpreter.yarn.resource.memory||1024||memory for interpreter process, unit: mb|
|zeppelin.interpreter.yarn.resource.memoryOverhead||Amount of non-heap memory to be allocated per interpreter process in yarn interpreter mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc.|
|zeppelin.interpreter.yarn.resource.cores||1||cpu cores for interpreter process|
|zeppelin.interpreter.yarn.queue||default||yarn queue name|
Differences with non-yarn interpreter mode (local mode)
There're several differences between yarn interpreter mode with non-yarn interpreter mode (local mode)
- New yarn app will be allocated for the interpreter process.
- Any local path setting won't work in yarn interpreter process. E.g. if you run python interpreter in yarn interpreter mode, then you need to make sure the python executable of
zeppelin.pythonexist in all the nodes of yarn cluster. Because the python interpreter may launch in any node.
- Don't use it for spark interpreter. Instead use spark's built-in yarn-client or yarn-cluster which is more suitable for spark interpreter.