This chart is modified from stable/hadoop.
Hadoop is a framework for running large scale distributed applications.
This chart is primarily intended to be used for YARN and MapReduce job execution where HDFS is just used as a means to transport small artifacts within the framework and not for a distributed filesystem. Data should be read from cloud based datastores such as Google Cloud Storage, S3 or Swift.
To install the chart with the release name hadoop that utilizes 50% of the available node resources:
$ helm install --name hadoop $(stable/hadoop/tools/calc_resources.sh 50) stable/hadoop
Note that you need at least 2GB of free memory per NodeManager pod, if your cluster isn't large enough, not all pods will be scheduled.
The optional calc_resources.sh script is used as a convenience helper to set the yarn.numNodes, and yarn.nodeManager.resources appropriately to utilize all nodes in the Kubernetes cluster and a given percentage of their resources. For example, with a 3 node n1-standard-4 GKE cluster and an argument of 50, this would create 3 NodeManager pods claiming 2 cores and 7.5Gi of memory.
To install the chart with persistent volumes:
$ helm install --name hadoop $(stable/hadoop/tools/calc_resources.sh 50) \
  --set persistence.nameNode.enabled=true \
  --set persistence.nameNode.storageClass=standard \
  --set persistence.dataNode.enabled=true \
  --set persistence.dataNode.storageClass=standard \
  stable/hadoop
Change the value of
storageClassto match your volume driver.standardworks for Google Container Engine clusters.
The following table lists the configurable parameters of the Hadoop chart and their default values.
| Parameter | Description | Default | 
|---|---|---|
| image.repository | Hadoop image (source) | danisla/hadoop | 
| image.tag | Hadoop image tag | 2.9.0 | 
| imagee.pullPolicy | Pull policy for the images | IfNotPresent | 
| hadoopVersion | Version of hadoop libraries being used | 2.9.0 | 
| antiAffinity | Pod antiaffinity, hardorsoft | hard | 
| hdfs.nameNode.pdbMinAvailable | PDB for HDFS NameNode | 1 | 
| hdfs.nameNode.resources | resources for the HDFS NameNode | requests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m | 
| hdfs.dataNode.replicas | Number of HDFS DataNode replicas | 1 | 
| hdfs.dataNode.pdbMinAvailable | PDB for HDFS DataNode | 1 | 
| hdfs.dataNode.resources | resources for the HDFS DataNode | requests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m | 
| hdfs.webhdfs.enabled | Enable WebHDFS REST API | false | 
| yarn.resourceManager.pdbMinAvailable | PDB for the YARN ResourceManager | 1 | 
| yarn.resourceManager.resources | resources for the YARN ResourceManager | requests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m | 
| yarn.nodeManager.pdbMinAvailable | PDB for the YARN NodeManager | 1 | 
| yarn.nodeManager.replicas | Number of YARN NodeManager replicas | 2 | 
| yarn.nodeManager.parallelCreate | Create all nodeManager statefulset pods in parallel (K8S 1.7+) | false | 
| yarn.nodeManager.resources | Resource limits and requests for YARN NodeManager pods | requests:memory=2048Mi,cpu=1000m,limits:memory=2048Mi,cpu=1000m | 
| persistence.nameNode.enabled | Enable/disable persistent volume | false | 
| persistence.nameNode.storageClass | Name of the StorageClass to use per your volume provider | - | 
| persistence.nameNode.accessMode | Access mode for the volume | ReadWriteOnce | 
| persistence.nameNode.size | Size of the volume | 50Gi | 
| persistence.dataNode.enabled | Enable/disable persistent volume | false | 
| persistence.dataNode.storageClass | Name of the StorageClass to use per your volume provider | - | 
| persistence.dataNode.accessMode | Access mode for the volume | ReadWriteOnce | 
| persistence.dataNode.size | Size of the volume | 200Gi | 
The Zeppelin Notebook chart can use the hadoop config for the hadoop cluster and use the YARN executor:
helm install --set hadoop.useConfigMap=true stable/zeppelin
- Original K8S Hadoop adaptation this chart was derived from: https://github.com/Comcast/kube-yarn