Skip to content

Apache Spark

Apache Spark is a high-performance engine for large-scale computing tasks, such as data processing, machine learning and real-time data streaming. It includes APIs for Java, Python, Scala and R.

TL;DR;

helm install my-spark sconeapps/spark

Introduction

This chart bootstraps a spark deployment on a Kubernetes cluster using the Helm package manager.

Prerequisites

  • Kubernetes 1.12+
  • Helm 2.12+ or Helm 3.0-beta3+

Before you begin

This Chart is a modified version of bitnami/spark using Scone and Intel SGX. Further information og the original chart can be found here.

Attestation

This chart does not submit any sessions to a CAS, so you have to do it beforehand, from a trusted computer. If you need to pass remote attestation information to your container, such as SCONE_CONFIG_ID and SCONE_CAS_ADDR, use the master.extraEnvVars and worker.extraEnvVars parameter on values.yaml.

Installing the Chart

To install the chart with the release name my-spark:

export GH_TOKEN=...
helm repo add sconeapps https://${GH_TOKEN}@raw.githubusercontent.com/scontain/sconeapps/master/
helm install my-spark sconeapps/spark

These commands deploy Spark on the Kubernetes cluster in the default configuration. The Parameters section lists the parameters that can be configured during installation.

Tip: List all releases using helm list

Uninstalling the Chart

To uninstall/delete the my-spark statefulset:

$ helm delete my-spark

The command removes all the Kubernetes components associated with the chart and deletes the release. Use the option --purge to delete all persistent volumes too.

Parameters

The following tables lists the configurable parameters of the spark chart and their default values.

Parameter Description Default
global.imageRegistry Global Docker image registry nil
global.imagePullSecrets Global Docker registry secret names as an array [] (does not add image pull secrets to deployed pods)
image.registry spark image registry docker.io
image.repository spark Image name lucasmc/pyspark
image.tag spark Image tag {TAG_NAME}
image.pullPolicy spark image pull policy IfNotPresent
image.pullSecrets Specify docker-registry secret names as an array [] (does not add image pull secrets to deployed pods)
nameOverride String to partially override spark.fullname template with a string (will prepend the release name) nil
fullnameOverride String to fully override spark.fullname template with a string nil
master.debug Specify if debug values should be set on the master false
master.webPort Specify the port where the web interface will listen on the master 8080
master.clusterPort Specify the port where the master listens to communicate with workers 7077
master.daemonMemoryLimit Set the memory limit for the master daemon No default
master.configOptions Optional configuration if the form -Dx=y No default
master.securityContext.enabled Enable security context true
master.securityContext.fsGroup Group ID for the container 0
master.securityContext.runAsUser User ID for the container 0
master.podAnnotations Annotations for pods in StatefulSet {} (The value is evaluated as a template)
master.nodeSelector Node affinity policy {} (The value is evaluated as a template)
master.tolerations Tolerations for pod assignment [] (The value is evaluated as a template)
master.affinity Affinity for pod assignment {} (The value is evaluated as a template)
master.resources CPU/Memory resource requests/limits {}
master.extraEnvVars Extra environment variables to pass to the master container {}
master.extraVolumes Array of extra volumes to be added to the Spark master deployment (evaluated as template). Requires setting master.extraVolumeMounts nil
master.extraVolumeMounts Array of extra volume mounts to be added to the Spark master deployment (evaluated as template). Normally used with master.extraVolumes. nil
master.useSGXDevPlugin Use SGX Device Plugin to access SGX resources. scone
master.sgxEpcMem Required to Azure SGX Device Plugin. Protected EPC memory in MiB nil
master.livenessProbe.enabled Turn on and off liveness probe true
master.livenessProbe.initialDelaySeconds Delay before liveness probe is initiated 10
master.livenessProbe.periodSeconds How often to perform the probe 10
master.livenessProbe.timeoutSeconds When the probe times out 5
master.livenessProbe.failureThreshold Minimum consecutive failures for the probe to be considered failed after having succeeded. 2
master.livenessProbe.successThreshold Minimum consecutive successes for the probe to be considered successful after having failed 1
master.readinessProbe.enabled Turn on and off readiness probe true
master.readinessProbe.initialDelaySeconds Delay before liveness probe is initiated 5
master.readinessProbe.periodSeconds How often to perform the probe 10
master.readinessProbe.timeoutSeconds When the probe times out 5
master.readinessProbe.failureThreshold Minimum consecutive failures for the probe to be considered failed after having succeeded. 6
master.readinessProbe.successThreshold Minimum consecutive successes for the probe to be considered successful after having failed 1
worker.debug Specify if debug values should be set on workers false
worker.webPort Specify the port where the web interface will listen on the worker 8080
worker.clusterPort Specify the port where the worker listens to communicate with the master 7077
worker.daemonMemoryLimit Set the memory limit for the worker daemon No default
worker.memoryLimit Set the maximum memory the worker is allowed to use No default
worker.coreLimit Se the maximum number of cores that the worker can use No default
worker.dir Set a custom working directory for the application No default
worker.javaOptions Set options for the JVM in the form -Dx=y No default
worker.configOptions Set extra options to configure the worker in the form -Dx=y No default
worker.replicaCount Set the number of workers 2
worker.autoscaling.enabled Enable autoscaling depending on CPU false
worker.autoscaling.CpuTargetPercentage k8s hpa cpu targetPercentage 50
worker.autoscaling.replicasMax Maximum number of workers when using autoscaling 5
worker.securityContext.enabled Enable security context true
worker.securityContext.fsGroup Group ID for the container 1001
worker.securityContext.runAsUser User ID for the container 1001
worker.podAnnotations Annotations for pods in StatefulSet {}
worker.nodeSelector Node labels for pod assignment. Used as a template from the values. {}
worker.tolerations Toleration labels for pod assignment []
worker.affinity Affinity and AntiAffinity rules for pod assignment {}
worker.resources CPU/Memory resource requests/limits Memory: 256Mi, CPU: 250m
worker.livenessProbe.enabled Turn on and off liveness probe true
worker.livenessProbe.initialDelaySeconds Delay before liveness probe is initiated 10
worker.livenessProbe.periodSeconds How often to perform the probe 10
worker.livenessProbe.timeoutSeconds When the probe times out 5
worker.livenessProbe.failureThreshold Minimum consecutive failures for the probe to be considered failed after having succeeded. 2
worker.livenessProbe.successThreshold Minimum consecutive successes for the probe to be considered successful after having failed 1
worker.readinessProbe.enabled Turn on and off readiness probe true
worker.readinessProbe.initialDelaySeconds Delay before liveness probe is initiated 5
worker.readinessProbe.periodSeconds How often to perform the probe 10
worker.readinessProbe.timeoutSeconds When the probe times out 5
worker.readinessProbe.failureThreshold Minimum consecutive failures for the probe to be considered failed after having succeeded. 6
worker.readinessProbe.successThreshold Minimum consecutive successes for the probe to be considered successful after having failed 1
worker.extraEnvVars Extra environment variables to pass to the worker container {}
worker.extraVolumes Array of extra volumes to be added to the Spark worker deployment (evaluated as template). Requires setting worker.extraVolumeMounts nil
worker.extraVolumeMounts Array of extra volume mounts to be added to the Spark worker deployment (evaluated as template). Normally used with worker.extraVolumes. nil
worker.useSGXDevPlugin Use SGX Device Plugin to access SGX resources. scone
worker.sgxEpcMem Required to Azure SGX Device Plugin. Protected EPC memory in MiB nil
security.passwordsSecretName Secret to use when using security configuration to set custom passwords No default
security.rpc.authenticationEnabled Enable the RPC authentication false
security.rpc.encryptionEnabled Enable the encryption for RPC false
security.storageEncryptionEnabled Enable the encryption of the storage false
security.ssl.enabled Enable the SSL configuration false
security.ssl.needClientAuth Enable the client authentication false
security.ssl.protocol Set the SSL protocol TLSv1.2
security.certificatesSecretName Set the name of the secret that contains the certificates No default
service.type Kubernetes Service type ClusterIP
service.webPort Spark client port 80
service.clusterPort Spark cluster port 7077
service.nodePort Port to bind to for NodePort service type (client port) nil
service.nodePorts.cluster Kubernetes cluster node port ""
service.nodePorts.web Kubernetes web node port ""
service.annotations Annotations for spark service {}
service.loadBalancerIP loadBalancerIP if spark service type is LoadBalancer nil
ingress.enabled Enable the use of the ingress controller to access the web UI false
ingress.certManager Add annotations for cert-manager false
ingress.annotations Ingress annotations {}
ingress.hosts[0].name Hostname to your Spark installation spark.local
ingress.hosts[0].path Path within the url structure /
ingress.hosts[0].tls Utilize TLS backend in ingress false
ingress.hosts[0].tlsHosts Array of TLS hosts for ingress record (defaults to ingress.hosts[0].name if nil) nil
ingress.hosts[0].tlsSecret TLS Secret (certificates) spark.local-tls

Specify each parameter using the --set key=value[,key=value] argument to helm install. For example,

helm install my-spark \
  --set master.webPort=8081 sconeapps/spark

The above command sets the spark master web port to 8081.

Alternatively, a YAML file that specifies the values for the parameters can be provided while installing the chart. For example,

$ helm install my-spark -f values.yaml sconeapps/spark

One can use the default values.yaml in the SconeApps repo

Configuration and installation details

Rolling VS Immutable tags

It is strongly recommended to use immutable tags in a production environment. This ensures your deployment does not change automatically if the same tag is updated with a different image.

Submit an application

To submit an application to the cluster use the spark-submit script. You can obtain the script here. For example, to deploy one of the example applications:

$ ./bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master spark://<master-IP>:<master-cluster-port>   --deploy-mode cluster  ./examples/jars/spark-examples_2.11-2.4.3.jar   1000

Where the master IP and port must be changed by you master IP address and port.

Be aware that currently is not possible to submit an application to a standalone cluster if RPC authentication is configured. More info about the issue here.