In Quick Start, we start a local[*]
PySpark Session for data explorations in JupyterLab, and a local[*]
Spark Connect Server. This example will combine both of the above on k8s: A PySpark Connect client from JupyterLab on k8s, connect to a Spark Connect Server on k8s.
kubectl create ns sparglim
You need to authorize the pod so that it can create pods(executor)
For a simple test, you can grant administrator privileges to all pods using the following command (DO NOT this in a production environment)
kubectl create clusterrolebinding serviceaccounts-cluster-admin
--clusterrole=cluster-admin
--group=system:serviceaccounts
# In project root
kubectl apply -f examples/jupyter-sparglim-sc/k8s/jupyter-sparglim/
kubectl apply -f examples/jupyter-sparglim-sc/k8s/sparglim-server/
Check pod is running:
$: kubectl get pod -n sparglim
NAME READY STATUS RESTARTS AGE
sparglim-app-5499f54f6b-gk4xv 1/1 Running 0 33m
Access JupyterLab and try it out:
http://<master-ip>:30888
Access SparkUI:
http://<master-ip>:30040
Using code for spark on k8s
initialization
from sparglim.config.builder import ConfigBuilder
spark = ConfigBuilder().config_connect_client().get_or_create()
This will auto config SparkSession to connect_client
mode, via env SPARGLIM_SQL_MODE
%load_ext sparglim.sql
spark
Test it:
%sql SHOW TABLES
Any configuration on the client side, such as spark.sql.repl.eagerEval.enabled=true
, is not effective. So %sql
(%%sql
) can't display the dataframe. You can use df.show()
instead.