在Jupyter Notebook中,以Python3的kernel运行,希望使用pyspark库。
在Jupytr Notebook运行主机配置环境变量,并激活
# pyspark
export PYLIB="$SPARK_HOME/python/lib"
export PYTHONPATH="$SPARK_HOME/python:$PYLIB/py4j-0.10.4-src.zip:$PYLIB/pyspark.zip"
在Jupyter Notebook中运行
from pyspark import SparkConf
sc=SparkConf()
sc
<pyspark.conf.SparkConf object at 0x7fba77d76450>
虽然,在上述的测试中,确实可以创建SparkConf,但是如果运行Spark SQL则会失败
配置启动文件
配置.ipython/profile_default/startup/00-default-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))
此时,如果在Jupyter Notebook运行Spark SQL也会成功
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", "hdfs://10.110.18.216/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("select * from sparktest.weather_patient")