Scala
启动spark-shell
./bin/spark-shell
创建DataSet
scala> val textFile = spark.read.textFile("file:///usr/hdp/2.6.1.0-129/spark2/README.md")
查看DataSet
scala> textFile.count() // Number of items in this Dataset
res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs
scala> textFile.first() // First item in this Dataset
res1: String = # Apache Spark
变换Dataset
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
可以将transformation与action链式操作
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
python
启动shell
./bin/pyspark
如果PySpark通过pip安装在环境中,可以
pyspark
创建Dataset
>>> textFile = spark.read.text("README.md")
Dataset的action
>>> textFile.count() # Number of rows in this DataFrame
126
>>> textFile.first() # First row in this DataFrame
Row(value=u'# Apache Spark')
Dataset的transformation
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
transformation与action的链式操作
>>> textFile.filter(textFile.value.contains("Spark")).count() # How many lines contain "Spark"?
15