Scala

启动spark-shell

./bin/spark-shell

创建DataSet

scala> val textFile = spark.read.textFile("file:///usr/hdp/2.6.1.0-129/spark2/README.md")

查看DataSet

scala> textFile.count() // Number of items in this Dataset
res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs

scala> textFile.first() // First item in this Dataset
res1: String = # Apache Spark

变换Dataset

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

可以将transformation与action链式操作

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

python

启动shell

./bin/pyspark

如果PySpark通过pip安装在环境中，可以

pyspark

创建Dataset

>>> textFile = spark.read.text("README.md")

Dataset的action

>>> textFile.count()  # Number of rows in this DataFrame
126

>>> textFile.first()  # First row in this DataFrame
Row(value=u'# Apache Spark')

Dataset的transformation

>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))

transformation与action的链式操作

>>> textFile.filter(textFile.value.contains("Spark")).count()  # How many lines contain "Spark"?
15

Spark Shell

results matching ""

No results matching ""