TF-IDF
TF-IDF(Term frequencey-inverse document frequency)是广泛应用于文本挖掘,梵音语料中文档中某个词条重要性的特征向量化方法。
具体示例查看$SPARK_HOME/examples/src/main/python/ml/tf_idf_example.py
引入依赖
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
准备训练数据
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
], ["label", "sentence"])
Tokenizer将句子分成词表
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
HashingTF
转换数据
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
IDF特性向量抽取
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()
Word2Vec
具体示例查看$SPARK_HOME/examples/src/main/python/ml/word2vec_example.py
引入依赖
from pyspark.ml.feature import Word2Vec
准备数据
documentDF = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"])
训练word2vec模型
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
转换数据
result = model.transform(documentDF)
for row in result.collect():
text, vector = row
print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))
CountVectorizer
具体示例查看$SPARK_HOME/examples/src/main/python/ml/count_vectorizer_example.py
引入依赖
from pyspark.ml.feature import CountVectorizer
准备数据
df = spark.createDataFrame([
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
], ["id", "words"])
训练模型
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
转换数据
result = model.transform(df)
result.show(truncate=False)