我们可以在engine template gallery选择适合自己应用场景的机器学习任务模板。
Classification Engine Template (scala)
PredictionIO的Classification Engine Template默认集成Spark MLlib的Naive Bayes算法。
通过Engine Template创建Engine
下载github源码
$ git clone https://github.com/apache/incubator-predictionio-template-attribute-based-classifier.git MyClassification
创建AppID和Access Key
$ pio app new MyApp1
输出信息如下所示:
[INFO] [HBLEvents] The table pio_event:events_1 doesn't exist yet. Creating now...
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [Pio$] Created a new app:
[INFO] [Pio$] Name: MyApp1
[INFO] [Pio$] ID: 1
[INFO] [Pio$] Access Key: ytxhxxg7rLgqS8ZabTASRs74B8ba2_o8XxWR_U0GGH7EJun30N8RMcx7Q8UkI-nt
查看所有的app
$ pio app list
收集训练Data
engine template会读取用户记录的四个属性:attr0,attr1,attr2,plan。你可以通过HTTP或SDK来向PredictionIO Event Server发送event。下面将使用curl命令进行操作,为方便操作,设置环境变量ACCESS_KEY
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "$set",
"entityType" : "user",
"entityId" : "u0",
"properties" : {
"attr0" : 0,
"attr1" : 1,
"attr2" : 0,
"plan" : 1
}
"eventTime" : "2014-11-02T09:39:45.618-08:00"
}'
查询event server
$ curl -i -X GET "http://localhost:7070/events.json?accessKey=$ACCESS_KEY"
导入更多的数据
为了导入更多的训练数据,我们需要借助python脚本import_eventserver.py
首先需要安装Python SDK
$ sudo pip install predictionio
然后,导入项目所需要的训练数据
$ cd MyClassification
$ python data/import_eventserver.py --access_key $ACCESS_KEY
将Engine发布为服务
编辑Engine.json
...
"datasource": {
"params" : {
"appName": "MyApp1"
}
},
...
构建MyClassification的engine
$ pio build --verbose
由于在构建过程中需要下载sbt的依赖,默认使用的是Maven中央仓库,为了加速sbt构建,可以选择使用阿里云的maven库。具体使用方式是,在~/.sbt/repositories
配置
[repositories]
local
aliyun-nexus: http://maven.aliyun.com/nexus/content/groups/public/
typesafe: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext], bootOnly
sonatype-oss-releases
maven-central
sonatype-oss-snapshots
训练数据
$ pio train
部署服务
$ pio deploy