要求
在10.110.18.216~10.110.18.218三台主机安装hadoop环境
前提
java环境:jdk 8
hadoop版本: 2.6.5
免密登录
本机免密登录
在10.110.18.216、10.110.18.217、10.110.18.218三台主机分别修改/etc/hosts
10.110.18.216 indata-10-110-18-216.indata.com indata-10-110-18-216
10.110.18.217 indata-10-110-18-217.indata.com indata-10-110-18-217
10.110.18.218 indata-10-110-18-218.indata.com indata-10-110-18-218
修改hostname,是其分别为indata-10-110-18-216、indata-10-110-18-217、indata-10-110-18-218
其中,10.110.18.216为master,10.110.18.217/10.110.18.218位slaves
生成公钥-私钥
ssh-keygen -t rsa
将其追加到authorized_keys
cat ~/.ssh/id_rsa.pub >> authorized_keys
赋予权限
chmod 600 .ssh/authorized_keys
验证本机免密登录
ssh indata-10-110-18-216
10.110.18.217/10.110.18.218做相同操作。
master免密登录slaves
将master主机的id_rsa.pub拷贝到slaves,并追加
scp [email protected]:/root/.ssh/id_rsa.pub /root/
cat id_rsa.pub >> ~/.ssh/authorized_keys
rm -rf /root/id_ras.pub
在master主机测试
ssh indata-10-110-18-217
slaves免密登录master
将slave主机的公钥拷贝到master主机
scp [email protected]/root/.ssh/id_rsa.pub /root
cat /root/id_rsa.pub >> /root/.ssh/authorized_keys
在slave主机测试
ssh indata-10-110-18-216
master主机环境搭建
解压hadoop压缩包
tar -xzvf hadoop-2.6.5.tar.gz -C /data
配置环境变量
# Hadoop
export HADOOP_HOME=/data/hadoop-2.6.5
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
激活环境变量
source /etc/profile
配置core-site.xml
指定NameNode的IP和端口,指定hadoop数据存储的临时文件夹
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop-2.6.5/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.110.18.216:8020</value>
</property>
</configuration>
如果没有配置hadoop.tmp.dir,则系统默认为/tmp/hadoop。该目录在每次重启主机时都会删掉,必须重新执行format,否则出错
配置hdfs-site.xml
指定HDFS备份为3,指定namenode节点文件存储目录,指定datanode节点文件存储目录
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/data/hadoop-2.6.5/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/data/hadoop-2.6.5/hdfs/data</value>
</property>
</configuration>
配置mapred-site.xml
拷贝mapred-site.xml.template,并修改
cp mapred-site.xml.template mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://10.110.18.216:9001</value>
</property>
</configuration>
配置yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>indata-10-110-18-216</value>
</property>
</configuration>
配置masters
修改/data/hadoop-2.6.5/etc/hadoop/masters
indata-10-110-18-216
配置slaves
修改/data/hadoop-2.6.5/etc/hadoop/slaves
indata-10-110-18-217
indata-10-110-18-218
slaves环境搭建
将master安装环境拷贝到slave主机
删除slaves配置
rm -rf /data/hadoop-2.6.5/etc/hadoop/slaves
启动集群
在master主机,格式化hdfs文件系统
hdfs namenode -format
如果出现datanode无法启动问题,可以删除/data/hadoop-2.6.5/tmp和/data/hadoop-2.6.5/hdfs目录,并重新格式化
启动hadoop
sbin/start-all.sh
通过jps命令查看运行情况
# master
1040 SecondaryNameNode
4201 Jps
1211 ResourceManager
828 NameNode
# slave
6202 DataNode
6315 NodeManager
6622 Jps
查看hadoop集群状态
hadoop dfsadmin -report
通过浏览器访问hdfs
http://10.110.18.216:50070
通过浏览器访问yarn
http://10.110.18.216:8088
搭建hive
在master主机(10.110.18.216)搭建hive
tar -xzvf apache-hive-2.3.3-bin.tar.gz
配置环境变量
# Hive
export HIVE_HOME=/data/apache-hive-2.3.3-bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$HIVE_HOME/bin:$PATH
激活环境变量
source /etc/profile
配置hive-site.xml
进入$HIVE_CONF_DIR
cp hive-default.xml.template hive-site.xml
在hdfs中创建hive相关目录
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /tmp/hive
hdfs dfs -chmod -R 777 /user/hive/warehouse
hdfs dfs -chmod -R 777 /tmp/hive
hdfs dfs -chown -R hdfs:hdfs /user/hive/warehouse
hdfs dfs -chown -R hdfs:hdfs /tmp/hive
在$HIVE_HOME
创建临时目录
cd $HIVE_HOME && mkdir tmp
chmod -R 777 tmp/
将hive-site.xml中的${system:java.io.tmpdir}
替换为临时目录,将${system:user.name}
替换为root
配置hhive-env.sh
export HADOOP_HOME=/data/hadoop-2.6.5
export HIVE_CONF_DIR=/data/apache-hive-2.3.3-bin/conf
export HIVE_AUX_JARS_PATH=/data/apache-hive-2.3.3-bin/lib
配置mysql
hive默认的数据库是derby,现在替换为mysql,在hive-site.xml配置下面属性
属性 | 值 |
---|---|
javax.jdo.option.ConnectionDriverName | com.mysql.jdbc.Driver |
javax.jdo.option.ConnectionURL | jdbc:mysql://10.110.13.208:3307/hive?createDatabaseIfNotExist=true |
javax.jdo.option.ConnectionUserName | root |
javax.jdo.option.ConnectionPassword | 123456 |
将mysql驱动放到Hive的lib目录。
初始化数据库
cd $HIVE_HOME/bin
schematool -initSchema -dbTypemysql
修改hadoop配置
修改$HADOOP_HOME/etc/hadoop/core-site.xml
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
启动hiveserver2
$HIVE_HOME/bin/hiveserver2
浏览器默认访问路径是
http://10.110.18.216:10002
Hive简单操作
hive的访问方式较多,可以直接通过hive工具或beeline工具访问
hive基本操作
beeline的访问类似mysql访问,默认端口是10000
beeline -u jdbc:hive2://10.110.18.216:10000
显示数据库
show databases;
创建数据库
create database if not exists spark_demo;
显示数据表
show tables;
创建普通数据表
create table if not exists heartdisease
(
age int,
sex string,
cp string,
trestbps int,
chol int,
fbs string,
restecg string,
thalach int,
exang string,
oldpeak float,
slop string,
ca int,
thal string,
status string
)
row format delimited fields terminated by '\t'
stored as textfile;
导入数据
load data local inpath '/data/idsw/test-data/heartdisease.csv' overwrite into table heartdisease;
java访问
pom依赖
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>2.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.5</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.6</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
加载驱动
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
System.exit(1);
}
创建数据库连接
Connection con = DriverManager.getConnection("jdbc:hive2://10.110.18.216:10000/spark_demo", "root", "root");
创建声明语句,并执行操作
Statement stmt = con.createStatement();
String tableName = "wyphao";
stmt.execute("drop table if exists " + tableName);
stmt.execute("create table " + tableName + " (key int, value string)");
System.out.println("Create table success!");
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
ResultSet res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
搭建Spark
解压spark-2.2.0-bin-hadoop2.6.tgz
tar -xzvf spark-2.2.0-bin-hadoop2.6.tgz
配置环境变量
# Spark
export SPARK_HOME=/data/spark-2.2.0-bin-hadoop2.
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
配置spark-env.sh
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
export HADOOP_CONF_DIR=/data/hadoop-2.6.5/etc/hadoop
xport HADOOP_CONF_DIR=/data/hadoop-2.6.5/etc/hadoop
配置spark-defaults.conf
spark.master yarn
spark.submit.deployMode cluster
验证yarn
运行示例
./bin/spark-submit run-example SparkPi 10
spark-submit执行
在master主机(10.110.18.216),通过spark-submit执行,yarn cluster模式运行程序
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
examples/jars/spark-examples*.jar \
10