要求

在10.110.18.216~10.110.18.218三台主机安装hadoop环境

前提

java环境:jdk 8

hadoop版本: 2.6.5

免密登录

本机免密登录

在10.110.18.216、10.110.18.217、10.110.18.218三台主机分别修改/etc/hosts

10.110.18.216   indata-10-110-18-216.indata.com   indata-10-110-18-216
10.110.18.217   indata-10-110-18-217.indata.com   indata-10-110-18-217
10.110.18.218   indata-10-110-18-218.indata.com   indata-10-110-18-218

修改hostname,是其分别为indata-10-110-18-216、indata-10-110-18-217、indata-10-110-18-218

其中,10.110.18.216为master,10.110.18.217/10.110.18.218位slaves

生成公钥-私钥

ssh-keygen -t rsa

将其追加到authorized_keys

cat ~/.ssh/id_rsa.pub >> authorized_keys

赋予权限

chmod 600 .ssh/authorized_keys

验证本机免密登录

ssh indata-10-110-18-216

10.110.18.217/10.110.18.218做相同操作。

master免密登录slaves

将master主机的id_rsa.pub拷贝到slaves,并追加

scp [email protected]:/root/.ssh/id_rsa.pub /root/
cat id_rsa.pub >> ~/.ssh/authorized_keys
rm -rf /root/id_ras.pub

在master主机测试

ssh indata-10-110-18-217

slaves免密登录master

将slave主机的公钥拷贝到master主机

scp [email protected]/root/.ssh/id_rsa.pub /root
cat /root/id_rsa.pub >> /root/.ssh/authorized_keys

在slave主机测试

ssh indata-10-110-18-216

master主机环境搭建

解压hadoop压缩包

tar -xzvf  hadoop-2.6.5.tar.gz    -C /data

配置环境变量

# Hadoop
export HADOOP_HOME=/data/hadoop-2.6.5
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

激活环境变量

source /etc/profile

配置core-site.xml

指定NameNode的IP和端口,指定hadoop数据存储的临时文件夹

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/data/hadoop-2.6.5/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://10.110.18.216:8020</value>
    </property>
</configuration>

如果没有配置hadoop.tmp.dir,则系统默认为/tmp/hadoop。该目录在每次重启主机时都会删掉,必须重新执行format,否则出错

配置hdfs-site.xml

指定HDFS备份为3,指定namenode节点文件存储目录,指定datanode节点文件存储目录

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/data/hadoop-2.6.5/hdfs/name</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/data/hadoop-2.6.5/hdfs/data</value>
    </property>
</configuration>

配置mapred-site.xml

拷贝mapred-site.xml.template,并修改

cp mapred-site.xml.template mapred-site.xml
<configuration>
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
   <property>
      <name>mapred.job.tracker</name>
      <value>http://10.110.18.216:9001</value>
  </property>
</configuration>

配置yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>indata-10-110-18-216</value>
    </property>
</configuration>

配置masters

修改/data/hadoop-2.6.5/etc/hadoop/masters

indata-10-110-18-216

配置slaves

修改/data/hadoop-2.6.5/etc/hadoop/slaves

indata-10-110-18-217
indata-10-110-18-218

slaves环境搭建

将master安装环境拷贝到slave主机

删除slaves配置

rm -rf /data/hadoop-2.6.5/etc/hadoop/slaves

启动集群

在master主机,格式化hdfs文件系统

hdfs namenode -format

如果出现datanode无法启动问题,可以删除/data/hadoop-2.6.5/tmp和/data/hadoop-2.6.5/hdfs目录,并重新格式化

启动hadoop

sbin/start-all.sh

通过jps命令查看运行情况

# master
1040 SecondaryNameNode
4201 Jps
1211 ResourceManager
828 NameNode
# slave
6202 DataNode
6315 NodeManager
6622 Jps

查看hadoop集群状态

hadoop dfsadmin -report

通过浏览器访问hdfs

http://10.110.18.216:50070

通过浏览器访问yarn

http://10.110.18.216:8088

搭建hive

在master主机(10.110.18.216)搭建hive

tar -xzvf apache-hive-2.3.3-bin.tar.gz

配置环境变量

# Hive
export HIVE_HOME=/data/apache-hive-2.3.3-bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$HIVE_HOME/bin:$PATH

激活环境变量

source /etc/profile

配置hive-site.xml

进入$HIVE_CONF_DIR

cp hive-default.xml.template hive-site.xml

在hdfs中创建hive相关目录

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /tmp/hive
hdfs dfs -chmod -R 777 /user/hive/warehouse
hdfs dfs -chmod -R 777 /tmp/hive
hdfs dfs -chown -R hdfs:hdfs /user/hive/warehouse
hdfs dfs -chown -R hdfs:hdfs /tmp/hive

$HIVE_HOME 创建临时目录

cd $HIVE_HOME && mkdir tmp
chmod -R 777 tmp/

将hive-site.xml中的${system:java.io.tmpdir}替换为临时目录,将${system:user.name} 替换为root

配置hhive-env.sh

export HADOOP_HOME=/data/hadoop-2.6.5
export HIVE_CONF_DIR=/data/apache-hive-2.3.3-bin/conf
export HIVE_AUX_JARS_PATH=/data/apache-hive-2.3.3-bin/lib

配置mysql

hive默认的数据库是derby,现在替换为mysql,在hive-site.xml配置下面属性

属性
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
javax.jdo.option.ConnectionURL jdbc:mysql://10.110.13.208:3307/hive?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionUserName root
javax.jdo.option.ConnectionPassword 123456

将mysql驱动放到Hive的lib目录。

初始化数据库

cd $HIVE_HOME/bin
schematool -initSchema -dbTypemysql

修改hadoop配置

修改$HADOOP_HOME/etc/hadoop/core-site.xml

<property>
  <name>hadoop.proxyuser.root.hosts</name>
  <value>*</value>
 </property>
 <property>
  <name>hadoop.proxyuser.root.groups</name>
  <value>*</value>
</property>

启动hiveserver2

$HIVE_HOME/bin/hiveserver2

浏览器默认访问路径是

http://10.110.18.216:10002

Hive简单操作

hive的访问方式较多,可以直接通过hive工具或beeline工具访问

hive基本操作

beeline的访问类似mysql访问,默认端口是10000

beeline -u jdbc:hive2://10.110.18.216:10000

显示数据库

show databases;

创建数据库

create database if not exists spark_demo;

显示数据表

show tables;

创建普通数据表

create table if not exists heartdisease
(
  age int,
  sex string,
  cp string,
  trestbps int,
  chol int,
  fbs string,
  restecg string,
  thalach int,
  exang string,
  oldpeak float,
  slop string,
  ca int,
  thal string,
  status string
)
row format delimited fields terminated by '\t'
stored as textfile;

导入数据

load data local inpath '/data/idsw/test-data/heartdisease.csv' overwrite into table heartdisease;

java访问

pom依赖

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>2.3.3</version>
        </dependency>


        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.6.5</version>
        </dependency>

        <dependency>
            <groupId>jdk.tools</groupId>
            <artifactId>jdk.tools</artifactId>
            <version>1.6</version>
            <scope>system</scope>
            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
        </dependency>

加载驱动

private static String driverName = "org.apache.hive.jdbc.HiveDriver";
        try {
            Class.forName(driverName);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
            System.exit(1);
        }

创建数据库连接

Connection con = DriverManager.getConnection("jdbc:hive2://10.110.18.216:10000/spark_demo", "root", "root");

创建声明语句,并执行操作

        Statement stmt = con.createStatement();
        String tableName = "wyphao";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName +  " (key int, value string)");
        System.out.println("Create table success!");
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) {
            System.out.println(res.getString(1));
        }

搭建Spark

解压spark-2.2.0-bin-hadoop2.6.tgz

tar -xzvf spark-2.2.0-bin-hadoop2.6.tgz

配置环境变量

# Spark
export SPARK_HOME=/data/spark-2.2.0-bin-hadoop2.
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

配置spark-env.sh

cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
export HADOOP_CONF_DIR=/data/hadoop-2.6.5/etc/hadoop
xport HADOOP_CONF_DIR=/data/hadoop-2.6.5/etc/hadoop

配置spark-defaults.conf

spark.master yarn
spark.submit.deployMode cluster

验证yarn

运行示例

./bin/spark-submit run-example SparkPi 10

spark-submit执行

在master主机(10.110.18.216),通过spark-submit执行,yarn cluster模式运行程序

$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    examples/jars/spark-examples*.jar \
    10

results matching ""

    No results matching ""