前提

阿里云ECS:

  • 可用区:华东2 可用区D

  • 实例规格:ecs.gn5i-c2g1.large

  • 镜像:ubuntu_16_04

  • CPU:2核

  • 内存:8GB

  • GPU:Nvidia Tesla P4

  • 带宽:100Mbps(峰值)

验证

查看GPU

$ lspci | grep -i nvidia

00:07.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)

Linux版本

$ uname -m && cat /etc/*release

x86_64
CentOS Linux release 7.3.1611 (Core)

gcc版本

$ gcc --version

gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)

系统内核

$ uname -r

3.10.0-514.26.2.el7.x86_64

准备

查看kernel

rpm -qa | grep kernel | sort

kernel-3.10.0-514.26.2.el7.x86_64
kernel-3.10.0-514.el7.x86_64
kernel-tools-3.10.0-514.26.2.el7.x86_64
kernel-tools-libs-3.10.0-514.26.2.el7.x86_64
kmod-kernel-mft-mlnx-4.7.0-1.rhel7u3.x86_64
kmod-mlnx-ofa_kernel-4.1-OFED.4.1.1.0.2.1.gc22af88.rhel7u3.x86_64
mlnx-ofa_kernel-4.1-OFED.4.1.1.0.2.1.gc22af88.rhel7u3.x86_64
mlnx-ofa_kernel-devel-4.1-OFED.4.1.1.0.2.1.gc22af88.rhel7u3.x86_64

缺少kernel-devel与kernel-header,安装下载好的rpm包

rpm -ivh kernel-devel-3.10.0-514.26.2.el7.x86_64.rpm  
rpm -ivh kernel-headers-3.10.0-514.26.2.el7.x86_64.rpm

重启主机。

Python3

下载Python-3.6.3.tgz,并解压

tar -xzvf Python-3.6.3.tgz

安装需要的依赖

yum -y install readline-devel setuptool zlib*

配置python

cd Python-3.6.3
./configure --enable-shared --prefix=/usr/local

编译安装

make && make install

创建软链接

ln -sv /usr/local/bin/python3 /usr/local/bin/python

配置/etc/profile.d/horovod.sh

# horovod
export LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/lib64

激活配置

source /etc/profile

升级pip(系统自带的pip,属于Python 2.7)

pip install --upgrade pip

安装virtualenv

pip install virtualenv

创建非root用户

创建非root用户horovod

useradd horovod

设置horovod密码

passwd horovod

将horovod设置为超级用户,先设置/etc/sudoers 权限

chmod u+w /etc/sudoers

设置horovod为超级用户

horovod ALL=(ALL)       ALL

创建virtualenv

切换到horovod用户

su - horovod

创建隐藏目录.pip

sudo mkdir ~/.pip

创建pip的国内源~/.pip/pip.conf

[global]
index-url=http://mirrors.aliyun.com/pypi/simple/

[install]
trusted-host=mirrors.aliyun.com

创建python3的虚拟环境

virtualenv -p python3 py3

激活python3环境

source ~/py3/bin/activate

验证python,此时显示python 3的版本

python -V

安装cuda

下载cuda_9.0.176_384.81_linux.run,修改权限

sudo chmod a+x cuda_9.0.176_384.81_linux.run

安装依赖

sudo yum install gcc gcc-c++ zlib* -y

安装cuda

sudo ./cuda_9.0.176_384.81_linux.run
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: yes

Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]: yes

Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]: 

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.0 ]: 


 Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: 

Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit: 

Enter CUDA Samples Location
 [ default is /root ]: 


 ===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-9.0
Samples:  Installed in /root, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-9.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_1213.log

配置/etc/profile.d/horovod.sh,添加

export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64

验证cuda安装

nvidia-smi

安装cuDNN

下载cudnn-9.0-linux-x64-v7.solitairetheme8 ,并将其改为tgz格式

sudo mv cudnn-9.0-linux-x64-v7.solitairetheme8 cudnn-9.0-linux-x64-v7

解压cudnn-9.0-linux-x64-v7

tar -xzvf cudnn-9.0-linux-x64-v7

将cuDNN的文件拷贝到CUDA工具库

$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h

激活配置/etc/profile

安装tensorflow-gpu

在py3的环境下,安装tensorflow-gpu

pip install tensorflow-gpu==1.6.0

验证tensorflow-gpu安装

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

安装nccl2

下载nccl_2.0.5-3+cuda9.0_amd64.txz ,并解压

sudo tar -xvf nccl_2.0.5-3+cuda9.0_amd64.txz

将解压的包移动到usr/local

sudo cp -r nccl_2.0.5-3+cuda9.0_amd64 /usr/local

配置/etc/profile.d/horovod.sh,增加

export HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda9.0_amd64
export HOROVOD_GPU_ALLREDUCE=NCCL
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOROVOD_NCCL_HOME/lib

激活配置source /etc/profile

安装openmpi

下载openmpi-3.0.1.tar.gz,解压

sudo tar -xzvf  openmpi-3.0.1.tar.gz

配置openmpi

sudo ./configure --prefix=/usr/local

安装依赖

yum install numactl-devel binutils binutils-devel

编译安装

sudo make all install

安装horovod

pip install horovod

测试horovod,需要准备一下mnist数据集,目录结构如下

.
- MNIST-data-0
- tensorflow_mnist.py

执行脚本

mpirun -np 1 \
    -H localhost:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    python tensorflow_mnist.py

results matching ""

    No results matching ""