前提
阿里云ECS:
可用区:华东2 可用区D
实例规格:ecs.gn5i-c2g1.large
镜像:ubuntu_16_04
CPU:2核
内存:8GB
GPU:Nvidia Tesla P4
带宽:100Mbps(峰值)
验证
查看GPU
$ lspci | grep -i nvidia
00:07.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
Linux版本
$ uname -m && cat /etc/*release
x86_64
CentOS Linux release 7.3.1611 (Core)
gcc版本
$ gcc --version
gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
系统内核
$ uname -r
3.10.0-514.26.2.el7.x86_64
准备
查看kernel
rpm -qa | grep kernel | sort
kernel-3.10.0-514.26.2.el7.x86_64
kernel-3.10.0-514.el7.x86_64
kernel-tools-3.10.0-514.26.2.el7.x86_64
kernel-tools-libs-3.10.0-514.26.2.el7.x86_64
kmod-kernel-mft-mlnx-4.7.0-1.rhel7u3.x86_64
kmod-mlnx-ofa_kernel-4.1-OFED.4.1.1.0.2.1.gc22af88.rhel7u3.x86_64
mlnx-ofa_kernel-4.1-OFED.4.1.1.0.2.1.gc22af88.rhel7u3.x86_64
mlnx-ofa_kernel-devel-4.1-OFED.4.1.1.0.2.1.gc22af88.rhel7u3.x86_64
缺少kernel-devel与kernel-header,安装下载好的rpm包
rpm -ivh kernel-devel-3.10.0-514.26.2.el7.x86_64.rpm
rpm -ivh kernel-headers-3.10.0-514.26.2.el7.x86_64.rpm
重启主机。
Python3
下载Python-3.6.3.tgz
,并解压
tar -xzvf Python-3.6.3.tgz
安装需要的依赖
yum -y install readline-devel setuptool zlib*
配置python
cd Python-3.6.3
./configure --enable-shared --prefix=/usr/local
编译安装
make && make install
创建软链接
ln -sv /usr/local/bin/python3 /usr/local/bin/python
配置/etc/profile.d/horovod.sh
# horovod
export LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/lib64
激活配置
source /etc/profile
升级pip(系统自带的pip,属于Python 2.7)
pip install --upgrade pip
安装virtualenv
pip install virtualenv
创建非root用户
创建非root用户horovod
useradd horovod
设置horovod密码
passwd horovod
将horovod设置为超级用户,先设置/etc/sudoers
权限
chmod u+w /etc/sudoers
设置horovod为超级用户
horovod ALL=(ALL) ALL
创建virtualenv
切换到horovod用户
su - horovod
创建隐藏目录.pip
sudo mkdir ~/.pip
创建pip的国内源~/.pip/pip.conf
[global]
index-url=http://mirrors.aliyun.com/pypi/simple/
[install]
trusted-host=mirrors.aliyun.com
创建python3的虚拟环境
virtualenv -p python3 py3
激活python3环境
source ~/py3/bin/activate
验证python,此时显示python 3的版本
python -V
安装cuda
下载cuda_9.0.176_384.81_linux.run
,修改权限
sudo chmod a+x cuda_9.0.176_384.81_linux.run
安装依赖
sudo yum install gcc gcc-c++ zlib* -y
安装cuda
sudo ./cuda_9.0.176_384.81_linux.run
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: yes
Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]: yes
Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]:
Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y
Enter Toolkit Location
[ default is /usr/local/cuda-9.0 ]:
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit:
Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit:
Enter CUDA Samples Location
[ default is /root ]:
===========
= Summary =
===========
Driver: Installed
Toolkit: Installed in /usr/local/cuda-9.0
Samples: Installed in /root, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-9.0/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.
Logfile is /tmp/cuda_install_1213.log
配置/etc/profile.d/horovod.sh
,添加
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
验证cuda安装
nvidia-smi
安装cuDNN
下载cudnn-9.0-linux-x64-v7.solitairetheme8
,并将其改为tgz格式
sudo mv cudnn-9.0-linux-x64-v7.solitairetheme8 cudnn-9.0-linux-x64-v7
解压cudnn-9.0-linux-x64-v7
tar -xzvf cudnn-9.0-linux-x64-v7
将cuDNN的文件拷贝到CUDA工具库
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h
激活配置/etc/profile
安装tensorflow-gpu
在py3的环境下,安装tensorflow-gpu
pip install tensorflow-gpu==1.6.0
验证tensorflow-gpu安装
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
安装nccl2
下载nccl_2.0.5-3+cuda9.0_amd64.txz
,并解压
sudo tar -xvf nccl_2.0.5-3+cuda9.0_amd64.txz
将解压的包移动到usr/local
sudo cp -r nccl_2.0.5-3+cuda9.0_amd64 /usr/local
配置/etc/profile.d/horovod.sh
,增加
export HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda9.0_amd64
export HOROVOD_GPU_ALLREDUCE=NCCL
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOROVOD_NCCL_HOME/lib
激活配置source /etc/profile
安装openmpi
下载openmpi-3.0.1.tar.gz
,解压
sudo tar -xzvf openmpi-3.0.1.tar.gz
配置openmpi
sudo ./configure --prefix=/usr/local
安装依赖
yum install numactl-devel binutils binutils-devel
编译安装
sudo make all install
安装horovod
pip install horovod
测试horovod,需要准备一下mnist数据集,目录结构如下
.
- MNIST-data-0
- tensorflow_mnist.py
执行脚本
mpirun -np 1 \
-H localhost:1 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python tensorflow_mnist.py