前提
阿里云ECS:
可用区:华东2 可用区D
实例规格:ecs.gn5i-c2g1.large
镜像:ubuntu_16_04
CPU:2核
内存:8GB
GPU:Nvidia Tesla P4
带宽:100Mbps(峰值)
验证
查看GPU
$ lspci | grep -i nvidia
00:07.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
Linux版本
$ uname -m && cat /etc/*release
gcc版本
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
系统内核
$ uname -r
4.4.0-105-generic
准备
更新apt源
apt update
升级pip
pip install --upgrade pip
安装virtualenv
pip install virtualenv
创建非root用户
在root用户下创建horovod用户(用户名随便)
adduser horovod
将horovod赋予超级用户权限,首先在root用户下,修改/etc/sudoers
为可写权限
chmod u+w /etc/sudoers
在用户权限的位置,添加
horovod ALL=(ALL:ALL) ALL
python环境
切换到horovod用户
su - horovod
创建隐藏目录.pip
sudo mkdir ~/.pip
创建pip的国内源~/.pip/pip.conf
[global]
index-url=http://mirrors.aliyun.com/pypi/simple/
[install]
trusted-host=mirrors.aliyun.com
创建python3的虚拟环境
virtualenv -p python3 py3
激活python3环境
source ~/py3/bin/activate
验证python,此时显示python 3的版本
python -V
安装cuda
下载cuda 9.0
cuda_9.0.176_384.81_linux.run
修改为可执行权限
chmod +x cuda_9.0.176_384.81_linux.run
执行该安装包
sudo ./cuda_9.0.176_384.81_linux.run
Do you accept the previously read EULA?
accept/decline/quit:
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: yes
Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]: yes
Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]:
Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y
Enter Toolkit Location
[ default is /usr/local/cuda-9.0 ]:
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit:
Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit:
Enter CUDA Samples Location
[ default is /root ]:
===========
= Summary =
===========
Driver: Installed
Toolkit: Installed in /usr/local/cuda-9.0
Samples: Installed in /root, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-9.0/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.
Logfile is /tmp/cuda_install_1213.log
配置环境变量/etc/profile.d/cuda.sh
# cuda
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/lib:$CUDA_HOME/lib64
激活配置source /etc/profile
安装cuDNN
下载cuDNN 7.0.5
cuDNN安装包 | 说明 |
---|---|
libcudnn7_7.0.5.16-1+cuda9.0_amd64.deb | cuDNN运行时类库 |
libcudnn7-dev_7.0.5.16-1+cuda9.0_amd64.deb | cuDNN开发者类库 |
libcudnn7-doc_7.0.5.15-1+cuda9.0_amd64.deb | cuDNN示例 |
dpkg -i libcudnn7_7.1.3.16-1+cuda9.0_amd64.deb
dpkg -i libcudnn7-dev_7.0.5.15-1+cuda9.0_amd64.deb
dpkg -i libcudnn7-doc_7.0.5.15-1+cuda9.0_amd64.deb
将cuDNN的实例拷贝到HOME目录
$cp -r /usr/src/cudnn_samples_v7/ $HOME
$ cd $HOME/cudnn_samples_v7/mnistCUDNN
编译mnistCUDNN实例
$ sudo make clean && make
运行mistCUDNN示例
$ ./mnistCUDNN
Test passed!
安装Tensorflow
安装tensorflow 1.6
pip install tensorflow-gpu==1.6.0
验证tensorflow-gpu安装
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
安装nccl
下载nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
安装NCCL repository
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
更新APT库
sudo apt update
安装libnccl2
sudo apt install libnccl2 libnccl-dev
下载nccl_2.0.5-3+cuda9.0_amd64.txz
,并解压
sudo tar -xvf nccl_2.0.5-3+cuda9.0_amd64.txz
将解压的包移动到usr/local
sudo cp -r nccl_2.0.5-3+cuda9.0_amd64 /usr/local
配置/etc/profile.d/horovod.sh
,增加
export HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda9.0_amd64
export HOROVOD_GPU_ALLREDUCE=NCCL
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOROVOD_NCCL_HOME/lib
激活配置source /etc/profile
安装OpenMPI
下载openmpi-3.0.1.tar.gz
sudo tar -xzvf openmpi-3.0.1.tar.gz
cd openmpi-3.0.1
配置路径
sudo ./configure --prefix=/usr/local
编译安装
sudo make all install
安装horovod
pip install horovod
测试horovod,需要准备一下mnist数据集,目录结构如下
.
- MNIST-data-0
- tensorflow_mnist.py
执行脚本
mpirun -np 1 \
-H localhost:1 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python tensorflow_mnist.py