前提

阿里云ECS:

  • 可用区:华东2 可用区D

  • 实例规格:ecs.gn5i-c2g1.large

  • 镜像:ubuntu_16_04

  • CPU:2核

  • 内存:8GB

  • GPU:Nvidia Tesla P4

  • 带宽:100Mbps(峰值)

验证

查看GPU

$ lspci | grep -i nvidia

00:07.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)

Linux版本

$ uname -m && cat /etc/*release

gcc版本

$ gcc --version

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609

系统内核

$ uname -r

4.4.0-105-generic

准备

更新apt源

apt update

升级pip

pip install --upgrade pip

安装virtualenv

pip install virtualenv

创建非root用户

在root用户下创建horovod用户(用户名随便)

adduser horovod

将horovod赋予超级用户权限,首先在root用户下,修改/etc/sudoers 为可写权限

chmod u+w /etc/sudoers

在用户权限的位置,添加

horovod ALL=(ALL:ALL) ALL

python环境

切换到horovod用户

su - horovod

创建隐藏目录.pip

sudo mkdir ~/.pip

创建pip的国内源~/.pip/pip.conf

[global]
index-url=http://mirrors.aliyun.com/pypi/simple/

[install]
trusted-host=mirrors.aliyun.com

创建python3的虚拟环境

virtualenv -p python3 py3

激活python3环境

source ~/py3/bin/activate

验证python,此时显示python 3的版本

python -V

安装cuda

下载cuda 9.0

cuda_9.0.176_384.81_linux.run

修改为可执行权限

chmod +x cuda_9.0.176_384.81_linux.run

执行该安装包

sudo ./cuda_9.0.176_384.81_linux.run
Do you accept the previously read EULA?
accept/decline/quit: 


Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: yes

Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]: yes

Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]: 

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.0 ]: 


 Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: 

Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit: 

Enter CUDA Samples Location
 [ default is /root ]: 


 ===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-9.0
Samples:  Installed in /root, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-9.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_1213.log

配置环境变量/etc/profile.d/cuda.sh

# cuda
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/lib:$CUDA_HOME/lib64

激活配置source /etc/profile

安装cuDNN

下载cuDNN 7.0.5

cuDNN安装包 说明
libcudnn7_7.0.5.16-1+cuda9.0_amd64.deb cuDNN运行时类库
libcudnn7-dev_7.0.5.16-1+cuda9.0_amd64.deb cuDNN开发者类库
libcudnn7-doc_7.0.5.15-1+cuda9.0_amd64.deb cuDNN示例
dpkg -i libcudnn7_7.1.3.16-1+cuda9.0_amd64.deb
dpkg -i libcudnn7-dev_7.0.5.15-1+cuda9.0_amd64.deb
dpkg -i libcudnn7-doc_7.0.5.15-1+cuda9.0_amd64.deb

将cuDNN的实例拷贝到HOME目录

$cp -r /usr/src/cudnn_samples_v7/ $HOME
$ cd  $HOME/cudnn_samples_v7/mnistCUDNN

编译mnistCUDNN实例

$ sudo make clean && make

运行mistCUDNN示例

$ ./mnistCUDNN

Test passed!

安装Tensorflow

安装tensorflow 1.6

pip install tensorflow-gpu==1.6.0

验证tensorflow-gpu安装

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

安装nccl

下载nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

安装NCCL repository

sudo dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

更新APT库

sudo apt update

安装libnccl2

sudo apt install libnccl2 libnccl-dev

下载nccl_2.0.5-3+cuda9.0_amd64.txz ,并解压

sudo tar -xvf nccl_2.0.5-3+cuda9.0_amd64.txz

将解压的包移动到usr/local

sudo cp -r nccl_2.0.5-3+cuda9.0_amd64 /usr/local

配置/etc/profile.d/horovod.sh,增加

export HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda9.0_amd64
export HOROVOD_GPU_ALLREDUCE=NCCL
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOROVOD_NCCL_HOME/lib

激活配置source /etc/profile

安装OpenMPI

下载openmpi-3.0.1.tar.gz

sudo tar -xzvf openmpi-3.0.1.tar.gz
cd openmpi-3.0.1

配置路径

sudo ./configure --prefix=/usr/local

编译安装

sudo make all install

安装horovod

pip install horovod

测试horovod,需要准备一下mnist数据集,目录结构如下

.
- MNIST-data-0
- tensorflow_mnist.py

执行脚本

mpirun -np 1 \
    -H localhost:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    python tensorflow_mnist.py

results matching ""

    No results matching ""