【技术分享】caffe2分布式平台安装部署文档-伙伴云

【技术分享】caffe2分布式平台安装部署文档

网友投稿 643 2022-05-29

本文旨在搭建部署Caffe2的分布式环境，支持多机多GPU的神经网络训练，最后以resnet50为例进行分布式训练。

1 环境准备

1.1 系统准备

1.1.1 物理服务器环境

三台服务器，硬件配置如下：

服务器名称

CPU

内存

GPU

显存

server1

Xeon E5-2680 v4

440 GB

Tesla P100

16 GB

server2

Xeon E5-2680 v4

440 GB

Tesla P100

16 GB

server3

Xeon E5-2680 v4

440 GB

Tesla P4

8 GB

系统环境如下：

服务器名称

系统

CUDA版本

server1

Ubuntu 16.04

192.168.133.10

8.0

server2

Ubuntu 16.04

192.168.133.11

8.0

server3

Ubuntu 16.04

192.168.133.12

8.0

在每个服务器上安装docker，随后安装nvidia-docker插件这里不再给出步骤，参考https://docs.docker.com/install/linux/docker-ce/ubuntu/

https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#setting-up

1.1.2 容器环境准备

选择某个物理主机，创建一个镜像，并且编译caffe2，随后将镜像保存后push到私有仓库，其他物理主机直接拉取镜像即可完成包含caffe2的容器创建，这里选择server1。

在server1物理服务器上拉取NVIDIA包含cuda的官方镜像，地址https://hub.docker.com/r/nvidia/cuda/（这里没有直接拉取caffe2的官方镜像，是为了能够方便自行编译，并且caffe2官方镜像中的版本时间较早，不一定能满足需求）这里选择了包含CUDA 8.0的devel版本，方便编译caffe2（base和runtime版本不支持cuda应用的源码编译，只支持编译好的应用）。

docker pull nvidia/cuda:8.0-devel-ubuntu16.04

使用nvidia-docker启动容器，支持GPU。

nvidia-docker run -it --name caffe2 --net=host --hostname caffe2 --dns 8.8.8.8 –v /mnt:/mnt nvidia/cuda:8.0-devel-ubuntu16.04 bash

容器操作系统：Ubuntu 16.04，64位版本，用户：root

1.2 网络配置

容器内的操作系统需要proxy代理方能连接yum源安装软件包。

本文使用的代理是http://192.168.5.18:3128，编辑根目录下的.bashrc 或者/etc/profile文件，在最后增加如下几行：

export http_proxy="http://192.168.5.18:3128"

export https_proxy="http://192.168.5.18:3128"

export ip_range=$(echo 192.168.79.{1..255} | sed 's/ /,/g')

export no_proxy="localhost,127.0.0.1,$ip_range,.huawei.com"

之后source .bashrc 或者 source /etc/profile即可。

使用curl baidu.com验证，如果有内容输出说明网络已连通。

2 部署caffe2

首先在单个节点的容器中完成caffe2的编译安装，随后拓展到其他节点。

2.1 安装依赖包

2.1.1 配置软件源

由于镜像中没有编辑器，先安装vim

apt-get update && apt-get install vim

更换软件源（速度更快一点），首先备份已有源

cd /etc/apt && mv sources.list sources.list.bk

下载或者通过vim将163或者阿里源写入sources.list中。

deb http://mirrors.163.com/ubuntu/ trusty main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-security main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-updates main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-proposed main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-backports main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-security main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-updates main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-proposed main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-backports main restricted universe multiverse

更新列表

apt-get update

2.1.2 安装依赖包

apt-get install openssh-server

apt-get install -y --no-install-recommends build-essential cmake git libgoogle-glog-dev libgtest-dev libiomp-dev libleveldb-dev liblmdb-dev libopencv-dev libopenmpi-dev libsnappy-dev libprotobuf-dev openmpi-bin openmpi-doc protobuf-compiler protobuf-c-compiler libgflags-dev python-dev python-pip python-setuptools graphviz

与官方推荐的依赖包相比增加了一些必要的依赖包。

2.1.3 安装python依赖库

sudo pip install flask future hypothesis numpy protobuf pydot python-nvd3 pyyaml requests scikit-image scipy setuptools six tornado jupyter matplotlib pydot

与官方推荐的库相比增加了一些必要的库。

2.1.4 安装cuDNN

由于镜像中包含了CUDA，所以这里只需要安装cuDNN加速库即可，两种方法：

安装cuDNN方法1：

添加源

echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list

更新列表并安装

apt-get update

apt-get install -y --no-install-recommends libcudnn7=7.1.2.21-1+cuda8.0 libcudnn7-dev=7.1.2.21-1+cuda8.0

rm -rf /var/lib/apt/lists/*

安装cuDNN方法2：

下载源码包并解压，在https://developer.nvidia.com/rdp/cudnn-download下载cuDNN支持8.0的版本

tar -xzvf cudnn-8.0-linux-x64-v7.1.tgz

cd cuda/include/cudnn.h /usr/local/cuda/include/

cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/

rm cudnn-8.0-linux-x64-v7.1.tgz && sudo ldconfig

配置环境变量

export PATH=/usr/local/cuda-8.0/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH

2.2 编译Caffe2

2.2.1 下载caffe2源代码

下载方法1：

cd /opt && git clone --recursive https://github.com/caffe2/caffe2.git

下载方法2：

如果公司内部服务器无法直接下载，可以在windows上安装git进行下载。注意要设置proxy ，参考http://3ms.huawei.com/km/blogs/details/5098499。（另外直接在浏览器下载源代码zip的方式最终在编译时会出错，原因在于此类方式会少下载third-party那部分源代码，因此必须使用git加--recursive的方式来下载这些submodule，否则它们不会直接下载）

2.2.2 编译安装caffe2

cd /opt/caffe2 && mkdir build && cd build

cmake ..

make install

编译时可能遇到cmake版本过低的问题，解决方法参考Troubleshooting。

环境变量设置

在/etc/profile中添加

export PYTHONPATH=/usr/local:$PYTHONPATH

export PYTHONPATH=/opt/caffe2/build:$PYTHONPATH

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

2.2.3 测试caffe2

cd ~ && python -c 'from caffe2.python import core' 2>/dev/null && echo "Success" || echo "Failure"

输出Success则表示成功

检测GPU是否生效（GPU支持并且cuda安装正常时可使用）。

python caffe2/python/operator_test/relu_op_test.py

python2 -c 'from caffe2.python import workspace; print(workspace.NumCudaDevices())'

输出结果大于0说明GPU正常被使用，否则会报错。

2.3 NFS共享目录部署

由于Caffe2在进行分布式训练时，需要共享目录完成参数的rendezvous，可以使用NFS或者Redis，这里选择NFS快速搭建一个可用的分布式环境。由于容器不能直接挂载NFS共享目录，或者这里通过物理主机映射的方式来实现共享。

2.3.1 NFS共享服务器搭建

首先选择一个物理主机搭建NFS server，这里选择192.168.133.10，将/export共享出去

apt-get install nfs-kernel-server

mkdir /export

chmod 777 /export

方便重启后NFS共享仍然生效，将其写入/etc/exports

/export 127.0.0.1(ro,fsid=0,insecure,no_subtree_check,async)

挂载目录

exportfs –a

2.3.2 挂载NFS客户端

在三个物理主机中都安装nfs-kernel

apt-get install nfs-kernel-server

然后挂载共享目录

mount –t nfs 192.168.133.10:/export /mnt

对于容器访问NFS共享目录，则在容器启动的时候将/mnt映射上去即可。

将容器打上tag，push到私有仓库之中，方便在其它主机上启动。

docker push 192.168.133.11:5000/caffe2:latest

这样该镜像就有编译好的caffe2和nfs-kernel

将刚才的容器关闭，以新的镜像重新启动一个容器

nvidia-docker run -it --name caffe2-dis --net=host --hostname caffe2 --dns 8.8.8.8 -v /mnt:/mnt 192.168.133.11:5000/caffe2:latest bash

在其他两个节点上则通过拉取来获取刚才创建的镜像

docker pull 192.168.133.11:5000/caffe2:latest

nvidia-docker run -it --name caffe2-dis --net=host --hostname caffe2 --dns 8.8.8.8 -v /mnt:/mnt 192.168.133.11:5000/caffe2:latest bash

至此包含caffe2并且能访问NFS共享目录的分布式环境搭建完毕

2.4 Troubleshooting

（1）编译时遇到cmake版本过低的问题。

解决方法：通过源码重新安装最新版本的cmake

在http://www.cmake.org/download/中下载最新源码包，如cmake-3.10.3.tar，然后执行解压安装

tar –xzvf cmake-3.10.3.tar.gz

cd cmake-3.10.3

./boostrap

make

make install

（2）linux环境下，git caffe2源码可能遇到证书错误。

解决方法：将github加入到信任列表

export GIT_SSL_NO_VERIFY=1

sudo update-ca-certificates

echo -n | openssl s_client -showcerts -connect github.com:443 2>/dev/null | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'

（3）make install caffe2的源码出错

[third_party/onnx/onnx/onnx_onnx_c2.pb.cc] Error 1

解决方法：

重新下载编译安装最新版本的Protobuf，地址https://github.com/google/protobuf/releases/，随后解压编译安装即可。

首先删除原有的低版本软件包

apt-get remove protobuf

然后解压编译安装3.5.1版本

tar –xzvf protobuf-all-3.5.1.tar.gz

cd protobuf-3.5.1

./autogen.sh

./configure

make

make check

make install

如果protoc不在/usr/bin/下，可以添加连接

ln -s /usr/local/bin/protoc /usr/bin/protoc

ln -s /usr/local/lib/libprotobuf.so /usr/lib64/libprotobuf.so

不然仍然无法通过编译。

（5）ImportError: cannot import name caffe2_pb2

环境变量设置问题，参考官网正确设置PATH，PYTHONPATH，LD_LIBRARY_PATH等变量，使用env查看是否有多余冒号等情况出现。

（6）cmake时提示无法找到cudnn库

这是因为cuDNN的库解压后，没有正确地被添加到/usr/local/cuda/下的原因，可以使用find / -name cudnn.h进行搜索，看是否存在了cudnn.h的头文件，以确实cudnn是否被正确安装。不能使用官网中提供的直接解压到目录的方法，该方法会导致没有该库文件存在，参考文中的方法，拷贝过去即可。

（7）ImportError: No module named _tkinter, please install the python-tk package

解决方法：sudo apt-get install python-tk

（8）如果安装了anaconda2，可能会遇到python的依赖找不到的情况，导致cmake的时候，numpy等python都找不到。

解决方法：在PYTHONPATH中添加对应的python路径，如/root/anaconda2/lib/python2.7/site-packages。并且在此种情况下用conda去管理安装需要的依赖包。

（9）WARNING:root:Debug message: /root/anaconda2/bin/../lib/libstdc++.so.6: version `CXXABI_1.3.8' not found 。

解决方法：

libstdc++.so.6在系统中的位置为

/usr/lib/x86_64-linux-gnu/libstdc++.so.6

这里出错的原因是该文件在别的位置也存在，如Anaconda中，并且Anaconda中的版本低于系统版本（可以使用strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6查看），解决办法是将系统中的位置都拷贝过去.

mv /root/anaconda2/lib/libstdc++.so.6 /root/anaconda2/lib/libstdc++.so.6.bk

cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 /root/anaconda2/lib/

3 ResNet50训练实践

ResNet50模型是caffe2官网上给出的多GPU训练的指导的例子，该网络被用于图像识别任务，常作为神经网络训练性能的基准测试网络。数据集为ImageNet 1K，但是该数据集过大（约300G空间），GPU太少时训练时间过长（两张GPU卡要耗一周），所以这里采用它的一个子集，训练集包含了640种车和640种船，一共1280张图片；测试集包含了48种车和48种船，一共96张图片，数据集总体130MB。

3.1 单机训练测试

单机上测试ResNet50的构建和训练，下载数据方便进行分布式地训练。（地址：https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/Multi-GPU_Training.ipynb）（单机也可以参考官网进行MNIST手写字符识别任务的学习，地址：https://github.com/caffe2/tutorials/blob/master/MNIST.ipynb）

3.1.1 导入数据

首先，通过代码下载并且解压数据（也可以手动进行），代码：

from __future__ import absolute_import

from __future__ import division

from __future__ import print_function

from __future__ import unicode_literals

from caffe2.python import core, workspace, model_helper, net_drawer, memonger, brew

from caffe2.python import data_parallel_model as dpm

from caffe2.python.models import resnet

from caffe2.proto import caffe2_pb2

import numpy as np

import time

import os

from IPython import display

workspace.GlobalInit(['caffe2', '--caffe2_log_level=2'])

# This section checks if you have the training and testing databases

current_folder = os.path.join(os.path.expanduser('~'), 'caffe2_notebooks')

data_folder = os.path.join(current_folder, 'tutorial_data', 'resnet_trainer')

# Train/test data

train_data_db = os.path.join(data_folder, "imagenet_cars_boats_train")

train_data_db_type = "lmdb"

# actually 640 cars and 640 boats = 1280

train_data_count = 1280

test_data_db = os.path.join(data_folder, "imagenet_cars_boats_val")

test_data_db_type = "lmdb"

# actually 48 cars and 48 boats = 96

test_data_count = 96

# Get the dataset if it is missing

def DownloadDataset(url, path):

import requests, zipfile, StringIO

print("Downloading {} ... ".format(url))

r = requests.get(url, stream=True)

z = zipfile.ZipFile(StringIO.StringIO(r.content))

z.extractall(path)

print("Done downloading to {}!".format(path))

# Make the data folder if it doesn't exist

if not os.path.exists(data_folder):

os.makedirs(data_folder)

else:

print("Data folder found at {}".format(data_folder))

# See if you already have to db, and if not, download it

if not os.path.exists(train_data_db):

DownloadDataset("http://download.caffe2.ai/databases/resnet_trainer.zip", data_folder)

代码通过官网下载imagenet数据集，随后将其解压到文件夹：~/caffe2_notebooks/tutorial_data/resnet_trainer

完成后我们可以看到该文件夹下包含了imagenet_cars_boats_train和imagenet_cars_boats_val两个文件夹，分别存储了训练数据集和测试数据集。这里记录了训练数据集和测试集的位置和数据库类型、数据大小等信息。

3.1.2 配置训练参数

配置网络训练时采用的参数，代码如下：

# Configure how you want to train the model and with how many GPUs

# This is set to use two GPUs in a single machine, but if you have more GPUs, extend the array [0, 1, 2, n]

gpus = [0]

# Batch size of 32 sums up to roughly 5GB of memory per device

batch_per_device = 32

total_batch_size = batch_per_device * len(gpus)

# This model discriminates between two labels: car or boat

num_labels = 2

# Initial learning rate (scale with total batch size)

base_learning_rate = 0.0004 * total_batch_size

# only intends to influence the learning rate after 10 epochs

stepsize = int(10 * train_data_count / total_batch_size)

# Weight decay (L2 regularization)

weight_decay = 1e-4

这里指定了使用的GPU、batch size、总batch size（这里单GPU就等于batch size）、标签数量（两类船和车）、学习速率、step size、权重衰减值。

3.1.3 构建网络并训练

创建网络并清空工作区，防止上次训练数据产生的干扰（如果你是第二次运行的话会有影响）

train_model = model_helper.ModelHelper(name="resnet_test")

workspace.ResetWorkspace()

3.1.4 数据读取

创建数据读取，从之前指定的位置读取数据作为训练数据。

reader = train_model.CreateDB("train_reader", db=train_data_db, db_type=train_data_db_type)

3.1.5 图片输入

定义原始图片输入的处理方法

def add_image_input_ops(model):

# utilize the ImageInput operator to prep the images

data, label = brew.image_input(model,

reader,

["data", "label"],

batch_size=batch_per_device,

# mean: to remove color values that are common

mean=128.,

# std is going to be modified randomly to influence the mean subtraction

std=128.,

# scale to rescale each image to a common size

scale=256,

# crop to the square each image to exact dimensions

crop=224,

# not running in test mode

is_test=False,

# mirroring of the images will occur randomly

mirror=1

)

# prevent back-propagation: optional performance improvement; may not be observable at small scale

data = model.net.StopGradient(data, data)

3.1.6 定义ResNet50网络模型创建方法

def create_resnet50_model_ops(model, loss_scale=1.0):

# Creates a residual network

[softmax, loss] = resnet.create_resnet50(

model,

"data",

num_input_channels=3,

num_labels=num_labels,

label="label",

)

prefix = model.net.Proto().name

loss = model.net.Scale(loss, prefix + "_loss", scale=loss_scale)

brew.accuracy(model, [softmax, "label"], prefix + "_accuracy")

return [loss]

这里调用了resnet的create_resnet50方法创建网络，该方法为caffe2官方实现，可以查看源码深入理解。

3.1.7 定义参数更新方法

def add_parameter_update_ops(model):

brew.add_weight_decay(model, weight_decay)

iter = brew.iter(model, "iter")

lr = model.net.LearningRate(

[iter],

"lr",

base_lr=base_learning_rate,

policy="step",

stepsize=stepsize,

gamma=0.1,

)

# Momentum SGD update

for param in model.GetParams():

param_grad = model.param_to_grad[param]

param_momentum = model.param_init_net.ConstantFill(

[param], param + '_momentum', value=0.0

)

# Update param_grad and param_momentum in place

model.net.MomentumSGDUpdate(

[param_grad, param_momentum, lr, param],

[param_grad, param_momentum, param],

momentum=0.9,

# Nesterov Momentum works slightly better than standard momentum

nesterov=1,

)

3.1.8 梯度优化

def optimize_gradient_memory(model, loss):

model.net._net = memonger.share_grad_blobs(

model.net,

loss,

set(model.param_to_grad.values()),

# Due to memonger internals, we need a namescope here. Let's make one up; we'll need it later!

namescope="imonaboat",

share_activations=False)

3.1.9 创建网络并且训练

强制指定网络训练使用的GPU（本机的第一个），随后调用之前定义的方法创建网络，并开始训练。

# We need to give the network context and force it to run on the first GPU even if there are more.

device_opt = core.DeviceOption(caffe2_pb2.CUDA, gpus[0])

# Here's where that NameScope comes into play

with core.NameScope("imonaboat"):

# Picking that one GPU

with core.DeviceScope(device_opt):

# Run our reader, and create the layers that transform the images

add_image_input_ops(train_model)

# Generate our residual network and return the losses

losses = create_resnet50_model_ops(train_model)

# Create gradients for each loss

blobs_to_gradients = train_model.AddGradientOperators(losses)

# Kick off the learning and managing of the weights

add_parameter_update_ops(train_model)

# Optimize memory usage by consolidating where we can

optimize_gradient_memory(train_model, [blobs_to_gradients[losses[0]]])

# Startup the network

workspace.RunNetOnce(train_model.param_init_net)

# Load all of the initial weights; overwrite lets you run this multiple times

workspace.CreateNet(train_model.net, overwrite=True)

num_epochs = 1

for epoch in range(num_epochs):

# Split up the images evenly: total images / batch size

num_iters = int(train_data_count / total_batch_size)

for iter in range(num_iters):

# Stopwatch start!

t1 = time.time()

# Run this iteration!

workspace.RunNet(train_model.net.Proto().name)

t2 = time.time()

dt = t2 - t1

# Stopwatch stopped! How'd we do?

print((

"Finished iteration {:>" + str(len(str(num_iters))) + "}/{}" +

" (epoch {:>" + str(len(str(num_epochs))) + "}/{})" +

" ({:.2f} images/sec)").

format(iter+1, num_iters, epoch+1, num_epochs, total_batch_size/dt))

将以上代码合并或者在python的交互端口依次输入上述代码即可开始训练。

提示：这部分代码与官网代码有部分方法调用上的不同，可能是caffe2接口更新，但是官方文档到目前为止还未更新的原因。建议在代码中，将model.xxx创建算子的方法更改为使用brew.xxx的帮手函数方法，增加第一个参数为定义的model即可。

3.2 分布式训练测试

3.2.1 下载resnet50代码

在每个节点上完成单机测试之后，即可开始分布式的测试。训练数据即是刚才我们下载的数据，代码为官方在github上给出的代码，下载下来后命名为resnet50_trainer.py。

（地址：https://github.com/caffe2/caffe2/blob/master/caffe2/python/examples/resnet50_trainer.py）

3.2.2 重要提示

容器内进行分布式训练，需要修改/etc/hosts，将该容器的域名解析设置为自己物理主机的IP，如

192.168.133.10 caffe2

如果不修改，会发生Gloo在通信时无法发现对方主机，无法建立socket连接（因此推测Gloo是根据IP进行通信的）。

3.2.3 分布式训练

训练时需要在每个节点上依次输入命令，在这里具体的命令为（节点更多时可以通过脚本来完成）：

第一个节点：

time python resnet50_trainer.py --train_data ~/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_train/ --test_data ~/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_val/ --gpus 0 --num_labels 2 --base_learning_rate 0.0384 --batch_size 32 --epoch_size 1280 --num_epochs 10 --num_shards 3 --shard_id 0 --run_id 1234 --file_store_path=/mnt/

第二个节点：

第三个节点：

可以看到三个节点上的命令基本相同，唯一不同的是shard-ids参数，它作为了每个节点的唯一标识，其他的参数解释参考下一节。

3.2.4 参数解释

参数的解释可以参考官网，个人的理解如下：

参数

个人理解

train_data

必备，训练数据集的位置，文件夹即可

test_data

可选，测试数据集的位置，文件夹即可

db_type

可选，数据库类型，默认lmdb

gpus

可选，指定当前节点上使用的gpu的ID列表，从0开始，用“,”隔开

num_gpus

可选，指定当前节点上的gpu个数，可用于替代gpu数目

num_channels

可选，图片的颜色通道数目，默认为3

image_size

输入图片的像素尺寸，高或宽，假设图片是正方形，默认227，可能不能应对小尺寸

num_labels

数据中的标签数量，默认是1000类，可以根据输入数据集而变化，这里的命令设置为2类

batch_size

batch的大小，这里指的是该节点上所有GPU的batch size，而不是所有节点的，单个GPU默认是32，根据该节点上的GPU数量增加

epoch_size

每个epoch输入的数量，默认未1500000，可以自定义，如caffe2官网提供的小数据集有1280张

num_epochs

epoch数量

base_learning_rate

学习速率，官方建议设置为所有节点batch_size之和*0.0004，默认值为0.1，假设所有节点的batch size之和为256的学习速率值，根据自己设定的总batch size而改变（不是该节点上的batch size）

weight_decay

权重衰减

num_shards

分布式训练时的机器节点数量，默认为1，单节点，

shard_id

该节点的shard ID，默认为0，将第一个节点设置为0，后续节点依次设置为1,2,3……即可

run_id

运行ID标识，用于分布式运行时，所有节点相互标识，参与该次训练的所有节点保持一致即可

redis_host

Redis服务器的端口，用作rendezvous

redis_port

Redis服务器的IP

file_store_path

共享目录位置，用于不同节点参数同步的临时文件夹，作为redis的替代，两者二选一即可，这里使用之前挂载的NFS目录。

3.3 简单的性能测试

使用resnet50_trainer.py，我们在配置好的物理环境中测试分布式测试的性能，结果如下（单机命令去掉num_shards等分布式所需的命令，修改batch size即可）：

服务器数目

GPU/服务器

总GPU数目

单节点Batch size

时间（s）

1（p100）

415.5

1（p100）

【技术分享】caffe2分布式平台安装部署文档

585

1（p4）

671

2（p4+p100）

516

2（p100*2）

283

3（p100*2+p4）

410

该测试为单次测试结果，没有多次重复进行，数据不严谨，仅作为分布式训练能力的探测，并且部分GPU加速、网络分发未优化。

可以看到P100和P4在性能上差距还是蛮大的，在测试过程中P4卡上batch size设置为64时便出现了out of memory的错误提示。因为P4卡的存在，多机分布式训练时反而降低了训练速度，看来使用同步的随机梯度优化还是使用同构的硬件比较好，不然会严重影响效率。

3.4 Troubleshooting

分布式过程中遇到的大部分问题，如connection error、Aborted (core dumped)都是由多节点的网络通信异常引起的，首先是保证各个节点自己能够解析自己的主机名（如本文中的caffe2），随后会使用解析出来的IP进行通信，然后保证节点之间能够顺利完成通信便可以解决大部分的训练问题。

4 参考资料

https://caffe2.ai/docs/getting-started.html?platform=ubuntu&configuration=compile

https://caffe2.ai/docs/getting-started.html?platform=centos&configuration=cloud

https://github.com/caffe2/tutorials/blob/master/MNIST.ipynb

https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/Multi-GPU_Training.ipynb

https://github.com/caffe2/caffe2/blob/master/caffe2/python/examples/resnet50_trainer.py

https://blog.csdn.net/zziahgf/article/details/79022490

https://hub.docker.com/r/nvidia/cuda/

人工智能

分享给好友（无线网怎么分享给好友）

643 2022-05-29

【技术 分享】caffe2分布式平台安装部署文档

为什么点进去的网址显示链接不存在（为什么有些链接不能点）

为什么点进去的网址显示链接不存在（链接怎么点不进去）

分享给好友（无线网怎么分享给好友）

推荐文章

企业生产管理是什么，企业生产管理软件

进盘点进销存软件排行榜前十名

进销存系统哪个简单好用？进销存系统优点

工厂生产管理（工厂生产管理流程及制度）

生产管理软件，机械制造业生产管理，制造业生产过程管理软件

进销存软件和ERP有什么区别？进销存与erp软件理解

进销存如何进行库存管理

如何利用excel制作销售订单管理系统？

数据库订单管理系统有哪些功能？数据库订单管理系统怎么设计？

什么是数据库管理系统？

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜

智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐

WPS2016怎么绘制简单的价格表?

连锁餐饮管理系统的功能有哪些？餐饮服务系统的构成及工

Excel项目进度表模板，简化您的项目进度管理

友情链接