Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Fluid 分布式训练模型参数切分策略详解
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> 分布式训练参数切分设计

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样确实简洁一些, thanks, 我修改一下

本篇文章将说明, 在使用 PaddlePaddle Fluid 进行基于 Parameter Server 的分布式训练时, 模型参数的切分方案设计, 并且举了一个如何应用这种切分方案的简单例子;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这句话其实没什么作用,只要标题足够简短明确就行。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得这句话能告诉读者他能从这篇文章里获取什么, 比如他可以直接复制例子的代码, 或者他可以理解背后的设计决策怎么做的, 这样也节省读者的时间, 仅有一个标题还没法做到这个效果


## 模型参数切分策略设计
### 切分原因
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

切分原因 不应该在“xxxx设计下”,原因是背景,是为什么要做,而不是怎么做。增加一个二级标题 “背景“ 说明需要切分的原因

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的, 我修改一下


在模型设计时, 我们通常不会限制模型各层使用的参数大小, 假设我们现在有3台参数服务器, 并且要训练如下的网络:

![fluid_3_layer_network](src/fluid_3_layers_network.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图裂了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个图有很多问题:

  1. 没有fluid.inputfluid.output这个函数, fluid.fc应该是fluid.layers.fc
  2. w, b 是在fc上的
  3. 前面说的“假设有3台服务器“ 并没有体现,也没有后续说明。

Copy link
Collaborator Author

@velconia velconia Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. thanks, 我修改一下图片
  2. wb我的理解是 因为 w * fluid.layers.data + b 才得到的 fluid.layers.fc 的输入, 我觉得应该在连接线上? 参考这张图 http://www.paddlepaddle.org/docs/develop/book/02.recognize_digits/image/mlp.png
  3. 后面其实有提及服务器的数量和切分数量的关系, 所以这里阐述清楚其实是很必要的

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片最好可以都居中,方法:
image


fluid.input 层非常宽, 导致 w1, b1 参数维度非常的大, 达到了 10 * 1000, 而 fluid.fc 层非常窄, 导致 w2, b2 参数维度特别小, 只有 1 * 10.

如果我们只是简单的将这些参数分配到参数服务器上, 会导致每个参数服务器拿到的参数大小并不均匀, 负载较轻的参数服务器等待负载较重的参数服务器;
所以针对这种参数大小不均匀的情况, 在Distribute Transpiler中, 我们会对模型的参数和对应的梯度进行切分, 参数和梯度在切分后变为一个或多个参数块.

### 切分方式

在切分参数时, 如果切分的粒度过细会导致参数服务器的计算效率不高, 但如果切分的粒度过大又无法做到参数的均匀分配;
所以为了在切分时控制粒度, 针对每个参数或梯度, 我们都会计算两个值, 最大切分数量和期望切分数量:

* 最大切分数量

为了避免切分粒度过细, 我们拟定了一个最小的参数块大小: 8192;
我们会对 参数大小 / 最小参数块大小 的结果向上取整, 得到这个参数的最大切分数量;
所以在上面的例子中, 我们会得到的最大切分数量是 2;

* 期望切分数量

为了做到参数完全平均分配到每一个参数服务器上, 我们将参数服务器的总数作为期望切分数量;
所以在上面的例子中, 我们会得到的期望切分数量是: 3;

在计算完上述的两个值后, 我们会取两值中的较小值作为最后的切分数量, 确保在保证最小粒度的情况下, 参数被尽可能的平均分配了;
在上面的例子中, 我们最后会将参数切分为2份;

### 分配方式

在将参数和梯度切分为多个参数块后, 我们还需要对将参数块均匀地分配到参数服务器上;

我们现在支持两种简单而有效的参数块分配方式: [Round Robin](https://en.wikipedia.org/wiki/Round-robin_scheduling) 和 [Hash](https://en.wikipedia.org/wiki/Hash_function);

在 Round Robin 模式中, 我们会 one-by-one 的将参数块分配到 Server 上;

在 Hash 模式中, 我们会对参数块名称进行 Hash 操作然后对参数服务器总数取模, 得到具体的参数服务器id;

### 整体切分流程

至此, 我们对参数还有梯度的切分策略就结束了, 针对上面的例子, 我们会得到如下图所示的切分结果:

![fluid_parameter_slice_up](src/fluid_parameter_slice_up.png)


## 模型参数切分用例
### 分布式实现

PaddlePaddle Fluid 分布式训练的具体实现方式可以参考 [Fluid Cluster Train](../../howto/cluster/fluid_cluster_train_cn.md)

### 参数详解
我们主要的参数策略实现在了 [Distribute Transpiler](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/distribute_transpiler.py) 中, 我们可以在```transpile```方法中指定```slice_var_up=True```来开启模型参数切分, 并且可以使用```split_method=RoundRobin```来指定模型参数的分配方式, 示例代码如下:

```python
transpiler.transpile(
trainer_id=trainer_id,
slice_var_up=True,
split_method=RoundRobin,
pservers=pservers,
trainers=trainers)
```
67 changes: 67 additions & 0 deletions doc/fluid/design/dist_train/fluid_parameter_split_strategy_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Fluid distributed parameter segmentation strategy
In this article, we'll explain the design of parameters segmentaion when we do pserver-based distributed training with PaddlePaddle Fluid, we will give a case of how this segmentation scheme could be used in python code;

## Model Parameter Segmentation Strategy Design
### Reason for segmentation

In the design of the model, we usually do not limit the size of the parameters used by each layer of the model. Suppose we have 3 parameter servers now and we want to train the following network:

![fluid_3_layer_network](src/fluid_3_layers_network.png)

The fluid.input layer is very wide, causing the w1, b1 parameter dimensions to be very large, reaching 10 * 1000, while the fluid.fc layer is very narrow, resulting in a 1 * 10 dimension of the w2, b2 parameter.

If we simply assign these parameters to the parameter server, the parameter size obtained by each parameter server will not be uniform, and the lightly loaded parameter server will wait for the parameter server with heavy load.
Therefore, for the case of non-uniform size of the parameters, in the Distribute Transpiler, we will segment the parameters of the model and the corresponding gradients into one or more parameter blocks.

### Segmentation

Take into account the grain size of segmentation, if the segmentation is fine-grained, then the calculation efficiency of the parameter server will be low, but if the segmentation is too coarse-grained, even distribution of the parameters cannot be achieved;
So in order to control the grain size at the time of segmentation, we will calculate two values, the maximum segmentation number and the desired segmentation number for each parameter or gradient:

* The maximum number of cuts

In order to avoid the fine-grained granularity, we have formulated a minimum parameter block size: 8192;
We will round up the result of parameter size / minimum parameter block size, and get the maximum number of segmentation of this parameter;
In the above example, the maximum number of segmentation is 2;

* Expected number of cuts

In order to achieve an even distribution of parameters to each parameter server, we use the total number of parameter servers as the desired number of partitions;
In the above example, the expected number of segmentation is 3;

After calculating the above two values, we will take the smaller of the two values as the final number of cuts, ensuring that the parameters are evenly distributed as far as possible while guaranteeing the minimum granularity;
So in the above example, we will finally divide the parameters into 2 parts;

### Partition

After segment the parameters and gradients into multiple parameter blocks, we also need to evenly partition the parameter blocks to the parameter servers.

Now, we support two simple and effective partition methods: [Round Robin](https://en.wikipedia.org/wiki/Round-robin_scheduling) and [Hash](https://en.wikipedia.org/ Wiki/Hash_function);

In Round Robin mode, we will one-by-one partition the parameter block to the Server;

In Hash mode, we will perform Hash operation on parameter block names and then modulo the total number of parameter servers to obtain a specific parameter server id;

### Overall Segmentation Process

At this point, our strategy for segmenting parameters and gradients is over. For the above example, we will get the segmentation result as shown in the following figure:

![fluid_parameter_slice_up](src/fluid_parameter_slice_up.png)


## Model Parameter Segmentation Use Case
### Distributed Implementation

Specific implementation of PaddlePaddle Fluid distributed training can refer to [Fluid Cluster Train](../../howto/cluster/fluid_cluster_train_cn.md)

### Parameter details
Our main parameter strategy is implemented in [Distribute Transpiler] (https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/distribute_transpiler.py), we can use the ```transpile``` method specifies ```slice_var_up=True``` to enable model parameter segmentation, and ```split_method=RoundRobin``` can be used to specify the partition of model parameters. Followings are the sample code:

```python
transpiler.transpile(
trainer_id=trainer_id,
slice_var_up=True,
split_method=RoundRobin,
pservers=pservers,
trainers=trainers)
```
1 change: 1 addition & 0 deletions doc/fluid/design/dist_train/index_cn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@
distributed_architecture.md
distributed_lookup_table_design.md
parameter_server.md
fluid_parameter_split_strategy_cn.md
1 change: 1 addition & 0 deletions doc/fluid/design/dist_train/index_en.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Distributed Training
distributed_architecture.md
distributed_lookup_table_design.md
parameter_server.md
fluid_parameter_split_strategy_en.md
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions doc/v2/dev/write_docs_cn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,8 @@ PaddlePaddle.org工具可以配合Docker使用,需要在系统里先安装好D
docker build -t paddle:dev .
docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" -e "WITH_DOC=ON" paddle:dev /bin/bash

# 进入Docker容器后使用build.sh脚本构建PaddlePaddle文档
bash -x /paddle/paddle/scripts/docker/build.sh
# 进入Docker容器后使用paddle_build.sh脚本构建PaddlePaddle文档
bash -x /paddle/paddle/scripts/paddle_build.sh build

注:上述命令把当前目录(源码根目录)映射为 container 里的 :code:`/paddle` 目录。

Expand Down
6 changes: 3 additions & 3 deletions doc/v2/dev/write_docs_en.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Please `click here <https://github.com/PaddlePaddle/PaddlePaddle.org/blob/develo
Manually Building the Documentation
-------------------------------------

Build PaddlePaddle's documentation with Docker,you need to install Docker first. Please refer to `Docker's official website <https://docs.docker.com/>`_ on how to install Docker. This method is quite similar to ` Build From Sources <http://paddlepaddle.org/docs/develop/documentation/en/build_and_install/build_from_source_en.html>`_ , by constructing, from source code, a docker image that can be used to build PaddlePaddle documentation. Enter the Docker container and use the script ``build.sh`` in the source directory to build the PaddlePaddle documentation. The specific steps are as follows:
Build PaddlePaddle's documentation with Docker,you need to install Docker first. Please refer to `Docker's official website <https://docs.docker.com/>`_ on how to install Docker. This method is quite similar to ` Build From Sources <http://paddlepaddle.org/docs/develop/documentation/en/build_and_install/build_from_source_en.html>`_ , by constructing, from source code, a docker image that can be used to build PaddlePaddle documentation. Enter the Docker container and use the script ``paddle_build.sh`` in the source directory to build the PaddlePaddle documentation. The specific steps are as follows:

.. code-block:: bash

Expand All @@ -79,8 +79,8 @@ Build PaddlePaddle's documentation with Docker,you need to install Docker firs
docker build -t paddle:dev .
docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" -e "WITH_DOC=ON" paddle:dev /bin/bash

# Use build.sh to build PaddlePaddle documentation
bash -x /paddle/paddle/scripts/docker/build.sh
# Use paddle_build.sh to build PaddlePaddle documentation
bash -x /paddle/paddle/scripts/paddle_build.sh build

Note: The above commands maps the current directory (source root directory) to the :code:`/paddle` directory in the container.

Expand Down