run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine 

### Question
Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.

###  Envrionment
gpu: Tesla V100 16g
python:3.6
cuda: 10.0
cudnn: 7
oneflow: 0.2.0
OneFlow-benchmark: master@f09f31ea8c3da6a1cc193081eb544b92d8e504c2

**log info：**
NUM_EPOCH=2
DATA_ROOT=/workdir/data/mini-imagenet/ofrecord

Running resnet50: num_gpu_per_node = 4, num_nodes = 1.

dtype = float32
gpu_num_per_node = 4
num_nodes = 1
node_ips = ['192.168.1.13', '192.168.1.14']
ctrl_port = 50051
model = resnet50
use_fp16 = None
use_xla = None
channel_last = None
pad_output = None
num_epochs = 2
model_load_dir = None
batch_size_per_device = 128
val_batch_size_per_device = 50
nccl_fusion_threshold_mb = 0
nccl_fusion_max_ops = 0
fuse_bn_relu = False
fuse_bn_add_relu = False
gpu_image_decoder = False
image_path = test_img/tiger.jpg
num_classes = 1000
num_examples = 1281167
num_val_examples = 50000
rgb_mean = [123.68, 116.779, 103.939]
rgb_std = [58.393, 57.12, 57.375]
image_shape = [3, 224, 224]
label_smoothing = 0.1
model_save_dir = ./output/snapshots/model_save-20201028202443
log_dir = ./output
loss_print_every_n_iter = 100
image_size = 224
resize_shorter = 256
train_data_dir = /workdir/data/mini-imagenet/ofrecord/train
train_data_part_num = 8
val_data_dir = /workdir/data/mini-imagenet/ofrecord/val
val_data_part_num = 8
optimizer = sgd
learning_rate = 1.024
wd = 3.0517578125e-05
momentum = 0.875
lr_decay = cosine
lr_decay_rate = 0.94
lr_decay_epochs = 2
warmup_epochs = 5
decay_rate = 0.9
epsilon = 1.0
gradient_clipping = 0.0

Time stamp: 2020-10-28-20:24:43
Loading data from /workdir/data/mini-imagenet/ofrecord/train
Optimizer:  SGD
Loading data from /workdir/data/mini-imagenet/ofrecord/val

Then, it hangs for a long time

### To Reproduce
1. build the oneflow envrionment
`python3 -m pip install --find-links https://oneflow-inc.github.io/nightly oneflow_cu100`
2. clone the source of OneFlow-benchmark
`git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git`
3. download the [mini-imagenet](https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/dataset/imagenet/mini-imagenet.zip)
**note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively**
3. change the content of the shell scripit
`cd Classification/cnns/`
`vim train.sh`
`set --train_data_part_num=8`
`set --val_data_part_num=8`
`set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs.`
4. run the shell scripts
`sh train.sh`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

Question

Envrionment

To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

Description

Question

Envrionment

To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions