Skip to content

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine  #152

@wuyujiji

Description

@wuyujiji

Question

Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.

Envrionment

gpu: Tesla V100 16g
python:3.6
cuda: 10.0
cudnn: 7
oneflow: 0.2.0
OneFlow-benchmark: master@f09f31ea8c3da6a1cc193081eb544b92d8e504c2

log info:
NUM_EPOCH=2
DATA_ROOT=/workdir/data/mini-imagenet/ofrecord

Running resnet50: num_gpu_per_node = 4, num_nodes = 1.

dtype = float32
gpu_num_per_node = 4
num_nodes = 1
node_ips = ['192.168.1.13', '192.168.1.14']
ctrl_port = 50051
model = resnet50
use_fp16 = None
use_xla = None
channel_last = None
pad_output = None
num_epochs = 2
model_load_dir = None
batch_size_per_device = 128
val_batch_size_per_device = 50
nccl_fusion_threshold_mb = 0
nccl_fusion_max_ops = 0
fuse_bn_relu = False
fuse_bn_add_relu = False
gpu_image_decoder = False
image_path = test_img/tiger.jpg
num_classes = 1000
num_examples = 1281167
num_val_examples = 50000
rgb_mean = [123.68, 116.779, 103.939]
rgb_std = [58.393, 57.12, 57.375]
image_shape = [3, 224, 224]
label_smoothing = 0.1
model_save_dir = ./output/snapshots/model_save-20201028202443
log_dir = ./output
loss_print_every_n_iter = 100
image_size = 224
resize_shorter = 256
train_data_dir = /workdir/data/mini-imagenet/ofrecord/train
train_data_part_num = 8
val_data_dir = /workdir/data/mini-imagenet/ofrecord/val
val_data_part_num = 8
optimizer = sgd
learning_rate = 1.024
wd = 3.0517578125e-05
momentum = 0.875
lr_decay = cosine
lr_decay_rate = 0.94
lr_decay_epochs = 2
warmup_epochs = 5
decay_rate = 0.9
epsilon = 1.0
gradient_clipping = 0.0

Time stamp: 2020-10-28-20:24:43
Loading data from /workdir/data/mini-imagenet/ofrecord/train
Optimizer: SGD
Loading data from /workdir/data/mini-imagenet/ofrecord/val

Then, it hangs for a long time

To Reproduce

  1. build the oneflow envrionment
    python3 -m pip install --find-links https://oneflow-inc.github.io/nightly oneflow_cu100
  2. clone the source of OneFlow-benchmark
    git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
  3. download the mini-imagenet
    note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively
  4. change the content of the shell scripit
    cd Classification/cnns/
    vim train.sh
    set --train_data_part_num=8
    set --val_data_part_num=8
    set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs.
  5. run the shell scripts
    sh train.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions