-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Question
Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.
Envrionment
gpu: Tesla V100 16g
python:3.6
cuda: 10.0
cudnn: 7
oneflow: 0.2.0
OneFlow-benchmark: master@f09f31ea8c3da6a1cc193081eb544b92d8e504c2
log info:
NUM_EPOCH=2
DATA_ROOT=/workdir/data/mini-imagenet/ofrecord
Running resnet50: num_gpu_per_node = 4, num_nodes = 1.
dtype = float32
gpu_num_per_node = 4
num_nodes = 1
node_ips = ['192.168.1.13', '192.168.1.14']
ctrl_port = 50051
model = resnet50
use_fp16 = None
use_xla = None
channel_last = None
pad_output = None
num_epochs = 2
model_load_dir = None
batch_size_per_device = 128
val_batch_size_per_device = 50
nccl_fusion_threshold_mb = 0
nccl_fusion_max_ops = 0
fuse_bn_relu = False
fuse_bn_add_relu = False
gpu_image_decoder = False
image_path = test_img/tiger.jpg
num_classes = 1000
num_examples = 1281167
num_val_examples = 50000
rgb_mean = [123.68, 116.779, 103.939]
rgb_std = [58.393, 57.12, 57.375]
image_shape = [3, 224, 224]
label_smoothing = 0.1
model_save_dir = ./output/snapshots/model_save-20201028202443
log_dir = ./output
loss_print_every_n_iter = 100
image_size = 224
resize_shorter = 256
train_data_dir = /workdir/data/mini-imagenet/ofrecord/train
train_data_part_num = 8
val_data_dir = /workdir/data/mini-imagenet/ofrecord/val
val_data_part_num = 8
optimizer = sgd
learning_rate = 1.024
wd = 3.0517578125e-05
momentum = 0.875
lr_decay = cosine
lr_decay_rate = 0.94
lr_decay_epochs = 2
warmup_epochs = 5
decay_rate = 0.9
epsilon = 1.0
gradient_clipping = 0.0
Time stamp: 2020-10-28-20:24:43
Loading data from /workdir/data/mini-imagenet/ofrecord/train
Optimizer: SGD
Loading data from /workdir/data/mini-imagenet/ofrecord/val
Then, it hangs for a long time
To Reproduce
- build the oneflow envrionment
python3 -m pip install --find-links https://oneflow-inc.github.io/nightly oneflow_cu100 - clone the source of OneFlow-benchmark
git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git - download the mini-imagenet
note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively - change the content of the shell scripit
cd Classification/cnns/
vim train.sh
set --train_data_part_num=8
set --val_data_part_num=8
set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs. - run the shell scripts
sh train.sh