Skip to content

v2 API集群训练core dump #2710

@typhoonzero

Description

@typhoonzero

v2分布式训练失败报错如下:

Mon Jul  3 20:01:59 2017[1,0]<stdout>:Pass 0, Batch 0, Cost 1.157251, {'__auc_evaluator_0__': 0.5791015625, 'classification_error_evaluator': 0.4375}
Mon Jul  3 20:01:59 2017[1,0]<stderr>:*** Aborted at 1499083319 (unix time) try "date -d @1499083319" if you are using GNU date ***
Mon Jul  3 20:01:59 2017[1,0]<stderr>:PC: @                0x0 (unknown)
Mon Jul  3 20:01:59 2017[1,0]<stderr>:*** SIGSEGV (@0x8) received by PID 1227 (TID 0x7f492b5fe700) from PID 8; stack trace: ***
Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @     0x7f495e753160 (unknown)
Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @     0x7f49586f9972 paddle::ProtoClient::recv()
Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @     0x7f4958f16126 paddle::ParameterClient2::sendParallel()
Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @     0x7f4958801a5c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @     0x7f4957c8a8a0 execute_native_thread_routine
Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @     0x7f495e74b1c3 start_thread
Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @     0x7f495dd7312d __clone
100  406k    0  406k    0    Mon Jul  3 20:01:59 2017[1,0]<stderr>:    @                0x0 (unknown)
 0  3454k      0 --:--:-- --:--:-- --:--:-- 3476k
Mon Jul  3 20:01:59 2017[1,0]<stderr>:./train.sh: line 239:  1227 Segmentation fault      python27-gcc482/bin/python conf/trainer_config.conf

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions