The image recognition and detection model on Fluid.

From the cooperation with Visual Technology Department on Fluid. We need to do three models about image classification, objection detection and optical character recognition (OCR). They are:

- [SE-ResNeXt 152](https://arxiv.org/abs/1709.01507) on ImageNet 2012 dataset.
- MobileNet-SSD on  MSCOCO dataset.
- OCR Model
    - CNN + RNN(GRU) + CTC model
    - CNN + RNN(GRU) + Attention model.

###  SE-ResNeXt 152
**The top-1 error on ImageNet 2012 dataset must less than 18.2%**.
TODOs:
- 1.)   Add data argumentation operation
   - 1.1)  [Random Crop](https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua#L83) (namely random crop,  aspect ratio)
  - 1.2) [Color Jitter](https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua#L83).
  - 1.3) [Lighting](https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua#L83).
  - 1.4) [Color Normalize](https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua#L90).
  - 1.5) [HorizontalFlip](https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua#L91).

      All these implementations follow Torch:  https://github.com/facebook/fb.resnet.torch and released papers also follow this code.

      **BUT I'm not sure whether we need to implement these as C++ operators? Since all the code will run based on C++ code in the future and not relied on Python code. Or only need to implement them by Python for the reader?**

- 2.) Write model configuration for SE-ResNeXt.
    Except the residual block, the SE-ResNeXt architecture contains [squeeze-and-excitation(SE) block](https://arxiv.org/abs/1709.01507) and [aggregating transformations](https://arxiv.org/abs/1611.05431). 
    - 2.1) SE-Block: 
         - Global average pooling + FC (or 1x1 conv) + ReLU + FC(or 1x1 conv) + Sigmoid
         - Scale Op (elementwise_mul operator in Fluid.)
         - About the global average pooling:
             From the [author's point of view](https://github.com/hujie-frank/SENet#note), our global pooling operator may also be less efficient. We also need to optimize it or just try `reduce_mean` at first.
    - 2.2) Aggregating Transformations
        - This is a grouped convolution.
- 3.) Experiment
    The single crop validation error of top-1 **must** be less than 18.2% on ImageNet 2012. But if the Multi-GPUs are not finished before the above works are finished. The result can be verified on CIFAR dataset at first.
- 4.) Submit demo and report.


The following two parts will continue to be edited to list more detailed subtasks. 
----

###  MobileNet-SSD
- 1.) MobileNet
    - 1.1) depthwise-conv operator.
    - 1.2) ARM based depthwise-conv. 
- 2.) SSD architecture
    - Even though, the layers have been implemented in old Paddle. I think we should make a survey about object detection on other frameworks like TensorFlow and then split into many subtasks. I'm doing this now.  In addition, except for the training, our goal is to deploy this model. 
    - Done in https://github.com/PaddlePaddle/Paddle/issues/7402
- 3.) Data argumentation
---
### OCR Model
- 1.) CNN + RNN(GRU) + CTC
    - 1.1) Merge WarpCTC Op: https://github.com/PaddlePaddle/Paddle/pull/5107
    - 1.2) Merge BlockExpand Op: https://github.com/PaddlePaddle/Paddle/pull/4866
    - 1.3) Merge edit distance  Op: https://github.com/PaddlePaddle/Paddle/pull/5300
    - 1.4) Fix the bug of gradient operator of WarpCTC.
    - 1.5) Python API of WarpCTC Op.
    - 1.6) Python API of block_expand Op.
    - 1.7) Python API of edit distance op.
    - 1.8) Implement data reader and support variable-length images across mini-batch.
    - 1.9) Write demo and train model.
- 2.) CNN + RNN(GRU) + Spatial Attention
  
  
  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The image recognition and detection model on Fluid. #7253

SE-ResNeXt 152

The following two parts will continue to be edited to list more detailed subtasks.

MobileNet-SSD

OCR Model

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The image recognition and detection model on Fluid. #7253

Description

SE-ResNeXt 152

The following two parts will continue to be edited to list more detailed subtasks.

MobileNet-SSD

OCR Model

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions