This review summarizes the progress made in deep learning target detection from 2014 to 201901, including:

More than 250 key contributions are included in this survey, covering many aspects of generic object detection research: leading detection frameworks and fundamental subprob-lems including object feature representation, object proposal generation, context information modeling and training strategies; evaluation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance.

The main purpose of this article is to extract some important diagrams and conclusions in the paper, as an index for system learning, without detailed development.

The following two pictures are from github, which are paper list and performance table, red is the paper that the author thinks must be read.

Object detection tasks and challenges。

The input of the object detection task is an image,The output is the position and category of the objects in the image,As shown below,Location can be described by Bounding Box,It can also be described as a collection of pixels.

ছবিতে থাকা সামগ্রীর অবস্থান এবং বিভাগ নির্ধারণ করতে,অনেক চ্যালেঞ্জের মুখোমুখি,একটি ভাল আবিষ্কারকের সঠিকভাবে সনাক্ত করা দরকার,সঠিক শ্রেণিবদ্ধকরণ এবং উচ্চ দক্ষতার প্রয়োজন বিকৃতি, স্কেল, দৃষ্টিকোণ, আকার, দৃষ্টিভঙ্গি, অবসারণ, অস্পষ্টতা,গোলমাল শক্ত,বৃহত্তর আন্তঃ-শ্রেণীর পার্থক্য সহ্য করতে সক্ষম হওয়া দরকার,এবং ছোট আন্তঃ শ্রেণীর পার্থক্যগুলি আলাদা করতে পারে, একই সময়ে, এটি দক্ষ হতে হবে।

In order to determine the position and category of the object in the picture, many challenges are faced. A good detector must be accurate in positioning, accurate in classification, and efficient. It requires lighting, deformation, scale, perspective, size, attitude, occlusion, and blur. Conditions such as noise and noise are robust. They need to be able to tolerate large intra-class differences, and be able to distinguish smaller inter-class differences, while ensuring high efficiency.

Summary of target detection methods

Before 2012, the target detection methods were mainly artificial feature engineering + classifiers, and after 2012, they were mainly DCNN-based methods, as shown in the following figure:

The framework of object detection can be divided into 2 categories:

Two stage detection framework: Contains a region proposal, first obtains the ROI, then identifies and returns the ROI bounding box, represented by the RCNN series of methods.
One stage detection framework: It does not include a region proposal. The full-graph grid is used to identify and regression each grid, represented by the YOLO series of methods.

The comparison and evolution of Pipeline are as follows:

Backbone network, detection framework design, Large-scale high-quality data sets are the three most important factors determining detection performance. Determines how well the feature is learned and how well the feature is used.

Basic sub-problem：

Key points discussed in this section include: DCNN-based feature representation, Candidate area generation, Contextual information, Training strategies, etc.

Represents the feature-based DCNN

Network backbone

ILSVRC (ImageNet Large Scale Visual Recognition Competition) has greatly promoted the improvement of DCNN architecture.Among the various tasks of computer vision,These classic networks are often used as backbones.And then write various articles on it,The DCNN architectures commonly used in object detection tasks are as follows:

Methods For Improving Object Representation

The size of the object in the image is unknown,The size of different objects in the picture may also be different,The deeper the DCNN, the larger the receptive field,Therefore, it is obviously difficult to achieve the optimal prediction only at a certain level.A natural idea is to use the information extracted from different layers to make predictions.Called multiscale object detection,Can be divided into 3 categories:

Detecting with combined features of multiple CNN layers
Detecting at multiple CNN layers;
Combinations of the above two methods

Directly looking at the picture is more intuitive:

Attempting to model geometric deformation is also a direction to improve Object Representation. Methods include Deformable Part based Models (DPMs) and Deformable Convolutional Networks (DCN).

Context Modeling

Context information can be divided into 3 categories:

Semantic context: The likelihood of an object to be found in some scenes but not in others;
Spatial context: The likelihood of finding an object in some position and not others with respect to other objects in the scene;
Scale context: Objects have a limited set of sizes relative to other objects in the scene.

DCNN may have implicitly used contextual information by learning the features of different levels of abstraction, so current state-of-art target detection methods do not explicitly use contextual information, but recently there are also some DCNN methods that explicitly use contextual information , Can be divided into 2 categories: Global context and Local context.

Feeling can be seen as data-level integration of learning in a way.

Detection Proposal Methods

Two stage detection framework needs to generate ROI.

The method of generating ROI can be divided into Bounding Box Proposal Methods and Object Segment Proposal Methods. The former returns to the Bounding Box to describe the ROI, and the latter describes the ROI by segmenting the pixel set.

Other Special Issues

Through data augmentation tricks (data augmentation), more robust feature representations can be obtained, which can be regarded as integrated learning at the data level. Considering the large or small object scale, scaling is the most widely used data augmentation method.