Feature pyramids are a basic component in recognitionsystems for detecting objects at different scales. But recentdeep learning object detectors have avoided pyramid representations, in part because they are compute and memoryintensive. In this paper, we exploit the inherent multi-scale,pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed forbuilding high-level semantic feature maps at all scales. Thisarchitecture, called a Feature Pyramid Network (FPN),shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic FasterR-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark withoutbells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPUand thus is a practical and accurate solution to multi-scaleobject detection. Code will be made publicly available.
Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting itslevel in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approachimpractical for real applications. Moreover, training deep
정말 많이 보고 배워서 참고합니다. 또한 영상 내용을 시간을 바탕으로 제가 새로 재구성했습니다.
이 그림은 FPN을 검색하면 나오는 대표적인 그림이다. 이를 바탕으로 설명한다.
해당 내용은 Faster-RCNN에서 나온 내용이다. Yolov2에서 anchor는 위 논문을 보고 따왔다.
3.1 Region Proposal NetworksA Region Proposal Network (RPN) takes an image(of any size) as input and outputs a set of rectangularobject proposals, each with an objectness score.3 Wemodel this process with a fully convolutional network[7], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNNobject detection network [2], we assume that both netsshare a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model[32] (ZF), which has 5 shareable convolutional layersand the Simonyan and Zisserman model [3] (VGG-16),which has 13 shareable convolutional layers.To generate region proposals, we slide a smallnetwork over the convolutional feature map outputby the last shared convolutional layer. This smallnetwork takes as input an n × n spatial window ofthe input convolutional feature map. Each slidingwindow is mapped to a lower-dimensional feature(256-d for ZF and 512-d for VGG, with ReLU [33]following). This feature is fed into two sibling fullyconnected layers—a box-regression layer (reg) and abox-classification layer (cls). We use n = 3 in thispaper, noting that the effective receptive field on theinput image is large (171 and 228 pixels for ZF andVGG, respectively). This mini-network is illustratedat a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-windowfashion, the fully-connected layers are shared acrossall spatial locations. This architecture is naturally implemented with an n×n convolutional layer followedby two sibling 1 × 1 convolutional layers (for reg andcls, respectively).3.1.1 AnchorsAt each sliding-window location, we simultaneouslypredict multiple region proposals, where the numberof maximum possible proposals for each location isdenoted as k. So the reg layer has 4k outputs encodingthe coordinates of k boxes, and the cls layer outputs2k scores that estimate probability of object or notobject for each proposal4. The k proposals are parameterized relative to k reference boxes, which we callanchors. An anchor is centered at the sliding windowin question, and is associated with a scale and aspectratio (Figure 3, left). By default we use 3 scales and3 aspect ratios, yielding k = 9 anchors at each slidingposition. For a convolutional feature map of a sizeW × H (typically ∼2,400), there are W Hk anchors intotal.