Feature pyramids are a basic component in recognitionsystems for detecting objects at different scales. But recentdeep learning object detectors have avoided pyramid representations, in part because they are compute and memoryintensive. In this paper, we exploit the inherent multi-scale,pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed forbuilding high-level semantic feature maps at all scales. Thisarchitecture, called a Feature Pyramid Network (FPN),shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic FasterR-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark withoutbells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPUand thus is a practical and accurate solution to multi-scaleobject detection. Code will be made publicly available.
Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting itslevel in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approachimpractical for real applications. Moreover, training deep
정말 많이 보고 배워서 참고합니다. 또한 영상 내용을 시간을 바탕으로 제가 새로 재구성했습니다.
초반 layer에서는 높은 해상도를 갖고 있지만 각 pixel하나는 큰 의미가 없을 수 있다. (매우 작은 object면 있을 수도 있음. 하지만 대부분 object들은 거의 pixel 하나에만 들어있지 않고 across하기 때문에 그럴리는 적다.) FC를 제외하고 뒤로가면 갈수록 해상도는 낮아지지만 depth가 깊어지고 각 pixel에 담는 정보는 매우 의미있음을 알 수 있다.
여기서 pixel은 x.shape = (512, w, h)로 봤을때 x[:, p_x, p_y] 값을 의미한다. 또한 FC를 제외하고 생각하자.

이 그림은 FPN을 검색하면 나오는 대표적인 그림이다. 이를 바탕으로 설명한다.
위 background에서 설명했던 내용처럼 Large Scale을 맞추기 위해서는 방법이 두 가지가 있다. 하나는 Window box를 늘리거나 또 하나는 Image size를 조정하는 것이다. Featurized image pyramid는 후자이다. 따라서 Image size를 변경해서 예측한다. 하지만 이는 자원소모가 크고 빠른 속도를 기대할 수 없다. 따라서 다른 방식이 나오기 시작한다. 해당 방법을 쓴 model은 Overfeat, HOG, SIFT 등이 있다.

"그러면 Large Scale을 적용하는 방법으로 Image 크기를 줄이는 법이 있는데 VGG같은 경우 일정 비율로 pooling 되니까 이를 stage마다 적용하는 방법이 있지 않을까?" 에서 시작했다고 생각한다. 가장 대표적인 모델은 SSD가 있다.
근데 아까 High Resolution이면 low feature를 갖는다고 했고 Low Resolution이면 high feature를 갖는다고 했다. forward 도중 predict를 진행하기에 cost 측면에서는 좋다. 하지만 높은 해상도에서는 좋은 feature들을 얻을 수 없다. 그렇기에 높은 해상도에서 관측 가능한 작은 객체들을 탐지하기 어려울 것이다.
그래서 여기서 하나 단어를 잡고 가면 Semantic Gap이다.

그래서 나온게 FPN이다. 그러면 봐야하는 특징이 세 가지 있다.(Yolov3 모델로 설명할 것이다. 사실 완벽한 FPN은 아닌 것 같다.)
해당 부분은 downsampling 과정이고 해상도를 낮추면서 feature level을 높이는 단계이다. FNP 논문에서는 ResNet을 기준으로 설명했다.
해당 부분은 upsampling 과정이다. 해상도를 높이는데 feature를 갖고 간다. 방법은 nn.Upsampling 공식 문서를 확인해도 되지만 간략히 설명하면 neareast는 아래와 같은 방식으로 동작한다.

여기 부분이 yolov3와 다르다. FNP는 +를 한다. 하지만 yolov3는 depth기준 concat을 수행한다.
해당 부분이 어떤 의미가 있는지는 좀 더 찾아보겠다.
본 논문에서는 아래와 같은 + 연산으로 1x1 conv에서 output 값을 맞춰준다. 따라서 같은 shape로 + 연산이 이루어진다.

해당 내용은 Faster-RCNN에서 나온 내용이다. Yolov2에서 anchor는 위 논문을 보고 따왔다.
3.1 Region Proposal NetworksA Region Proposal Network (RPN) takes an image(of any size) as input and outputs a set of rectangularobject proposals, each with an objectness score.3 Wemodel this process with a fully convolutional network[7], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNNobject detection network [2], we assume that both netsshare a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model[32] (ZF), which has 5 shareable convolutional layersand the Simonyan and Zisserman model [3] (VGG-16),which has 13 shareable convolutional layers.To generate region proposals, we slide a smallnetwork over the convolutional feature map outputby the last shared convolutional layer. This smallnetwork takes as input an n × n spatial window ofthe input convolutional feature map. Each slidingwindow is mapped to a lower-dimensional feature(256-d for ZF and 512-d for VGG, with ReLU [33]following). This feature is fed into two sibling fullyconnected layers—a box-regression layer (reg) and abox-classification layer (cls). We use n = 3 in thispaper, noting that the effective receptive field on theinput image is large (171 and 228 pixels for ZF andVGG, respectively). This mini-network is illustratedat a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-windowfashion, the fully-connected layers are shared acrossall spatial locations. This architecture is naturally implemented with an n×n convolutional layer followedby two sibling 1 × 1 convolutional layers (for reg andcls, respectively).3.1.1 AnchorsAt each sliding-window location, we simultaneouslypredict multiple region proposals, where the numberof maximum possible proposals for each location isdenoted as k. So the reg layer has 4k outputs encodingthe coordinates of k boxes, and the cls layer outputs2k scores that estimate probability of object or notobject for each proposal4. The k proposals are parameterized relative to k reference boxes, which we callanchors. An anchor is centered at the sliding windowin question, and is associated with a scale and aspectratio (Figure 3, left). By default we use 3 scales and3 aspect ratios, yielding k = 9 anchors at each slidingposition. For a convolutional feature map of a sizeW × H (typically ∼2,400), there are W Hk anchors intotal.