**(UDT): Unsupervised Deep Tracking****Authors**: Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, Houqiang Li**arXiv Link**: https://arxiv.org/abs/1904.01828**Project Link**: https://github.com/594422814/UDT**Summary**: Train a robust siamese network on large-scale unlabeled videos in an unsupervised manner - forward-and-backward, i.e., the tracker can forward localize the target object in successive frames and backtrace to its initial position in the first frame.**Highlights**: Unsupervised learning

**(TADT): Target-Aware Deep Tracking****Authors**: Xin Li, Chao Ma, Baoyuan Wu, Zhenyu He, Ming-Hsuan Yang**arXiv Link**: https://arxiv.org/abs/1904.01772**Project Link**: https://xinli-zn.github.io/TADT-project-page/**Summary**: Targets of interest can be arbitrary object class with arbitrary forms, while pre-trained deep features are less effective in modeling these targets of arbitrary forms for distinguishing them from the background. TADT learns target-aware features, thus can better recognize the targets undergoing significant appearance variations than pre-trained deep features.**Highlights**: Target-aware features, better discrimination

**(SiamMask): Fast Online Object Tracking and Segmentation: A Unifying Approach****Authors**: Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, Philip H.S. Torr**arXiv Link**: https://arxiv.org/abs/1812.05050**Project Link**: https://github.com/foolwood/SiamMask

Zhihu Link: https://zhuanlan.zhihu.com/p/58154634**Summary**: Perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach.**Highlights**: Mask prediction in tracking

**SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks****Authors**: Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, Junjie Yan**arXiv Link**: https://arxiv.org/abs/1812.11703**Project Link**: http://bo-li.info/SiamRPN++/**Summary**: SiamRPN++ breaks the translation invariance restriction through a simple yet effective spatial-aware sampling strategy. SiamRPN++ performs depth-wise and layer-wise aggregations, improving the accuracy but also reduces the model size. Current state-of-the-art in OTB2015, VOT2018, UAV123, LaSOT, and TrackingNet.**Highlights**: Deep backbones, state-of-the-art

**(CIR/SiamDW): Deeper and Wider Siamese Networks for Real-Time Visual Tracking****Authors**: Zhipeng Zhang, Houwen Peng**arXiv Link**: https://arxiv.org/abs/1901.01660**Project Link**: https://github.com/researchmm/SiamDW**Summary**: SiamDW explores utilizing deeper and wider network backbones in another aspect - careful designs of residual units, considering receptive field, stride, output feature size - to eliminate the negative impact of padding in deep network backbones.**Highlights**: Cropping-Inside-Residual, eliminating the negative impact of padding

**(SiamC-RPN): Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking****Authors**: Heng Fan, Haibin Ling**arXiv Link**: https://arxiv.org/abs/1812.06148**Project Link**: None**Summary**: Previously proposed one-stage Siamese-RPN trackers degenerate in presence of similar distractors and large scale variation. Advantages: 1) Each RPN in Siamese C-RPN is trained using outputs of the previous RPN, thus simulating hard negative sampling. 2) Feature transfer blocks (FTB) further improving the discriminability. 3) The location and shape of the target in each RPN is progressively refined, resulting in better localization.**Highlights**: Cascaded RPN, excellent accuracy

**SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking****Authors**: Guangting Wang, Chong Luo, Zhiwei Xiong, Wenjun Zeng**arXiv Link**: https://arxiv.org/abs/1904.04452**Project Link**: None**Summary**: To overcome the simultaneous requirements on robustness and discrimination power, SPM-Tracker tackle the challenge by connecting a coarse matching stage and a fine matching stage, taking advantage of both stages, resulting in superior performance, and exceeding other real-time trackers by a notable margin.**Highlights**: Coarse matching & fine matching

**ATOM: Accurate Tracking by Overlap Maximization****Authors**: Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, Michael Felsberg**arXiv Link**: https://arxiv.org/abs/1811.07628**Project Link**: https://github.com/visionml/pytracking**Summary**: Target estimation is a complex task, requiring highlevel knowledge about the object, while most trackers only resort to a simple multi-scale search. In comparison, ATOM estimate target states by predicting the overlap between the target object and an estimated bounding box. Besides, a classification component that is trained online to guarantee high discriminative power in the presence of distractors.**Highlights**: Overlap IoU prediction

**(GCT): Graph Convolutional Tracking****Authors**: Junyu Gao, Tianzhu Zhang, Changsheng Xu**arXiv Link**: None**PDF Link**: http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_Graph_Convolutional_Tracking_CVPR_2019_paper.pdf**Project Link**: http://nlpr-web.ia.ac.cn/mmc/homepage/jygao/gct_cvpr2019.html**Summary**: Spatial-temporal information can provide diverse features to enhance the target representation. GCT incorporates 1) a spatial-temporal GCN to model the structured representation of historical target exemplars, and 2) a context GCN to utilize the context of the current frame to learn adaptive features for target localization.**Highlights**: Graph convolution networks, spatial-temporal information

**(ASRCF): Visual Tracking via Adaptive Spatially-Regularized Correlation Filters****Authors**: Kenan Dai, Dong Wang, Huchuan Lu, Chong Sun, Jianhua Li**arXiv Link**: None**Project Link**: https://github.com/Daikenan/ASRCF (To be updated)**Summary**: ASRCF simultaneously optimize the filter coefficients and the spatial regularization weight. ASRCF applies two correlation filters (CFs) to estimate the location and scale respectively - 1) location CF model, which exploits ensembles of shallow and deep features to determine the optimal position accurately, and 2) scale CF model, which works on multi-scale shallow features to estimate the optimal scale efficiently.**Highlights**: Estimate location and scale respectively

**(RPCF): RoI Pooled Correlation Filters for Visual Tracking****Authors**: Yuxuan Sun, Chong Sun, Dong Wang, You He, Huchuan Lu**arXiv Link**: None**Project Link**: None

PDF Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Sun_ROI_Pooled_Correlation_Filters_for_Visual_Tracking_CVPR_2019_paper.pdf**Summary**: RoI-based pooling can be equivalently achieved by enforcing additional constraints on the learned filter weights and thus becomes feasible on the virtual circular samples. Considering RoI pooling in the correlation filter formula, the RPCF performs favourably against other state-of-the-art trackers.**Highlights**: RoI pooling in correlation filters

**(TBA): Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers****Authors**: Zhen He, Jian Li, Daxue Liu, Hangen He, David Barber**arXiv Link**: https://arxiv.org/abs/1809.03137**Project Link**: https://github.com/zhen-he/tracking-by-animation**Summary**: The common Tracking-by-Detection (TBD) paradigm use supervised learning and treat detection and tracking separately. Instead, TBA is a differentiable neural model that first tracks objects from input frames, animates these objects into reconstructed frames, and learns by the reconstruction error through backpropagation. Besides, a Reprioritized Attentive Tracking is proposed to improve the robustness of data association.**Highlights**: Label-free, end-to-end MOT learning

**Eliminating Exposure Bias and Metric Mismatch in Multiple Object Tracking****Authors**: Andrii Maksai, Pascal Fua**arXiv Link**: https://arxiv.org/abs/1811.10984**Project Link**: None**Summary**: Many state-of-the-art MOT approaches now use sequence models to solve identity switches but their training can be affected by biases. An iterative scheme of building a rich training set is proposed and used to learn a scoring function that is an explicit proxy for the target tracking metric.**Highlights**: Eliminating loss-evaluation mismatch

**Multi-Person Articulated Tracking With Spatial and Temporal Embeddings****Authors**: Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian**arXiv Link**: https://arxiv.org/abs/1903.09214**Project Link**: None**Summary**: The framework consists of a SpatialNet and a TemporalNet, predicting (body part detection heatmaps + Keypoint Embedding (KE) + Spatial Instance Embedding (SIE)) and (Human Embedding (HE) + Temporal Instance Embedding (TIE)). Besides, a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable.**Highlights**: Spatial & temporal embeddings, end-to-end learning "detection and grouping" pipeline

**(STAF): Efficient Online Multi-Person 2D Pose Tracking With Recurrent Spatio-Temporal Affinity Fields****Authors**: Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, Yaser Sheikh**arXiv Link**: https://arxiv.org/abs/1811.11975**Project Link**: None**Summary**: Upon Part Affinity Field (PAF) representation designed for static images, an architecture encoding ans predicting Spatio-Temporal Affinity Fields (STAF) across a video sequence is proposed - a novel temporal topology cross-linked across limbs which can consistently handle body motions of a wide range of magnitudes. The network ingests STAF heatmaps from previous frames and estimates those for the current frame.**Highlights**: Online, fastest and the most accurate bottom-up approach

**(OTR)**: Object Tracking by Reconstruction With View-Specific Discriminative Correlation Filters**Authors**: Ugur Kart, Alan Lukezic, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas**arXiv Link**: https://arxiv.org/abs/1811.10863**Summary**: Perform online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs). State-of-the-art on Princeton RGB-D tracking and STC Benchmarks.

I'm not experienced in point clouds so I couldn't make a summary for the following papers. The abstracts are given below. Check them out at arXiv to learn more if you're interested.

**VITAMIN-E**: VIsual Tracking and MappINg With Extremely Dense Feature Points**Authors**: Masashi Yokozuka, Shuji Oishi, Simon Thompson, Atsuhiko Banno**arXiv Link**: https://arxiv.org/abs/1904.10324**Project Link**: None**Abstract**: In this paper, we propose a novel indirect monocular SLAM algorithm called "VITAMIN-E," which is highly accurate and robust as a result of tracking extremely dense feature points. Typical indirect methods have difficulty in reconstructing dense geometry because of their careful feature point selection for accurate matching. Unlike conventional methods, the proposed method processes an enormous number of feature points by tracking the local extrema of curvature informed by dominant flow estimation. Because this may lead to high computational cost during bundle adjustment, we propose a novel optimization technique, the "subspace Gauss--Newton method", that significantly improves the computational efficiency of bundle adjustment by partially updating the variables. We concurrently generate meshes from the reconstructed points and merge them for an entire 3D model. The experimental results on the SLAM benchmark dataset EuRoC demonstrated that the proposed method outperformed state-of-the-art SLAM methods, such as DSO, ORB-SLAM, and LSD-SLAM, both in terms of accuracy and robustness in trajectory estimation. The proposed method simultaneously generated significantly detailed 3D geometry from the dense feature points in real time using only a CPU.

**Leveraging Shape Completion for 3D Siamese Tracking****Authors**: Silvio Giancola*, Jesus Zarzar*, and Bernard Ghanem**arXiv Link**: https://arxiv.org/abs/1903.01784**Project Link**: https://github.com/SilvioGiancola/ShapeCompletion3DTracking**Abstract**: Point clouds are challenging to process due to their sparsity, therefore autonomous vehicles rely more on appearance attributes than pure geometric features. However, 3D LIDAR perception can provide crucial information for urban navigation in challenging light or weather conditions. In this paper, we investigate the versatility of Shape Completion for 3D Object Tracking in LIDAR point clouds. We design a Siamese tracker that encodes model and candidate shapes into a compact latent representation. We regularize the encoding by enforcing the latent representation to decode into an object model shape. We observe that 3D object tracking and 3D shape completion complement each other. Learning a more meaningful latent representation shows better discriminatory capabilities, leading to improved tracking performance. We test our method on the KITTI Tracking set using car 3D bounding boxes. Our model reaches a 76.94% Success rate and 81.38% Precision for 3D Object Tracking, with the shape completion regularization leading to an improvement of 3% in both metrics.

**LaSOT**: A High-Quality Benchmark for Large-Scale Single Object Tracking**Authors**: Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling**arXiv Link**: https://arxiv.org/abs/1809.07845**Project Link**: https://cis.temple.edu/lasot/**Summary**: A high-quality benchmark for **La**rge-scale **S**ingle **O**bject **T**racking, consisting of 1,400 sequences with more than 3.5M frames.

**CityFlow**: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification

Zheng Tang, Milind Naphade, Ming-Yu Liu, Xiaodong Yang, Stan Birchfield, Shuo Wang, Ratnesh Kumar, David Anastasiu, Jenq-Neng Hwang**arXiv Link**: https://arxiv.org/abs/1903.09254**Project Link**: https://www.aicitychallenge.org/**Summary**: The largest-scale dataset in terms of spatial coverage and the number of cameras/videos in an urban environment, consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km.

**MOTS**: Multi-Object Tracking and Segmentation**Authors**: Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, Bastian Leibe**arXiv Link**: https://arxiv.org/abs/1902.03604**Project Link**: https://www.vision.rwth-aachen.de/page/mots**Summary**: Going beyond 2D bounding boxes and extending the popular task of multi-object tracking to multi-object tracking and segmentation, in tasks and metrics.**Highlights**: Extend MOT with segmentation

**Argoverse**: 3D Tracking and Forecasting With Rich Maps

Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, James Hays**arXiv Link**: None**PDF Link**: http://openaccess.thecvf.com/content_CVPR_2019/papers/Chang_Argoverse_3D_Tracking_and_Forecasting_With_Rich_Maps_CVPR_2019_paper.pdf**Project Link**: Argoverse.org (Not working?)**Summary**: A dataset designed to support autonomous vehicle perception tasks including 3D tracking and motion forecasting.

从VALSE2019回来后，感觉自己俨然变成了欧阳万里老师的脑残粉呀╰( ᐖ╰)！会上欧阳老师介绍的FishNet简直让我眼前一亮，这么好的点子，我怎么就没想到呐！回来好好读了一下文章和代码，简单总结一下。

较早的典型深度CNN结构大多为漏斗状，不断地进行卷积、下采样来提取、浓缩图像特征，最后用一些全连接层之类的结构来计算具体任务的输出结果。这样的设计很自然地被用于图像分类任务，因为较深的神经网络更能学习高级语义特征，最后将图像浓缩到一个像素而变成一个向量时，这个像素的每一个通道的值则代表了整个图像在这个语义特征上的表现。

Fig. 1 漏斗状卷积神经网络，以VGG-16为例

但是呢，这样的结构原封不动地应用到其他任务上，效果就不是很好了。比如在分割任务中，细节特征保留得好的话，分割的效果则会更佳（例如FCN-8s的效果远好于FCN-32s）。又如在anchor-based目标检测模型中，用尺寸更大的特征图能够更好地回归较小目标的候选框（例如YOLOv3加入FPN后显著提升小物体的检测效果）。因此，出现了一些沙漏状甚至多沙漏堆叠的网络结构（U-Net，FPN，Stacked Hourglass等等）来更好地处理这些任务。

Fig. 2 沙漏状卷积神经网络，以U-Net为例

可以看到，类似这样的工作大多出于这样的一个想法：底层细节特征很重要，我们要把它融合到顶层语义特征里去。这样就有人问了：那语义特征是不是也能融合到细节特征里去，从而增强高分辨率特征图的效果呢？FishNet就做到了这样的融合，让网络最后一部分的各个分辨率的特征图中的底层、中层、顶层特征（作者原话为pixel-level, region-level, image-level）都能“你中有我，我中有你”。

在ResNet中，作者用一种巧妙的办法让较浅的层也能得到有效的梯度信息——在每层层的输出上加一个identity mapping。也就是该层的输入\(x_l\)、下一层的输入\(x_{l+1}\)以及本层的运算\(\mathcal{F}(x, \mathcal{W}_l)\)之间的关系是$$x_{l+1}=x_l+\mathcal{F}(x_l, \mathcal{W_l})$$

再下一层的话：

$$x_{l+2}=x_l + \mathcal{F}(x_l, \mathcal{W_l}) + \mathcal{F}(x_{l+1}, \mathcal{W_{l+1}})$$

要是一直写到最后一层\(x_L\)：

$$x_{L}=x_l+\sum_{i=l}^{L-1}\mathcal{F}(x_i, \mathcal{W_i})$$

那么梯度反传时则有：

$$\begin{split}

\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\

& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)

\end{split}$$

然而现实是：因为中间涉及了几次下采样，采样后的特征图尺寸发生了变化，这时，那个恒等映射\(x\)上不得不加一个\(\mathcal{M}(x)\)（一般为一个\((1\times 1)\)尺寸的卷积，作者称之为I-conv，即Isolated convolution）来改变尺寸和通道数。因此，不是每一层都能保证简单的\(x_{l+1}=x_l+\mathcal{F}(x_l, \mathcal{W_l})\)，上边的梯度公式也只是一种理想情况而已。

Fig. 3 ResNet中，理想的Bottleneck模块与现实中某些Bottleneck模块

在ResNet本身里面倒还好。到了FPN甚至Stacked Hourglass中，这样的I-conv在每次特征图融合时都被使用，这就有点违背ResNet保持梯度有效反传的初衷了。而FishNet在这种情况下采用了一种更“平滑”的方式使得梯度反传受到的影响降到最低。

Fig. 4 FishNet中涉及采样的Bottleneck模块（除tail部分外）

妙啊（👏）！那我们就来看一眼FishNet的全貌：

Fig. 5 FishNet

~~整条鱼~~整个FishNet由三部分构成：tail（尾巴），body（躯干）和head（头）。tail部分之前，图像先过了三层卷积层，初步从\((224\times 225 \times 3)\)尺寸的原图像提取出\((56\times 56 \times 64)\)尺寸的特征图。作者把不同阶段内同一分辨率的特征图分为同一个stage，\((56\times 56)\)的是stage 1，\((28\times 28)\)的是stage 2，\((14\times 14)\)的是stage 3，\((7\times 7)\)的是stage 4。因为分辨率相同，三个部分的特征图可以不用上/下采样而直接在channel维度上concat起来。

tail部分就是一个漏斗状的网络，涉及三次最大池化，每次池化前，最后一个卷积层输出的特征图被留下来供body部分使用。这一部分的结果就是经典的漏斗状网络，作者使用的是一个三阶段的ResNet。tail部分的最后，作者用了一个Squeeze-Excitation模块[2]，先把\((7\times 7 \times 512)\)尺寸的特征图用Global Average Pooling再加几个卷积层（实际上和全连接层并无本质区别）映射成一个\((1\times 1\times 512)\)的向量，再把这个向量的每一个值作为一个权重，乘到之前\((7\times 7\times 512)\)的特征图对应的通道上去。

body部分像FPN一样不断地用上采样来放大特征图，同时融合之前tail部分保留下来的同一分辨率的特征。

head部分则是FishNet的独创性工作，它像是body部分的反过程。以往的沙漏形网络将高层语义特征用来精化低层细节特征，而head网络反其道而行之，又用精化过的低层细节特征反过来精化高层特征。这样，再次采样得到的高层特征的质量被有效提高。

FishNet-99整体的各个部分的参数见下表。

Part-Stage | Input shape | Output shape | Bottlenecks | I-convs | Convs in total |
---|---|---|---|---|---|

Input | \(3\times 224 \times 224\) | \(64\times 56 \times 56\) | \(0\) | \(0\) | \(3\) |

Tail-1 | \(64\times 56 \times 56\) | \(128\times 28 \times 28\) | \(2\) | \(1\) | \(7\) |

Tail-2 | \(128\times 28 \times 28\) | \(256\times 14 \times 14\) | \(2\) | \(1\) | \(7\) |

Tail-3 | \(256\times 14 \times 14\) | \(512\times 7 \times 7\) | \(6\) | \(1\) | \(19\) |

SE-block | \(512\times 7 \times 7\) | \(512\times 7 \times 7\) | \(2\) | \(1\) | \(11\) |

Body-3 | \(512\times 7 \times 7\) | \(256\times 14 \times 14\) | \(1 + 1\) | \(0\) | \(6\) |

Body-2 | \((512+256)\times 14 \times 14\) | \(384\times 28 \times 28\) | \(1 + 1\) | \(0\) | \(6\) |

Body-1 | \((384+128)\times 28 \times 28\) | \(256\times 56 \times 56\) | \(1 + 1\) | \(0\) | \(6\) |

Head-1 | \((256+64)\times 56 \times 56\) | \(320\times 28 \times 28\) | \(1 + 1\) | \(0\) | \(6\) |

Head-2 | \((320+512)\times 28 \times 28\) | \(832\times 14 \times 14\) | \(2 + 1\) | \(0\) | \(9\) |

Head-3 | \((832+768)\times 14 \times 14\) | \(1600\times 7 \times 7\) | \(2 + 4\) | \(0\) | \(18\) |

Score-Conv | \((1600+512)\times 7 \times 7\) | \(1056\times 7 \times 7\) | \(0\) | \(0\) | \(1\) |

Score-FC | \(1056\times 7 \times 7\) | \(1000\times 1 \times 1\) | \(0\) | \(0\) | \(1\) |

说明：

- 第一列的Tail-1代表Tail部分的stage \(1\)。
- Body-3至Head-3的Bottleneck模块数量包括两种：网络主干上的和特征图迁移模块上的。迁移模块用于将上一部分同一stage的特征图进行变换。

FishNet-150的参数见下表，与FishNet-99相比而言只是各个部分Bottleneck块的数量不同，没有太大差异。

Part-Stage | Input shape | Output shape | Bottlenecks | I-convs | Convs in total |
---|---|---|---|---|---|

Input | \(3\times 224 \times 224\) | \(64\times 56 \times 56\) | \(0\) | \(0\) | \(3\) |

Tail-1 | \(64\times 56 \times 56\) | \(128\times 28 \times 28\) | \(2\) | \(1\) | \(7\) |

Tail-2 | \(128\times 28 \times 28\) | \(256\times 14 \times 14\) | \(4\) | \(1\) | \(13\) |

Tail-3 | \(256\times 14 \times 14\) | \(512\times 7 \times 7\) | \(8\) | \(1\) | \(25\) |

SE-block | \(512\times 7 \times 7\) | \(512\times 7 \times 7\) | \(4\) | \(1\) | \(17\) |

Body-3 | \(512\times 7 \times 7\) | \(256\times 14 \times 14\) | \(2 + 2\) | \(0\) | \(12\) |

Body-2 | \((512+256)\times 14 \times 14\) | \(384\times 28 \times 28\) | \(2 + 2\) | \(0\) | \(12\) |

Body-1 | \((384+128)\times 28 \times 28\) | \(256\times 56 \times 56\) | \(2 + 2\) | \(0\) | \(12\) |

Head-1 | \((256+64)\times 56 \times 56\) | \(320\times 28 \times 28\) | \(2 + 2\) | \(0\) | \(12\) |

Head-2 | \((320+512)\times 28 \times 28\) | \(832\times 14 \times 14\) | \(2 + 2\) | \(0\) | \(12\) |

Head-3 | \((832+768)\times 14 \times 14\) | \(1600\times 7 \times 7\) | \(4 + 4\) | \(0\) | \(24\) |

Score-Conv | \((1600+512)\times 7 \times 7\) | \(1056\times 7 \times 7\) | \(0\) | \(0\) | \(1\) |

Score-FC | \(1056\times 7 \times 7\) | \(1000\times 1 \times 1\) | \(0\) | \(0\) | \(1\) |

tail，body和head三部分的主要成分都是Bottleneck模块，即下表所示的结构：

Layer | Type | Output channels | Kernel Size |
---|---|---|---|

(shortcut) | (take shortcut) | - | - |

relu | ReLU | \(C\) | - |

bn1 | Batch Normalization | \(C\) | - |

conv1 | Convolution | \(C / 4\) | \(1\times 1\) |

bn2 | Batch Normalization | \(C / 4\) | - |

conv2 | Convolution | \(C / 4\) | \(3\times 3\) |

bn3 | Batch Normalization | \(C / 4\) | - |

conv3 | Convolution | \(C'\) | \(1\times 1\) |

(addition) | (add shortcut) | \(C'\) | - |

在tail部分的每一个stage中，第一个Bottleneck模块会涉及通道数的变化（即\(C'\neq C\)）。这时shortcut需要经过一个卷积层来变换identity mapping的通道数。因此，这三个shortcut上依旧无法避免使用Isolated convolution。在SE-block中也存在类似的情况。而在head部分中，尽管特征图仍在不断地下采样，其通道数并没有被改变，所以不需要使用这样的Isolated convolution来干扰梯度的直接反传（direct back-propagation）。

（PS：可是我数了数，FishNet-99里有100个卷积，FishNet-150里有151个卷积呀😂？个人猜测是因为Score-FC层不应该算在FishNet主干内？对了，虽然它叫做FC层，但作者代码里还是用卷积层的形式定义的哦。因为\(7\times 7\)尺寸的特征图过了一层Global Average Pooling变成了\(1\times 1\)尺寸，所以它本质上变成了一个长度为通道数的向量。）

从body部分的stage 3开始直到head部分的stage3，每个stage的特征图将与之前部分的特征图融合（也就是图中的红色虚线和红框所表示的内容）。为了保证梯度直接反传，作者设计了UR-block (Upsampling & Refinement) 和DR-block (Downsampling & Refinement) 来“保持和精化”（preserve and refine）各个部分的特征。

上边提到，FishNet中的stage号不是从浅到深依次增大的，而是与特征图的尺度相对应。设tail部分和body部分的stage \(s\)的**第一层**输出特征分别为\(x^t_s\)和\(x^b_s\)，则\(x^t_s\)和\(x^b_s\)的宽度和高度应该是一致的（尽管通道数可能不同）。\(x^t_s\)经过一个迁移模块\(\mathcal{T}(x)\)（transferring block，同样是带shortcut的Bottleneck模块）后与\(x^b_s\)进行连接构成融合的特征图\(\widetilde{x}^b_s\):

$$\widetilde{x}^b_s = concat(x^b_s, \mathcal{T}(x^t_s))$$

\(\widetilde{x}^b_s\)将继续作为body部分的stage \(s\)中后面的卷积层\(\mathcal{M}(x)\)的输入。同时，为了梯度的直接反传，另有一条恒等映射与\(\mathcal{M}(\widetilde{x}^b_s)\)相加。这里的思路与ResNet中\(\mathcal{H}(x)=x+\mathcal{F}(x)\)是一致的：

$$\widetilde{x}'^b_s = r(\widetilde{x}^b_s) + \mathcal{M}(\widetilde{x}^b_s)$$

在body部分的stage 1中，\(\mathcal{M}(x)\)的输出值通道数与\(x\)相同，此时\(r(x)\)即为\(x\)。而stage 2和stage 1中，由于\(\mathcal{M}(x)\)中通道数会产生变化（在作者代码中，通道数减半，\(k=2\)），所以这里的\(r(x)\)需要起到缩小通道数（channel-wise reduction）的作用。**还是为了梯度直接反传**，这里甚至没有使用\((1\times 1)\)的卷积来变换通道数，而是直接把每\(k\)个通道求和（element-wise summation）而压缩成一个通道。\(\widetilde{x}’^b_s\)再进行一下上采样就成为body部分下一个stage（即stage \(s-1\)）的输入了：

$$x^b_{s-1}=up(\widetilde{x}'^b_s)$$

Fig. 6 上采样&精化模块

（PS：为什么这里不用\((1\times 1)\)卷积，而前面tail部分要用呢？个人猜测是因为tail部分要扩大通道数而不得不用这样的方式。或许在tail部分使用与这里的\(r(x)\)相反的过程——通过把每个通道duplicate一下来达成通道数增加一倍的效果也能work呢？有兴趣的可以试一下。）

head部分的下采样&精化模块比上采样&精化模块更加简单，因为这里所有的\(\mathcal{M}(x)\)都不会导致通道数的变化，UR模块用于的\(r(x)\)也就不需要了。其他的公式与UR模块基本相同：

$$\widetilde{x}^b_s = concat(x^b_s, \mathcal{T}(x^t_s)) \\

\widetilde{x}'^b_s = \widetilde{x}^b_s + \mathcal{M}(\widetilde{x}^b_s) \\

x^b_{s+1}=down(\widetilde{x}'^b_s)$$

Fig. 7 下采样&精化模块

漏斗状卷积网络里，较浅卷积层中的特征往往是较简单、像素级的特征，而更深的卷积层中的特征由于感受域较大，是更抽象、泛化的特征。由于FishNet中上采样、下采样的存在，直接以“浅层”“深层”特征来区分不同分辨率的特征似乎并不妥当。因此，这里我用“低层特征”来指代分辨率较大、较具体的特征，用“高层特征”指代分辨率较小、抽象程度较高，或者说“浓缩程度”较高的特征。

分类任务里，图像通过一个漏斗状的卷积网络即可回归出它的类别；检测任务里，通过用高层特征加强低层特征的方式可以有效提升检测效果；如果反过来再用低层特征增强高层特征，网络则可同时被用于图像级、区域级和像素级的不同任务。

尽量避免在shortcut上使用I-conv。FishNet除了tail部分在涉及通道数变化的残差模块上使用了I-conv外，在body和head部分的融合时都避免使用I-conv，从而最大限度地保证了梯度的直接反传。

上采样方式的选择上，尽可能**不使用带权值的反卷积**，而是用最近邻插值等方式。此举同样是为了保证梯度的直接反传。

**用kernel尺寸为\((2\times 2)\)、stride也为\(2\)的MaxPooling进行下采样**与其他几种典型的下采样方式相比，效果更好。用来对(diao)比(da)的另外几种下采样方式包括：

- 最后一层卷积stride=\(2\)（干扰了梯度直接反传）
- kernel size=\((3\times 3)\)、stride=\(2\)的MaxPooling（滑动窗口有交叠，扰乱了结构信息）
- kernel size=\((3\times 3)\)、stride=\(2\)的AveragePooling（原文没讲，个人认为与最后一层卷积加stride=\(2\)效果类似）

“老僧三十年前，未参禅时，见山是山，见水是水。

及至后来亲见知识，有个入处，见山不是山，见水不是水。

而今得个休歇处，依前见山只是山，见水只是水。

大众，这三般见解，是同是别，有人缁素得出。”

——吉州青原惟信禅师[3]

FishNet的思想，似乎与这三重境界有什么关联？池化，插值，融合，再池化，再融合，这个过程，仿佛一个人脑海中对知识的建构、解构和重构的过程。

我在初识某些新事物时，由于对它还没有形成充分的了解，只是大致地形成了一个印象。比如十多年前，“屏幕，主机，鼠标，键盘”，这就是我脑海中一台计算机的样子，所谓“计算机科学”，在当时的自己看来也不过是用一些软件写写文档画画图之类的工作。

随着学习的逐渐深入，我从一个使用者成为了一个开发者后，关注点也不断地深入、细化：当看到一个网页的动效，我想到按F12看看它是怎么用js实现的，想到这个异步请求是怎么响应的，想到网络请求的TCP报文是怎样的，想到报文是如何经历一系列路由器传输到服务器的。在对计算机的了解不断深入的过程中，我却又对它产生了一种陌生感——这门科学还藏有多少的奥秘，其中是否有些我甚至还无法想象？

至于再将学习深入下去我会对计算机产生怎样的认识，才疏学浅，尚不得而知。也许某一天我会恍然大悟——哦，原来计算机科学就是这个样子的呀。

The R-CNNs are awesome works on object detection, which demonstrated the effectiveness of using region proposals with deep neural networks, and have become a state-of-the-art baseline for the object detection task. In this blog post I'll make a brief review of the R-CNN family - from R-CNN to Mask R-CNN, and several related works based on the idea of R-CNNs. Implementation and evaluation details are not mentioned here. For those details, please refer to the original papers provided in the References section.

Before CNN was widely adopted in object detection, SIFT or HOG features are commonly used for the detection task.

Unlike image classification, detection requires localizing objects within an image. Common approaches to localization are 1) bounding box regression, and 2) sliding-window detector. The first approach used in [1] proved to be not working very well, while the second used in [2] needs high spatial resolutions, thus deeper networks makes precise localization a challenge.

R-CNN solves the CNN localization problem by operating the "recognition using regions" paradigm.

From the input image, the method first generates around 2000 category-independent region proposals with Selective Search algorithm, and then extracts a fixed-length feature vector from each proposal using the same CNN(AlexNet). Finally, it classifies each region with category-specific linear SVMs.

Fig. 1 Overview of R-CNN.

However, the region proposal may not be that satisfactory as a final detection window. Therefore, a bounding-box regression stage is introduced to predict a new detection window given the feature map of a region proposal. As reported in [3], this simple approach fixes a large number of mislocalized detections. More details are available in the supplementary material[12] of the R-CNN paper.

Since AlexNet only takes images of size 227 × 227, the image clip in the bounding box should be resized.

In R-CNN, the image clip is directly warped into the demanded size.

Fig. 2 Cropping from the bounding box and warping.

- The
*region proposal(RoI) - feature extraction - classification*approach - Using
*Selective Search*to generate region proposals - Using
*bounding-box regression*to refine region proposals - Using
*CNN features*for classification

- Run CNN feature extraction on each of the 2000 regions consumes too much computation
- The warped content may result in unwanted geometric distortion

SPP-Net introduces the spatial pyramid pooling layer that takes in feature maps of arbitrary size, while also considering multi-scale features in the input image. It also solved the way-too-slow issue of R-CNN.

While R-CNN extracts features from warped image clips in each proposed region, the SPP-Net first extracts the feature of the whole image and get one shared feature map. After this, the feature map is cropped according to the bounding boxes (boxes fixed by regressor, same as R-CNN). Each of the feature map clip is put into the spatial pyramid pooling layers to get a feature vector of the same length. Then the feature vectors are the inputs of following fully connected layers which are the same as R-CNN.

Fig. 3 Overview of SPP-Net.

Fig. 4 Spatial pyramid pooling layer.

The spatial pyramid pooling layers consider the feature map clip in different scales - it divides the feature map clip into 4 × 4, 2 × 2 and 1 × 1 grids and computes 4 × 4, 2 × 2 and 1 × 1 feature maps (channel number doesn't change). The computed feature maps are flattened and concatenated into one vector, which is the input of the following fully connected layers.

- Extracting
*feature maps first and only once*, greatly improves the speed of R-CNN - Using
*spatial pyramid pooling layers*, avoiding geometric distortion

- Training classifier and box regressor separately requires much work

As mentioned in the paper, R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Fast R-CNN improved its detection efficiency by using the deeper VGG16 network, which is 213 times(nice number :D) faster than R-CNN. It also introduced RoI pooling layer, which is simple a special case of SPP-Net where only one scale is considered(only one pyramid level). Fast R-CNN uses a multi-task loss and is trained in single stage, updating all network layers. Fast R-CNN yields higher detection quality(mAP) than R-CNN and SPP-Net, while being comparatively fast to train and test.

Similar to SPP-Net, Fast R-CNN extracts image features before the RoI-based projection to share computation and speed up detection. But differently, Fast R-CNN uses a deep neural network - VGG16 for more efficient feature extraction. Rather than training bounding-box regressor and classifier separately, Fast R-CNN uses a streamlined training process and jointly optimize a softmax classifier and a bounding-box regressor. The RoI-fixing regressor is moved after the fully-connected layers. The multi-task loss **for each RoI** is defined as:

$$L(p,u,t^u,v) = L_{cls}[p,u]+\lambda[u\geq 1]L_{loc}(t^u,v)$$

in which the definition of classification loss and localization loss are:

$$L_{cls}(p,u)=-log(p_u)$$

$$L_{loc}(t^u,v)=\sum_{i\in \{x,y,w,h\}}{smooth_{L_1}(t_i^u-v_i)}$$

in which \(smooth_{L_1}\) loss is defined as:

$$smooth_{L_1}(x)=\begin{cases}

0.5x^2& \text{if |x|<1}\\

|x|-0.5& \text{otherwise}

\end{cases}$$

Symbol definitions:

Symbol | ||
---|---|---|

Output of the classification layer, a vector of length \(K+1\)(K object classes and background) | \(p=(p_0,\cdots,p_K)\) | |

Output of the regression layer, a matrix of size \(K\times 4\). | \(t^k=(t^k_x,t^k_y,t^k_w,t^k_h)\) | |

True class. | \(u\in N, 1\le u \le K\) | |

True bounding-box regression target. | \(v=(v_x,v_y,v_w,v_h)\) |

Fig. 5 Overview of Fast R-CNN.

In this architecture, two of the three main procedures except region proposal are trained in single-stage with the multi-task loss.

Here is are two graphs demonstrating common pooling layers(max or avg) and RoI pooling layers. On the left is the original 5x5 feature map, and each in grid is a pixel value. During calculation, the common pooling kernel covers an area each step and calculates the maximum value or the average value in the area. With a kernel size of 3x3 and a stride of 2, a feature map of 2x2 is generated from the 5x5 feature map.

Fig. 6 Common pooling with kernel_size=3 and stride=2.

And in RoI pooling, the RoI is cropped from the whole feature map, and is divided into pieces with equal areas according to the output feature map size. However, it's possible that grids on the borders of different pieces have to be assigned to one piece only. In this case, there may be a little bit of "injustice" among the pieces. In each piece, a global average/maximum pooling is done and the result is only one number in each channel.

Fig. 7 RoI pooling with output size=(2, 2). The black dashed line denotes the original RoI, and the colored area is the actual cropped RoI.

- Deeper CNN -
*VGG16*for feature extraction *Multi-task loss*&*Single-stage training*

- For region proposal, conventional Selective Search algorithm doesn't make use of GPU computation power, thus consuming more time

In Fast R-CNN, two of the three main procedures are trained in single-stage, except region proposal. And region proposal is the bottleneck of total detection speed, since GPU with high computation power isn't utilized here yet. Why not try training a CNN that generates region proposals?

Simply remove the Selective Search in Fast R-CNN. In place of the SS algorithm, an RPN(Region Proposal Network) is introduced. Given the DCNN features, the RPN generates RoIs with improved speed.

This is a question that had been confusing me for so long.

In a word, it's a simple CNN taking an image of any size as input, slides a window and outputs \(6k\) numbers each time the window moves. \(k\) is the number of anchors pre-defined - IT DOES NOT MEAN "THOUSAND". Wait, what is an anchor?

An anchor is a box size we define first before generating data (for example, \((width=36, height=78)\) for pedestrain, and \((width=50, height=34)\) for dogs?). Though the input image is of size \(n * n\), the anchor can be in any size and any w-h ratio. The prediction of the 6 numbers are based on the anchors we define. When the RPN works, it does NOT predict the possibility that there is an object - BUT the possibility that there is an object that fits in the anchor.

Besides a classification layer predicting the possibility of there being an object and the possibility of there being nothing but background, a regression layer predicts the relative box coordinates \((t_x, t_y, t_w, t_h)\). For each anchor, its size \((w_a, h_a)\) is given and its position \((w_x, w_y)\) is decided by the center position of the sliding window. The relation between relative coordinates \((t_x, t_y, t_w, t_h)\) and absolute coordinates \((x, y, w, h)\) is:

$$t_x=(x-x_a)/w_a\\

t_y=(y-y_a)/h_a\\

t_w=log(w/w_a)\\

t_h=log(h/h_a)$$

for both prediction and ground truth.

Fig. 8 The original graph demonstration of RPN. Keep in mind that "k" does not mean "thousand".

But there are a great pile of boxes generated by the RPN. Some basic methods have to be taken to select the "good" boxes. Firstly, the boxes with low object scores and high background scores (usually thresholds are set manually) are abandoned. Secondly, using non-maximum supression, one box for each object target is elected from all boxes that mark the same object.

*Region Proposal Network*- high-speed high-quality region proposals

Though Mask R-CNN is a great work, its idea is rather intuitive - since detection and classification is done, why not add a segmentation head? In this case, some instance-first instance segmentation work would be done!

Add a small mask fully-convolutional overhead to Faster R-CNN, replace VGG net with more efficient ResNet/FPN(Residual Network / Feature Pyramid Network) and replace RoI pooling with RoI alignment.

Fig. 9 The Mask R-CNN framework for instance segmentation. The last convlutional layer is the newly added segmentation layer for each RoI.

In RoI pooling, quantization will be performed when the RoI coordinates are not integers. For example, when cutting the area \((x_1=11.02, y_1=53.9, x_2=16.2, y_2=58.74)\), actually the area \((x_1=11, y_1=54, x_2=16, y_2=59)\) is what we get (nearest-neighbor).

But in RoI alignment, the area is exactly \((x_1=11.02, y_1=53.9, x_2=16.2, y_2=58.74)\). Instead of cropping it down, the feature map area is sampled using some sample points. Divide the RoI into \(n*n\)(output size) bins Using bi-linear interpolation, one value would be calculated at each sample point. In the image below is a simple example. In this case we have only one sample point for each pixel in the pooled RoI. Coordinate of the only sample point in the first area is \((12.315, 55.11)\). Calculate the weighted average of the 4 grid points nearby his sample point and we'll have the value for this pixel in the pooled feature map.

Fig. 10 RoI alignment with output size=(2, 2) and 1 sample point each bin.

It's obvious that one sample point each bin is far from enough in our example. So using more sample points is wiser.

Fig. 11 RoI alignment with output size=(2, 2) and 2×2 sample point each bin.

*RoI Align*- improving mask accuracy greatly- Add a segmentation overhead on Faster R-CNN and achieve accurate instance segmentation

There are several other R-CNNs by other researchers, which are basically variants of the R-CNN architecture.

arXiv: https://arxiv.org/abs/1711.07264

Code(Official, TensorFlow): https://github.com/zengarden/light_head_rcnn

arXiv: https://arxiv.org/abs/1712.00726

Code(Official, Caffe): https://github.com/zhaoweicai/cascade-rcnn

Code(PyTorch): https://github.com/guoruoqian/cascade-rcnn_Pytorch

arXiv: https://arxiv.org/abs/1811.12030

Code: Not yet

[10] Xin Lu, et al. "Grid R-CNN." arXiv preprint arXiv:1811.12030 (2018).

]]>DensePose has been re-implemented with the brand-new object detection framework Detectron2, which is based on PyTorch and much easier to install and use (You don't have to manually compile Caffe2)

I strongly recommend that you check out the new official DensePose code at https://github.com/facebookresearch/detectron2/tree/master/projects/DensePose.

DensePose is a great work in real-time human pose estimation, which is based on Caffe2 and Detectron framework. It extracts dense human body 3D surface based on RGB images. The installation instructions are provided here.

During my installation process, these are the problems that took me some time to tackle. I spent on week to finally figure out solutions to all the issues. So lucky of me not to give up too early...

By the way, **before you suffer too much**, I strongly recommend following the step-by-step Caffe2+DensePose installation guide by @Johnqczhang. If you think you're almost there, help yourself with the solutions below~

- System: Ubuntu 18.04
- Linux kernel: 4.15.0-29-generic
- Graphics card: NVIDIA GeForce 1080Ti
- Graphics driver: 410.48
- CUDA: 10.0.130
- cuDNN: 7.3.1
- Caffe2: Built from source
- Python: 2.7.15, based on Anaconda 4.5.11

Occurred when running `make`

.

Main error message:

1 | Could not find a package configuration file provided by "Caffe2" with any |

Caffe2 build path isn't known by CMake.

Added one line in the beginning of CMakeLists.txt:

1 | set(Caffe2_DIR "/path/to/pytorch/torch/share/cmake/Caffe2/") |

(Note: `set(Caffe2_DIR "/path/to/pytorch/build/")`

can also fix this issue but may cause other issues.)

Occurred when running `python2 $DENSEPOSE/detectron/tests/test_spatial_narrow_as_op.py`

after `make`

.

Main error message:

1 | Detectron ops lib not found; make sure that your Caffe2 version includes Detectron module. |

Seems that the Python part of DensePose couldn't recognize Caffe2.

Add `/path/to/pytorch/build`

to `PYTHONPATH`

environment variable. Could be added by directly `export PYTHONPATH=$PYTHONPATH:/path/to/pytorch/build`

instruction or by adding this line to `~/.bashrc`

. Remember to run `source ~/.bashrc`

after the modification.

Occurred when running `make ops`

.

Main error message:

1 | CMake Error at /path/to/pytorch/build/Caffe2Config.cmake:14 (include): |

(Several `*.cmake`

files, I only showed a few.)

These files are not in the `pytorch/build`

directory. By searching, I found that they are in the `pytorch/torch/share/cmake/Caffe2`

directory.

Added one line in the beginning of CMakeLists.txt:

1 | set(Caffe2_DIR "/path/to/pytorch/torch/share/cmake/Caffe2/") |

Occurred when running `make ops`

.

I forgot to record the error messages, but it should be obvious that some header files(not just `context_gpu.h`

) are missing.

This time it's the include path not recognized...

Added one line in the beginning of CMakeLists.txt:

1 | include_directories("/path/to/pytorch/torch/lib/include") |

Occurred when running `make ops`

.

Main error message:

1 | /path/to/pytorch/torch/lib/include/caffe2/proto/caffe2.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is |

If you only have a protobuf higher than v3.6.1, this should not happen. Check if you have multiple protobufs installed from different sources. (In my case, there was a protobuf v3.2.0 installed with `apt-get`

earlier)

I can't provide an exact solution. Please try

1 | which protoc |

and see where protobuf is installed. If this shows the protobuf you installed with Anaconda, remove it completely and try this again. Since DensePose tells you that you have an older version of protobuf, you should be able to locate one. After finding it, remove it or upgrade it to v3.6.1 or higher. I would prefer installing protobuf from source here. It's not so painful as installing DensePose.

Occurred when running `make ops`

.

I forgot to record the error messages, but it should be obvious too.

Intel Math Kernel Library was turned on but not found. (Why is it enabled when I didn't even install it???)

Install Intel Math Kernel Library here and add `/opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/include`

to `C_PATH`

environment variable:

1 | export CPATH=$CPATH:/opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/include |

The exact path may vary according to the MKL version and your configuration.

Maybe try `find / -name mkl_cblas.h`

to make sure of its location after the installation.

Adding the path to CMakeLists.txt should also be helpful, but I didn't test it:

1 | include_directories("/opt/intel/compilers_and_libraries_2019.1.144/Linux/mkl/include") |

Occurred when running `make ops`

.

Main error message:

1 | /path/to/pytorch/caffe2/operators/accumulate_op.h: In constructor ‘caffe2::AccumulateOp<T, Context>::AccumulateOp(const caffe2::OperatorDef&, caffe2::Workspace*)’: |

I'm not sure. Could be that `GetSingleArgument()`

is defined elsewhere?

Modify `/path/to/densepose/detectron/ops/pool_points_interp.h`

. Change `OperatorBase::GetSingleArgument<float>`

to `this->template GetSingleArgument<float>`

(Thanks to badpx@Github: https://github.com/facebookresearch/DensePose/pull/137/commits/51389c6a02173a25e9429825db452beb5e1cf3be)

Occurs when running "make ops".

Main error message:

1 | /path/to/pytorch/torch/lib/include/caffe2/core/workspace.h:19:48: fatal error: caffe2/utils/threadpool/ThreadPool.h: No such file or directory |

This should only happen when your Caffe2 is installed with Anaconda.

If your Caffe2 is installed with Anaconda, these files may not be found anywhere in the Caffe2 directory, or in your hard disk at all.

In Anikily@Github's case, downloading Caffe2 source code and add its path to DensePose's include directories will work:

1 | git clone git@github.com:pytorch/pytorch.git |

and add one line in the beginning of DensePose/CMakeLists.txt:

1 | include_directories("/path/to/pytorch") |

The directory you include here should contain caffe2/utils/threadpool/ThreadPool.h and all the others.

I don't think this issue should be solved this way, but I'm sure that these files couldn't be found anywhere else. If anyone finds a better solution, please comment here to help the others.

Occurred when running `python detectron/tests/test_zero_even_op.py`

.

Main error message:

1 | OSError: /path/to/densepose/build/libcaffe2_detectron_custom_ops_gpu.so: undefined symbol: _ZN6google8protobuf8internal9ArenaImpl28AllocateAlignedAndAddCleanupEmPFvPvE |

WTF is this!???

As can be seen, this symbol has something to do with Google, and protobuf.

I guess this is caused by a different protobuf version. Good news is that a proper version of protobuf was also built with Caffe2, so why not tell this to DensePose?

In `/path/to/densepose/CMakeLists.txt`

, Add a few lines in the beginning:

1 | add_library(libprotobuf STATIC IMPORTED) |

You can find two `target_link_libraries`

lines in this file(they are not adjacent):

1 | target_link_libraries(caffe2_detectron_custom_ops caffe2_library) |

Edit the two lines, adding a "libprotobuf" at the end to each of them:

1 | target_link_libraries(caffe2_detectron_custom_ops caffe2_library libprotobuf) |

Then run `make ops`

again, and `python detectron/tests/test_zero_even_op.py`

again.

(Thanks to hyounsamk@Github: https://github.com/facebookresearch/DensePose/issues/119)

After fixing this issue, my DensePose passed tests and was running flawlessly. If any more issues remain, don't hesitate to comment here~

Occurred when running `python detectron/tests/test_zero_even_op.py`

, with Caffe2 installed with Anaconda.

Main error message:

1 | OSError: /path/to/densepose/build/libcaffe2_detectron_custom_ops_gpu.so: undefined symbol: _ZN6caffe219CPUOperatorRegistryB5cxx11Ev |

As can be seen from the messy undefined symbol, this should have something to do with Caffe2 and probably CXX11(oh really???).

Run `ldd -r /path/to/densepose/build/libcaffe2_detectron_custom_ops.so`

and the one or several undefined symbols with similar names will be shown, which should have been defined in `libcaffe2.so`

. After running `strings -a /path/to/pytorch/torch/lib/libcaffe2.so | grep _ZN6caffe219CPUOperator`

, a few similar symbols (two, in my case) would come up, but are different from the one undefined - `"B5cxx11"`

is missing.

Why does DensePose want to find a symbol with `"B5cxx11"`

? Who added this suffix?

It should be our GCC who did it when compiling DensePose with C++11 standard!

To find which version of GCC was Caffe2 built by, run `strings -a /path/to/pytorch/torch/lib/libcaffe2.so | grep GCC:`

.

In my case, the output is:

1 | GCC: (GNU) 4.9.2 20150212 (Red Hat 4.9.2-6) |

Oh? It seems that Caffe2 developers are Red Hat lovers!

The Caffe2 installed with Anaconda was built by GCC 4.9.2, which had a slightly different standard on naming symbols.

The simpliest way out is to turn to GCC 4.9.2 for building DensePose, too.

Otherwise, maybe also consider compiling Caffe2/PyTorch from source code?

(Many thanks to Johnqczhang@Github: https://github.com/linkinpark213/linkinpark213.github.io/issues/12)

Starting from this post, I decide to keep a record (tag: MineSweeping) of the issues I meet while working with environments and also their solutions.

Doing configurations in order to run others' code may be a difficult task, and is sometimes depressing, since various issues could arise, and the it's impossible for the authors to keep providing solutions for every user in the community. What's worse, after fixing some problems with a lot of struggle, one may have to waste the same amount of time on the same issue the next time he/she run it again. That's why I decide to keep this record: to avoid wasting time twice, while also helping others deal with problems if possible.

]]>Here are some photos that I took during my trip to Higashi-Osaka, Nara and Kyoto.

]]>Update in March 2019:

After TensorFlow developers introduced the APIs of Tensorflow 2.0 on Tensorflow Dev Summit 2019, I have made my decision to turn to PyTorch.

TensorFlow is a powerful open-source deep learning framework, supporting various languages including Python. However, its APIs are far too complicated for a beginner in deep learning(especially those who are new to Python). In order to ease the pain of having to understand the mess of various elements in TensorFlow computation graphs, I made this tutorial to help beginners take the first bite of the cake.

ResNets are one of the greatest works in the deep learning field. Although they look scary with extreme depths, it's not a hard job to implement one. Now let's build one of the simplest ResNets - ResNet-56, and train it on the CIFAR-10 dataset.

2019.3 更新:

Tensorflow Dev Summit上开发者介绍TF 2.0 API后， 我彻底下定了换用PyTorch的决心。

TensorFlow是一个强大的开源深度学习软件库，它支持包括Python在内的多种语言。然而，由于API过于复杂（实际上还有点混乱），它往往使得一个深度学习的初学者（尤其是为此初学Python的那些）望而却步——老虎吃天，无从下口。为了减轻初学者不得不尝试理解TensorFlow中的大量概念的痛苦，我213今天带各位尝尝深度学习这片天的第一口。

ResNet是深度学习领域的一个重磅炸弹，尽管它们（ResNet有不同层数的多个模型）的深度看上去有点吓人，但实际上实现一个ResNet并不难。接下来，我们来实现一个较为简单的ResNet——ResNet-56，并在CIFAR-10数据集上训练一下，看看效果如何。

First let's take a look at ResNet-56. It's proposed by Kaiming He et al., and is designed to confirm the effect of residual networks. It has 56 weighted layers, deep but simple. The structure is shown in the figure below:

首先来看一下ResNet-56这个神经网络。它是何凯明等在ResNet论文中提出的、用于验证残差网络效果的一个相对简单的残差网络（尽管它很深，深度达到了56个权重层）。图示如下：

Fig. 1 The structure of ResNet-56

Seems a little bit long? Don't worry, let's do this step by step.

看起来有点长了是不是？别担心，我们一步一步来做。

Python 3.6

TensorFlow 1.4.0

Numpy 1.13.3

OpenCV 3.2.0

Also prepare some basic knowledge on Python programming, digital image processing and convolutional neural networks. If you are already capable of building, training and validating your own neural networks with TensorFlow, you don't have to read this post.

另外，请确保自己有一点点Python编程、数字图像处理和卷积神经网络的知识储备。如果你已经具备用TensorFlow自行搭建神经网络并进行训练、测试的能力，就不必阅读本文了。

Prepare(import) the tools for our project, including all that I mentioned above. Like this :P

准(i)备(m)所(p)需(o)工(r)具(t)，上一部分已提到过。如下：

1 | import tensorflow as tf |

Wait... What's this? TensorChain? Another deep learning framework like TensorFlow?

Uh, nope. This is my own encapsulation of some TensorFlow APIs, for the sake of easing your pain. You'll only have to focus on "what's what" in the beginning. We'll look into my implementation of this encapsulation later, when you are clear how everything goes. Please download this file and put it where your code file is, and import it.

等等...最后这个是个什么鬼？ TensorChain？另一个深度学习框架吗？

呃...并不是。这个是我对一些TensorFlow API的封装，为了减轻你的痛苦才做的。作为初学者，你只需要关注用TensorFlow搭建网络模型的这个过程，分清东西南北。回头等你弄清了大体流程后，我们再来看这个的实现细节。请先下载这个文件并把它与你的代码放在同一文件夹下，然后就可以import了。

Every neural network requires an input - you always have to identify the details of a question, before asking the computer to solve it. All of the variable, constant in TensorFlow are objects of type

每个神经网络都需要有输入——毕竟你想找电脑解决一些问题的话，你总得告诉它问题的一些细节吧？TensorFlow中所有的变量、常量都是

1 | input_tensor = tf.placeholder(dtype=tf.float32, shape=[None, 32, 32, 3]) |

监督学习中，正确标注的数据（英文为

1 | ground_truth = tf.placeholder(dtype=tf.float32, shape=[None, 10]) |

We want the label data to be in the one-hot encoding format, which means an array of length 10, denoting 10 classes. Only on one position is a '1', and on other positions are '0's.

我们需要标记的数据呈One-Hot编码格式（又称为一位有效编码），意思是如果有10个类别，那么数组长度就是10，每一位代表一个类别。只有一个位置上是1（代表图片被分为这个类），其他位上都是0。

For now, let's use our TensorChain to build it fast. Under most circumstances that we may face, the computations are based on the input data or the result of the former computation, so our network(or say, the most of it) look more like a chain than a web. Every time we add some new operation(layer), we add it to our

The construction function of TensorChain class requires a Tensor object as the parameter, which is also the input tensor of this chain. As we mentioned earlier, all we have to do is add operations. See my ResNet-56 code:

现在呢，我们先用TensorChain来快速盖楼。因为我们遇到的大多数情况下，所有的计算都是在输入数据或者这个计算的前一个计算结果基础上进行的，所以我们的网络（至少是它的绝大部分）会看起来像个链而不是所谓的网。每次我们添加一个新的运算（层），我们会把它加到这个独一无二的TensorChain对象。只要记得在使用原生TensorFlow API前把它的

TensorChain类的构造函数需要一个Tensor对象作为参数，这个对象也正是被拿来作为这个链的输入层。正如我们之前所说的，只要在这个对象上添加运算即可。写个ResNet-56，代码很简单：

1 | chain = TensorChain(input_tensor) \ |

This is it? Right, this is it! Isn't it cool? Didn't seem that high, huh? That's because I encapsulated that huge mess of weights and biases, only leaving a few parameters that decide the structure of the network. Later in this pose we'll talk about the actual work that these functions do.

就这？没错呀，就这！稳不稳？似乎看起来也没56层那么高呀？毕竟这些函数被我封装得太严实了，只留出几个决定网络结构的几个参数供修改。这篇博客后边就会讲到这些函数究竟干了点什么事儿。

In supervised learning, you always have to tell the learning target to the model. To tell the model how to optimize, you have to let it know how, how much, on which direction should it change its parameters. This is done by using a loss function. Therefore, we need to define a loss function for our ResNet-56 model(which we designed for this classification problem) so that it will learn and optimize.

A commonly used loss function in classification problems is cross entropy. It's defined below:

搞监督学习，总是要让模型按照“参考答案”去改的。要改就得让它知道怎么改、改多少、往什么方向改，这也就是

分类问题上一个常用的损失函数是交叉熵。定义如下式：

$$C=-\frac{1}{n}\sum_x{y\ln a+(1-y)\ln(1-a)}$$

in which \(y\) is the expected(or say correct) output and \(a\) is the actual output.

This seems a little bit complicated. But it's not a hard job to implement, since TensorFlow implemented it already! You can also try and implement it yourself within one line if you want. For now we use the pre-defined cross entropy loss function:

其中\(y\)为期望输出（或者说参考答案），\(a\)为实际输出。

略复杂呀...这个用程序怎么写？其实也不难。。。毕竟TensorFlow都帮我们实现好啦！（有兴趣的话也可以自己尝试着写一下，同样一行代码即可搞定）现在你只需要来这么一句：

1 | loss = tf.reduce_mean(tf.losses.softmax_cross_entropy(ground_truth, prediction)) |

and it returns a tf.Tensor that denotes an average of cross entropies(don't forget that this is a batch). As for the 'softmax' before the 'cross_entropy', it's a function that project the data in an array to range 0~1, which allows us to do a comparison between our prediction and the ground truth(in one-hot code). The definition is simple too:\

就可以创建一个表示交叉熵平均值（别忘了这可是一个batch）的Tensor了。至于cross_entropy前边的那个

$$S_i=\frac{e^{V_i}}{\sum_j{e^{V_j}}}$$

Now we have the loss function. We'll have to tell its value to an

现在误差函数已经有了，我们需要把它的值告诉一个优化器（

1 | optimizer = tf.train.AdamOptimizer(learning_rate=0.001) |

Also, tell the optimizer that what the loss tensor is. The returned object is a train operation.

当然还要告诉它要减小的损失函数是哪个Tensor，这个函数返回的是一个训练操作（

1 | train = optimizer.minimize(loss) |

The neural network is finished. It's time to grab some data and train it.

其实到这里为止，神经网络已经搭建好了。是时候搞点数据来训练它了。

Remember how we defined the placeholders? It's time to fetch some data that fits the placeholders and train it. See how CIFAR-10 dataset can be fetched on its website.

1 | def unpickle(file): |

The returned value

返回值

1 | batch = unpickle(DATA_PATH + 'data_batch_{}'.format(i)) # 'i' is the loop variable |

The details for data processing are not covered here. Try doing step-by-step to see the results.

The

处理的细节不再赘述。你可以尝试一步一步运行来看看每一步的结果。

这样我们拿到的

1 | with tf.Session() as session: |

A

When running

每次运行一个TensorFlow模型（无论是训练还是测试）时，都需要通过tf.Session()创建一个

运行

I'm also interested in the loss function value in each iteration(which means feeding a batch of data and executing one forward-propagation and one back-propagation) in the training process. Therefore, what I'll fill in the parameter is not just the train op, but also the loss tensor. And the session.run() above should be modified to:

然而呢，我还想看看每次迭代（即把一个batch送进去，执行一次正向传播与反向传播这个过程）中损失函数变成了多大，来监控一下训练的效果。这样，需要session.run()的就不仅是那个train运算，还要加上loss运算。将上边的session.run()部分改为：

1 | [train_, loss_value] = session.run([train, loss], |

This is when the return value of session.run() becomes useful. Its value(s) - corresponding to the first parameter of run() - is/are the actual value(s) of the tensor(s) in the first parameter. In our example,

这时候，session.run()函数的返回值就有意义了。它与第一个参数的内容一一对应，分别是该参数中各个operation的实际输出值。像这个例子里边，

Actually, one epoch(train the model once with the whole dataset) is not enough for the model to fully optimize. I trained this model for 40 epochs and added some loop variables to display the result. You can see my code and my output below. It's highly recommended that you train this with a high-performance GPU, or it would be a century before you train your model to a satisfactory degree.

实际上，一个epoch（把整个数据集都在模型里过一遍的周期）并不足以让模型充分学习。我把这个模型训练了40个epoch并且加了一些循环变量来输出结果。我的代码和结果如下。强烈建议用一个高性能GPU训练（如果手头没有，可以租一个GPU服务器），不然等别人把毕设论文逗写完的时候，你还在训练就很尴尬了。

1 | import tensorflow as tf |

Fig. 2 Training result: cross entropy has dropped below 0.5

In a word, building & training neural network models with TensorFlow involves the following steps:

1. Decide the

2. Add operations(

3. Define the

4. Select an

5. Process

总而言之，用TensorFlow建立、训练一个神经网络模型分以下几步：

1. 定义

2. 在已有的Tensor上添加运算（

3. 像之前添加的那些运算一样，定义

4. 选择一个

5. 把

Wait, it's too late to leave now!

TensorChain saved you from having to deal with a mess of TensorFlow classes and functions. Now it's time that we take a closer look at how TensorChain is implemented, thus understanding the native TensorFlow APIs.

别走呢喂！

TensorChain让你不至于面对TensorFlow中乱糟糟的类型和函数而不知所措被水淹没。现在是时候近距离观察一下TensorChain是如何实现的，以便理解TensorFlowAPI了。

Let's begin with TensorFlow variables. Variables in TensorFlow are similar to variables in C, Java or any other strong typed programming languages - they have a type, though not necessarily explicitly decided upon definition. Usually them will change as the training process goes on, getting close to a best value.

The most commonly used variables in TensorFlow are weights and biases. I guess that you have seen formulae like:

先说TensorFlow的变量。TensorFlow的变量和C，Java以及其他强类型语言类似——都有一个类型，尽管不一定在它的定义时就显式地声明。通常它们会随着训练的进行而不断变化，达到一个最佳的值附近。

TensorFlow中最常用的变量就是weights和biases（权重和偏置）。想必你应该见过这样的式子吧：

$$y=Wx+b$$

The \(W\) here is the weight, and the \(b\) here is the bias. When implementing some common network layers, they two are always used as the parameters in the layers. For instance, at the very beginning of our ResNet-56, we had a 3x3 sized convolution layer with 16 channels. Its implementation in TensorChain is:

这里\(W\)就是权重，\(b\)就是偏置。在定义一些常用的层时，我们往往也是用这两个变量作为这些层中的参数。比如说，在我们ResNet-56最开始，我们用到了一个3x3大小、16个通道的卷积层，TensorChain中，它的实现如下：

1 | def convolution_layer_2d(self, filter_size: int, num_channels: int, stride: int = 1, name: str = None, |

See? On line 16, we used a

看见了吧？16行上，我们用了一个

1 | tf.Variable(tf.truncated_normal(shape, stddev=sigma), dtype=tf.float32, name=suffix) |

To define weight or bias variables, create a

要定义权重或者偏置变量，请创建一个

Going on with the parameters of the

The 4th parameter

接着说

第四个参数

tf.nn.conv2d() is just an example of TensorFlow

tf.nn.conv2d()只是TensorFlow运算（

All the functions in the TensorChain class are based on the most basic TensorFlow operations and variables. After learning about these basic TensorFlow concepts, actually you can already abandon TensorChain, go and try implementing your own neural networks yourself!

TensorChain类中的所有成员函数都是基于最基本的TensorFlow运算和变量的。实际上，了解了这些，你现在已经可以抛开TensorChain的束缚，去尝试实现你自己的神经网络了！

I'm not joking just now! But I know that there are a lot of things that you still don't understand about using TensorFlow - like "how do I visualize my computation graph", "how do I save/load my model to/from files", "how do I record some tensors' values while training" or "how do I view the loss curves" - after all TensorFlow APIs are far more complicated than just building those nets. Those are also important techniques in your research. If you'd rather ask me than spending some time experimenting, please go on with reading.

我，我真没开玩笑！但是我知道关于如何使用TensorFlow，你还有许许多多的问题，好比“如何可视化地查看我的计算图结构”、“如何存储/读取模型文件”、“如何记录训练过程中某些Tensor的真实值”、“如何查看损失函数的变化曲线”——毕竟TensorFlow的API太复杂了，远比搭建神经网络那点函数复杂得多。上边说的那些是你使用TensorFlow研究过程中的重要技巧。如果你愿意听我讲而不想花些时间尝试的话，请继续读下去。

The very first thing that you may want to do - after training a network model with nice outcomes - would be saving it. Saving a model is fairly easy - just use a

训练出一个看起来输出还不错的神经网络模型后你想做的第一件事恐怕就是把它存下来了吧？保存模型其实非常简单：只要用一个

1 | with tf.Session() as session: |

I saved my model and variable values to 'models/model.ckpt'. But actually, you'll find 3 files in the 'models' directory -

我把我的模型和变量值存到了'models/model.ckpt'文件里。但是！实际上在models目录里你会找到三个文件：

1 | with tf.Session() as session: |

Remember that session.run(tf.global_variables_initializer()) shouldn't be executed, since variables are already initialized with your saved

If you only need the graph to be loaded, only use the

记住，这时候就不要再去执行session.run(tf.global_variables_initializer())了，因为变量已经用存储的checkpoint文件内容初始化过了。

如果只需要读取计算图结构，只要读取

1 | with tf.Session() as session: |

Function

1 | with tf.Session() as session: |

To retrieve normal tensors, you'll have to append a

要取回一般的Tensor，需要在Tensor的name属性值后边加一个

Sometimes you may want to explain some algorithms or principles with beautiful formulae in your blog. How to do this? Edit them in Microsoft Word, take a screenshot, crop it and put it in the blog post? When you finish your article and find out that you missed a symbol in the pictures - oh man, gotta repeat that again? Stop using those images now! A beautiful math display engine - MathJax allows you to code math like a coder.

$$\mathcal{C}\phi \delta e \mathfrak{M}\alpha th \mathit{I}n \mathcal{H}ex\sigma \mathbb{N}o\omega!$$

First, install *hexo-math* in your Hexo blog directory.

1 | $ npm install hexo-math --save |

Then, add *math* configurations in your *_config.yml* file.

1 | math: |

Finally, also add to your *_config.yml* file in the **theme directory** these configurations below.

1 | mathjax: |

Maybe you don't have to use math in every blog post. If so, insert the following snippet in your Markdown file also works.

1 | <script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-MML-AM_CHTML' async></script> |

MathJax supports the same grammar that LaTeX does. To learn more about LaTeX, please refer to Chapter 3 of The Not So Short

Introduction to LATEX(CN version also available here).

Use a "\\(" and a "\\)" to insert a formula in the line(they decide the boundary of the formula), or two "$$" to insert one that occupy a new line. I'll give a few examples below.

1 | \\(\mathcal{F}(x)=\mathcal{H}(x)-x\\) |

\(\mathcal{F}(x)=\mathcal{H}(x)-x\)

1 | \\(E=mc^2\\) |

\(E=mc^2\)

1 | $$\lim_{n\rightarrow \infty}(1+2^n+3^n)^\frac{1}{x+\sin n}$$ |

$$\lim_{n\rightarrow \infty}(1+2^n+3^n)^\frac{1}{x+\sin n}$$

1 | $$\mathcal{C}\phi \delta e \mathfrak{M}\alpha th \mathit{I}n \mathcal{H}ex\sigma \mathbb{N}o\omega!$$ |

$$\mathcal{C}\phi \delta e \mathfrak{M}\alpha th \mathit{I}n \mathcal{H}ex\sigma \mathbb{N}o\omega!$$

This list will be appended whenever I find any more.

This is a tough problem. Hexo renderer would first render the .md file into a .html file, and the MathJax script will only work on the .html file. Therefore, when there are multiple subscript symbols, they might be rendered as <em></em> tags.

For example: when you actually need a full-line formula \(x_{i+1}+y_j\), perhaps you'll get a "$$x*{i+1}+y*j$$" instead. Look into the HTML code and you'll understand why.

My solution for now, is giving up this Markdown emphasize symbol, since both "_" and "*" can be used as emphasize tags, and the alternative symbol "*" will also work if we remove "_". Using "\_" also works, but it would be frequently used(while "*" isn't), thus turning our math code into mess code.

How do we do this? Bravely look into the *node_modules* directory and find the renderer of the Hexo engine. My renderer is *marked*, which is the default for Hexo. There is a file named *marked.js* inside *node_modules/marked/lib/* directory. You can find two appearances of "em:". Like this:

1 | var inline = { |

and

1 | inline.pedantic = merge({}, inline.normal, { |

Modify the regular expression after them - remove the one about "_"s and leave the one about "*"s. The new version would be:

1 | var inline = { |

and

1 | inline.pedantic = merge({}, inline.normal, { |

From now on, you can use "_" as the subscript in MathJax freely. You don't have to worry about its becoming <em></em> tags anymore.

For example, in my previous post about ResNet, I tried to use the following code to start a new line in an equation while aligning the lines to the equal sign:

1 | $$\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\\\ |

The "&" symbols were used to align the lines to a certain point. However, the result was a "Misplaced &" prompt.

By disabling MathJax, I found out that the rendered equation was correct, which means that **the problem isn't with Hexo renderer**. This was when I realized that although

1 | \begin{equation} |

are not necessary,

1 | \begin{split} |

shouldn't be removed. Surround the equation with them will work. My code is here:

1 | $$\begin{split} |

And it runs like:

$$\begin{split}

\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\

& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)

\end{split}$$

If you encounter other issues while using MathJax with Hexo(with or without a solution), feel free to leave a comment below!

]]>Deep learning researchers have been constructing skyscrapers in recent years. Especially, VGG nets and GoogLeNet have pushed the depths of convolutional networks to the extreme. But questions remain: if time and money aren't problems, are deeper networks always performing better? Not exactly.

When residual networks were proposed, researchers around the world was stunned by its depth. "Jesus Christ! Is this a neural network or the Dubai Tower?" But **don't be afraid!** These networks are deep but the structures are simple. Interestingly, these networks not only defeated all opponents in the classification, detection, localization challenges in ImageNet 2015, but were also the main innovation in the best paper of CVPR2016.

VGG nets proved the beneficial of representation depth of convolutional neural networks, at least within a certain range, to be exact. However, when Kaiming He et al. tried to deepen some plain networks, the training error and test error stopped decreasing after the network reached a certain depth(which is not surprising) and soon degraded. This is not an overfitting problem, because training errors also increased; nor is it a gradient vanishing problem, because there are some techniques(e.g. batch normalization[4]) that ease the pain.

Fig.1 The downgrade problem

What seems to be the cause of this degradation? Obviously, deeper neural networks are more difficult to train, but that doesn't mean deeper neural networks would yield worse results. To explain this problem, Balduzzi et al.[3] identified shattered gradient problem - as depth increases, gradients in standard feedforward networks increasingly resemble white noise. I will write about that later.

As the old saying goes, "千里之行，始于足下". Although ResNets are as deep as a thousand layers, they are built with these basic residual blocks(the right part of the figure).

Fig.2 Parts of plain networks and a residual block(or residual unit)

In comparison, basic units of plain network models would look like the one on the left: one ReLU function after a weight layer(usually also with biases), repeated several times. Let's denote the desired underlying mapping(the ideal mapping) of the two layers as \(\mathcal{H}(x)\), and the real mapping as \(\mathcal{F}(x)\). Clearly, the closer \(\mathcal{F}(x)\) is to \(\mathcal{H}(x)\), the better it fits.

However, He et al. explicitly let these layers fit a residual mapping instead of the desired underlying mapping. This is implemented with "shortcut connections", which skip one or more layers, simply performing identity mappings and getting added to the outputs of the stacked weight layers. This way, \(\mathcal{F}(x)\) would not try to fit \(\mathcal{H}(x)\), but \(\mathcal{H}(x)-x\). The whole structure(from the identity mapping branch, to merging the branches by the addition operation) are named "residual blocks"(or "residual units").

What's the point in this? Let's do a simple analysis. The computation done by the original residual block is: $$y_l=h(x_l)+\mathcal{F}(x_l,\mathcal{W}_l),$$ $$x_{l+1}=f(y_l).$$

Here are the definitions of symbols:

\(x_l\): input features to the \(l\)-th residual block;

\(\mathcal{W}_{l}={W_{l,k}|_{1\leq k\leq K}}\): a set of weights(and biases) associated with the \(l\)-th residual unit. \(K\) is the number of layers in this block;

\(\mathcal{F}(x,\mathcal{W})\): the residual function, which we talked about earlier. It's a stack of 2 conv. layers here;

\(f(x)\): the activation function. We are using ReLU here;

\(h(x)\): identity mapping.

If \(f(x)\) is also an identity mapping(as if we're not using any activation function), the first equation would become:

$$x_{l+1}=x_l+\mathcal{F}(x_l,\mathcal{W}_l)$$

Therefore, we can define \(x_L\) recursively of any layer:

$$x_L=x_l+\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)$$

That's not the end yet! When it comes to the gradients, according to the chain rules of backpropagation, we have a beautiful definition:

$$\begin{split}

\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\

& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)

\end{split}$$

What does it mean? It means that the information is directly backpropagated to ANY shallower block. This way, the gradients of a layer never vanish or explode even if the weights are too small or too big.

It's important that we use identity mapping here! Just consider doing a simple modification here, for example, \(h(x)=\lambda_lx_l\)(\(\lambda_l\) is a modulating scalar). The definition of \(x_L\) and \(\frac{\partial{\mathcal{E}}}{\partial{x_l}}\) would become:

$$x_L=(\prod_{i=l}^{L-1}\lambda_i)x_l+\sum_{i=l}^{L-1}(\prod_{j=i+1}^{L-1}\lambda_j)\mathcal{F}(x_i,\mathcal{W}_i)$$

$$\frac{\partial{\mathcal{E}}}{\partial{x_l}}=\frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big((\prod_{i=l}^{L-1}\lambda_i)+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}(\prod_{j=i+1}^{L-1}\lambda_j)\mathcal{F}(x_i,\mathcal{W}_i)\Big)$$

For extremely deep neural networks where \(L\) is too large, \(\prod_{i=l}^{L-1}\lambda_i\) could be either too small or too large, causing gradient vanishing or gradient explosion. For \(h(x)\) with complex definitions, the gradient could be extremely complicated, thus losing the advantage of the skip connection. Skip connection works best under the condition where the grey channel in Fig. 3 cover no operations (except the addition) and is clean.

Interestingly, this comfirmed the philosophy of "大道至简" once again.

Wait a second... "\(f(x)\) is also an identity mapping" is just our assumption. The activation function is still there!

Right. There IS an activation function, but it's moved to somewhere else. In fact, the original residual block is still a little bit problematic - the output of one residual block is not always the input of the next, since there is a ReLU activation function after the addition(It did NOT REALLY keep the identity mapping to the next block!). Therefore, in[2], He et al. fixed the residual blocks by changing the order of operations.

Fig.3 New identity mapping proposed by He et al.

Besides using a simple identity mapping, He et al. also discussed about the position of the activation function and the batch normalization operation. Assuming that we got a special(asymmetric) activation function \(\hat f(x)\), which only affects the path to the next residual unit. Now our definition of \(x_{x+1}\) would become:

$$x_{l+1}=x_l+\mathcal{F}(\hat f(x_l),\mathcal{W}_l)$$

With \(x_l\) still multiplied by 1, information is still fully backpropagated to shallower residual blocks. And the good thing is that using this asymmetric activation function after the addition(partial post-activation) is equivalent to using it beforehand(pre-activation)! This is why He et al. chose to use pre-activation - otherwise it would be necessary to implement that magical activation function \(\hat f(x)\).

Fig.4 Using asymmetric after-addition activation is equivalent to constructing a pre-activation residual unit

Here are the ResNet architectures for ImageNet. Building blocks are shown in brackets, with the numbers of blocks stacked. With the first block of every stack(starting from conv3_x), a downsampling is performed. Each column represents one of the residual networks, and the deepest one has 152 weight layers! Since ResNets were proposed, VGG nets - which were officially called "Very Deep Convolutional Networks" - are not relatively deep anymore. Maybe call them "A Little Bit Deep Convolutional Networks".

Table. 1 ResNet architectures for ImageNet.

He et al. trained ResNet-18 and ResNet-34 on the ImageNet dataset, and also compared them to plain convolutional networks. In Fig. 5, the thin curves denote training error, and the bold ones denote validation error. The figure on the left shows the results of plain convolution networks(in which the 34-layered ones has higher error rates than the 18-layered one), and the figure on the right shows that residual networks perform better than plain ones, while deeper ones perform better than shallow ones.

Fig. 5 Training ResNet on ImageNet

He et al. also tried various types of shortcut connections to replace the identity mapping, and various positions of activation functions / batch normalization. Experiments show that the original identity mapping and full pre-activation yield the best results.

Fig. 6 Various shortcuts in residual units

Table. 2 Classification error on CIFAR-10 test set with various shortcut connections in residual units

Fig. 7 Various usages of activation in residual units

Table. 3 Classification error on CIFAR-10 test set with various usages of activation in residual units

Residual learning can be crowned as "ONE OF THE GREATEST HITS IN DEEP LEARNING FIELDS". With a simple identity mapping, it solved the degradation problem of deep neural networks. Now that you have learned about the concept of ResNet, why not give it a try and implement your first residual learning model today?

Convolutional neural networks(CNN) have enjoyed great success in computer vision research fields in the past few years. A number of attempts are made based on the original CNN architecture to improve its accuracy and performance. In 2014, Karen Simonyan et al. did an investigation on the effect of depth on CNNs' accuracy in large-scale image recognition (thus also proposing a series of very deep CNNs which are usually called VGG nets). The result confirmed the importance of CNN depth in visual representations.

Before introducing VGG net, let's take a glance at prior convolutional neural networks.

Basic neural network structures(for example, multi-layer perceptron) learn patterns on 1D vectors, which cannot cope with 2D features in images well. In 1986, Lecun et al. proposed a convolution network model called LeNet-5. Its structure is fairly simple: two convolution layers, two subsampling layers and a few fully connected layers. This network was used to solve a number recognition problem. (If you need to learn more about the convolution operation, please refer to Google or *Digital Image Processing* by Rafael C. Gonzalez)

Fig. 1 Architecture of LeNet

In 2012, Alex Krizhevsky et al. won the first place in ILSVRC-2012(ImageNet Large-Scale Visual Recognition Challenge 2012) and achieved the highest top-5 error rate of 15.3% with a convolutional network model, while the second-best entry only achieved 26.2%. The network, namely AlexNet, was trained on two GTX580 3GB GPUs in parallel. Since a single GTX580 GPU has only 3GB memory, the maximum size of networks is limited. This model proved the effectiveness of CNNs under complicated circumstances and the power of GPUs. So what if the network can go deeper? Will the top-5 error rate get even lower?

Fig. 2 Architecture of AlexNet

Here comes our hero - VGG nets. By the way, VGG is not the name of the network, but the name of the authors' group - *Visual Geometry Group*, from Department of Engineering Science, University of Oxford. The networks they proposed were therefore named after the group. The main contributions of VGG nets are: 1. more but smaller convolution filters; 2. great depth of networks.

Rather than using relatively large receptive fields in the first convolution layers, Simonyan et al. selected very small 3x3 receptive fields throughout the whole net, which are convolved with the input at every pixel with a stride of 1. As is shown in the figures below, a stack of two 3x3 convolution layers has an effective receptive field of 5x5. We can also conclude that a stack of three 3x3 convolution filters has an effective receptive field of 7x7.

Fig. 3 A convolution layer with one 5x5 conv. filter has a receptive field of 5x5

Fig. 4 A convolution layer stack with two 3x3 conv. filter also has a effective receptive field of 5x5

Now that we're clear that stacks of small-kernel convolution layers have equal sized receptive fields, why are they the better choice? Well, the first advantage is incorporating more rectification layers instead of a single one, since every convolution layer includes an activation function(usually ReLU). More rectification brings more non-linearity, and more non-linearity makes the decision function more discriminative and fit better. Also, when the receptive field isn't too large, a stack of 3x3 convolution filters have fewer parameters to train. Assuming the number of input and output channels of a convolution layer stack are equal(let's call it C) and the receptive field is 5x5, we have \(2*3*3*C*C=18C^2\) instead of \(5*5*C*C=25C^2\) parameters here. Similarly, when the receptive field is 7x7, we have \(3*3*3*C*C=27C^2\) instead of \(7*7*C*C=49C^2\). When the field gets even larger? A function with \(O(n)\) complexity only has greater advantage against an \(O(n^2)\) when \(n\) grows.

Cliches time. Just like any blogger mentioning VGG nets would do, here are the network structures proposed by Simonyan et al.

Table. 1 VGG nets of various depths

Look at the table column-by-column. Each column(A, A-LRN, B, C, D, E) corresponds to one network structure. As you can see, their networks grew from 11 layers(in net A) to 19 layers(in net E). Each time something is added to the previous net, it would appear bold. Clearly, LRN(Local Response Normalization) didn't work well in this case(actually, A-LRN net performed worse than A, while consuming much more memory and computation time), and was thus removed.

What's worth mentioning are the 1x1 convolution layers appearing in network C. This is a way to increase non-linearity(also by introducing activation functions) of the decision function while also keeping the size of the receptive fields unchanged.

Bad initialization could stall learning due to the instability of gradient deep networks. Therefore, the authors first trained the network A, which is shallow enough to be trained with random initialization. Then, the next networks (B to E) are initialized with the pre-trained models, and only weights of the new layers are randomly initialized.

In spite of the larger number of parameters and the greater depth of our nets compared to AlexNet, the nets required less epochs to converge due to the implicit regularization imposed by greater depth, smaller convolution filter sizes and the pre-initialization of certain layers. They also generalize well to other datasets, achieving state-of-the-art performances. Results of VGG nets in comparison against other models in ILSVRC are shown in the table below.

Table. 2 VGG net performance, in comparison with the state of the art in ILSVRC classification

In conclusion, the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth.

You might ask: Why not even deeper, with more powerful GPUs(the authors used Titan Black), we can absolutely train deeper networks that perform better! Not exactly. Problems arose as the networks get too deep, and this is where ResNet comes in.

1 | ' |

, Hexo would convert it to a symbol like this

1 | ’ |

You would say that this is also an apotrophe, but it really looks UNBEARABLE in the articles. It's been a problem bothering me for more than a month.(I'm not saying that this is the reason for not updating my blog, but I don't mind if you think so!)

Therefore, I Googled about this problem and tried to find other victims. According to their sharings, this problem is caused by *marked* -- the default markdown renderer of Hexo.The *"smatrypants"* function of marked was turned on by default.

Now take a look at the introduction of *smartypants* on the *hexo-renderer-marked* page:

smartypants- Use "smart" typograhic punctuation for things like quotes and dashes.

C'mon, seriously?

There are a few bloggers who solved this by adding the code below to the _config.yml file in the blog directory.

1 | mark: |

This worked for most victims (perhaps all of them), but not for me. I have no idea why those config wasn't working, so if anyone finds out the reason, please contact me by e-mail.

If you're sure *smartypants* is causing the problem, and the solution above didn't work for you either, maybe you can try my solution.

Since *hexo-renderer-marked* is installed in the blog's *node_modules* directory(may also be in your Node.js directory if installed globally), isn't it possible that we change its own configurations? I looked at the *index.js* file in the *node_modules/hexo-renderer-marked/* directory. There you are, smartypants!

1 | hexo.config.marked = assign({ |

Now you know what to do.

Aaaaaaaaaaaaaaaand many thanks to Xizi Wu, the artist of my new avatar! I love it!

1 | for i in 'Harper' 'Sweet' 'Kobayashi' 'Kawasaki' |

International linkinpark213 Day is a global anniversary set up by Harper Long in 2011 A.D., celebrated on February 13th every year. This anniversary is officially written as 'linkinpark213 Day', the first letter of which is a lower-case. The establishment of the anniversary dates back to the early 10s in the 21st century.

Till now, the population for this anniversary has already reached 1e-6 million, and the distribution also expanded from a small county to the whole middle-China area, including Hebei, Henan, Shanxi and Shaanxi province. There will also be some Japanese resident who plan to celebrate this day in 2019.

According to the modern Chinese habit of writing, '13th Feb' is usually written as '2.13'. Also, '13' and 'B' look similar and are often regarded to be equal. In conclusion, '13 Feb' can be transformed to '2B', which is a common word in Chinese. Although the word is sometimes classified as "offensive", it reflects feelings of optimism, bravery and entertainment.

When the anniversary was first set up, there was no officially specified ways of celebration. People gathered, held parties and enjoyed spending time together.

On the 3th linkinpark213 Day, a proposal by a high school Chinese teacher was taken as the official way of celebration -- on each linkinpark213 Day, a number of people participate in the Pigeon-Flying Competition funded by Harper Long. This way of celebration prevailed until now, and all participants except Harper Long won the competition every year.

Pigeon-Flying is a broadly-accepted traditional Chinese custom, the exact origin of which is too ancient to be revealed. Modern scholars tend to believe that this custom became well-known no later than 206 A.D. in China. In modern times, pigeon-flying is an activity involving "making a promise" and "not keeping it". According to some folk stories, this activity was firstly named when it happened to a pigeon keeper when he forgot to keep a promise that he made with a friend.

1 | public static void main(String[] args) { |

(Please notice that pigeon-flying mentioned here is not the same activity as what happens every Olympics since 1896 A.D.)

Join our celebration today! You can easily participate in the Pigeon-Flying Competition by not participating.

]]>