(UDT): Unsupervised Deep Tracking
Authors: Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, Houqiang Li
arXiv Link: https://arxiv.org/abs/1904.01828
Project Link: https://github.com/594422814/UDT
Summary: Train a robust siamese network on large-scale unlabeled videos in an unsupervised manner - forward-and-backward, i.e., the tracker can forward localize the target object in successive frames and backtrace to its initial position in the first frame.
Highlights: Unsupervised learning
(TADT): Target-Aware Deep Tracking
Authors: Xin Li, Chao Ma, Baoyuan Wu, Zhenyu He, Ming-Hsuan Yang
arXiv Link: https://arxiv.org/abs/1904.01772
Project Link: https://xinli-zn.github.io/TADT-project-page/
Summary: Targets of interest can be arbitrary object class with arbitrary forms, while pre-trained deep features are less effective in modeling these targets of arbitrary forms for distinguishing them from the background. TADT learns target-aware features, thus can better recognize the targets undergoing significant appearance variations than pre-trained deep features.
Highlights: Target-aware features, better discrimination
(SiamMask): Fast Online Object Tracking and Segmentation: A Unifying Approach
Authors: Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, Philip H.S. Torr
arXiv Link: https://arxiv.org/abs/1812.05050
Project Link: https://github.com/foolwood/SiamMask
Zhihu Link: https://zhuanlan.zhihu.com/p/58154634
Summary: Perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach.
Highlights: Mask prediction in tracking
SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks
Authors: Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, Junjie Yan
arXiv Link: https://arxiv.org/abs/1812.11703
Project Link: http://bo-li.info/SiamRPN++/
Summary: SiamRPN++ breaks the translation invariance restriction through a simple yet effective spatial-aware sampling strategy. SiamRPN++ performs depth-wise and layer-wise aggregations, improving the accuracy but also reduces the model size. Current state-of-the-art in OTB2015, VOT2018, UAV123, LaSOT, and TrackingNet.
Highlights: Deep backbones, state-of-the-art
(CIR/SiamDW): Deeper and Wider Siamese Networks for Real-Time Visual Tracking
Authors: Zhipeng Zhang, Houwen Peng
arXiv Link: https://arxiv.org/abs/1901.01660
Project Link: https://github.com/researchmm/SiamDW
Summary: SiamDW explores utilizing deeper and wider network backbones in another aspect - careful designs of residual units, considering receptive field, stride, output feature size - to eliminate the negative impact of padding in deep network backbones.
Highlights: Cropping-Inside-Residual, eliminating the negative impact of padding
(SiamC-RPN): Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking
Authors: Heng Fan, Haibin Ling
arXiv Link: https://arxiv.org/abs/1812.06148
Project Link: None
Summary: Previously proposed one-stage Siamese-RPN trackers degenerate in presence of similar distractors and large scale variation. Advantages: 1) Each RPN in Siamese C-RPN is trained using outputs of the previous RPN, thus simulating hard negative sampling. 2) Feature transfer blocks (FTB) further improving the discriminability. 3) The location and shape of the target in each RPN is progressively refined, resulting in better localization.
Highlights: Cascaded RPN, excellent accuracy
SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking
Authors: Guangting Wang, Chong Luo, Zhiwei Xiong, Wenjun Zeng
arXiv Link: https://arxiv.org/abs/1904.04452
Project Link: None
Summary: To overcome the simultaneous requirements on robustness and discrimination power, SPM-Tracker tackle the challenge by connecting a coarse matching stage and a fine matching stage, taking advantage of both stages, resulting in superior performance, and exceeding other real-time trackers by a notable margin.
Highlights: Coarse matching & fine matching
ATOM: Accurate Tracking by Overlap Maximization
Authors: Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, Michael Felsberg
arXiv Link: https://arxiv.org/abs/1811.07628
Project Link: https://github.com/visionml/pytracking
Summary: Target estimation is a complex task, requiring highlevel knowledge about the object, while most trackers only resort to a simple multi-scale search. In comparison, ATOM estimate target states by predicting the overlap between the target object and an estimated bounding box. Besides, a classification component that is trained online to guarantee high discriminative power in the presence of distractors.
Highlights: Overlap IoU prediction
(GCT): Graph Convolutional Tracking
Authors: Junyu Gao, Tianzhu Zhang, Changsheng Xu
arXiv Link: None
PDF Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_Graph_Convolutional_Tracking_CVPR_2019_paper.pdf
Project Link: http://nlpr-web.ia.ac.cn/mmc/homepage/jygao/gct_cvpr2019.html
Summary: Spatial-temporal information can provide diverse features to enhance the target representation. GCT incorporates 1) a spatial-temporal GCN to model the structured representation of historical target exemplars, and 2) a context GCN to utilize the context of the current frame to learn adaptive features for target localization.
Highlights: Graph convolution networks, spatial-temporal information
(ASRCF): Visual Tracking via Adaptive Spatially-Regularized Correlation Filters
Authors: Kenan Dai, Dong Wang, Huchuan Lu, Chong Sun, Jianhua Li
arXiv Link: None
Project Link: https://github.com/Daikenan/ASRCF (To be updated)
Summary: ASRCF simultaneously optimize the filter coefficients and the spatial regularization weight. ASRCF applies two correlation filters (CFs) to estimate the location and scale respectively - 1) location CF model, which exploits ensembles of shallow and deep features to determine the optimal position accurately, and 2) scale CF model, which works on multi-scale shallow features to estimate the optimal scale efficiently.
Highlights: Estimate location and scale respectively
(RPCF): RoI Pooled Correlation Filters for Visual Tracking
Authors: Yuxuan Sun, Chong Sun, Dong Wang, You He, Huchuan Lu
arXiv Link: None
Project Link: None
PDF Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Sun_ROI_Pooled_Correlation_Filters_for_Visual_Tracking_CVPR_2019_paper.pdf
Summary: RoI-based pooling can be equivalently achieved by enforcing additional constraints on the learned filter weights and thus becomes feasible on the virtual circular samples. Considering RoI pooling in the correlation filter formula, the RPCF performs favourably against other state-of-the-art trackers.
Highlights: RoI pooling in correlation filters
(TBA): Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers
Authors: Zhen He, Jian Li, Daxue Liu, Hangen He, David Barber
arXiv Link: https://arxiv.org/abs/1809.03137
Project Link: https://github.com/zhen-he/tracking-by-animation
Summary: The common Tracking-by-Detection (TBD) paradigm use supervised learning and treat detection and tracking separately. Instead, TBA is a differentiable neural model that first tracks objects from input frames, animates these objects into reconstructed frames, and learns by the reconstruction error through backpropagation. Besides, a Reprioritized Attentive Tracking is proposed to improve the robustness of data association.
Highlights: Label-free, end-to-end MOT learning
Eliminating Exposure Bias and Metric Mismatch in Multiple Object Tracking
Authors: Andrii Maksai, Pascal Fua
arXiv Link: https://arxiv.org/abs/1811.10984
Project Link: None
Summary: Many state-of-the-art MOT approaches now use sequence models to solve identity switches but their training can be affected by biases. An iterative scheme of building a rich training set is proposed and used to learn a scoring function that is an explicit proxy for the target tracking metric.
Highlights: Eliminating loss-evaluation mismatch
Multi-Person Articulated Tracking With Spatial and Temporal Embeddings
Authors: Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian
arXiv Link: https://arxiv.org/abs/1903.09214
Project Link: None
Summary: The framework consists of a SpatialNet and a TemporalNet, predicting (body part detection heatmaps + Keypoint Embedding (KE) + Spatial Instance Embedding (SIE)) and (Human Embedding (HE) + Temporal Instance Embedding (TIE)). Besides, a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable.
Highlights: Spatial & temporal embeddings, end-to-end learning "detection and grouping" pipeline
(STAF): Efficient Online Multi-Person 2D Pose Tracking With Recurrent Spatio-Temporal Affinity Fields
Authors: Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, Yaser Sheikh
arXiv Link: https://arxiv.org/abs/1811.11975
Project Link: None
Summary: Upon Part Affinity Field (PAF) representation designed for static images, an architecture encoding ans predicting Spatio-Temporal Affinity Fields (STAF) across a video sequence is proposed - a novel temporal topology cross-linked across limbs which can consistently handle body motions of a wide range of magnitudes. The network ingests STAF heatmaps from previous frames and estimates those for the current frame.
Highlights: Online, fastest and the most accurate bottom-up approach
(OTR): Object Tracking by Reconstruction With View-Specific Discriminative Correlation Filters
Authors: Ugur Kart, Alan Lukezic, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas
arXiv Link: https://arxiv.org/abs/1811.10863
Summary: Perform online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs). State-of-the-art on Princeton RGB-D tracking and STC Benchmarks.
I'm not experienced in point clouds so I couldn't make a summary for the following papers. The abstracts are given below. Check them out at arXiv to learn more if you're interested.
VITAMIN-E: VIsual Tracking and MappINg With Extremely Dense Feature Points
Authors: Masashi Yokozuka, Shuji Oishi, Simon Thompson, Atsuhiko Banno
arXiv Link: https://arxiv.org/abs/1904.10324
Project Link: None
Abstract: In this paper, we propose a novel indirect monocular SLAM algorithm called "VITAMIN-E," which is highly accurate and robust as a result of tracking extremely dense feature points. Typical indirect methods have difficulty in reconstructing dense geometry because of their careful feature point selection for accurate matching. Unlike conventional methods, the proposed method processes an enormous number of feature points by tracking the local extrema of curvature informed by dominant flow estimation. Because this may lead to high computational cost during bundle adjustment, we propose a novel optimization technique, the "subspace Gauss--Newton method", that significantly improves the computational efficiency of bundle adjustment by partially updating the variables. We concurrently generate meshes from the reconstructed points and merge them for an entire 3D model. The experimental results on the SLAM benchmark dataset EuRoC demonstrated that the proposed method outperformed state-of-the-art SLAM methods, such as DSO, ORB-SLAM, and LSD-SLAM, both in terms of accuracy and robustness in trajectory estimation. The proposed method simultaneously generated significantly detailed 3D geometry from the dense feature points in real time using only a CPU.
Leveraging Shape Completion for 3D Siamese Tracking
Authors: Silvio Giancola, Jesus Zarzar, and Bernard Ghanem
arXiv Link: https://arxiv.org/abs/1903.01784
Project Link: https://github.com/SilvioGiancola/ShapeCompletion3DTracking
Abstract: Point clouds are challenging to process due to their sparsity, therefore autonomous vehicles rely more on appearance attributes than pure geometric features. However, 3D LIDAR perception can provide crucial information for urban navigation in challenging light or weather conditions. In this paper, we investigate the versatility of Shape Completion for 3D Object Tracking in LIDAR point clouds. We design a Siamese tracker that encodes model and candidate shapes into a compact latent representation. We regularize the encoding by enforcing the latent representation to decode into an object model shape. We observe that 3D object tracking and 3D shape completion complement each other. Learning a more meaningful latent representation shows better discriminatory capabilities, leading to improved tracking performance. We test our method on the KITTI Tracking set using car 3D bounding boxes. Our model reaches a 76.94% Success rate and 81.38% Precision for 3D Object Tracking, with the shape completion regularization leading to an improvement of 3% in both metrics.
LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking
Authors: Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling
arXiv Link: https://arxiv.org/abs/1809.07845
Project Link: https://cis.temple.edu/lasot/
Summary: A high-quality benchmark for Large-scale Single Object Tracking, consisting of 1,400 sequences with more than 3.5M frames.
CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification
Zheng Tang, Milind Naphade, Ming-Yu Liu, Xiaodong Yang, Stan Birchfield, Shuo Wang, Ratnesh Kumar, David Anastasiu, Jenq-Neng Hwang
arXiv Link: https://arxiv.org/abs/1903.09254
Project Link: https://www.aicitychallenge.org/
Summary: The largest-scale dataset in terms of spatial coverage and the number of cameras/videos in an urban environment, consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km.
MOTS: Multi-Object Tracking and Segmentation
Authors: Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, Bastian Leibe
arXiv Link: https://arxiv.org/abs/1902.03604
Project Link: https://www.vision.rwth-aachen.de/page/mots
Summary: Going beyond 2D bounding boxes and extending the popular task of multi-object tracking to multi-object tracking and segmentation, in tasks and metrics.
Highlights: Extend MOT with segmentation
Argoverse: 3D Tracking and Forecasting With Rich Maps
Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, James Hays
arXiv Link: None
PDF Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Chang_Argoverse_3D_Tracking_and_Forecasting_With_Rich_Maps_CVPR_2019_paper.pdf
Project Link: Argoverse.org (Not working?)
Summary: A dataset designed to support autonomous vehicle perception tasks including 3D tracking and motion forecasting.
从VALSE2019回来后,感觉自己俨然变成了欧阳万里老师的脑残粉呀╰( ᐖ╰)!会上欧阳老师介绍的FishNet简直让我眼前一亮,这么好的点子,我怎么就没想到呐!回来好好读了一下文章和代码,简单总结一下。
较早的典型深度CNN结构大多为漏斗状,不断地进行卷积、下采样来提取、浓缩图像特征,最后用一些全连接层之类的结构来计算具体任务的输出结果。这样的设计很自然地被用于图像分类任务,因为较深的神经网络更能学习高级语义特征,最后将图像浓缩到一个像素而变成一个向量时,这个像素的每一个通道的值则代表了整个图像在这个语义特征上的表现。
但是呢,这样的结构原封不动地应用到其他任务上,效果就不是很好了。比如在分割任务中,细节特征保留得好的话,分割的效果则会更佳(例如FCN-8s的效果远好于FCN-32s)。又如在anchor-based目标检测模型中,用尺寸更大的特征图能够更好地回归较小目标的候选框(例如YOLOv3加入FPN后显著提升小物体的检测效果)。因此,出现了一些沙漏状甚至多沙漏堆叠的网络结构(U-Net,FPN,Stacked Hourglass等等)来更好地处理这些任务。
可以看到,类似这样的工作大多出于这样的一个想法:底层细节特征很重要,我们要把它融合到顶层语义特征里去。这样就有人问了:那语义特征是不是也能融合到细节特征里去,从而增强高分辨率特征图的效果呢?FishNet就做到了这样的融合,让网络最后一部分的各个分辨率的特征图中的底层、中层、顶层特征(作者原话为pixel-level, region-level, image-level)都能“你中有我,我中有你”。
在ResNet中,作者用一种巧妙的办法让较浅的层也能得到有效的梯度信息——在每层层的输出上加一个identity mapping。也就是该层的输入\(x_l\)、下一层的输入\(x_{l+1}\)以及本层的运算\(\mathcal{F}(x, \mathcal{W}_l)\)之间的关系是$$x_{l+1}=x_l+\mathcal{F}(x_l, \mathcal{W_l})$$
再下一层的话:
$$x_{l+2}=x_l + \mathcal{F}(x_l, \mathcal{W_l}) + \mathcal{F}(x_{l+1}, \mathcal{W_{l+1}})$$
要是一直写到最后一层\(x_L\):
$$x_{L}=x_l+\sum_{i=l}^{L-1}\mathcal{F}(x_i, \mathcal{W_i})$$
那么梯度反传时则有:
$$\begin{split}
\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\
& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)
\end{split}$$
然而现实是:因为中间涉及了几次下采样,采样后的特征图尺寸发生了变化,这时,那个恒等映射\(x\)上不得不加一个\(\mathcal{M}(x)\)(一般为一个\((1\times 1)\)尺寸的卷积,作者称之为I-conv,即Isolated convolution)来改变尺寸和通道数。因此,不是每一层都能保证简单的\(x_{l+1}=x_l+\mathcal{F}(x_l, \mathcal{W_l})\),上边的梯度公式也只是一种理想情况而已。
在ResNet本身里面倒还好。到了FPN甚至Stacked Hourglass中,这样的I-conv在每次特征图融合时都被使用,这就有点违背ResNet保持梯度有效反传的初衷了。而FishNet在这种情况下采用了一种更“平滑”的方式使得梯度反传受到的影响降到最低。
妙啊(👏)!那我们就来看一眼FishNet的全貌:
整条鱼整个FishNet由三部分构成:tail(尾巴),body(躯干)和head(头)。tail部分之前,图像先过了三层卷积层,初步从\((224\times 225 \times 3)\)尺寸的原图像提取出\((56\times 56 \times 64)\)尺寸的特征图。作者把不同阶段内同一分辨率的特征图分为同一个stage,\((56\times 56)\)的是stage 1,\((28\times 28)\)的是stage 2,\((14\times 14)\)的是stage 3,\((7\times 7)\)的是stage 4。因为分辨率相同,三个部分的特征图可以不用上/下采样而直接在channel维度上concat起来。
tail部分就是一个漏斗状的网络,涉及三次最大池化,每次池化前,最后一个卷积层输出的特征图被留下来供body部分使用。这一部分的结果就是经典的漏斗状网络,作者使用的是一个三阶段的ResNet。tail部分的最后,作者用了一个Squeeze-Excitation模块[2],先把\((7\times 7 \times 512)\)尺寸的特征图用Global Average Pooling再加几个卷积层(实际上和全连接层并无本质区别)映射成一个\((1\times 1\times 512)\)的向量,再把这个向量的每一个值作为一个权重,乘到之前\((7\times 7\times 512)\)的特征图对应的通道上去。
body部分像FPN一样不断地用上采样来放大特征图,同时融合之前tail部分保留下来的同一分辨率的特征。
head部分则是FishNet的独创性工作,它像是body部分的反过程。以往的沙漏形网络将高层语义特征用来精化低层细节特征,而head网络反其道而行之,又用精化过的低层细节特征反过来精化高层特征。这样,再次采样得到的高层特征的质量被有效提高。
FishNet-99整体的各个部分的参数见下表。
Part-Stage | Input shape | Output shape | Bottlenecks | I-convs | Convs in total |
---|---|---|---|---|---|
Input | \(3\times 224 \times 224\) | \(64\times 56 \times 56\) | \(0\) | \(0\) | \(3\) |
Tail-1 | \(64\times 56 \times 56\) | \(128\times 28 \times 28\) | \(2\) | \(1\) | \(7\) |
Tail-2 | \(128\times 28 \times 28\) | \(256\times 14 \times 14\) | \(2\) | \(1\) | \(7\) |
Tail-3 | \(256\times 14 \times 14\) | \(512\times 7 \times 7\) | \(6\) | \(1\) | \(19\) |
SE-block | \(512\times 7 \times 7\) | \(512\times 7 \times 7\) | \(2\) | \(1\) | \(11\) |
Body-3 | \(512\times 7 \times 7\) | \(256\times 14 \times 14\) | \(1 + 1\) | \(0\) | \(6\) |
Body-2 | \((512+256)\times 14 \times 14\) | \(384\times 28 \times 28\) | \(1 + 1\) | \(0\) | \(6\) |
Body-1 | \((384+128)\times 28 \times 28\) | \(256\times 56 \times 56\) | \(1 + 1\) | \(0\) | \(6\) |
Head-1 | \((256+64)\times 56 \times 56\) | \(320\times 28 \times 28\) | \(1 + 1\) | \(0\) | \(6\) |
Head-2 | \((320+512)\times 28 \times 28\) | \(832\times 14 \times 14\) | \(2 + 1\) | \(0\) | \(9\) |
Head-3 | \((832+768)\times 14 \times 14\) | \(1600\times 7 \times 7\) | \(2 + 4\) | \(0\) | \(18\) |
Score-Conv | \((1600+512)\times 7 \times 7\) | \(1056\times 7 \times 7\) | \(0\) | \(0\) | \(1\) |
Score-FC | \(1056\times 7 \times 7\) | \(1000\times 1 \times 1\) | \(0\) | \(0\) | \(1\) |
说明:
FishNet-150的参数见下表,与FishNet-99相比而言只是各个部分Bottleneck块的数量不同,没有太大差异。
Part-Stage | Input shape | Output shape | Bottlenecks | I-convs | Convs in total |
---|---|---|---|---|---|
Input | \(3\times 224 \times 224\) | \(64\times 56 \times 56\) | \(0\) | \(0\) | \(3\) |
Tail-1 | \(64\times 56 \times 56\) | \(128\times 28 \times 28\) | \(2\) | \(1\) | \(7\) |
Tail-2 | \(128\times 28 \times 28\) | \(256\times 14 \times 14\) | \(4\) | \(1\) | \(13\) |
Tail-3 | \(256\times 14 \times 14\) | \(512\times 7 \times 7\) | \(8\) | \(1\) | \(25\) |
SE-block | \(512\times 7 \times 7\) | \(512\times 7 \times 7\) | \(4\) | \(1\) | \(17\) |
Body-3 | \(512\times 7 \times 7\) | \(256\times 14 \times 14\) | \(2 + 2\) | \(0\) | \(12\) |
Body-2 | \((512+256)\times 14 \times 14\) | \(384\times 28 \times 28\) | \(2 + 2\) | \(0\) | \(12\) |
Body-1 | \((384+128)\times 28 \times 28\) | \(256\times 56 \times 56\) | \(2 + 2\) | \(0\) | \(12\) |
Head-1 | \((256+64)\times 56 \times 56\) | \(320\times 28 \times 28\) | \(2 + 2\) | \(0\) | \(12\) |
Head-2 | \((320+512)\times 28 \times 28\) | \(832\times 14 \times 14\) | \(2 + 2\) | \(0\) | \(12\) |
Head-3 | \((832+768)\times 14 \times 14\) | \(1600\times 7 \times 7\) | \(4 + 4\) | \(0\) | \(24\) |
Score-Conv | \((1600+512)\times 7 \times 7\) | \(1056\times 7 \times 7\) | \(0\) | \(0\) | \(1\) |
Score-FC | \(1056\times 7 \times 7\) | \(1000\times 1 \times 1\) | \(0\) | \(0\) | \(1\) |
tail,body和head三部分的主要成分都是Bottleneck模块,即下表所示的结构:
Layer | Type | Output channels | Kernel Size |
---|---|---|---|
(shortcut) | (take shortcut) | - | - |
relu | ReLU | \(C\) | - |
bn1 | Batch Normalization | \(C\) | - |
conv1 | Convolution | \(C / 4\) | \(1\times 1\) |
bn2 | Batch Normalization | \(C / 4\) | - |
conv2 | Convolution | \(C / 4\) | \(3\times 3\) |
bn3 | Batch Normalization | \(C / 4\) | - |
conv3 | Convolution | \(C'\) | \(1\times 1\) |
(addition) | (add shortcut) | \(C'\) | - |
在tail部分的每一个stage中,第一个Bottleneck模块会涉及通道数的变化(即\(C'\neq C\))。这时shortcut需要经过一个卷积层来变换identity mapping的通道数。因此,这三个shortcut上依旧无法避免使用Isolated convolution。在SE-block中也存在类似的情况。而在head部分中,尽管特征图仍在不断地下采样,其通道数并没有被改变,所以不需要使用这样的Isolated convolution来干扰梯度的直接反传(direct back-propagation)。
(PS:可是我数了数,FishNet-99里有100个卷积,FishNet-150里有151个卷积呀😂?个人猜测是因为Score-FC层不应该算在FishNet主干内?对了,虽然它叫做FC层,但作者代码里还是用卷积层的形式定义的哦。因为\(7\times 7\)尺寸的特征图过了一层Global Average Pooling变成了\(1\times 1\)尺寸,所以它本质上变成了一个长度为通道数的向量。)
从body部分的stage 3开始直到head部分的stage3,每个stage的特征图将与之前部分的特征图融合(也就是图中的红色虚线和红框所表示的内容)。为了保证梯度直接反传,作者设计了UR-block (Upsampling & Refinement) 和DR-block (Downsampling & Refinement) 来“保持和精化”(preserve and refine)各个部分的特征。
上边提到,FishNet中的stage号不是从浅到深依次增大的,而是与特征图的尺度相对应。设tail部分和body部分的stage \(s\)的第一层输出特征分别为\(x^t_s\)和\(x^b_s\),则\(x^t_s\)和\(x^b_s\)的宽度和高度应该是一致的(尽管通道数可能不同)。\(x^t_s\)经过一个迁移模块\(\mathcal{T}(x)\)(transferring block,同样是带shortcut的Bottleneck模块)后与\(x^b_s\)进行连接构成融合的特征图\(\widetilde{x}^b_s\):
$$\widetilde{x}^b_s = concat(x^b_s, \mathcal{T}(x^t_s))$$
\(\widetilde{x}^b_s\)将继续作为body部分的stage \(s\)中后面的卷积层\(\mathcal{M}(x)\)的输入。同时,为了梯度的直接反传,另有一条恒等映射与\(\mathcal{M}(\widetilde{x}^b_s)\)相加。这里的思路与ResNet中\(\mathcal{H}(x)=x+\mathcal{F}(x)\)是一致的:
$$\widetilde{x}'^b_s = r(\widetilde{x}^b_s) + \mathcal{M}(\widetilde{x}^b_s)$$
在body部分的stage 1中,\(\mathcal{M}(x)\)的输出值通道数与\(x\)相同,此时\(r(x)\)即为\(x\)。而stage 2和stage 1中,由于\(\mathcal{M}(x)\)中通道数会产生变化(在作者代码中,通道数减半,\(k=2\)),所以这里的\(r(x)\)需要起到缩小通道数(channel-wise reduction)的作用。还是为了梯度直接反传,这里甚至没有使用\((1\times 1)\)的卷积来变换通道数,而是直接把每\(k\)个通道求和(element-wise summation)而压缩成一个通道。\(\widetilde{x}’^b_s\)再进行一下上采样就成为body部分下一个stage(即stage \(s-1\))的输入了:
$$x^b_{s-1}=up(\widetilde{x}'^b_s)$$
(PS:为什么这里不用\((1\times 1)\)卷积,而前面tail部分要用呢?个人猜测是因为tail部分要扩大通道数而不得不用这样的方式。或许在tail部分使用与这里的\(r(x)\)相反的过程——通过把每个通道duplicate一下来达成通道数增加一倍的效果也能work呢?有兴趣的可以试一下。)
head部分的下采样&精化模块比上采样&精化模块更加简单,因为这里所有的\(\mathcal{M}(x)\)都不会导致通道数的变化,UR模块用于的\(r(x)\)也就不需要了。其他的公式与UR模块基本相同:
$$\widetilde{x}^b_s = concat(x^b_s, \mathcal{T}(x^t_s)) \\
\widetilde{x}'^b_s = \widetilde{x}^b_s + \mathcal{M}(\widetilde{x}^b_s) \\
x^b_{s+1}=down(\widetilde{x}'^b_s)$$
漏斗状卷积网络里,较浅卷积层中的特征往往是较简单、像素级的特征,而更深的卷积层中的特征由于感受域较大,是更抽象、泛化的特征。由于FishNet中上采样、下采样的存在,直接以“浅层”“深层”特征来区分不同分辨率的特征似乎并不妥当。因此,这里我用“低层特征”来指代分辨率较大、较具体的特征,用“高层特征”指代分辨率较小、抽象程度较高,或者说“浓缩程度”较高的特征。
分类任务里,图像通过一个漏斗状的卷积网络即可回归出它的类别;检测任务里,通过用高层特征加强低层特征的方式可以有效提升检测效果;如果反过来再用低层特征增强高层特征,网络则可同时被用于图像级、区域级和像素级的不同任务。
尽量避免在shortcut上使用I-conv。FishNet除了tail部分在涉及通道数变化的残差模块上使用了I-conv外,在body和head部分的融合时都避免使用I-conv,从而最大限度地保证了梯度的直接反传。
上采样方式的选择上,尽可能不使用带权值的反卷积,而是用最近邻插值等方式。此举同样是为了保证梯度的直接反传。
用kernel尺寸为\((2\times 2)\)、stride也为\(2\)的MaxPooling进行下采样与其他几种典型的下采样方式相比,效果更好。用来对(diao)比(da)的另外几种下采样方式包括:
“老僧三十年前,未参禅时,见山是山,见水是水。
及至后来亲见知识,有个入处,见山不是山,见水不是水。
而今得个休歇处,依前见山只是山,见水只是水。
大众,这三般见解,是同是别,有人缁素得出。”
——吉州青原惟信禅师[3]
FishNet的思想,似乎与这三重境界有什么关联?池化,插值,融合,再池化,再融合,这个过程,仿佛一个人脑海中对知识的建构、解构和重构的过程。
我在初识某些新事物时,由于对它还没有形成充分的了解,只是大致地形成了一个印象。比如十多年前,“屏幕,主机,鼠标,键盘”,这就是我脑海中一台计算机的样子,所谓“计算机科学”,在当时的自己看来也不过是用一些软件写写文档画画图之类的工作。
随着学习的逐渐深入,我从一个使用者成为了一个开发者后,关注点也不断地深入、细化:当看到一个网页的动效,我想到按F12看看它是怎么用js实现的,想到这个异步请求是怎么响应的,想到网络请求的TCP报文是怎样的,想到报文是如何经历一系列路由器传输到服务器的。在对计算机的了解不断深入的过程中,我却又对它产生了一种陌生感——这门科学还藏有多少的奥秘,其中是否有些我甚至还无法想象?
至于再将学习深入下去我会对计算机产生怎样的认识,才疏学浅,尚不得而知。也许某一天我会恍然大悟——哦,原来计算机科学就是这个样子的呀。
The R-CNNs are awesome works on object detection, which demonstrated the effectiveness of using region proposals with deep neural networks, and have become a state-of-the-art baseline for the object detection task. In this blog post I'll make a brief review of the R-CNN family - from R-CNN to Mask R-CNN, and several related works based on the idea of R-CNNs. Implementation and evaluation details are not mentioned here. For those details, please refer to the original papers provided in the References section.
Before CNN was widely adopted in object detection, SIFT or HOG features are commonly used for the detection task.
Unlike image classification, detection requires localizing objects within an image. Common approaches to localization are 1) bounding box regression, and 2) sliding-window detector. The first approach used in [1] proved to be not working very well, while the second used in [2] needs high spatial resolutions, thus deeper networks makes precise localization a challenge.
R-CNN solves the CNN localization problem by operating the "recognition using regions" paradigm.
From the input image, the method first generates around 2000 category-independent region proposals with Selective Search algorithm, and then extracts a fixed-length feature vector from each proposal using the same CNN(AlexNet). Finally, it classifies each region with category-specific linear SVMs.
However, the region proposal may not be that satisfactory as a final detection window. Therefore, a bounding-box regression stage is introduced to predict a new detection window given the feature map of a region proposal. As reported in [3], this simple approach fixes a large number of mislocalized detections. More details are available in the supplementary material[12] of the R-CNN paper.
Since AlexNet only takes images of size 227 × 227, the image clip in the bounding box should be resized.
In R-CNN, the image clip is directly warped into the demanded size.
SPP-Net introduces the spatial pyramid pooling layer that takes in feature maps of arbitrary size, while also considering multi-scale features in the input image. It also solved the way-too-slow issue of R-CNN.
While R-CNN extracts features from warped image clips in each proposed region, the SPP-Net first extracts the feature of the whole image and get one shared feature map. After this, the feature map is cropped according to the bounding boxes (boxes fixed by regressor, same as R-CNN). Each of the feature map clip is put into the spatial pyramid pooling layers to get a feature vector of the same length. Then the feature vectors are the inputs of following fully connected layers which are the same as R-CNN.
The spatial pyramid pooling layers consider the feature map clip in different scales - it divides the feature map clip into 4 × 4, 2 × 2 and 1 × 1 grids and computes 4 × 4, 2 × 2 and 1 × 1 feature maps (channel number doesn't change). The computed feature maps are flattened and concatenated into one vector, which is the input of the following fully connected layers.
As mentioned in the paper, R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Fast R-CNN improved its detection efficiency by using the deeper VGG16 network, which is 213 times(nice number :D) faster than R-CNN. It also introduced RoI pooling layer, which is simple a special case of SPP-Net where only one scale is considered(only one pyramid level). Fast R-CNN uses a multi-task loss and is trained in single stage, updating all network layers. Fast R-CNN yields higher detection quality(mAP) than R-CNN and SPP-Net, while being comparatively fast to train and test.
Similar to SPP-Net, Fast R-CNN extracts image features before the RoI-based projection to share computation and speed up detection. But differently, Fast R-CNN uses a deep neural network - VGG16 for more efficient feature extraction. Rather than training bounding-box regressor and classifier separately, Fast R-CNN uses a streamlined training process and jointly optimize a softmax classifier and a bounding-box regressor. The RoI-fixing regressor is moved after the fully-connected layers. The multi-task loss for each RoI is defined as:
$$L(p,u,t^u,v) = L_{cls}[p,u]+\lambda[u\geq 1]L_{loc}(t^u,v)$$
in which the definition of classification loss and localization loss are:
$$L_{cls}(p,u)=-log(p_u)$$
$$L_{loc}(t^u,v)=\sum_{i\in \{x,y,w,h\}}{smooth_{L_1}(t_i^u-v_i)}$$
in which \(smooth_{L_1}\) loss is defined as:
$$smooth_{L_1}(x)=\begin{cases}
0.5x^2& \text{if |x|<1}\\
|x|-0.5& \text{otherwise}
\end{cases}$$
Symbol definitions:
Symbol | ||
---|---|---|
Output of the classification layer, a vector of length \(K+1\)(K object classes and background) | \(p=(p_0,\cdots,p_K)\) | |
Output of the regression layer, a matrix of size \(K\times 4\). | \(t^k=(t^k_x,t^k_y,t^k_w,t^k_h)\) | |
True class. | \(u\in N, 1\le u \le K\) | |
True bounding-box regression target. | \(v=(v_x,v_y,v_w,v_h)\) |
In this architecture, two of the three main procedures except region proposal are trained in single-stage with the multi-task loss.
Here is are two graphs demonstrating common pooling layers(max or avg) and RoI pooling layers. On the left is the original 5x5 feature map, and each in grid is a pixel value. During calculation, the common pooling kernel covers an area each step and calculates the maximum value or the average value in the area. With a kernel size of 3x3 and a stride of 2, a feature map of 2x2 is generated from the 5x5 feature map.
And in RoI pooling, the RoI is cropped from the whole feature map, and is divided into pieces with equal areas according to the output feature map size. However, it's possible that grids on the borders of different pieces have to be assigned to one piece only. In this case, there may be a little bit of "injustice" among the pieces. In each piece, a global average/maximum pooling is done and the result is only one number in each channel.
In Fast R-CNN, two of the three main procedures are trained in single-stage, except region proposal. And region proposal is the bottleneck of total detection speed, since GPU with high computation power isn't utilized here yet. Why not try training a CNN that generates region proposals?
Simply remove the Selective Search in Fast R-CNN. In place of the SS algorithm, an RPN(Region Proposal Network) is introduced. Given the DCNN features, the RPN generates RoIs with improved speed.
This is a question that had been confusing me for so long.
In a word, it's a simple CNN taking an image of any size as input, slides a window and outputs \(6k\) numbers each time the window moves. \(k\) is the number of anchors pre-defined - IT DOES NOT MEAN "THOUSAND". Wait, what is an anchor?
An anchor is a box size we define first before generating data (for example, \((width=36, height=78)\) for pedestrain, and \((width=50, height=34)\) for dogs?). Though the input image is of size \(n * n\), the anchor can be in any size and any w-h ratio. The prediction of the 6 numbers are based on the anchors we define. When the RPN works, it does NOT predict the possibility that there is an object - BUT the possibility that there is an object that fits in the anchor.
Besides a classification layer predicting the possibility of there being an object and the possibility of there being nothing but background, a regression layer predicts the relative box coordinates \((t_x, t_y, t_w, t_h)\). For each anchor, its size \((w_a, h_a)\) is given and its position \((w_x, w_y)\) is decided by the center position of the sliding window. The relation between relative coordinates \((t_x, t_y, t_w, t_h)\) and absolute coordinates \((x, y, w, h)\) is:
$$t_x=(x-x_a)/w_a\\
t_y=(y-y_a)/h_a\\
t_w=log(w/w_a)\\
t_h=log(h/h_a)$$
for both prediction and ground truth.
But there are a great pile of boxes generated by the RPN. Some basic methods have to be taken to select the "good" boxes. Firstly, the boxes with low object scores and high background scores (usually thresholds are set manually) are abandoned. Secondly, using non-maximum supression, one box for each object target is elected from all boxes that mark the same object.
Though Mask R-CNN is a great work, its idea is rather intuitive - since detection and classification is done, why not add a segmentation head? In this case, some instance-first instance segmentation work would be done!
Add a small mask fully-convolutional overhead to Faster R-CNN, replace VGG net with more efficient ResNet/FPN(Residual Network / Feature Pyramid Network) and replace RoI pooling with RoI alignment.
In RoI pooling, quantization will be performed when the RoI coordinates are not integers. For example, when cutting the area \((x_1=11.02, y_1=53.9, x_2=16.2, y_2=58.74)\), actually the area \((x_1=11, y_1=54, x_2=16, y_2=59)\) is what we get (nearest-neighbor).
But in RoI alignment, the area is exactly \((x_1=11.02, y_1=53.9, x_2=16.2, y_2=58.74)\). Instead of cropping it down, the feature map area is sampled using some sample points. Divide the RoI into \(n*n\)(output size) bins Using bi-linear interpolation, one value would be calculated at each sample point. In the image below is a simple example. In this case we have only one sample point for each pixel in the pooled RoI. Coordinate of the only sample point in the first area is \((12.315, 55.11)\). Calculate the weighted average of the 4 grid points nearby his sample point and we'll have the value for this pixel in the pooled feature map.
It's obvious that one sample point each bin is far from enough in our example. So using more sample points is wiser.
There are several other R-CNNs by other researchers, which are basically variants of the R-CNN architecture.
arXiv: https://arxiv.org/abs/1711.07264
Code(Official, TensorFlow): https://github.com/zengarden/light_head_rcnn
arXiv: https://arxiv.org/abs/1712.00726
Code(Official, Caffe): https://github.com/zhaoweicai/cascade-rcnn
Code(PyTorch): https://github.com/guoruoqian/cascade-rcnn_Pytorch
arXiv: https://arxiv.org/abs/1811.12030
Code: Not yet
[10] Xin Lu, et al. "Grid R-CNN." arXiv preprint arXiv:1811.12030 (2018).
]]>DensePose has been re-implemented with the brand-new object detection framework Detectron2, which is based on PyTorch and much easier to install and use (You don't have to manually compile Caffe2)
I strongly recommend that you check out the new official DensePose code at https://github.com/facebookresearch/detectron2/tree/master/projects/DensePose.
DensePose is a great work in real-time human pose estimation, which is based on Caffe2 and Detectron framework. It extracts dense human body 3D surface based on RGB images. The installation instructions are provided here.
During my installation process, these are the problems that took me some time to tackle. I spent on week to finally figure out solutions to all the issues. So lucky of me not to give up too early...
By the way, before you suffer too much, I strongly recommend following the step-by-step Caffe2+DensePose installation guide by @Johnqczhang. If you think you're almost there, help yourself with the solutions below~
Occurred when running make
.
Main error message:1
2
3
4Could not find a package configuration file provided by "Caffe2" with any
of the following names:
Caffe2Config.cmake
caffe2-config.cmake
Caffe2 build path isn't known by CMake.
Added one line in the beginning of CMakeLists.txt:1
set(Caffe2_DIR "/path/to/pytorch/torch/share/cmake/Caffe2/")
(Note: set(Caffe2_DIR "/path/to/pytorch/build/")
can also fix this issue but may cause other issues.)
Occurred when running python2 $DENSEPOSE/detectron/tests/test_spatial_narrow_as_op.py
after make
.
Main error message:1
Detectron ops lib not found; make sure that your Caffe2 version includes Detectron module.
Seems that the Python part of DensePose couldn't recognize Caffe2.
Add /path/to/pytorch/build
to PYTHONPATH
environment variable. Could be added by directly export PYTHONPATH=$PYTHONPATH:/path/to/pytorch/build
instruction or by adding this line to ~/.bashrc
. Remember to run source ~/.bashrc
after the modification.
Occurred when running make ops
.
Main error message:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16CMake Error at /path/to/pytorch/build/Caffe2Config.cmake:14 (include):
include could not find load file:
/path/to/pytorch/build/public/utils.cmake
/path/to/pytorch/build/public/threads.cmake
/path/to/pytorch/build/public/cuda.cmake
/path/to/pytorch/build/public/mkl.cmake
/path/to/pytorch/build/Caffe2Targets.cmake
Call Stack (most recent call first):
CMakeLists.txt:8 (find_package)
CMake Error at /path/to/pytorch/build/Caffe2Config.cmake:117 (caffe2_interface_library):
Unknown CMake command "caffe2_interface_library".
Call Stack (most recent call first):
CMakeLists.txt:8 (find_package)
(Several *.cmake
files, I only showed a few.)
These files are not in the pytorch/build
directory. By searching, I found that they are in the pytorch/torch/share/cmake/Caffe2
directory.
Added one line in the beginning of CMakeLists.txt:1
set(Caffe2_DIR "/path/to/pytorch/torch/share/cmake/Caffe2/")
Occurred when running make ops
.
I forgot to record the error messages, but it should be obvious that some header files(not just context_gpu.h
) are missing.
This time it's the include path not recognized...
Added one line in the beginning of CMakeLists.txt:1
include_directories("/path/to/pytorch/torch/lib/include")
Occurred when running make ops
.
Main error message:1
2
3
4
5
6
7
8
9/path/to/pytorch/torch/lib/include/caffe2/proto/caffe2.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
#error This file was generated by a newer version of protoc which is
^
/path/to/pytorch/torch/lib/include/caffe2/proto/caffe2.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
#error incompatible with your Protocol Buffer headers. Please update
^
/path/to/pytorch/torch/lib/include/caffe2/proto/caffe2.pb.h:14:2: error: #error your headers.
#error your headers.
^
If you only have a protobuf higher than v3.6.1, this should not happen. Check if you have multiple protobufs installed from different sources. (In my case, there was a protobuf v3.2.0 installed with apt-get
earlier)
I can't provide an exact solution. Please try1
which protoc
and see where protobuf is installed. If this shows the protobuf you installed with Anaconda, remove it completely and try this again. Since DensePose tells you that you have an older version of protobuf, you should be able to locate one. After finding it, remove it or upgrade it to v3.6.1 or higher. I would prefer installing protobuf from source here. It's not so painful as installing DensePose.
Occurred when running make ops
.
I forgot to record the error messages, but it should be obvious too.
Intel Math Kernel Library was turned on but not found. (Why is it enabled when I didn't even install it???)
Install Intel Math Kernel Library here and add /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/include
to C_PATH
environment variable:1
export CPATH=$CPATH:/opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/include
The exact path may vary according to the MKL version and your configuration.
Maybe try find / -name mkl_cblas.h
to make sure of its location after the installation.
Adding the path to CMakeLists.txt should also be helpful, but I didn't test it:1
include_directories("/opt/intel/compilers_and_libraries_2019.1.144/Linux/mkl/include")
Occurred when running make ops
.
Main error message:1
2
3
4
5
6
7
8
9
10
11
12
13/path/to/pytorch/caffe2/operators/accumulate_op.h: In constructor ‘caffe2::AccumulateOp<T, Context>::AccumulateOp(const caffe2::OperatorDef&, caffe2::Workspace*)’:
/path/to/pytorch/caffe2/operators/accumulate_op.h:13:187: error: ‘GetSingleArgument<float>’ is not a member of ‘caffe2::AccumulateOp<T, Context>’
AccumulateOp(const OperatorDef& operator_def, Workspace* ws)
^
/path/to/pytorch/caffe2/operators/elementwise_ops.h: In constructor ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>::BinaryElementwiseWithArgsOp(const caffe2::OperatorDef&, caffe2::Workspace*)’:
/path/to/pytorch/caffe2/operators/elementwise_ops.h:106:189: error: ‘GetSingleArgument<bool>’ is not a member of ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>’
BinaryElementwiseWithArgsOp(const OperatorDef& operator_def, Workspace* ws)
^
/path/to/pytorch/caffe2/operators/elementwise_ops.h:106:272: error: ‘GetSingleArgument<int>’ is not a member of ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>’
BinaryElementwiseWithArgsOp(const OperatorDef& operator_def, Workspace* ws)
^
/path/to/pytorch/caffe2/operators/elementwise_ops.h:106:350: error: ‘GetSingleArgument<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >’ is not a member of ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>’
BinaryElementwiseWithArgsOp(const OperatorDef& operator_def, Workspace* ws)
I'm not sure. Could be that GetSingleArgument()
is defined elsewhere?
Modify /path/to/densepose/detectron/ops/pool_points_interp.h
. Change OperatorBase::GetSingleArgument<float>
to this->template GetSingleArgument<float>
(Thanks to badpx@Github: https://github.com/facebookresearch/DensePose/pull/137/commits/51389c6a02173a25e9429825db452beb5e1cf3be)
Occurs when running "make ops".
Main error message:1
/path/to/pytorch/torch/lib/include/caffe2/core/workspace.h:19:48: fatal error: caffe2/utils/threadpool/ThreadPool.h: No such file or directory
This should only happen when your Caffe2 is installed with Anaconda.
If your Caffe2 is installed with Anaconda, these files may not be found anywhere in the Caffe2 directory, or in your hard disk at all.
In Anikily@Github's case, downloading Caffe2 source code and add its path to DensePose's include directories will work:
1 | git clone git@github.com:pytorch/pytorch.git |
and add one line in the beginning of DensePose/CMakeLists.txt:1
include_directories("/path/to/pytorch")
The directory you include here should contain caffe2/utils/threadpool/ThreadPool.h and all the others.
I don't think this issue should be solved this way, but I'm sure that these files couldn't be found anywhere else. If anyone finds a better solution, please comment here to help the others.
Occurred when running python detectron/tests/test_zero_even_op.py
.
Main error message:1
OSError: /path/to/densepose/build/libcaffe2_detectron_custom_ops_gpu.so: undefined symbol: _ZN6google8protobuf8internal9ArenaImpl28AllocateAlignedAndAddCleanupEmPFvPvE
WTF is this!???
As can be seen, this symbol has something to do with Google, and protobuf.
I guess this is caused by a different protobuf version. Good news is that a proper version of protobuf was also built with Caffe2, so why not tell this to DensePose?
In /path/to/densepose/CMakeLists.txt
, Add a few lines in the beginning:1
2
3
4
5add_library(libprotobuf STATIC IMPORTED)
set(PROTOBUF_LIB "/path/to/pytorch/torch/lib/libprotobuf.a")
set_property(TARGET libprotobuf PROPERTY IMPORTED_LOCATION "${PROTOBUF_LIB}")
You can find two target_link_libraries
lines in this file(they are not adjacent):1
2target_link_libraries(caffe2_detectron_custom_ops caffe2_library)
target_link_libraries(caffe2_detectron_custom_ops_gpu caffe2_gpu_library)
Edit the two lines, adding a "libprotobuf" at the end to each of them:1
2target_link_libraries(caffe2_detectron_custom_ops caffe2_library libprotobuf)
target_link_libraries(caffe2_detectron_custom_ops_gpu caffe2_gpu_library libprotobuf)
Then run make ops
again, and python detectron/tests/test_zero_even_op.py
again.
(Thanks to hyounsamk@Github: https://github.com/facebookresearch/DensePose/issues/119)
After fixing this issue, my DensePose passed tests and was running flawlessly. If any more issues remain, don't hesitate to comment here~
Occurred when running python detectron/tests/test_zero_even_op.py
, with Caffe2 installed with Anaconda.
Main error message:1
OSError: /path/to/densepose/build/libcaffe2_detectron_custom_ops_gpu.so: undefined symbol: _ZN6caffe219CPUOperatorRegistryB5cxx11Ev
As can be seen from the messy undefined symbol, this should have something to do with Caffe2 and probably CXX11(oh really???).
Run ldd -r /path/to/densepose/build/libcaffe2_detectron_custom_ops.so
and the one or several undefined symbols with similar names will be shown, which should have been defined in libcaffe2.so
. After running strings -a /path/to/pytorch/torch/lib/libcaffe2.so | grep _ZN6caffe219CPUOperator
, a few similar symbols (two, in my case) would come up, but are different from the one undefined - "B5cxx11"
is missing.
Why does DensePose want to find a symbol with "B5cxx11"
? Who added this suffix?
It should be our GCC who did it when compiling DensePose with C++11 standard!
To find which version of GCC was Caffe2 built by, run strings -a /path/to/pytorch/torch/lib/libcaffe2.so | grep GCC:
.
In my case, the output is:1
GCC: (GNU) 4.9.2 20150212 (Red Hat 4.9.2-6)
Oh? It seems that Caffe2 developers are Red Hat lovers!
The Caffe2 installed with Anaconda was built by GCC 4.9.2, which had a slightly different standard on naming symbols.
The simpliest way out is to turn to GCC 4.9.2 for building DensePose, too.
Otherwise, maybe also consider compiling Caffe2/PyTorch from source code?
(Many thanks to Johnqczhang@Github: https://github.com/linkinpark213/linkinpark213.github.io/issues/12)
Starting from this post, I decide to keep a record (tag: MineSweeping) of the issues I meet while working with environments and also their solutions.
Doing configurations in order to run others' code may be a difficult task, and is sometimes depressing, since various issues could arise, and the it's impossible for the authors to keep providing solutions for every user in the community. What's worse, after fixing some problems with a lot of struggle, one may have to waste the same amount of time on the same issue the next time he/she run it again. That's why I decide to keep this record: to avoid wasting time twice, while also helping others deal with problems if possible.
]]>Here are some photos that I took during my trip to Higashi-Osaka, Nara and Kyoto.
]]>Python 3.6
TensorFlow 1.4.0
Numpy 1.13.3
OpenCV 3.2.0
1 | import tensorflow as tf |
1 | input_tensor = tf.placeholder(dtype=tf.float32, shape=[None, 32, 32, 3]) |
1 | ground_truth = tf.placeholder(dtype=tf.float32, shape=[None, 10]) |
1 | chain = TensorChain(input_tensor) \ |
1 | loss = tf.reduce_mean(tf.losses.softmax_cross_entropy(ground_truth, prediction)) |
1 | optimizer = tf.train.AdamOptimizer(learning_rate=0.001) |
1 | train = optimizer.minimize(loss) |
1 | def unpickle(file): |
1 | batch = unpickle(DATA_PATH + 'data_batch_{}'.format(i)) # 'i' is the loop variable |
1 | with tf.Session() as session: |
1 | [train_, loss_value] = session.run([train, loss], |
1 | import tensorflow as tf |
1 | def convolution_layer_2d(self, filter_size: int, num_channels: int, stride: int = 1, name: str = None, |
1 | tf.Variable(tf.truncated_normal(shape, stddev=sigma), dtype=tf.float32, name=suffix) |
1 | with tf.Session() as session: |
1 | with tf.Session() as session: |
1 | with tf.Session() as session: |
1 | with tf.Session() as session: |
Sometimes you may want to explain some algorithms or principles with beautiful formulae in your blog. How to do this? Edit them in Microsoft Word, take a screenshot, crop it and put it in the blog post? When you finish your article and find out that you missed a symbol in the pictures - oh man, gotta repeat that again? Stop using those images now! A beautiful math display engine - MathJax allows you to code math like a coder.
First, install hexo-math in your Hexo blog directory.1
$ npm install hexo-math --save
Then, add math configurations in your _config.yml file.1
2math:
engine: 'mathjax'
Finally, also add to your _config.yml file in the theme directory these configurations below.1
2
3
4mathjax:
enable: true
per_page: false
cdn: //cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-MML-AM_CHTML
Maybe you don't have to use math in every blog post. If so, insert the following snippet in your Markdown file also works.1
<script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-MML-AM_CHTML' async></script>
MathJax supports the same grammar that LaTeX does. To learn more about LaTeX, please refer to Chapter 3 of The Not So Short
Introduction to LATEX(CN version also available here).
Use a "\\(" and a "\\)" to insert a formula in the line(they decide the boundary of the formula), or two "$$" to insert one that occupy a new line. I'll give a few examples below.
1 | \\(\mathcal{F}(x)=\mathcal{H}(x)-x\\) |
\(\mathcal{F}(x)=\mathcal{H}(x)-x\)1
\\(E=mc^2\\)
\(E=mc^2\)1
$$\lim_{n\rightarrow \infty}(1+2^n+3^n)^\frac{1}{x+\sin n}$$
$$\lim_{n\rightarrow \infty}(1+2^n+3^n)^\frac{1}{x+\sin n}$$1
$$\mathcal{C}\phi \delta e \mathfrak{M}\alpha th \mathit{I}n \mathcal{H}ex\sigma \mathbb{N}o\omega!$$
$$\mathcal{C}\phi \delta e \mathfrak{M}\alpha th \mathit{I}n \mathcal{H}ex\sigma \mathbb{N}o\omega!$$
This list will be appended whenever I find any more.
This is a tough problem. Hexo renderer would first render the .md file into a .html file, and the MathJax script will only work on the .html file. Therefore, when there are multiple subscript symbols, they might be rendered as <em></em> tags.
For example: when you actually need a full-line formula \(x_{i+1}+y_j\), perhaps you'll get a "$$x{i+1}+yj$$" instead. Look into the HTML code and you'll understand why.
My solution for now, is giving up this Markdown emphasize symbol, since both "_" and "*" can be used as emphasize tags, and the alternative symbol "*" will also work if we remove "_". Using "\_" also works, but it would be frequently used(while "*" isn't), thus turning our math code into mess code.
How do we do this? Bravely look into the node_modules directory and find the renderer of the Hexo engine. My renderer is marked, which is the default for Hexo. There is a file named marked.js inside node_modules/marked/lib/ directory. You can find two appearances of "em:". Like this:1
2
3
4
5var inline = {
...
em: /^\b_((?:[^_]|__)+?)_\b|^\*((?:\*:\*|[\s\S])+?)\*(?!\*)/,
...
};
and1
2
3
4inline.pedantic = merge({}, inline.normal, {
...
em: /^_(?=\S)([\s\S]*?\S)_(?!_)|^\*(?=\S)([\s\S]*?\S)\*(?!\*)/
});
Modify the regular expression after them - remove the one about "_"s and leave the one about "*"s. The new version would be:1
2
3
4
5var inline = {
...
em: /^\*((?:\*\*|[\s\S])+?)\*(?!\*)/,
...
};
and1
2
3
4inline.pedantic = merge({}, inline.normal, {
...
em: /^\*(?=\S)([\s\S]*?\S)\*(?!\*)/
});
From now on, you can use "_" as the subscript in MathJax freely. You don't have to worry about its becoming <em></em> tags anymore.
For example, in my previous post about ResNet, I tried to use the following code to start a new line in an equation while aligning the lines to the equal sign:1
2$$\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\\\
& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)$$
The "&" symbols were used to align the lines to a certain point. However, the result was a "Misplaced &" prompt.
By disabling MathJax, I found out that the rendered equation was correct, which means that the problem isn't with Hexo renderer. This was when I realized that although1
2\begin{equation}
\end{equation}
are not necessary,1
2\begin{split}
\end{split}
shouldn't be removed. Surround the equation with them will work. My code is here:1
2
3
4$$\begin{split}
\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\\\
& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)
\end{split}$$
And it runs like:
$$\begin{split}
\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\
& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)
\end{split}$$
If you encounter other issues while using MathJax with Hexo(with or without a solution), feel free to leave a comment below!
]]>Deep learning researchers have been constructing skyscrapers in recent years. Especially, VGG nets and GoogLeNet have pushed the depths of convolutional networks to the extreme. But questions remain: if time and money aren't problems, are deeper networks always performing better? Not exactly.
When residual networks were proposed, researchers around the world was stunned by its depth. "Jesus Christ! Is this a neural network or the Dubai Tower?" But don't be afraid! These networks are deep but the structures are simple. Interestingly, these networks not only defeated all opponents in the classification, detection, localization challenges in ImageNet 2015, but were also the main innovation in the best paper of CVPR2016.
VGG nets proved the beneficial of representation depth of convolutional neural networks, at least within a certain range, to be exact. However, when Kaiming He et al. tried to deepen some plain networks, the training error and test error stopped decreasing after the network reached a certain depth(which is not surprising) and soon degraded. This is not an overfitting problem, because training errors also increased; nor is it a gradient vanishing problem, because there are some techniques(e.g. batch normalization[4]) that ease the pain.
What seems to be the cause of this degradation? Obviously, deeper neural networks are more difficult to train, but that doesn't mean deeper neural networks would yield worse results. To explain this problem, Balduzzi et al.[3] identified shattered gradient problem - as depth increases, gradients in standard feedforward networks increasingly resemble white noise. I will write about that later.
As the old saying goes, "千里之行,始于足下". Although ResNets are as deep as a thousand layers, they are built with these basic residual blocks(the right part of the figure).
In comparison, basic units of plain network models would look like the one on the left: one ReLU function after a weight layer(usually also with biases), repeated several times. Let's denote the desired underlying mapping(the ideal mapping) of the two layers as \(\mathcal{H}(x)\), and the real mapping as \(\mathcal{F}(x)\). Clearly, the closer \(\mathcal{F}(x)\) is to \(\mathcal{H}(x)\), the better it fits.
However, He et al. explicitly let these layers fit a residual mapping instead of the desired underlying mapping. This is implemented with "shortcut connections", which skip one or more layers, simply performing identity mappings and getting added to the outputs of the stacked weight layers. This way, \(\mathcal{F}(x)\) would not try to fit \(\mathcal{H}(x)\), but \(\mathcal{H}(x)-x\). The whole structure(from the identity mapping branch, to merging the branches by the addition operation) are named "residual blocks"(or "residual units").
What's the point in this? Let's do a simple analysis. The computation done by the original residual block is: $$y_l=h(x_l)+\mathcal{F}(x_l,\mathcal{W}_l),$$ $$x_{l+1}=f(y_l).$$
Here are the definitions of symbols:
\(x_l\): input features to the \(l\)-th residual block;
\(\mathcal{W}_{l}={W_{l,k}|_{1\leq k\leq K}}\): a set of weights(and biases) associated with the \(l\)-th residual unit. \(K\) is the number of layers in this block;
\(\mathcal{F}(x,\mathcal{W})\): the residual function, which we talked about earlier. It's a stack of 2 conv. layers here;
\(f(x)\): the activation function. We are using ReLU here;
\(h(x)\): identity mapping.
If \(f(x)\) is also an identity mapping(as if we're not using any activation function), the first equation would become:
$$x_{l+1}=x_l+\mathcal{F}(x_l,\mathcal{W}_l)$$
Therefore, we can define \(x_L\) recursively of any layer:
$$x_L=x_l+\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)$$
That's not the end yet! When it comes to the gradients, according to the chain rules of backpropagation, we have a beautiful definition:
$$\begin{split}
\frac{\partial{\mathcal{E}}}{\partial{x_l}} & = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\frac{\partial{x_L}}{\partial{x_l}}\\
& = \frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big(1+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}\mathcal{F}(x_i,\mathcal{W}_i)\Big)
\end{split}$$
What does it mean? It means that the information is directly backpropagated to ANY shallower block. This way, the gradients of a layer never vanish or explode even if the weights are too small or too big.
It's important that we use identity mapping here! Just consider doing a simple modification here, for example, \(h(x)=\lambda_lx_l\)(\(\lambda_l\) is a modulating scalar). The definition of \(x_L\) and \(\frac{\partial{\mathcal{E}}}{\partial{x_l}}\) would become:
$$x_L=(\prod_{i=l}^{L-1}\lambda_i)x_l+\sum_{i=l}^{L-1}(\prod_{j=i+1}^{L-1}\lambda_j)\mathcal{F}(x_i,\mathcal{W}_i)$$
$$\frac{\partial{\mathcal{E}}}{\partial{x_l}}=\frac{\partial{\mathcal{E}}}{\partial{x_L}}\Big((\prod_{i=l}^{L-1}\lambda_i)+\frac{\partial{}}{\partial{x_l}}\sum_{i=l}^{L-1}(\prod_{j=i+1}^{L-1}\lambda_j)\mathcal{F}(x_i,\mathcal{W}_i)\Big)$$
For extremely deep neural networks where \(L\) is too large, \(\prod_{i=l}^{L-1}\lambda_i\) could be either too small or too large, causing gradient vanishing or gradient explosion. For \(h(x)\) with complex definitions, the gradient could be extremely complicated, thus losing the advantage of the skip connection. Skip connection works best under the condition where the grey channel in Fig. 3 cover no operations (except the addition) and is clean.
Interestingly, this comfirmed the philosophy of "大道至简" once again.
Wait a second... "\(f(x)\) is also an identity mapping" is just our assumption. The activation function is still there!
Right. There IS an activation function, but it's moved to somewhere else. In fact, the original residual block is still a little bit problematic - the output of one residual block is not always the input of the next, since there is a ReLU activation function after the addition(It did NOT REALLY keep the identity mapping to the next block!). Therefore, in[2], He et al. fixed the residual blocks by changing the order of operations.
Besides using a simple identity mapping, He et al. also discussed about the position of the activation function and the batch normalization operation. Assuming that we got a special(asymmetric) activation function \(\hat f(x)\), which only affects the path to the next residual unit. Now our definition of \(x_{x+1}\) would become:
$$x_{l+1}=x_l+\mathcal{F}(\hat f(x_l),\mathcal{W}_l)$$
With \(x_l\) still multiplied by 1, information is still fully backpropagated to shallower residual blocks. And the good thing is that using this asymmetric activation function after the addition(partial post-activation) is equivalent to using it beforehand(pre-activation)! This is why He et al. chose to use pre-activation - otherwise it would be necessary to implement that magical activation function \(\hat f(x)\).
Here are the ResNet architectures for ImageNet. Building blocks are shown in brackets, with the numbers of blocks stacked. With the first block of every stack(starting from conv3_x), a downsampling is performed. Each column represents one of the residual networks, and the deepest one has 152 weight layers! Since ResNets were proposed, VGG nets - which were officially called "Very Deep Convolutional Networks" - are not relatively deep anymore. Maybe call them "A Little Bit Deep Convolutional Networks".
He et al. trained ResNet-18 and ResNet-34 on the ImageNet dataset, and also compared them to plain convolutional networks. In Fig. 5, the thin curves denote training error, and the bold ones denote validation error. The figure on the left shows the results of plain convolution networks(in which the 34-layered ones has higher error rates than the 18-layered one), and the figure on the right shows that residual networks perform better than plain ones, while deeper ones perform better than shallow ones.
He et al. also tried various types of shortcut connections to replace the identity mapping, and various positions of activation functions / batch normalization. Experiments show that the original identity mapping and full pre-activation yield the best results.
Residual learning can be crowned as "ONE OF THE GREATEST HITS IN DEEP LEARNING FIELDS". With a simple identity mapping, it solved the degradation problem of deep neural networks. Now that you have learned about the concept of ResNet, why not give it a try and implement your first residual learning model today?
Convolutional neural networks(CNN) have enjoyed great success in computer vision research fields in the past few years. A number of attempts are made based on the original CNN architecture to improve its accuracy and performance. In 2014, Karen Simonyan et al. did an investigation on the effect of depth on CNNs' accuracy in large-scale image recognition (thus also proposing a series of very deep CNNs which are usually called VGG nets). The result confirmed the importance of CNN depth in visual representations.
Before introducing VGG net, let's take a glance at prior convolutional neural networks.
Basic neural network structures(for example, multi-layer perceptron) learn patterns on 1D vectors, which cannot cope with 2D features in images well. In 1986, Lecun et al. proposed a convolution network model called LeNet-5. Its structure is fairly simple: two convolution layers, two subsampling layers and a few fully connected layers. This network was used to solve a number recognition problem. (If you need to learn more about the convolution operation, please refer to Google or Digital Image Processing by Rafael C. Gonzalez)
In 2012, Alex Krizhevsky et al. won the first place in ILSVRC-2012(ImageNet Large-Scale Visual Recognition Challenge 2012) and achieved the highest top-5 error rate of 15.3% with a convolutional network model, while the second-best entry only achieved 26.2%. The network, namely AlexNet, was trained on two GTX580 3GB GPUs in parallel. Since a single GTX580 GPU has only 3GB memory, the maximum size of networks is limited. This model proved the effectiveness of CNNs under complicated circumstances and the power of GPUs. So what if the network can go deeper? Will the top-5 error rate get even lower?
Here comes our hero - VGG nets. By the way, VGG is not the name of the network, but the name of the authors' group - Visual Geometry Group, from Department of Engineering Science, University of Oxford. The networks they proposed were therefore named after the group. The main contributions of VGG nets are: 1. more but smaller convolution filters; 2. great depth of networks.
Rather than using relatively large receptive fields in the first convolution layers, Simonyan et al. selected very small 3x3 receptive fields throughout the whole net, which are convolved with the input at every pixel with a stride of 1. As is shown in the figures below, a stack of two 3x3 convolution layers has an effective receptive field of 5x5. We can also conclude that a stack of three 3x3 convolution filters has an effective receptive field of 7x7.
Now that we're clear that stacks of small-kernel convolution layers have equal sized receptive fields, why are they the better choice? Well, the first advantage is incorporating more rectification layers instead of a single one, since every convolution layer includes an activation function(usually ReLU). More rectification brings more non-linearity, and more non-linearity makes the decision function more discriminative and fit better. Also, when the receptive field isn't too large, a stack of 3x3 convolution filters have fewer parameters to train. Assuming the number of input and output channels of a convolution layer stack are equal(let's call it C) and the receptive field is 5x5, we have \(2*3*3*C*C=18C^2\) instead of \(5*5*C*C=25C^2\) parameters here. Similarly, when the receptive field is 7x7, we have \(3*3*3*C*C=27C^2\) instead of \(7*7*C*C=49C^2\). When the field gets even larger? A function with \(O(n)\) complexity only has greater advantage against an \(O(n^2)\) when \(n\) grows.
Cliches time. Just like any blogger mentioning VGG nets would do, here are the network structures proposed by Simonyan et al.
Look at the table column-by-column. Each column(A, A-LRN, B, C, D, E) corresponds to one network structure. As you can see, their networks grew from 11 layers(in net A) to 19 layers(in net E). Each time something is added to the previous net, it would appear bold. Clearly, LRN(Local Response Normalization) didn't work well in this case(actually, A-LRN net performed worse than A, while consuming much more memory and computation time), and was thus removed.
What's worth mentioning are the 1x1 convolution layers appearing in network C. This is a way to increase non-linearity(also by introducing activation functions) of the decision function while also keeping the size of the receptive fields unchanged.
Bad initialization could stall learning due to the instability of gradient deep networks. Therefore, the authors first trained the network A, which is shallow enough to be trained with random initialization. Then, the next networks (B to E) are initialized with the pre-trained models, and only weights of the new layers are randomly initialized.
In spite of the larger number of parameters and the greater depth of our nets compared to AlexNet, the nets required less epochs to converge due to the implicit regularization imposed by greater depth, smaller convolution filter sizes and the pre-initialization of certain layers. They also generalize well to other datasets, achieving state-of-the-art performances. Results of VGG nets in comparison against other models in ILSVRC are shown in the table below.
In conclusion, the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth.
You might ask: Why not even deeper, with more powerful GPUs(the authors used Titan Black), we can absolutely train deeper networks that perform better! Not exactly. Problems arose as the networks get too deep, and this is where ResNet comes in.
1 | ' |
, Hexo would convert it to a symbol like this1
’
You would say that this is also an apotrophe, but it really looks UNBEARABLE in the articles. It's been a problem bothering me for more than a month.(I'm not saying that this is the reason for not updating my blog, but I don't mind if you think so!)
Therefore, I Googled about this problem and tried to find other victims. According to their sharings, this problem is caused by marked -- the default markdown renderer of Hexo.The "smatrypants" function of marked was turned on by default.
Now take a look at the introduction of smartypants on the hexo-renderer-marked page:
smartypants - Use "smart" typograhic punctuation for things like quotes and dashes.
C'mon, seriously?
There are a few bloggers who solved this by adding the code below to the _config.yml file in the blog directory.
1 | mark: |
This worked for most victims (perhaps all of them), but not for me. I have no idea why those config wasn't working, so if anyone finds out the reason, please contact me by e-mail.
If you're sure smartypants is causing the problem, and the solution above didn't work for you either, maybe you can try my solution.
Since hexo-renderer-marked is installed in the blog's node_modules directory(may also be in your Node.js directory if installed globally), isn't it possible that we change its own configurations? I looked at the index.js file in the node_modules/hexo-renderer-marked/ directory. There you are, smartypants!
1 | hexo.config.marked = assign({ |
Now you know what to do.
Aaaaaaaaaaaaaaaand many thanks to Xizi Wu, the artist of my new avatar! I love it!
1 | for i in 'Harper' 'Sweet' 'Kobayashi' 'Kawasaki' |
International linkinpark213 Day is a global anniversary set up by Harper Long in 2011 A.D., celebrated on February 13th every year. This anniversary is officially written as 'linkinpark213 Day', the first letter of which is a lower-case. The establishment of the anniversary dates back to the early 10s in the 21st century.
Till now, the population for this anniversary has already reached 1e-6 million, and the distribution also expanded from a small county to the whole middle-China area, including Hebei, Henan, Shanxi and Shaanxi province. There will also be some Japanese resident who plan to celebrate this day in 2019.
According to the modern Chinese habit of writing, '13th Feb' is usually written as '2.13'. Also, '13' and 'B' look similar and are often regarded to be equal. In conclusion, '13 Feb' can be transformed to '2B', which is a common word in Chinese. Although the word is sometimes classified as "offensive", it reflects feelings of optimism, bravery and entertainment.
When the anniversary was first set up, there was no officially specified ways of celebration. People gathered, held parties and enjoyed spending time together.
On the 3th linkinpark213 Day, a proposal by a high school Chinese teacher was taken as the official way of celebration -- on each linkinpark213 Day, a number of people participate in the Pigeon-Flying Competition funded by Harper Long. This way of celebration prevailed until now, and all participants except Harper Long won the competition every year.
Pigeon-Flying is a broadly-accepted traditional Chinese custom, the exact origin of which is too ancient to be revealed. Modern scholars tend to believe that this custom became well-known no later than 206 A.D. in China. In modern times, pigeon-flying is an activity involving "making a promise" and "not keeping it". According to some folk stories, this activity was firstly named when it happened to a pigeon keeper when he forgot to keep a promise that he made with a friend.
1 | public static void main(String[] args) { |
(Please notice that pigeon-flying mentioned here is not the same activity as what happens every Olympics since 1896 A.D.)
Join our celebration today! You can easily participate in the Pigeon-Flying Competition by not participating.
]]>