Here's my brief summary of all CVPR19 papers in the field of visual tracking. Abbreviations without parentheses are part of the paper title, and those with parentheses are added by me according to the paper.
RGB-based
Single-Object Tracking
(UDT): Unsupervised Deep Tracking
Authors: Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, Houqiang Li
arXiv Link: https://arxiv.org/abs/1904.01828
Project Link: https://github.com/594422814/UDT
Summary: Train a robust siamese network on large-scale unlabeled videos in an unsupervised manner - forward-and-backward, i.e., the tracker can forward localize the target object in successive frames and backtrace to its initial position in the first frame.
Highlights: Unsupervised learning
(TADT): Target-Aware Deep Tracking
Authors: Xin Li, Chao Ma, Baoyuan Wu, Zhenyu He, Ming-Hsuan Yang
arXiv Link: https://arxiv.org/abs/1904.01772
Project Link: https://xinli-zn.github.io/TADT-project-page/
Summary: Targets of interest can be arbitrary object class with arbitrary forms, while pre-trained deep features are less effective in modeling these targets of arbitrary forms for distinguishing them from the background. TADT learns target-aware features, thus can better recognize the targets undergoing significant appearance variations than pre-trained deep features.
Highlights: Target-aware features, better discrimination
(SiamMask): Fast Online Object Tracking and Segmentation: A Unifying Approach
Authors: Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, Philip H.S. Torr
arXiv Link: https://arxiv.org/abs/1812.05050
Project Link: https://github.com/foolwood/SiamMask
Zhihu Link: https://zhuanlan.zhihu.com/p/58154634
Summary: Perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach.
Highlights: Mask prediction in tracking
SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks
Authors: Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, Junjie Yan
arXiv Link: https://arxiv.org/abs/1812.11703
Project Link: http://bo-li.info/SiamRPN++/
Summary: SiamRPN++ breaks the translation invariance restriction through a simple yet effective spatial-aware sampling strategy. SiamRPN++ performs depth-wise and layer-wise aggregations, improving the accuracy but also reduces the model size. Current state-of-the-art in OTB2015, VOT2018, UAV123, LaSOT, and TrackingNet.
Highlights: Deep backbones, state-of-the-art
(CIR/SiamDW): Deeper and Wider Siamese Networks for Real-Time Visual Tracking
Authors: Zhipeng Zhang, Houwen Peng
arXiv Link: https://arxiv.org/abs/1901.01660
Project Link: https://github.com/researchmm/SiamDW
Summary: SiamDW explores utilizing deeper and wider network backbones in another aspect - careful designs of residual units, considering receptive field, stride, output feature size - to eliminate the negative impact of padding in deep network backbones.
Highlights: Cropping-Inside-Residual, eliminating the negative impact of padding
(SiamC-RPN): Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking
Authors: Heng Fan, Haibin Ling
arXiv Link: https://arxiv.org/abs/1812.06148
Project Link: None
Summary: Previously proposed one-stage Siamese-RPN trackers degenerate in presence of similar distractors and large scale variation. Advantages: 1) Each RPN in Siamese C-RPN is trained using outputs of the previous RPN, thus simulating hard negative sampling. 2) Feature transfer blocks (FTB) further improving the discriminability. 3) The location and shape of the target in each RPN is progressively refined, resulting in better localization.
Highlights: Cascaded RPN, excellent accuracy
SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking
Authors: Guangting Wang, Chong Luo, Zhiwei Xiong, Wenjun Zeng
arXiv Link: https://arxiv.org/abs/1904.04452
Project Link: None
Summary: To overcome the simultaneous requirements on robustness and discrimination power, SPM-Tracker tackle the challenge by connecting a coarse matching stage and a fine matching stage, taking advantage of both stages, resulting in superior performance, and exceeding other real-time trackers by a notable margin.
Highlights: Coarse matching & fine matching
ATOM: Accurate Tracking by Overlap Maximization
Authors: Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, Michael Felsberg
arXiv Link: https://arxiv.org/abs/1811.07628
Project Link: https://github.com/visionml/pytracking
Summary: Target estimation is a complex task, requiring highlevel knowledge about the object, while most trackers only resort to a simple multi-scale search. In comparison, ATOM estimate target states by predicting the overlap between the target object and an estimated bounding box. Besides, a classification component that is trained online to guarantee high discriminative power in the presence of distractors.
Highlights: Overlap IoU prediction
(GCT): Graph Convolutional Tracking
Authors: Junyu Gao, Tianzhu Zhang, Changsheng Xu
arXiv Link: None
PDF Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_Graph_Convolutional_Tracking_CVPR_2019_paper.pdf
Project Link: http://nlpr-web.ia.ac.cn/mmc/homepage/jygao/gct_cvpr2019.html
Summary: Spatial-temporal information can provide diverse features to enhance the target representation. GCT incorporates 1) a spatial-temporal GCN to model the structured representation of historical target exemplars, and 2) a context GCN to utilize the context of the current frame to learn adaptive features for target localization.
Highlights: Graph convolution networks, spatial-temporal information
(ASRCF): Visual Tracking via Adaptive Spatially-Regularized Correlation Filters
Authors: Kenan Dai, Dong Wang, Huchuan Lu, Chong Sun, Jianhua Li
arXiv Link: None
Project Link: https://github.com/Daikenan/ASRCF (To be updated)
Summary: ASRCF simultaneously optimize the filter coefficients and the spatial regularization weight. ASRCF applies two correlation filters (CFs) to estimate the location and scale respectively - 1) location CF model, which exploits ensembles of shallow and deep features to determine the optimal position accurately, and 2) scale CF model, which works on multi-scale shallow features to estimate the optimal scale efficiently.
Highlights: Estimate location and scale respectively
(RPCF): RoI Pooled Correlation Filters for Visual Tracking
Authors: Yuxuan Sun, Chong Sun, Dong Wang, You He, Huchuan Lu
arXiv Link: None
Project Link: None
PDF Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Sun_ROI_Pooled_Correlation_Filters_for_Visual_Tracking_CVPR_2019_paper.pdf
Summary: RoI-based pooling can be equivalently achieved by enforcing additional constraints on the learned filter weights and thus becomes feasible on the virtual circular samples. Considering RoI pooling in the correlation filter formula, the RPCF performs favourably against other state-of-the-art trackers.
Highlights: RoI pooling in correlation filters
Multi-Object Tracking
(TBA): Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers
Authors: Zhen He, Jian Li, Daxue Liu, Hangen He, David Barber
arXiv Link: https://arxiv.org/abs/1809.03137
Project Link: https://github.com/zhen-he/tracking-by-animation
Summary: The common Tracking-by-Detection (TBD) paradigm use supervised learning and treat detection and tracking separately. Instead, TBA is a differentiable neural model that first tracks objects from input frames, animates these objects into reconstructed frames, and learns by the reconstruction error through backpropagation. Besides, a Reprioritized Attentive Tracking is proposed to improve the robustness of data association.
Highlights: Label-free, end-to-end MOT learning
Eliminating Exposure Bias and Metric Mismatch in Multiple Object Tracking
Authors: Andrii Maksai, Pascal Fua
arXiv Link: https://arxiv.org/abs/1811.10984
Project Link: None
Summary: Many state-of-the-art MOT approaches now use sequence models to solve identity switches but their training can be affected by biases. An iterative scheme of building a rich training set is proposed and used to learn a scoring function that is an explicit proxy for the target tracking metric.
Highlights: Eliminating loss-evaluation mismatch
Pose Tracking
Multi-Person Articulated Tracking With Spatial and Temporal Embeddings
Authors: Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian
arXiv Link: https://arxiv.org/abs/1903.09214
Project Link: None
Summary: The framework consists of a SpatialNet and a TemporalNet, predicting (body part detection heatmaps + Keypoint Embedding (KE) + Spatial Instance Embedding (SIE)) and (Human Embedding (HE) + Temporal Instance Embedding (TIE)). Besides, a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable.
Highlights: Spatial & temporal embeddings, end-to-end learning "detection and grouping" pipeline
(STAF): Efficient Online Multi-Person 2D Pose Tracking With Recurrent Spatio-Temporal Affinity Fields
Authors: Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, Yaser Sheikh
arXiv Link: https://arxiv.org/abs/1811.11975
Project Link: None
Summary: Upon Part Affinity Field (PAF) representation designed for static images, an architecture encoding ans predicting Spatio-Temporal Affinity Fields (STAF) across a video sequence is proposed - a novel temporal topology cross-linked across limbs which can consistently handle body motions of a wide range of magnitudes. The network ingests STAF heatmaps from previous frames and estimates those for the current frame.
Highlights: Online, fastest and the most accurate bottom-up approach
RGBD-based
(OTR): Object Tracking by Reconstruction With View-Specific Discriminative Correlation Filters
Authors: Ugur Kart, Alan Lukezic, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas
arXiv Link: https://arxiv.org/abs/1811.10863
Summary: Perform online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs). State-of-the-art on Princeton RGB-D tracking and STC Benchmarks.
Pointcloud-based
I'm not experienced in point clouds so I couldn't make a summary for the following papers. The abstracts are given below. Check them out at arXiv to learn more if you're interested.
VITAMIN-E: VIsual Tracking and MappINg With Extremely Dense Feature Points
Authors: Masashi Yokozuka, Shuji Oishi, Simon Thompson, Atsuhiko Banno
arXiv Link: https://arxiv.org/abs/1904.10324
Project Link: None
Abstract: In this paper, we propose a novel indirect monocular SLAM algorithm called "VITAMIN-E," which is highly accurate and robust as a result of tracking extremely dense feature points. Typical indirect methods have difficulty in reconstructing dense geometry because of their careful feature point selection for accurate matching. Unlike conventional methods, the proposed method processes an enormous number of feature points by tracking the local extrema of curvature informed by dominant flow estimation. Because this may lead to high computational cost during bundle adjustment, we propose a novel optimization technique, the "subspace Gauss--Newton method", that significantly improves the computational efficiency of bundle adjustment by partially updating the variables. We concurrently generate meshes from the reconstructed points and merge them for an entire 3D model. The experimental results on the SLAM benchmark dataset EuRoC demonstrated that the proposed method outperformed state-of-the-art SLAM methods, such as DSO, ORB-SLAM, and LSD-SLAM, both in terms of accuracy and robustness in trajectory estimation. The proposed method simultaneously generated significantly detailed 3D geometry from the dense feature points in real time using only a CPU.
Leveraging Shape Completion for 3D Siamese Tracking
Authors: Silvio Giancola, Jesus Zarzar, and Bernard Ghanem
arXiv Link: https://arxiv.org/abs/1903.01784
Project Link: https://github.com/SilvioGiancola/ShapeCompletion3DTracking
Abstract: Point clouds are challenging to process due to their sparsity, therefore autonomous vehicles rely more on appearance attributes than pure geometric features. However, 3D LIDAR perception can provide crucial information for urban navigation in challenging light or weather conditions. In this paper, we investigate the versatility of Shape Completion for 3D Object Tracking in LIDAR point clouds. We design a Siamese tracker that encodes model and candidate shapes into a compact latent representation. We regularize the encoding by enforcing the latent representation to decode into an object model shape. We observe that 3D object tracking and 3D shape completion complement each other. Learning a more meaningful latent representation shows better discriminatory capabilities, leading to improved tracking performance. We test our method on the KITTI Tracking set using car 3D bounding boxes. Our model reaches a 76.94% Success rate and 81.38% Precision for 3D Object Tracking, with the shape completion regularization leading to an improvement of 3% in both metrics.
Datasets
LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking
Authors: Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling
arXiv Link: https://arxiv.org/abs/1809.07845
Project Link: https://cis.temple.edu/lasot/
Summary: A high-quality benchmark for Large-scale Single Object Tracking, consisting of 1,400 sequences with more than 3.5M frames.
CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification
Zheng Tang, Milind Naphade, Ming-Yu Liu, Xiaodong Yang, Stan Birchfield, Shuo Wang, Ratnesh Kumar, David Anastasiu, Jenq-Neng Hwang
arXiv Link: https://arxiv.org/abs/1903.09254
Project Link: https://www.aicitychallenge.org/
Summary: The largest-scale dataset in terms of spatial coverage and the number of cameras/videos in an urban environment, consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km.
MOTS: Multi-Object Tracking and Segmentation
Authors: Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, Bastian Leibe
arXiv Link: https://arxiv.org/abs/1902.03604
Project Link: https://www.vision.rwth-aachen.de/page/mots
Summary: Going beyond 2D bounding boxes and extending the popular task of multi-object tracking to multi-object tracking and segmentation, in tasks and metrics.
Highlights: Extend MOT with segmentation
Argoverse: 3D Tracking and Forecasting With Rich Maps
Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, James Hays
arXiv Link: None
PDF Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Chang_Argoverse_3D_Tracking_and_Forecasting_With_Rich_Maps_CVPR_2019_paper.pdf
Project Link: Argoverse.org (Not working?)
Summary: A dataset designed to support autonomous vehicle perception tasks including 3D tracking and motion forecasting.