On-line visual tracking of a specified target in motion throughout frames of video clips faces challenges in robust identification of the target in the current frame based on the past frames. Three approaches for tracking the target image patch are described and compared. These approaches utilize particle filtering and principal component analysis (PCA) to identify the most likely location of the target in the current frame and a low dimensional subspace representation of the patches of images to be kept as the templates in the dictionary for the identification. By using a combination of methods and compare the result of each, a new model based is proposed. The goal is to achieve a more robust and accurate tracking of a target throughout the video and continue updating the identification templates to adapt the target changes, such as apparences in lighting, angle, scale and occlusions. The challenges in tracking are to introduction of the "right" templates into the identification templates in the dictionary and identify the most accurate particle image patch while tracking the target with the right tracking patch scaling.
The first approach considered and on which the structure of the visual tracker is based is the "Incremental Learning for Robust Visual Tracking" by D. Ross et al., which is a computationally fast tracker that utilizes a method of low dimensional subspace for the identification template dictionary and incremental PCA for its tracking. The tracker has a simple rule in accepting the patches of images to be in the identification template dictionary after the image patch has gone through a singular value decomposition (SVD), where it eliminates singular values are smaller than $10^{-6}$ of the sum of squared sinuglar values and the corresponding bases are also eliminated. This elimination scheme has very limited robustness in tracking, therefore, more selective processes in accepting identification templates in the dictionary are explored and introduced on top of the existing method in comparison and to address the challenges in on-line video tracking.
The second approach is the "Least Soft-Threshold Squares Tracking" proposed by D. Wang et al. solves the least soft-threshold squares distance problem to identify the distances of the particles to the templates in the dictionary, which greatly improves the tracking accuracy. This method is also computationally cheap in comparison to the first approach, and its accuracy is also better than the first approach, but it would sometimes fail to track in some applications. Finally, the third approach reviewed is the "Robust Visual Tracking and Vehicle Classification via Sparse Representation" by X. Mei et al. is to weight each particles when selecting the most likely target patch so the best patch has a highest weighted probability which ensures it being selected and introduced to the template dictionary. This approach performs well in comparison to the first and the second approaches in tracking accuracy and robustness, but this approach is extremely computationally expensive.
Three new components are proposed in an effort to mitigate some of the limitations that the three approaches exhibit. One such component is to simply reject the image patches that exhibit too great of difference to the current template dictionary, which resulted in improved tracking robustness. This method is computationally cheap and easy to implement. Another component introduced is a second set of dictionary that is composed of admitted image patches, which is used for tracking when the image patches appears to be too dissimilar to the dictionary with low dimensional representation. It is expected that with more well defined and stronger features, it forces the tracking to identify the target. Finally, the third component introduced is the to prevent shrinkage of the target boundary box by weighting the particles drawn with the ratio of area change so that more weight is placed on particles with less arial change. This increases the likelihood of recovering the target again if tracking loses the target, and instead of shrinking the boundary box, the tracking is biased to staying with the image patch of the same size. The resulting performance of the proposed tracking scheme has not been noticeably improved, part of the reason is because the metrics available to identify a noisy image patch from the good image patches are not always indicative of the noisy-good image patch divide.