Active Learning in Multi-Camera Networks, With Applications in Person Re-Identification
With the proliferation of cheap visual sensors, camera networks are everywhere. The ubiquitous presence of cameras opens the door for cutting edge research in processing and analysis of the huge video data generated by such large-scale camera networks. Re-identification of persons coming in and out of the cameras is an important task. This has remained a challenge to the community for a variety of reasons such as change of scale, illumination, resolution etc. between cameras. All these leads to transformation of features between cameras which makes re-identification a challenging task. The first question that is addressed in this work is - Can we model the way features get transformed between cameras and use it to our advantage to re-identify persons between cameras with non-overlapping views? The similarity between the feature histograms and time series data motivated us to apply the principle of Dynamic Time Warping to study the transformation of features by warping the feature space. After capturing the feature warps, describing the transformation of features the variabilities of the warp functions were modeled as a function space of these feature warps. The function space not only allowed us to model feasible transformation between pairs of instances of the same target, but also to separate them from the infeasible transformations between instances of different targets. A supervised training phase is employed to learn a discriminating surface between these two classes in the function space.
However, it is unlikely that supervised methods alone will be enough to deal with the volume and variety of data in such scenarios. The performance is dependent on tediously labeling the training data. Also supervised person re-identification strategies are static in the sense that these are unable to adapt to the changing dynamics of continuous streaming data. Active participation of human expert is necessary in such scenario. The human labor is reduced if the human is involved for the most difficult cases and if it can be made sure that the human expert is not asked to do the same job repetitively. So the question we addressed is the following. Is it possible to identify a manageable set of informative, but non-redundant, samples for labeling by a human expert? Moreover, is it possible to select these examples progressively in an online setting where all the training data may not be available a priori? The second work explored a convex optimization based iterative framework that progressively and judiciously chooses a sparse but informative set of samples for labeling, with minimal overlap with previously labeled images. The third work also addresses the same basic question from a different perspective where the human effort is reduced in two ways - by changing the questions asked to the human annotator to binary yes-no type instead of multiple choice and by incorporating the domain knowledge from the human how a human expert discriminates between persons. The two objectives are fulfilled by employing a ‘value of information’ based active learning strategy and mid level semantic ‘attributes’ respectively. Via extensive experimentation with different scenarios, we validate our approach and demonstrate that our framework achieves superior performance with significantly less amount of manual labor.