Locating and labeling objects in images remains a central problem in computer vision, and the challenge of doing so efficiently has taken on increasing importance with the deployment of autonomous vehicles, edge recognition applications, and ever-growing foundation models.We consider a variation of object localization called localized image retrieval, which combines the problems of image retrieval and query localization, by ranking the precise sub-regions in a set of scenes which match the user query.
In this work, we will examine the topic of representation learning for localized image retrieval, specifically through the lens of the person search problem.
We will show novel work in categorizing and improving efficiency of person search models during training and inference by introducing new modular approaches, comparing query-centric and object-centric methods, and developing methods for weakly-supervised and self-supervised pre-training.
Finally, we will discuss how identity search, prompt localization, object detection, and tracking can be unified under the single problem of localized image retrieval, and consider connections to generalist models.