When operating in unstructured and semi-structured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables, where the object may be partially or fully hidden behind other objects. The goal of this task, which we define as mechanical search, is to retrieve a target object in as few actions as possible. Robustly perceiving and manipulating objects is challenging in these scenarios due to the presence of sensor noise, occlusions, and unknown object properties. Because of these perception and manipulation challenges, learning end-to-end mechanical search policies from data is difficult. Instead, we break mechanical search policies into three modules, a perception module that creates an intermediate representation from the input observation, a set of low-level manipulation primitives, and a high-level action selection policy that iteratively chooses which low-level primitives to execute based on the output from the perception module. We explore progress made on manipulation primitives, such as pushing and grasping, segmentation of scenes with unknown objects and occupancy distribution predictions to infer likely locations of the target object. Additionally, we demonstrate that using simulated depth images or point clouds can allow rapid generation of large-scale training datasets for perception networks while allowing them to generalize to real-world objects and scenes. We show that integrating these components can result in an efficient mechanical search policy, improving the success rate by 15% and reducing the number of actions needed to extract the target object as compared to baseline policies in both bin and shelf environments across simulated and physical trials.