We present a neural dynamic model that perceptually grounds nested noun phrases, i.e., noun phrases that contain further (possibly also nested) noun phrases as parts. The model receives input from the visual array and a representation of a noun phrase from language processing. It organizes a search for the denoted object in the visual scene. The model is a neural dynamic architecture of interacting neural populations which has clear interfaces with perceptual processes. It solves a set of theoretical challenges, including the problem of keeping a nested structure in short-term memory in a way that solves the problem of 2 and massive binding problem emphasized by Jackendoff (2002). The model organizes a search for the objects that are referenced in that structure. We motivate the model, demonstrate simulation results, and discuss how it differs from related models.