Recent advances in multiple object tracking (MOT) rely primarily on visual appearance features to reconnect tracks lost due to occlusions. However, appearance features cannot be relied on to discriminate between objects that are visually similar or identical, such as animals, people in uniform, or mass-produced items. We propose a new model that relies on spatio-temporal motion features rather than appearance features for such videos. Furthermore, training an MOT method often relies on expensive hand labeling of bounding boxes or segmentation masks with ground truth tracks. By contrast, our videos are labeled only with fixed bounding boxes (effectively only positional information). We train our model in a semi-supervised manner using iterative pseudo-labeling (IPL), a technique often used in natural language processing, but not common to computer vision tasks. We show that appearance features are insufficient for reconnecting tracklets in videos of bee foraging, and that our motion-based IPL method offers an improvement over appearance feature methods.