There has been a recent trend to applying deep learning methods compared to shallowmethods for automatic identification of insects. Classification strategies built around al-
gorithms with deep learning architectures at their center like YOLO and others require
large amounts of data to making learning successful and are often augmented with tens
of thousands of images or more to achieve excellent performance. Recent pre-trained
models of deep neural networks have significantly reduced the amount of data required
to create accurate classification algorithms by ingesting and training on a huge data set
different than the target task and using the resulting encoding to transfer information
to a new task. This work shows that recent performance gains from models pre-trained
on huge data sets are effective as image encoders for the classification of the sex of
spotted wing drosophila (SWD). A data set of 676 SWD microscope images is created
to evaluate classification models for use in automation of the sterile insect technique
(SIT), which requires large amounts of male SWD to be identified and separated. Bi-
nary classification models trained on top of image encoding from new models based off
of visual transformers [3] pre-trained on over 400 million images with CLIP [2] are able
to achieve accuracy as high as 96.7% when trained with LogReg and similar classifiers on
augmented data from the SWD image set. Other models pre-trained on the ImageNet
data set of 14 million images also performed well, approaching 92% with VGG models
and 90% with MobileNetV2 model. Image segmentation of the data set is then inves-
tigated as a source of corroboration for the identification of the morphological features
responsible for classification, and an out-of-distribution data set is collected to evaluate
classification and segmentation results on more diverse and difficult examples. While
robust identification of features special to SWD remains, classification accuracy is not a
guarantee on data which differs substantially from the factory or laboratory setting on
which it is trained and additional data may be needed for training on use-cases outside
of SIT such as for applications on the farm or for automated identification in insect
traps. This emphasizes a fact which is not elaborated on for many insect detection
models in the literature: that their models are not likely robust in situations where the
data is significantly OOD and for situations which may not be adequately covered with-
out specialized augmentation methods or additional data. Nonetheless results indicate
that pre-trained models have advanced to the point where they can play a central role
in securing the food supply from potentially billions of dollars of damages every year
from pests such as SWD.