Consider an uncertain situation where an artificial intelligence (AI) system is called upon to determine a human action oractivity in an image or scene. The AI system has not been previously trained to recognize any human action or activity,and has no prior information on pose, parts, spatial layout of the object in an image. In such a situation, what is theAI system supposed to do? Its options are limited, and it must determine the action or activity with the aid of the mostprobable inanimate object (other than the human actor) that it can detect in the image. The AI system needs to formulatetwo hypotheses to infer the action or activity in a zero-shot manner; first, that the most probable inanimate object detectedin the image is one that is involved in the action or activity, and second, that the most likely action or activity associatedwith this object in the real world is the one actually occurring in the image. To what extent are these hypotheses valid?We propose that correct detection of the highly probable object and use of natural language word embeddings obtainedvia training on a general text corpus such as Wikipedia could enable the AI system to determine the underlying humanaction or activity in an image with reasonable classification accuracy. We conducted studies on the HICO dataset, whichis a challenging dataset containing many rare human action/activity categories. Our experimental results show that if theAI system can reliably detect the most probable inanimate object in the image and then infer the corresponding verb ina zero-shot manner using language models trained on general text corpora, then it has a reasonable chance of correctlyguessing the underlying action/activity in an image.