Vision-language modeling is a crucial subfield of AI that focuses on jointly learning and representing image and text data, often using one modality to enhance understanding of the other. In cognitive science, humans use their visual system to grasp deep aspects of a concept, such as shape and size, while language helps them understand its semantics. Similarly, a machine can gain a better understanding of the world by utilizing multiple modalities, providing deeper insights compared to learning from a single modality. VL modeling is widely explored in the general domain, thanks to the vast image-text data available online and extensive annotated VL datasets. There are several strong VL models in the general domain, such as CLIP, which perform well on various tasks. However, in low-resource domains with limited data or dense knowledge areas, like the medical field, the data shortage hinders the development of robust multimodal models with reliable performance, especially where model reliability is critical. My research goal is to study the underlying capability of Vision-Language and Language Models and to develop innovative approaches to enhance their usage for low-resource domains such as the medical domain.