Advancements in deep neural networks have led to significant progress in computer vision and natural language processing. These networks, trained on real-world stimuli, develop high-level feature representations of stimuli. It is hypothesized that these representations, stemming from different inputs, should converge into similar conceptual systems, as they reflect various perspectives of the same underlying reality. This paper examines the degree to which different conceptual systems can be aligned in an unsupervised manner, using feature-based representations from deep neural networks. Our investigation centers on the alignment between the image and word representations produced by diverse neural networks, emphasizing those trained via self-supervised learning methods. Subsequently, to probe comparable alignment patterns in human learning, we extend this examination to models trained on developmental headcam data from children. Our findings reveal a more pronounced alignment in models trained through self-supervised learning compared to supervised learning, effectively uncovering higher-level structural connections among categories. However, this alignment was notably absent in models trained with limited developmental headcam data, suggesting more data, more inductive biases, or more supervision are needed to establish alignment from realistic input.