Microsoft introduced Kosmos-1, a multimodal language model, at the beginning of March 2023, a fact that was allegedly underreported in the United States.
According to the German news website Heise.de, “…the researchers put the pre-trained model to a variety of tests, with good results in categorizing photos, answering questions about image content, automatically labeling images, optical text recognition, and voice production tasks….”
“Visual reasoning, i.e., deriving inferences from visuals without using words as an intermediary, appears to be crucial in this case.”