> Image inputs are still a research preview and not publicly available.
Will input-images also be tokenized? Multi-modal input is an area of research, but an image could be converted into a text description (?) before being inserted into the input stream.
My understanding is that image embeddings are a rather abstract representation of the image. What about if the image itself contains text, such as street signs etc?
Will input-images also be tokenized? Multi-modal input is an area of research, but an image could be converted into a text description (?) before being inserted into the input stream.