You can probably fix this insufficient training by going for multimodal training...

You can probably fix this insufficient training by going for multimodal training. Just like it would take excessively long to teach a person the concept of a color that they can't see, an AI would need infeasible amount of text data to learn about, say music. But give it direct training with music data and I think the model will quickly grasp a context of it.