*Respectfully if you don’t know what a vector is, you probably don’t need a vect...

visarga · on April 27, 2023

Don't worry, you're just catching up in one hour on 10 years of NLP research. There has to be some conceptual gap to cross. After you clarify the "vector" and "computing similarity" concepts, it's pretty nifty. You have a text

    emb = model(text)

Now you got the embedding. What can you do with it? you can calculate how similar it is to other texts.

    emb1 = model(text1)
    emb2 = model(text2)
    similarity = sum([a * b for a, b in zip(emb1, emb2)])

Just a multiply and add, this is trivial! So if you do that for a million texts, you got a search engine. Vector DBs are automating this for you. There are free libraries just as good. And free models to embed text with, OpenAI also have some great embeddings. You can use np.dot to compute similarities fast, up to 100,000 vectors it's the best way and get exact, not approximate results.

The great thing about embedding text is the simplicity of the API and the similarity operation. It's dead simple to use. You can do clustering, classification, near neighbour search / ranking, recommendation, or any kind of semantic operations between two texts that can be described as a score. If you cache your vectors you can search very very quickly with np.dot or other methods, in a few ms. Today you can also embed images to the same vector space and do image classification by taking the text label with max dot product.

You can also train a very small model on top of embeddings to classify the input into your desired classes, if you can collect a dataset. Embeddings are the best features for text classification. You can think of this embedding method as a way to slice and dice in the semantic space like you do with strings in character space. All fast and local, without GPUs.