This is what we’re using. We already sync database content to a typesense DB for regular search so it wasn’t much more work to add in embeddings and now we can do semantic search.
I was using pinecone before installing pgvector in Postgres. Pinecone works and all but having the vectors in Postgres resulted in an explosion of use for us. Full relational queries with where clauses and order by etc AND vector embeddings is wicked.
Why do you use pgvector instead of pgANN? My understanding is pgANN is built with FAISS. When I compared pgvector with FAISS, pgvector was 3-5x slower.
There is certainly a wide variety of problems today for which pgvector is unsuitable due to performance limitations... but fear not! This is an area that is getting significant focus right now.
This hits home, it is a big ask to keep data in sync for yet another store. We already balance MS SQL and Algolia and all the plumbing required to catch updates, deletes, etc. adding another feels like a bridge too far. Hopefully MS will get on this train at some point and catch up to postgres.
Speaking of the repo, they have a number of features they want to add if anyone is interested in contributing, there's lots of room for advancement. Many of these features already have active branches
https://github.com/pgvector/pgvector/issues/27
I honestly hope they use it to improve their documentation. I consider myself and pretty adept developer but without much background in AI and was looking for a solution for building out a recommendation engine and ended up at Pinecone.
Maybe I'm not the target audience but after spending some time poking around I couldn't honestly couldn't even figure out how to use it.
Even a simple Goole Search for "What is a Vector Database" ends up with this page.
Pinecone is a vector database that makes it easy for developers to add vector-search features to their applications
Um okay... what's "vector-search"? For that sake what the eff is a "Vector" to begin with? Finally after getting about a third down the page we start defining what a vector is....
Maybe I'm not their target audience but I ended up poking around for about an hour or two before just throwing up my hands and thinking it wasn't for me.
Ended up just sticking with Algolia, since we had them in place for Search anyway...
Respectfully if you don’t know what a vector is, you probably don’t need a vector DB.
When they say “vector-search” they mean semantic search. I.e. “which document is the most semantically similar to the query text”.
So how do we establish semantic similarity?
In a database like Elasticsearch, you store text and the DB indexes the text so you can search.
In a vector DB you don’t just store the raw text, you store a vectorized version of the text.
A vector can be thought of as an array of numbers. To get a vector representation we need some way to take a string and map it to an array while also capturing the notion of semantics.
This is hard, but machine learning models save the day! The first popular model used for this was called “word2vec” while a more modern model is BERT.
These take an input like “fish” and output a vector like [ 3.12 … 4.092 ] (with over a thousand more elements).
So let’s say we have a sentence that we vectorized and we want to compare some user input to see how similar it is to that sentence. How?
If we call our sentence A and the input vector B, we can compute a number between zero and one that tells us how similar they are.
This is called cosine similarity and is computed by taking the dot product of the two vectors and dividing by both of their magnitudes.
When you load a bunch of vectors in a vector DB, the principal operation you will perform is “give me the top K documents that are similar to the input”. The databases indexing process computes k nearest neighbors algorithm on all vectors in the DB and stores this for use at query time.
Without the indexing process there is no real difference between a vector db and key value store.
Respectfully if you don’t know what a vector is, you probably don’t need a vector DB.
I wasn't looking for one ;-) I was looking for a recommendation engine, similarly most often I'm looking for various ways to use ML and AI to improve various features and workflows.
Which I guess is my point, I don't know who Pinecone's target market is but from following this thread it seems like all the folks who know how to do what they do have alternatives that suit them better. If they are targeting folks like me they're not doing it well.
Pinecone's examples[1] (hat tip to Jurrasic in this thread - I've seen these) all show potential use cases that I might want to leverage, but when you dive into them (for example the Movie Recommender[2] - my use case) I end up with this:
The user_model and movie_model are trained using Tensorflow Keras. The user_model transforms a given user_id into a 32-dimensional embedding in the same vector space as the movies, representing the user’s movie preference. The movie recommendations are then fetched based on proximity to the user’s location in the multi-dimensional space.
It took me another 5 minutes of googling stuff to parse that sentence. And while I could easily get the examples to run I was still running back and forth to Google to figure out what it was doing in the examples - again the documentation is poor here. I'm not a Python dev but I could follow it but I still had to google tqdm to figure out it was a progress bar library?
Also, and this is not unique to Pinecone, I've found generally that while some things are fairly well documented on "Here's how to build a Movie Recommender based on these datasets) frequently in this space there's very little data on how to build a model using your own datasets ie how to take this example and do it with your own data.
Don't worry, you're just catching up in one hour on 10 years of NLP research. There has to be some conceptual gap to cross. After you clarify the "vector" and "computing similarity" concepts, it's pretty nifty. You have a text
emb = model(text)
Now you got the embedding. What can you do with it? you can calculate how similar it is to other texts.
emb1 = model(text1)
emb2 = model(text2)
similarity = sum([a * b for a, b in zip(emb1, emb2)])
Just a multiply and add, this is trivial! So if you do that for a million texts, you got a search engine. Vector DBs are automating this for you. There are free libraries just as good. And free models to embed text with, OpenAI also have some great embeddings. You can use np.dot to compute similarities fast, up to 100,000 vectors it's the best way and get exact, not approximate results.
The great thing about embedding text is the simplicity of the API and the similarity operation. It's dead simple to use. You can do clustering, classification, near neighbour search / ranking, recommendation, or any kind of semantic operations between two texts that can be described as a score. If you cache your vectors you can search very very quickly with np.dot or other methods, in a few ms. Today you can also embed images to the same vector space and do image classification by taking the text label with max dot product.
You can also train a very small model on top of embeddings to classify the input into your desired classes, if you can collect a dataset. Embeddings are the best features for text classification. You can think of this embedding method as a way to slice and dice in the semantic space like you do with strings in character space. All fast and local, without GPUs.
Perfect example of AI gold rush nonsense. Pinecone has zero moat and quite a few free alternatives (Faiss, Weviate, pg-vector). Their biggest selling point is that AI hype train people don’t Google “alternatives to pinecone” when cloning the newest trending repo (or I guess, ask ChatGPT).
> Pinecone has zero moat and quite a few free alternatives (Faiss, Weviate, pg-vector)
Faiss is a collection of algorithms for in-memory exact and approximate high-dimensional (e.g., > ~30 dimensional) dense vector k-nearest neighbor, it doesn't add or really consider persistence (beyond full index serialization to an in memory or on disk binary blob), fault tolerance, replication, domain-specific autotuning and the like. The "vector database" companies like Pinecone, Weviate, Zilliz and what not will add these other features to turn them into a complete service, they're not really the same. pgvector seems to be DB-backed IndexFlat and IndexIVFFlat (?) from the Faiss library at present but is of course not a complete service.
However which kind of approximate indexing you want to use very much depends upon the data you're indexing, and where in the tradeoff space between latency, throughput, encoding accuracy, NN recall and memory/disk consumption you want to be (these are the fundamental tradeoffs in the vector search domain), and whether you are performing batched queries or not. To access the full range of tradeoffs you'd need to use all of the options which are available in Faiss or similar low-level libraries which may be difficult to use or require knowledge of underlying algorithms.
Spot on. There is zero moat and the self-hosted alternatives are rapidly improving (if not better) than Pinecone. There are good open-source contributions coming from bigcorp beyond Meta too, e.g., DiskANN (https://github.com/microsoft/DiskANN).
Maybe I am fundamentally missing something, but a "cloud database company" seems like the most boring tech? No one is calling Planetscale or Yugabyte nonsense because there are free alternatives like Postgres.
Is it possible Andreessen are misunderstanding how pinecone/vector dbs are used? It seems like they are pitching it as "memory for large language models" or something. Are people using vector db's in some way I'm not aware of? To me it's a database to help you do a semantic search. A multi-token string is converted into a single embedding. Like maybe 1000 words into one embedding. This is helpful because you can quickly find the relevant parts of a document to answer a question and there are token limits into an LLM, but the idea that it's helping the LLM keep state or something seems off?
Is it possible they are confusing the use of embeddings across whole swaths of text to do a semantic search with the embeddings that happen on a per token basis as data runs through an LLM? Same word, same basic idea, but used so differently that they may as well be different words?
I might be mistaken, but my understanding from having played around with LangChain for a couple months is that because you’ve got to keep all your state in the context window, giving the model access to a vectorstore containing the entire chat history allows it to retrieve relevant messages against your query that can then be stuffed or mapreduced into the context window for the next response.
The alternative - and I believe the way the ChatGPT web app currently works - is just to stuff/mapreduce user input and machine response into the context window on each step, which quickly gets quite lossy.
You aren't mistaken. Keeping state, or storing memories, is where it's at with prompts. The trick is knowing what to remember and what to forget.
I consider vector engines to be "hot" models, given they are storing the vector representations of text already run through the "frozen" model.
Having written something a while back that indexes documents and enters into discussion with them, I'm pretty sure ChatGPT is using some type of embedding lookup/match/distance on the history in the window. That means not all text is submitted at the next entry, but whatever mostly matches what is entered by the user (in vector space) is likely pulled in and sent over in the final prompt.
Sure - but a vector db is helping you keep your prompts to under size X. It isn't adding state and there are various mechanisms to keep your prompt to under size X - like summarization, providing a table of contents etc. It seems to me that vector db and semantic search are one trick in a pile of tricks to keep prompt sizes down until we can get the input sizes up (although gpt4 already takes 32,000 tokens).
Using semantic search to find relevant chunks seems misguided but practical in the short term. One of the key benefits of LLMs is they can take into account a lot of context.
Context constraint is a cheap way to keep the model on-topic. So rather than relying on an ever-growing context window to stuff/mapreduce more undifferentiated “context” (the entire chat history), interposing a vector search engine that only returns relevant context tends to get you better overall model performance, in addition to being scalable in a way that increasing context window size is not.
But summarization is better to keep the model on topic for most cases. And there are other tricks.
Vectors and semantic search are one (likely questionable way given LLMs can likely reason over a table of contents or similar better) to search a large corpus or very large document. It's really only appropriate for a specific set of use cases. It's not some "general memory layer" for AI.
Summarization is much more expensive than vector db's. Assume you have 1m tokens of context. You could run all through GPT-4 and summarize the information, but it would cost $60 (based on current prices) and take 10's of minutes of GPU time to do the inference.
Disclaimer: I work for a16z and on the infra team, so consider me biassed.
If you look through the comments here, folks are mostly referring to keeping for example a chat history. No one is doing 1m words of chat. A common pattern is to summarize a chat history and pass that in the prompt.
As for a corpus of documents (which is what you are presumably talking about), there are a couple problems with what you are saying:
First, you are implying that the content is always new - that's not true for many cases folks are talking about solving (like technical support or customer support), so it's a one time fee to summarize the corpus. You might run it periodically for updates.
Second, there is an assumption that a basic semantic search is the best way to search documents to find the most relevant content. That's questionable before the existence of LLMs, but with LLMs you are basically assuming your cosine similarity search on your vectors is better than an LLM can do with a simple table of contents and question "where should I search?" I haven't seen someone do a detailed study, but the implicit assumption that semantic search is the best idea for text could easily be a bad one.
Third, it assumes the quantum of data to search through is astronomically large and/or getting bigger compared to almost certain decreases in inference cost and increases in input tokens. This will be true for some subset of things, but unlikely to be many and in the cases it is true they'll do something more sophisticated than embeddings and embedding search. They'll probably fine tune the underlying model on an ongoing basis.
Regardless - the post you guys wrote seems... like a stretch for a definition of what this really is And, at least on the surface vector databases appear to be commodity infra. Pinecone might be growing fast now, but how do they ever make much money above their costs? But, you guys seem smart, so maybe there is something there?
Chat history may work, it depends on how long it is and the business model.
I don't quite understand how general summarization would work. If you use an LLM to simply to summarize in order to feed it into a prompt, the summarization needs to be specific to the query. i.e. "summarize what this text says about topic X". You can't summarize long text in a generic way without losing information. Or do I misunderstand the comment?
If you have a perfect table of context (or better, an index by topic) you may not need semantic search. But for the typical use case we are seeing you have unstructured data without an index (e.g. tech support knowledge db entries, company reports, emails). For that, semantic search work quite well.
For the sizes, the observation is that the data that people want to search over (e.g. your email, a wiki, JIRA, a knowledge base) is far larger than the context length. You are correct that we assume that inference cost and speed won't decrease sufficiently quickly in the near future. Why is a longer topic, but in a nutshell GPU speed increase is ~2.5x gen/gen and other than overtraining vs. Chinchilla we don't see immediate model gains. But that is speculative, we don't know what's in store.
To some degree we are just reacting to user adoption in the market. We don't build these systems, but if we see enough of them eventually we recognize the pattern. And while I am optimistic, we could be wrong. AI is major revolution and we are all students.
Yeah, everything here seems basically reasonable, I'd quibble with a couple things but it's debatable. And we might be talking past each other a little bit on use cases. Anyway, it's a fun space.
And if someone is building a chat interface which is effectively a search product then they are going to find these things useful. But it's not a generic LLM memory layer or something.
From my perspective, it’s not clear why you would want to use bulk summarisation of all context versus summarisation over “relevant” vectors, since it is both substantially more expensive and less effective, since you are effectively polluting the context window with “irrelevant” context. And the problem is compounded as you scale up - even as you scale up trivially.
Admittedly I’m hand-waving a bit around “relevant” and “irrelevant” - clearly your vector search setup has to be fit for purpose. That’s a talent all on its own, so I wonder if we will see competing approaches at the vectorstore level or if it’s relatively settled. Anyway, I’m out of my depth at that point so I’ll leave it there.
I think it's unsettled and we'll see some clever things which combine approaches. On the surface it seems like preprocessing a corpus in clever ways will be useful.
If we read a document, that's preprocessing it. It's useful for being able to discuss later, or bring that understanding to bear on a different, yet related problem space.
I agree that a combined approach is likely useful.
It also has a more basic version that just keeps a log of past messages.
I don't know whether there's a way (or even a need) to combine these approaches. In a long conversation, it might be useful to trust more recent information more than earlier messages, but Langchain's vector memory doesn't care about sequence.
Same with opensearch and elasticsearch, both of which have added vector search as well (slight differences between their implementations). And since vector search is computationally expensive, there is a lot of value in narrowing down your result set with a regular query before calculating the best matches from the narrowed down result set.
From what I've seen, the big limitation currently is dimensionality. Most of the more advanced models have a high dimensionality and especially Elasticsearch and Lucene limit the dimensionality to 1024. E.g. several of the openai models have a much higher dimensionality. Opensearch works around this by supporting alternate implentations to lucene for vectors.
Of course it's a sane limitation from a cost and computation point of view, having these huge embeddings doesn't scale that well. But it does limit the quality of the results unless you can train your own models and tailor them to your use case.
If you are curious on how to use this stuff, I invested some time a few weeks ago getting my kt-search kotlin library to support this and wrote some documentation for this: https://jillesvangurp.github.io/kt-search/manual/KnnSearch.h.... The quality was underwhelming IMHO but that might be my complete lack of experience with this stuff.
I have no experience with pinecone and I'm sure it's great. But I do share the sentiment that they might not come out on top for this. There are too many players here and it's a fast moving field. OpenAI just majorly moved the whole field forward enormously in terms of what is possible and feasible.
I wasn't making a personal recommendation to you? I was answering more broadly why someone would use Pinecone in the future.
Every new software company like this has "why wouldn't everyone just use x existing open source project, why even try to make it a real business with a hundred devs, actual support/marketing, and big ambitions to be more than a plugin to Postgres?"
Based on the videos and interviews with their lead dev I've seen Pinecone has some quite large plans by integrating with a wider stack and integrating with company databases, well beyond what they have done so far releasing an early version of the DB.
Regardless, getting wider adoption via actual businesses investing in marketing/sales to seed ideas in the market can spur development and potentially progress/innovate the tooling across the wider market, that feeds back into open source.
I don't know their financials to know if it's a good investment or not, but I have a feeling that they are over hyped. Vector DBs are becoming a commodity with many hosting options too. Again, I don't know what they have in plan for growth. Long term memory is too vague. We'll see how it goes.
They're so hot right now that you can't even signup for a starter account. I'm guessing this money will help them fix that so as to not slow down their potential customer base. It's a really easy DB to use for people with no idea about vector DBs, etc.
Happy for them, has been a very smooth developer experience using Pinecone and I think there is more than meets the eye with the combined keyword+vector search and handling of filtering. It's not going to solve all problems, but it's a well scoped solution at a good time.
Can someone explain how vector databases can be used as long term memory for AI (and LLMs specifically)?
Isn't there still a token limit as to how much ChatGPT can hold in working memory?
Is the goal that ChatGPT can query the vector database directly to get information out, and if so, how is that different than using a regular database?
Vector databases make it so semantically similar sentences get mapped to be closer together in the vector space.
So the sentence "I started working as a programmer" will be very close to "I began my job as a software developer". This makes it very powerful for natural language search.
So when the user asks a bot "Find the text message John sent me 3 years ago about wanting to found a company, I think it was like a hang gliding company? or parasailing? idk"
Behind the scenes you can ask GPT "Output a list of 10 candidate sentences that are plausible text messages that John may have sent", and it spits out
"I'm thinking of starting a hang gliding company" and "I might found a parasailing company" etc.
Then you query the vector DB for those imaginary sentences, and as a result you get sentences that are semantically similar. You take the top N nearest neighbors and plug them in to the GPT context window and say:
"Here are 100 sentences that might match the original query. If any of them match, select the number that matches. Otherwise output null"
If you engineer this system well, you can get pretty decent results.
This is great for key:value type querying, but IMO there is a lot of ground to be explored by extending it with more graph-like links. I.e. use vector search to get some initial nodes that are better than random, but then start doing a little beam-search algorithm from those nodes to find nodes they are "linked" to in some way that may answer the query better.
So you have to feed in the items to ChatGPT manually (or via some script) it looks like? In the future I guess ChatGPT with plugins could query the database on its own?
Does it work for text data or can it work for other types of data as well?
It works for all types of data. You can say "Give me objects similar to water" and have it return words like "liquid" and "juice," pictures of water, the water drop emoji, and babbling brook and rain storm sound files.
It all depends on how you produce the vectors before storage. The vector database just stores them.
How does the production of vectors work? How do you know that you should associate the water emoji to pictures of water close together in the hyperspace?
You call the ChatGPT API programmatically, entirely automated. You call the API, parse the response, make a decision in code, call the API again, etc. The GPT API just becomes a natural language reasoning module in an otherwise normal codebase.
I'm still surprised by their generous free tier, I have a database of 300k embeddings on Pinecone and it's only 10% full by their metrics. Now, I'm only averaging a request every other minute with 90ms per query, but it would take crazy amounts of traffic or a ton more data for me to convert from their free tier.
The homepage offers a few clues: Shopify, Gong, Zapier, HubSpot, Expel, and several thousand others. That includes huge enterprises who tend not to want their names shown publicly.
Basically there are many companies with tens of millions, hundreds of millions, and even billions of embeddings. If they care about performance and reliability, and don't want to tie up an entire team of engineers to manage a self-hosted solution, then Pinecone makes a lot of sense for them.
In a way this also answers the many questions about "Pinecone vs [whatever]" ... If you're dealing with <1M embeddings the differences between your options will hardly matter — just pick whatever's easiest for you. If you're already using a managed DB that introduced something that's good enough for you... just use that. Though we still work hard to make Pinecone the easiest choice and have features that many basic solutions don't have, such as hybrid search (sparse + dense vector embeddings) for better search results.
I’ve wanted to ask this question but I don’t know who to ask.
Can someone explain what the use case is for vector DBs like pinecone, milvus etc. vs a fully featured search engine like Vespa, ElasticSearch etc. which also support vector search features?
Is there something about running this type of index operationally that is particularly difficult?
I’ve been playing around here a bit. One feature is that documents have embeddings stored at index time which means that query performance is quite good. Another is that the model you’re using to embed and query can be changed and configured based on specific use cases. A pretty cool feature of Marqo is multi-statement queries that let you search for multiple positive queries and even include negative queries to filter out results.
What are Pinecone’s advantages compared to other vector databases? From what I understand one of the founders has experience building similar features at AWS previously.
> We raised $100 million in Series B funding, led by Andreessen Horowitz,
I closed the browser tab here. It's a usual techbro hustle, nothing to see here, move along. Hype, whatnot, couldn't care less. I am sure they will make money but it'll be detrimental to society. As always.
The key thing is that it's in-memory and allows you to combine attribute-based filtering, together with nearest-neighbor search.
We're also working on a way to automatically generate embeddings from within Typesense using any ML models of your choice.
So Algolia + Pinecone + Open Source + Self-Hostable with a cloud hosted option = Typesense