Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You aren't mistaken. Keeping state, or storing memories, is where it's at with prompts. The trick is knowing what to remember and what to forget.

I consider vector engines to be "hot" models, given they are storing the vector representations of text already run through the "frozen" model.

Having written something a while back that indexes documents and enters into discussion with them, I'm pretty sure ChatGPT is using some type of embedding lookup/match/distance on the history in the window. That means not all text is submitted at the next entry, but whatever mostly matches what is entered by the user (in vector space) is likely pulled in and sent over in the final prompt.



Sure - but a vector db is helping you keep your prompts to under size X. It isn't adding state and there are various mechanisms to keep your prompt to under size X - like summarization, providing a table of contents etc. It seems to me that vector db and semantic search are one trick in a pile of tricks to keep prompt sizes down until we can get the input sizes up (although gpt4 already takes 32,000 tokens).

Using semantic search to find relevant chunks seems misguided but practical in the short term. One of the key benefits of LLMs is they can take into account a lot of context.


Context constraint is a cheap way to keep the model on-topic. So rather than relying on an ever-growing context window to stuff/mapreduce more undifferentiated “context” (the entire chat history), interposing a vector search engine that only returns relevant context tends to get you better overall model performance, in addition to being scalable in a way that increasing context window size is not.


Agreed.

But summarization is better to keep the model on topic for most cases. And there are other tricks.

Vectors and semantic search are one (likely questionable way given LLMs can likely reason over a table of contents or similar better) to search a large corpus or very large document. It's really only appropriate for a specific set of use cases. It's not some "general memory layer" for AI.


Summarization is much more expensive than vector db's. Assume you have 1m tokens of context. You could run all through GPT-4 and summarize the information, but it would cost $60 (based on current prices) and take 10's of minutes of GPU time to do the inference.

Disclaimer: I work for a16z and on the infra team, so consider me biassed.


If you look through the comments here, folks are mostly referring to keeping for example a chat history. No one is doing 1m words of chat. A common pattern is to summarize a chat history and pass that in the prompt.

As for a corpus of documents (which is what you are presumably talking about), there are a couple problems with what you are saying:

First, you are implying that the content is always new - that's not true for many cases folks are talking about solving (like technical support or customer support), so it's a one time fee to summarize the corpus. You might run it periodically for updates.

Second, there is an assumption that a basic semantic search is the best way to search documents to find the most relevant content. That's questionable before the existence of LLMs, but with LLMs you are basically assuming your cosine similarity search on your vectors is better than an LLM can do with a simple table of contents and question "where should I search?" I haven't seen someone do a detailed study, but the implicit assumption that semantic search is the best idea for text could easily be a bad one.

Third, it assumes the quantum of data to search through is astronomically large and/or getting bigger compared to almost certain decreases in inference cost and increases in input tokens. This will be true for some subset of things, but unlikely to be many and in the cases it is true they'll do something more sophisticated than embeddings and embedding search. They'll probably fine tune the underlying model on an ongoing basis.

Regardless - the post you guys wrote seems... like a stretch for a definition of what this really is And, at least on the surface vector databases appear to be commodity infra. Pinecone might be growing fast now, but how do they ever make much money above their costs? But, you guys seem smart, so maybe there is something there?


Chat history may work, it depends on how long it is and the business model.

I don't quite understand how general summarization would work. If you use an LLM to simply to summarize in order to feed it into a prompt, the summarization needs to be specific to the query. i.e. "summarize what this text says about topic X". You can't summarize long text in a generic way without losing information. Or do I misunderstand the comment?

If you have a perfect table of context (or better, an index by topic) you may not need semantic search. But for the typical use case we are seeing you have unstructured data without an index (e.g. tech support knowledge db entries, company reports, emails). For that, semantic search work quite well.

For the sizes, the observation is that the data that people want to search over (e.g. your email, a wiki, JIRA, a knowledge base) is far larger than the context length. You are correct that we assume that inference cost and speed won't decrease sufficiently quickly in the near future. Why is a longer topic, but in a nutshell GPU speed increase is ~2.5x gen/gen and other than overtraining vs. Chinchilla we don't see immediate model gains. But that is speculative, we don't know what's in store.

To some degree we are just reacting to user adoption in the market. We don't build these systems, but if we see enough of them eventually we recognize the pattern. And while I am optimistic, we could be wrong. AI is major revolution and we are all students.

edit: disclaimer, I work for a16z.


Yeah, everything here seems basically reasonable, I'd quibble with a couple things but it's debatable. And we might be talking past each other a little bit on use cases. Anyway, it's a fun space.

edit: To me this is a better summary of what a vector db is useful for: https://cloud.google.com/blog/topics/developers-practitioner...

And if someone is building a chat interface which is effectively a search product then they are going to find these things useful. But it's not a generic LLM memory layer or something.


From my perspective, it’s not clear why you would want to use bulk summarisation of all context versus summarisation over “relevant” vectors, since it is both substantially more expensive and less effective, since you are effectively polluting the context window with “irrelevant” context. And the problem is compounded as you scale up - even as you scale up trivially.

Admittedly I’m hand-waving a bit around “relevant” and “irrelevant” - clearly your vector search setup has to be fit for purpose. That’s a talent all on its own, so I wonder if we will see competing approaches at the vectorstore level or if it’s relatively settled. Anyway, I’m out of my depth at that point so I’ll leave it there.


I think it's unsettled and we'll see some clever things which combine approaches. On the surface it seems like preprocessing a corpus in clever ways will be useful.


If we read a document, that's preprocessing it. It's useful for being able to discuss later, or bring that understanding to bear on a different, yet related problem space.

I agree that a combined approach is likely useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: