Option to use an additional stream for RAG #322

MrDowntempo · 2024-05-28T07:24:40Z

I had been using this with Ollama, but recently got LM Studio serving my model instead. It runs a lot quicker for me since I can offload to the GPU. Another feature it has (and other servers/ and services) is RAG. LM Studio supports it with a separate stream. For example, if the chat steram is http://localhost:1234/v1/chat/completions then the RAG files can be sent to http://localhost:1234/v1/embeddings

I don't use PrivateGPT, but it looks like it works similarly, with v1/chat/completions for the chat, and v1/ingest or v1/ingest/text for the RAG embeddings.

Ideally, I'd like to be able to set the stream for the RAG alongside the other backend configuration, and have gptel allow me to either send files. buffers, directories etc. for RAG support.

karthink · 2024-05-30T20:12:52Z

Could you explain how persistent RAG support is? Do you have to specify the sources for RAG in each chat session, or are the sources/embeddings of sources persistently available to the model? By persistent I mean that you set it up once and then the LLM runner (OM Studio etc) makes the RAG documents available to the model whenever you start a chat session from Emacs (or equivalent).

MrDowntempo · 2024-05-30T21:00:49Z

To be frank, I'm very new to the local AI world, so am not the best resource. So take anything I say with a grain of salt here.

What I do know, is that v1/embeddings is a part of the OpenAI API and they have some docs here My understanding is that its just an endpoint target, that could be used differently depending on the application/use case. With LM Studio, I don't think they store anything that gets sent to it, as they just provide the target that another app can use for RAG. Which I presume means, if you want to talk about the same document or project in another session, you'd have to send that stuff to the embeddings target again. This means you could all the files of a webdev project one sessions so you can have the agent work on it more intelligently, but the next session you could send it your emacs calendar if you want to talk about scheduling issues and it would have THAT context, but wouldn't remember anything about your coding project.

I think MemGPT is tackling the more memory persistence issue in interesting ways, and they use the embeddings target for part of that, but that's a whole other thing :)

And again, I'm not really educated enough about this to give good answers.

MrDowntempo · 2024-05-30T22:35:20Z

I'll also add that LM Studio, the tool I've been using most recently to test and discover models, and the tool that I use to get the /v1/embeddings target, has some additional info on their implementation here

karthink · 2024-06-09T22:09:31Z

There's some disconnect in our understanding of RAG.

As I understand it, RAG and embeddings are two independent methods.

RAG works as follows:

A corpus of (text) documents is provided to the model provider beforehand. These are added to a database.
When sending a request, the user prompt is used to generate a search query. The model uses another model or some simple transformation to generate this query. This is the (R)etrieval part of RAG.
This query is fed to a search engine or the database. The results are chunks from the most relevant documents, as determined by the search algorithm.
These chunks are fed into the model as additional user prompts. This is the (A)ugmented part of RAG.
The model now has more relevant text to work with, and (G)enerates better results.

At no point in this process did the user have to access an embeddings API endpoint. Tokenization and embedding happens as usual in the regular course of operation of the model.

So the embeddings API is relevant. If you can point me to the technical documentation for RAG over HTTP, I can take a look.

I might also be wrong and have misunderstood the process. In that case please let me know what I got wrong.

karthink added the question Further information is requested label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to use an additional stream for RAG #322

Option to use an additional stream for RAG #322

MrDowntempo commented May 28, 2024

karthink commented May 30, 2024

MrDowntempo commented May 30, 2024

MrDowntempo commented May 30, 2024

karthink commented Jun 9, 2024 •

edited

Loading

Option to use an additional stream for RAG #322

Option to use an additional stream for RAG #322

Comments

MrDowntempo commented May 28, 2024

karthink commented May 30, 2024

MrDowntempo commented May 30, 2024

MrDowntempo commented May 30, 2024

karthink commented Jun 9, 2024 • edited Loading

karthink commented Jun 9, 2024 •

edited

Loading