Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to use an additional stream for RAG #322

Open
MrDowntempo opened this issue May 28, 2024 · 4 comments
Open

Option to use an additional stream for RAG #322

MrDowntempo opened this issue May 28, 2024 · 4 comments
Labels
question Further information is requested

Comments

@MrDowntempo
Copy link

I had been using this with Ollama, but recently got LM Studio serving my model instead. It runs a lot quicker for me since I can offload to the GPU. Another feature it has (and other servers/ and services) is RAG. LM Studio supports it with a separate stream. For example, if the chat steram is http://localhost:1234/v1/chat/completions then the RAG files can be sent to http://localhost:1234/v1/embeddings

I don't use PrivateGPT, but it looks like it works similarly, with v1/chat/completions for the chat, and v1/ingest or v1/ingest/text for the RAG embeddings.

Ideally, I'd like to be able to set the stream for the RAG alongside the other backend configuration, and have gptel allow me to either send files. buffers, directories etc. for RAG support.

@karthink
Copy link
Owner

Could you explain how persistent RAG support is? Do you have to specify the sources for RAG in each chat session, or are the sources/embeddings of sources persistently available to the model? By persistent I mean that you set it up once and then the LLM runner (OM Studio etc) makes the RAG documents available to the model whenever you start a chat session from Emacs (or equivalent).

@MrDowntempo
Copy link
Author

To be frank, I'm very new to the local AI world, so am not the best resource. So take anything I say with a grain of salt here.

What I do know, is that v1/embeddings is a part of the OpenAI API and they have some docs here My understanding is that its just an endpoint target, that could be used differently depending on the application/use case. With LM Studio, I don't think they store anything that gets sent to it, as they just provide the target that another app can use for RAG. Which I presume means, if you want to talk about the same document or project in another session, you'd have to send that stuff to the embeddings target again. This means you could all the files of a webdev project one sessions so you can have the agent work on it more intelligently, but the next session you could send it your emacs calendar if you want to talk about scheduling issues and it would have THAT context, but wouldn't remember anything about your coding project.

I think MemGPT is tackling the more memory persistence issue in interesting ways, and they use the embeddings target for part of that, but that's a whole other thing :)

And again, I'm not really educated enough about this to give good answers.

@MrDowntempo
Copy link
Author

I'll also add that LM Studio, the tool I've been using most recently to test and discover models, and the tool that I use to get the /v1/embeddings target, has some additional info on their implementation here

@karthink
Copy link
Owner

karthink commented Jun 9, 2024

There's some disconnect in our understanding of RAG.

As I understand it, RAG and embeddings are two independent methods.

RAG works as follows:

  1. A corpus of (text) documents is provided to the model provider beforehand. These are added to a database.
  2. When sending a request, the user prompt is used to generate a search query. The model uses another model or some simple transformation to generate this query. This is the (R)etrieval part of RAG.
  3. This query is fed to a search engine or the database. The results are chunks from the most relevant documents, as determined by the search algorithm.
  4. These chunks are fed into the model as additional user prompts. This is the (A)ugmented part of RAG.
  5. The model now has more relevant text to work with, and (G)enerates better results.

At no point in this process did the user have to access an embeddings API endpoint. Tokenization and embedding happens as usual in the regular course of operation of the model.

So the embeddings API is relevant. If you can point me to the technical documentation for RAG over HTTP, I can take a look.

I might also be wrong and have misunderstood the process. In that case please let me know what I got wrong.

@karthink karthink added the question Further information is requested label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants