-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to use an additional stream for RAG #322
Comments
Could you explain how persistent RAG support is? Do you have to specify the sources for RAG in each chat session, or are the sources/embeddings of sources persistently available to the model? By persistent I mean that you set it up once and then the LLM runner (OM Studio etc) makes the RAG documents available to the model whenever you start a chat session from Emacs (or equivalent). |
To be frank, I'm very new to the local AI world, so am not the best resource. So take anything I say with a grain of salt here. What I do know, is that v1/embeddings is a part of the OpenAI API and they have some docs here My understanding is that its just an endpoint target, that could be used differently depending on the application/use case. With LM Studio, I don't think they store anything that gets sent to it, as they just provide the target that another app can use for RAG. Which I presume means, if you want to talk about the same document or project in another session, you'd have to send that stuff to the embeddings target again. This means you could all the files of a webdev project one sessions so you can have the agent work on it more intelligently, but the next session you could send it your emacs calendar if you want to talk about scheduling issues and it would have THAT context, but wouldn't remember anything about your coding project. I think MemGPT is tackling the more memory persistence issue in interesting ways, and they use the embeddings target for part of that, but that's a whole other thing :) And again, I'm not really educated enough about this to give good answers. |
I'll also add that LM Studio, the tool I've been using most recently to test and discover models, and the tool that I use to get the /v1/embeddings target, has some additional info on their implementation here |
There's some disconnect in our understanding of RAG. As I understand it, RAG and embeddings are two independent methods. RAG works as follows:
At no point in this process did the user have to access an embeddings API endpoint. Tokenization and embedding happens as usual in the regular course of operation of the model. So the embeddings API is relevant. If you can point me to the technical documentation for RAG over HTTP, I can take a look. I might also be wrong and have misunderstood the process. In that case please let me know what I got wrong. |
I had been using this with Ollama, but recently got LM Studio serving my model instead. It runs a lot quicker for me since I can offload to the GPU. Another feature it has (and other servers/ and services) is RAG. LM Studio supports it with a separate stream. For example, if the chat steram is http://localhost:1234/v1/chat/completions then the RAG files can be sent to http://localhost:1234/v1/embeddings
I don't use PrivateGPT, but it looks like it works similarly, with v1/chat/completions for the chat, and v1/ingest or v1/ingest/text for the RAG embeddings.
Ideally, I'd like to be able to set the stream for the RAG alongside the other backend configuration, and have gptel allow me to either send files. buffers, directories etc. for RAG support.
The text was updated successfully, but these errors were encountered: