Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing Objects through Pipelines and DocumentStores #7830

Open
CarlosFerLo opened this issue Jun 7, 2024 · 5 comments
Open

Managing Objects through Pipelines and DocumentStores #7830

CarlosFerLo opened this issue Jun 7, 2024 · 5 comments

Comments

@CarlosFerLo
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When trying to store objects into document stores (for example, to make some kind of agent memory) you have to add components to serialize and deserialize those documents, and it gets really messy so you end up writing custom components.

Describe the solution you'd like
I would like to implement the following components:

  • 'PydanticWriter': serializes pydantic base models passed and converts them to documents before adding them to the document store.
  • 'PydanticReader': reads documents from the document store, we could add a BM25 and embedding readers, and for this latter, add the corresponding writer.

Describe alternatives you've considered
Now you can create your own customs components to preprocess the document content before writting and after retrieval.

@CarlosFerLo
Copy link
Contributor Author

I've been working on this implementation and I have two possible approaches regarding metadata management.

Approach 1: Metadata List

For indexing, we could easily add all metadata through a metadata list. However, this approach is only useful for indexing, as we would need to come up with a solution for carrying metadata down the pipeline after retrieval. While this is a solid starting point, it may not be the optimal long-term solution.

Approach 2: Integrating Objects into Documents

Instead, I suggest we integrate objects into documents, similar to how we handle DataFrames or Images. This would allow us to incorporate them into the pipeline flow seamlessly. We could add a new attribute to the Document class that expects a BaseModel instance, making serialization super easy and type-safe.

For embeddings, we could add an object preprocessor that populates the content attribute based on a Jinja template or something similar. This addition to the document data class is very handy for retrieval because we would need to create a wrapper around the object anyway to link the metadata to the specific instance.

I've found that with Pydantic BaseModels, we can easily manage the deserialisation of objects after retrieving them from the document store.

However, this Document-Object integration has one flaw: the pipelines won't be as clean. We would have to figure out the content of documents instead of just passing the objects around. This is a positive point for the following alternative.

Alternative: Extending the BaseModel Class

As an alternative, we could extend the BaseModel class to integrate metadata directly. By expecting these kinds of objects during indexing and retrieval, we could make things work using the current Document class solely for storage. This approach would keep the pipelines cleaner, as we could pass objects around without needing to determine document content.

In summary, while both approaches allow for easy and type-safe indexing and retrieval, integrating objects directly into the Document class allows us to reuse current components at the cost of readability in the pipelines. On the other hand, extending the BaseModel class provides cleaner pipelines but compromises the object content by adding extra fields.

@CarlosFerLo CarlosFerLo changed the title 'PydanticWriter' and 'PydanticReader' components for type-safe object storage Managing Objects through Pipelines and DocumentStores Jun 8, 2024
@anakin87
Copy link
Member

@CarlosFerLo could you better explain the use case where these components would be useful, perhaps with an example?

(@silvanocerza in case you want to take a look)

@CarlosFerLo
Copy link
Contributor Author

I am theorizing an Agent that I plan to implement using Haystack. This Agent is designed to interact with the outside world via a chat interface (not in the traditional QA fashion) and through internet searches. To provide it with memory, I am developing a short-long memory system that stores Objects representing facts, memories, people, and more. These Objects should be easily passed through the pipeline and stored in document stores. I believe that creating a more flexible Object design for the pipelines is the best approach to adapt to new scenarios where structured data needs to be passed around to fulfill more complex requirements.

@mrm1001
Copy link
Member

mrm1001 commented Jun 24, 2024

Hi @CarlosFerLo thanks for this! Please correct me if I got something wrong :-D. You're trying to implement some type of memory that has a structured format and would like to be able to save this as documents in document stores, so you can later do retrieval/vector search on them. One of the key considerations is how to convert these data structures into documents when indexing and later going from documents to the objects at retrieval.

One question that comes to mind is: I understand doing search over facts, but how are you planning to do search/retrieval over entities like places or people? I don't see a document being a good fit for this type of data.

Another question: could you maybe give an example of how you would use a "metadata list" to convert a structured object into a document (with metadata)?

@CarlosFerLo
Copy link
Contributor Author

Hi @mrm1001, thanks for your response.

I am trying to generalize the concept of an indexable object. This would allow us to index more complex objects than just plain text or bytes, utilizing the 'Document' infrastructure for document stores. The goal is to make objects compatible with Document serialization so we can leverage everything we have already built.

One question that comes to mind is: I understand doing search over facts, but how are you planning to do search/retrieval over entities like places or people? I don't see a document being a good fit for this type of data.

The point isn't that the data structure fits perfectly. By serializing it properly, we can use the existing functionality for indexing and retrieving these elements. For example, imagine you want your agent to identify a place based on a description. If we're implementing some kind of memory, we could use embedding retrieval for descriptions of objects along with metadata filtering to gather only 'Place' objects. While we could also explore implementing a Graph database and graph embedding, starting with vector document stores through the 'Document' class is sufficient.

Another question: could you maybe give an example of how you would use a "metadata list" to convert a structured object into a document (with metadata)?

Let's say we want to encode places. We would have the following class, inheriting from our extended pydantic 'BaseModel'. We'll call it 'Indexable'. This class manages the metadata during indexing and provides all the serialization functionality that 'BaseModel' offers.

from haystack import Indexable

class Place(Indexable):
    name: str
    description: str

We can use this class for our internal logic, with additional functionality from Haystack like 'get_meta_attr' and 'set_meta_attr' to manage metadata. For instance, we might want to store when an object was originally indexed in the metadata, even though it's not relevant to the actual content. When indexing, we might also store the embeddings of those objects in the metadata. Here’s an example of what we might implement, assuming these components exist:

Indexing objects into a document store:

from haystack import Pipeline
from haystack.components.embedders import ObjectEmbedder
from haystack.components.writers import ObjectWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

embedder = ObjectEmbedder[Place](template="{{name}}: {{description}}")
# This notation is for type checking, and we could work without it
# This component will add the embeddings as a meta field using 'set_meta_attr'

writer = ObjectWriter(document_store)
# This component will serialize the object fields as document content and add all haystack meta fields as the document metadata, serialized. 
# It will also add the original type of object for deserialization purposes by default.

indexing_pipe = Pipeline()
indexing_pipe.add_component('embedder', embedder)
indexing_pipe.add_component('writer', writer)
indexing_pipe.connect('embedder', 'writer')

places = [
    Place(name="Barcelona", description="Barcelona is a city in Catalonia, Spain."),
    Place(name="Madrid", description="Madrid is the capital of Spain.")
]

# Embed and index all documents to the document store
indexing_pipe.run(data={
    "embedder": {
        "objects": places
    }
})

Retrieving objects as part of RAG QA:

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.retrievers import ObjectInMemoryEmbeddingRetriever
from haystack.components.generators import Generator
from haystack.components.embedders import TextEmbedder

retriever = ObjectInMemoryEmbeddingRetriever(document_store=document_store)
embedder = TextEmbedder()
generator = Generator()

template = """
Answer the following query related to different Spanish cities based on the cities descriptions provided.

Cities:
{% for place in places %}
 - {{ place.name }}: {{ place.description }}
{% endfor %}

Query: {{ query }}

Response:
"""

prompt_builder = PromptBuilder(template=template)

# Build the pipeline
query_pipe = Pipeline()
query_pipe.add_component("embedder", embedder)
query_pipe.add_component("retriever", retriever)
query_pipe.add_component("prompt_builder", prompt_builder)
query_pipe.add_component("llm", generator)

query_pipe.connect("embedder", "retriever")
query_pipe.connect("retriever", "prompt_builder.places")
query_pipe.connect("prompt_builder", "llm")

# Test the pipeline
query = "What is the capital of Spain?"

response = query_pipe.run(data={
    "embedder": {
        "text": query
    }, 
    "prompt_builder": {
        "query": query
    }
})

Additionally, we should note that the place objects retrieved still retain all the metadata contained in the documents, such as embeddings. This is a simple code example that might not seem very interesting, but imagine working with much more complex data structures and managing them the same way we do with text documents, we could build cool things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants