Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChatMessage content being str-only doesn't allow user to pass image #7848

Open
tomarharsh opened this issue Jun 12, 2024 · 7 comments · May be fixed by #7913
Open

ChatMessage content being str-only doesn't allow user to pass image #7848

tomarharsh opened this issue Jun 12, 2024 · 7 comments · May be fixed by #7913
Labels
2.x Related to Haystack v2.0 community-triage

Comments

@tomarharsh
Copy link

Is your feature request related to a problem? Please describe.
While talking to our bot, the user is allowed to send an image. This image is sent to vision enabled LLM bot. Haystack ChatMessage class content only allows string but it needs to allow a List to be passed. Here's the OpenAI page the Haystack refers to for content which allows array and image_url that can be sent that way.

Describe the solution you'd like
ChatMessage to be able to handle inbound image

Describe alternatives you've considered
Not using generator component at all is the only other alternative I can explore.

Additional context
Haystack's ChatMessage content: Link
OpenAI's chat message parameter: Link
How ChatMessage content is getting populated from the generator: Link

@tomarharsh tomarharsh changed the title ChatMessage content only allows content as str but to pass image we need it to be list similar to how OpenAI allows it ChatMessage content only allows content as str-only doesn't allow user to pass image Jun 12, 2024
@tomarharsh tomarharsh changed the title ChatMessage content only allows content as str-only doesn't allow user to pass image ChatMessage content being str-only doesn't allow user to pass image Jun 12, 2024
@CarlosFerLo
Copy link
Contributor

I will try to add this functionality :)

@CarlosFerLo
Copy link
Contributor

I've reviewed the base code and propose that we enable the 'content' of a 'ChatMessage' to be set as a list containing 'str', 'Path', or any type used to encode an image. This will require us to rewrite the 'to_openai_format' method and incorporate image processing with 'base64' for calls involving images. We'll also need to address serialization issues, but we can handle those once #7849 is merged into the main branch to avoid merge conflicts.

The main challenge will be accurately distinguishing between images and text in the input list, especially when the input is a string. It would be helpful to know which data types you want to support for images. I'll begin working on this after the mentioned PR is merged.

@lbux
Copy link
Contributor

lbux commented Jun 17, 2024

The main challenge will be accurately distinguishing between images and text in the input list, especially when the input is a string. It would be helpful to know which data types you want to support for images.

I don't think we should try and extract this info ourselves. We should make the user specify. My idea is to make a ContentPart class with type, text, image_url, base_64, and detail. We can then have helper methods in this class that helps with formatting.

Essentially, we would allow for something like this:

message = ChatMessage.from_user([
    ContentPart.from_text("What’s in this image?"),
    ContentPart.from_image_url("example.com/test.jpg"),
    ContentPart.from_base64_image(base64_image)
])

We should also look into deprecating Functions and supporting Tools within ChatMessage as that has also changed.

@CarlosFerLo
Copy link
Contributor

CarlosFerLo commented Jun 18, 2024

I will implement this functionality. Regarding the deprecation of Functions, we could open an issue to handle it separately.

@vblagoje
Copy link
Member

The main challenge will be accurately distinguishing between images and text in the input list, especially when the input is a string. It would be helpful to know which data types you want to support for images.

I don't think we should try and extract this info ourselves. We should make the user specify. My idea is to make a ContentPart class with type, text, image_url, base_64, and detail. We can then have helper methods in this class that helps with formatting.

Essentially, we would allow for something like this:

message = ChatMessage.from_user([
    ContentPart.from_text("What’s in this image?"),
    ContentPart.from_image_url("example.com/test.jpg"),
    ContentPart.from_base64_image(base64_image)
])

We should also look into deprecating Functions and supporting Tools within ChatMessage as that has also changed.

I agree with this direction. We need to look at all the multimodal message formats across all LLM providers and deduce common denominators. From a brief cursory look I believe these multimodal/multipart messages are all json payloads of various formats (schemas). So let's come up with a nice abstractions (like the ContentPart idea above) that abstracts the implementation details and see how they map to data structures across various LLM providers.

@silvanocerza
Copy link
Contributor

We can keep it much simpler.

As of now models can receive and generate the following:

  • text
  • image
  • audio
  • video
  • heterogeneous list of all the above

We have all the necessary abstractions to define the above.
str obviously for text.
haystack.dataclasses.ByteStream for image, audio and video.
The list is List[Union[str, ByteStream]] then.

Given that we say that ChatMessage.content type should be Union[str, ByteStream, List[Union[str, ByteStream]]].

This abstracts at an high level all the supported type of data a model receives and generates. If model X needs their input or generates their output in a certain format its Generator will handle the conversion, but that's an implementation detail.

Introducing new classes or new abstractions is not the way to go in my opinion.

@CarlosFerLo
Copy link
Contributor

@silvanocerza I like the simplicity of your solution, but I've just read the code for 'ByteStream' and we should expect the metadata to be populated with some flag to indicate the content type, else we won't be able to distinguish. That's why I believe that the 'ContentPart' approach to be easier to handle and allows us to provide for brother input types for the different formats.
I will proceed with this implementation as soon as #7849 is merged to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 community-triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants