Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for multimodal on ChatMessages with ContentPart #7913

Closed
wants to merge 8 commits into from

Conversation

CarlosFerLo
Copy link
Contributor

@CarlosFerLo CarlosFerLo commented Jun 22, 2024

Related Issues

Proposed Changes:

I have added a new data class called 'ContentPart' as suggested in #7848 that holds any kind of content you might want to add to a 'ChatMessage', now only text, image urls and image in base 64 are implemented, as the original goal of this PR was to support Image passing to the OpenAI Chat API. This class allows for serialization and OpenAI specific formatting.

Added support for different content types on 'ChatMessage': 'str', 'ContentPart' and 'List[Union[str, ContentPart]]'. All serialization and OpenAI specific formatting was addressed.

Added support for these new content types for 'ChatMessage' on 'ChatPromptBuilder'.

How did you test it?

Added unit test for any new functionality for both 'ChatMessage' and 'ChatPromptBuilder', and created a new file containing unit tests for 'ContentPart'.

Notes for the reviewer

  • I have stumbled upon that some of the 'from_user' kind of methods on 'ChatMessage' allow the passing of metadata and others don't, should we add it to all of them?
  • Do not know if you want me to support any other of content type such as Audio or Video.
  • 'ByteStream' does not ensure that the content it has are bytes, we can pass it a string, and it breaks when calling the 'to_string' method, I patched it, but I believe we should think of a better solution.
  • When working with base64 images, the 'from_base64' image method just accepts 'ByteStream', but we could make it accept str also and convert it to 'ByteStream' internally.
  • The OpenAI API Vision documentation (here) says that it supports for other formats different from JPEG, but it does not provide any examples on how to encode the base64 image, see the 'to_openai_format' method of 'ContentPart' for context, I do not know how to check if it works.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes ✅
  • I added unit tests and updated the docstrings ✅
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:. ✅
  • I documented my code ✅
  • I ran pre-commit hooks and fixed any issue ✅

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jun 22, 2024
@CarlosFerLo CarlosFerLo marked this pull request as ready for review June 26, 2024 13:38
@CarlosFerLo CarlosFerLo requested review from a team as code owners June 26, 2024 13:38
@CarlosFerLo CarlosFerLo requested review from dfokina and Amnah199 and removed request for a team June 26, 2024 13:38
Copy link
Contributor

@silvanocerza silvanocerza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the approach I suggested on this comment.

@coveralls
Copy link
Collaborator

coveralls commented Jun 26, 2024

Pull Request Test Coverage Report for Build 9680828964

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 37 unchanged lines in 6 files lost coverage.
  • Overall coverage decreased (-0.2%) to 89.744%

Files with Coverage Reduction New Missed Lines %
components/evaluators/document_map.py 1 96.15%
components/evaluators/document_mrr.py 1 95.45%
tracing/datadog.py 1 94.59%
components/generators/openai.py 2 96.34%
dataclasses/chat_message.py 4 94.94%
components/builders/chat_prompt_builder.py 28 77.05%
Totals Coverage Status
Change from base Build 9615573436: -0.2%
Covered Lines: 6843
Relevant Lines: 7625

💛 - Coveralls

@CarlosFerLo
Copy link
Contributor Author

Hi @silvanocerza,

Your comment for reference:

We can keep it much simpler.

As of now models can receive and generate the following:

  • text
  • image
  • audio
  • video
  • heterogeneous list of all the above

We have all the necessary abstractions to define the above. str obviously for text. haystack.dataclasses.ByteStream for image, audio and video. The list is List[Union[str, ByteStream]] then.

Given that we say that ChatMessage.content type should be Union[str, ByteStream, List[Union[str, ByteStream]]].

This abstracts at an high level all the supported type of data a model receives and generates. If model X needs their input or generates their output in a certain format its Generator will handle the conversion, but that's an implementation detail.

Introducing new classes or new abstractions is not the way to go in my opinion.

I replied to your comment a while ago remarking my concerns on how to know whether a string or 'ByteStream' is one or another data type, how do you suggest we solve that?

I'm not a very experienced coder, so I do not know how to implement this using your method. If you are willing to give me some guidance I have no problem of implementing your solution, as I said, it results in a simpler API, but I cannot think of a way of implementing it.

I will tag other members of our initial conversation so they can give their opinion on the matter.
@lbux @vblagoje

@silvanocerza
Copy link
Contributor

@CarlosFerLo
ByteStream has been thought exactly to solve the problem of handling multi modality in Haystack. It's generic enough that can store any kind of binary. The mime_type field can be used to know which kind of data we're dealing with, whether it's an image, audio, video or anything else.

The idea is to keep the abstraction level higher than what is proposed in this PR.

If some Generator need to send the ByteStream data encoded in a way or another it will handle that internally, there's no need to add another abstraction for that.

@CarlosFerLo
Copy link
Contributor Author

Okey, did not know what that field was thought for in the ByteStream. Just one last question, how can we represent image urls? If we go for the simple string, it is more natural, we might run into problems for differentiating urls and raw text, but if we go with a ByteStream we would find ourselves using a tool for a thing it wasn't meant to be used for, I believe. I will close this PR and start implementing your solution right away.

@silvanocerza
Copy link
Contributor

ChatMessage has a meta field, that information could be stored there too if necessary.

@CarlosFerLo
Copy link
Contributor Author

@silvanocerza What if we add a prefix or something to the urls? To avoid duplicate information.

@silvanocerza
Copy link
Contributor

Like http? 😁

@CarlosFerLo
Copy link
Contributor Author

Like a 'image_url' prefix. 'http' may appear in regular text. We could find a prefix that is not usual, and together with the url structure differentiate between ulrs and text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ChatMessage content being str-only doesn't allow user to pass image
3 participants