Skip to content

Conversation

AdamStrojek
Copy link
Contributor

This pull request addresses a significant gap in the GenAI crate by adding initial image support for chat requests.

Key updates:

  1. I've updated the ChatMessage structure to support ContentPart, enabling more flexibility.
  2. API calls have been implemented for:
    • OpenAI
    • Anthropic (also known as Ollama)
    • Gemini
  3. Additionally, the Gemini 2.0 Flash Experimental has been included.
  4. An image example has been added to demonstrate this new feature.

Testing and results:

I've successfully tested the API calls with various chat requests:

  • OpenAI: both URL and Base64 images were processed correctly.
  • Ollama: Base64 images worked as expected.
  • Gemini: both URL and Base64 images were handled correctly.

In all cases, the Large Language Model responded accurately to the requests. However, I encountered difficulties adhering strictly to your proposed interface due to differences in how each API handles images. As a result, I extracted a common interface that requires providing MIME/Content-Type for each image.

Future possibilities:

This interface could be utilized later to enable PDF support for Anthropic API.

Remaining work:

Two aspects require further attention:

  1. Providing clear feedback to users when Anthropic API does not support URL images, but only Base64 images.
  2. Handling image from file functionality, which I've left commented out for now. This would introduce a point where the file is not accessible in the chat message and could result in unexpected behavior. But it should be possible to implement is as TryFrom<File/PathBuf> for ContentPart so handling this error would be on user side

@jeremychone
Copy link
Owner

First, thanks for this work.

At quick glance, I got a c06-image error

image

Also, this example, does not seem to have a correct base64 image.

Anyway, let me play with it, to get a better gist of the changes.

I wanted to add it but didn’t have the time, so this is excellent.

If it aligns with the API, I will squash and merge. I might make post-commits for API changes if realignments are needed. But this looks to be a great start. Thanks.

@jeremychone
Copy link
Owner

@AdamStrojek

btw, thanks for the context for ImageSource:

pub enum ImageSource {
	Url,
	Base64

	// No `Local` location, this would require handling errors like "file not found" etc.
	// Such file can be easily provided by user as Base64, also can implement convenient
	// TryFrom<File> to Base64 version. All LLMs accepts local Images only as Base64
}

This could be seen as an asymmetry between a remote URL and a local file, but I understand your point that LLM services accept remote URLs and obviously cannot handle local files. Therefore, the goal is to have the genai MessageContent reflect this distinction.

I see your point.

However, providing a true local file reference at the genai level might be useful, allowing the base64 encoding to be generated only when the message is sent or received, rather than keeping it in memory during the conversation. I need to think this through.

In that vein, I’m not sure if the content should be part of ImageSource or ContentPart::Image.

Either way, this doesn’t take away from the great work done here—this is awesome. Just making sure we consider all angles.

@AdamStrojek
Copy link
Contributor Author

Hi @jeremychone,

Thank you for checking this out. You were correct that the example wasn't compiling. I fixed it and renamed to use prefix c07. Also to allow on faster verification how this code works I added it to cargo examples, so it can be started by:

cargo run --example images

Like you noticed I provided fake Base64 in example. I didn't wanted to focus on how to properly read and encode image, because that is out of scope of this example and would require install some extra dev-dependencies and also I didn't wanted to include any picture to project (with possible problems with licensing etc.). I switched this example to Url, I use image provided by Wikipedia, so it should allow on running this example as is without modification.

Looking at your other commits and issue list I picked up same feeling, that currently there are more important features to be handled. Honestly, I thought it would be harder job than it was. I was unsure where to put ContentPart, right now you can find it next to MessageContent, where original example was. Also this interface still requires some polishing. Like you can see, creating user messages that would contain images require changing approach from previous one. Neither the less this do it job properly. After merging this PR I can work on improving it, maybe some chaining? ChatMessage::user("What is in image?").add_image(url1).add_text("What is new in this picture?").add_image(url2)

I would be cautious with including file support inside library. Like I stated in comments this introducers another point of failure and what is more important introduces new dependency. Implementation of file handling by ContentPart would require to find suitable base64 implementation, this can be tricky, maybe there will be some special requirements for some AI provider? Also what when user want to use constant time variant due to some security reasons? It is good option as feature flag, but not in library. Also writing such a function is not a problem and this moves error handling on user, not library developer. There are a few errors may occur, like missing permissions, file doesn't exist or base64 encoding error. How this should be propagated? Should this be raised during message constructions or sending chat request? Personally I work with images as files very rarely. Mostly when I send request to LLM is by providing URL, because images are already uploaded and I don't need to download them. In other cases I have image in &[u8] with known type, for example retrieved from another API/library. In both cases (file and &[u8] I need to encode it using base64, but many times image is already provided as base64.

I think more important matter is that not all API supports both images sources and that content-type is required. After merging I thought to work on proper error handling for most common AI providers. I saw your comments that wrong types don't raise any warning or error.

If you have ideas for improvements let me know, I would like to help in doing that for this feature in the future. Unless there is something critical required to be fixed, I would like to limit changes in current Pull Request to minimum.

@AdamStrojek
Copy link
Contributor Author

In that vein, I’m not sure if the content should be part of ImageSource or ContentPart::Image.

Either way, this doesn’t take away from the great work done here—this is awesome. Just making sure we consider all angles.

I also had my doubts. I started from approach where ImageSource contains only necessary fields for type. So ImageSource::Url(String) was directly URL, and ImageSource::Base64 had content and mime. If you look into code I left comment that unfortunately because how Gemini works we need to store mime for local and Url files. This would mean that both types would contain same fields, but with different names, what would generate inconsistencies and could be cause of typing errors. In both cases content is String, unless we will put dependency on url crate. But it is not a problem to change it.

@jeremychone
Copy link
Owner

@AdamStrojek

Thanks for the feedback.

I hear you on the file part; but I might differ, but I am okay with deferring the decision for now.

So, here is the plan.

  1. Let's freeze this PR and not add anything.
  2. I will create an nx-image-support branch where I will squash and merge this PR.
  3. I will experiment with it and probably change the API.
    • For example, moving the ContentPart::Image.content to ImageSource::.. variants
    • Renaming examples
    • Aligning the API with the rest. Perhaps introducing Parts(Vec<ContentPart>) as an ergonomic holder.
    • Writing some unit tests.

Then, I will need to decide if that warrants a genai 0.2.0 jump or not, depending on how much API change it involves.

After that, I will probably squash all my commits after yours in the nx-image-support branch and then merge it into main.

Hope this makes sense. We might differ on the File issue, but that can come later. It should not be structural to the API anyway.

Again, thanks for the great work.

@jeremychone
Copy link
Owner

This PR has been merged manually at 59f0b14 (GitHub did not pick up the merge, but it was).

@AdamStrojek Thank you for the great work. I made some subsequent commits to add tests and some minor API modifications.

@jeremychone jeremychone closed this Jan 2, 2025
@jeremychone jeremychone added the PR-MERGED Added for PR that somehow did not get the Merge label Jan 2, 2025
@AdamStrojek AdamStrojek deleted the image-support branch February 11, 2025 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR-MERGED Added for PR that somehow did not get the Merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants