diff --git a/docs/content/docs/examples/image-input-agents.mdx b/docs/content/docs/examples/image-input-agents.mdx new file mode 100644 index 00000000..d1f1a64c --- /dev/null +++ b/docs/content/docs/examples/image-input-agents.mdx @@ -0,0 +1,124 @@ +mad# Creating Agents with Image Input + +## Overview + +The goal of this guide is to explain the differences between using images as input instead of text, provide context for why these differences exist, and show you a basic example of an agent with image input + +## The Difference Between Text and Image Inputs + +When building agents that take an image as input, images are passed directly in the message content using the `image_url` type, not as input variables. This is because images require special handling in the message format that differs from regular text or JSON inputs. + +## Why Images Are Handled Differently + +[TODO: explain this better] + +Traditional input variables in AnotherAI are designed for text, numbers, and structured data that can be templated into prompts using Jinja2 syntax. For example, you might have `{{ user_name }}` or `{{ email_content }}` in your prompt template. + +Images, however, cannot be templated this way because: +[the following is generated by Claude; not sure if it's accurate] +- They are binary data or URLs, not text that can be inserted into a string +- AI models expect images to be provided in a specific format within the message structure +- The models need to know explicitly that they're receiving image data, not text + +## Example: Image Description Agent + +Let's explore a complete example that shows how to correctly pass images to an agent: + +```python +def image_description(image_url: str) -> str: + res = openai.chat.completions.create( + model="gpt-4o-mini", + messages=[ + { + "role": "system", + "content": """You are an image description specialist who provides detailed and accurate descriptions of images. Your task is to analyze the provided image and generate a comprehensive description that captures the key elements, context, and details visible in the image. + + Your description should be: + - Clear and concise + - Factual and objective + - Detailed enough to help someone visualize the image + - Well-structured and easy to understand""", + }, + { + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": image_url}}, + ], + }, + ], + ) + if not res.choices[0].message.content: + raise ValueError("No image description found") + return res.choices[0].message.content +``` + +As you can see, images differ from text inputs: +- For images, use `type: "image_url"` +- The URL is nested: `image_url: {"url": image_url}` + +### Combining Image Input with Text + +#### Mixing Static Text and Images + +[TODO: confirm if this is correct] +You can combine text and image content in the same message. For example: + +```python +{ + "role": "user", + "content": [ + {"type": "text", "text": "How many cats are in the image?"}, + {"type": "image_url", "image_url": {"url": image_url}} + ], +} +``` + +#### Mixing Input Variables and Images + +[TODO: confirm if this is correct] +When your input includes both images and an input variable, you can use Jinja2 templating in text content while keeping images in the structured format. For example: + +```python +class ImageQuestionAnswer(BaseModel): + answer: str + +def answer_image_question( + image_url: str, + question: str, +) -> ImageQuestionAnswer: + res = openai.beta.chat.completions.parse( + model="gpt-4o-mini", + messages=[ + { + "role": "system", + "content": """You are an image analyst who provides detailed and accurate answers to questions about images. Your task is to analyze the provided image and question about the image and generate a comprehensive answer. + + Your answer should be: + - Clear and concise + - Factual and objective + - Well-structured and easy to understand""", + }, + { + "role": "user", + "content": [ + { + "type": "text", + "text": "{{question}}" + }, + {"type": "image_url", "image_url": {"url": image_url}} + ] + } + ], + response_format=ImageQuestionAnswer, + extra_body={ + "input": { + "variables": { + "question": question, + } + } + }, + ) + if not res.choices[0].message.parsed: + raise ValueError("No image question answer found") + return res.choices[0].message.parsed +``` \ No newline at end of file