Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions docs/content/docs/examples/image-input-agents.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
mad# Creating Agents with Image Input

## Overview

The goal of this guide is to explain the differences between using images as input instead of text, provide context for why these differences exist, and show you a basic example of an agent with image input

## The Difference Between Text and Image Inputs

When building agents that take an image as input, images are passed directly in the message content using the `image_url` type, not as input variables. This is because images require special handling in the message format that differs from regular text or JSON inputs.

## Why Images Are Handled Differently

[TODO: explain this better]

Traditional input variables in AnotherAI are designed for text, numbers, and structured data that can be templated into prompts using Jinja2 syntax. For example, you might have `{{ user_name }}` or `{{ email_content }}` in your prompt template.

Images, however, cannot be templated this way because:
[the following is generated by Claude; not sure if it's accurate]
- They are binary data or URLs, not text that can be inserted into a string
- AI models expect images to be provided in a specific format within the message structure
- The models need to know explicitly that they're receiving image data, not text

## Example: Image Description Agent

Let's explore a complete example that shows how to correctly pass images to an agent:

```python
def image_description(image_url: str) -> str:
res = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are an image description specialist who provides detailed and accurate descriptions of images. Your task is to analyze the provided image and generate a comprehensive description that captures the key elements, context, and details visible in the image.

Your description should be:
- Clear and concise
- Factual and objective
- Detailed enough to help someone visualize the image
- Well-structured and easy to understand""",
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
],
},
],
)
if not res.choices[0].message.content:
raise ValueError("No image description found")
return res.choices[0].message.content
```

As you can see, images differ from text inputs:
- For images, use `type: "image_url"`
- The URL is nested: `image_url: {"url": image_url}`

### Combining Image Input with Text

#### Mixing Static Text and Images

[TODO: confirm if this is correct]
You can combine text and image content in the same message. For example:

```python
{
"role": "user",
"content": [
{"type": "text", "text": "How many cats are in the image?"},
{"type": "image_url", "image_url": {"url": image_url}}
],
}
```

#### Mixing Input Variables and Images

[TODO: confirm if this is correct]
When your input includes both images and an input variable, you can use Jinja2 templating in text content while keeping images in the structured format. For example:

```python
class ImageQuestionAnswer(BaseModel):
answer: str

def answer_image_question(
image_url: str,
question: str,
) -> ImageQuestionAnswer:
res = openai.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are an image analyst who provides detailed and accurate answers to questions about images. Your task is to analyze the provided image and question about the image and generate a comprehensive answer.

Your answer should be:
- Clear and concise
- Factual and objective
- Well-structured and easy to understand""",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "{{question}}"
},
{"type": "image_url", "image_url": {"url": image_url}}
]
}
],
response_format=ImageQuestionAnswer,
extra_body={
"input": {
"variables": {
"question": question,
}
}
},
)
if not res.choices[0].message.parsed:
raise ValueError("No image question answer found")
return res.choices[0].message.parsed
```