Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
516 changes: 516 additions & 0 deletions docs/en/Components/AgentSkills.md

Large diffs are not rendered by default.

18 changes: 18 additions & 0 deletions docs/en/Components/Config.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,24 @@ tools:

For the complete list of supported tools and custom tools, please refer to [here](./Tools.md)

## Skills Configuration

> Optional, used when enabling Agent Skills

```yaml
skills:
# Path to skills directory or ModelScope repo ID
path: /path/to/skills
# Whether to auto-execute skills (default: True)
auto_execute: true
# Working directory for outputs
work_dir: /path/to/workspace
# Whether to use Docker sandbox for execution (default: True)
use_sandbox: false
```

For the complete skill module documentation (including architecture, directory structure, API reference, and security mechanisms), see [Agent Skills](./AgentSkills).

## Memory Compression Configuration

> Optional, for context management in long conversations
Expand Down
299 changes: 299 additions & 0 deletions docs/en/Components/MultimodalSupport.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
---
slug: MultimodalSupport
title: Multimodal Support
description: Ms-Agent multimodal conversation guide - image understanding and analysis configuration and usage.
---

# Multimodal Support

This document describes how to use ms-agent for multimodal conversations, including image understanding and analysis capabilities.

## Overview

ms-agent supports multimodal models such as Alibaba Cloud's `qwen3.5-plus`. Multimodal models can:
- Analyze image content
- Recognize objects, scenes, and text in images
- Engage in conversations based on image content

## Prerequisites

### 1. Install Dependencies

Ensure the required packages are installed:

```bash
pip install openai
```

### 2. Configure API Key

(Using qwen3.5-plus as an example) Obtain a DashScope API Key and set the environment variable:

```bash
export DASHSCOPE_API_KEY='your-dashscope-api-key'
```

Or set `dashscope_api_key` directly in the configuration file.

## Configure Multimodal Models

Multimodal functionality depends on two factors:
1. **Choose a model that supports multimodal input** (e.g. `qwen3.5-plus`)
2. **Use the correct message format** (containing `image_url` blocks)

You can dynamically modify the model configuration in code on top of an existing config:

```python
from ms_agent.config import Config
from ms_agent import LLMAgent
import os

# Use an existing configuration file (e.g. ms_agent/agent/agent.yaml)
config = Config.from_task('ms_agent/agent/agent.yaml')

# Override configuration for multimodal model
config.llm.model = 'qwen3.5-plus'
config.llm.service = 'dashscope'
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'

# Create LLMAgent
agent = LLMAgent(config=config)
```

## Using LLMAgent for Multimodal Conversations

Using `LLMAgent` for multimodal conversations is recommended, as it provides more complete features including memory management, tool calling, and callback support.

### Basic Usage

```python
import asyncio
import os
from ms_agent import LLMAgent
from ms_agent.config import Config
from ms_agent.llm.utils import Message

async def multimodal_chat():
# Create configuration
config = Config.from_task('ms_agent/agent/agent.yaml')
config.llm.model = 'qwen3.5-plus'
config.llm.service = 'dashscope'
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'

# Create LLMAgent
agent = LLMAgent(config=config)

# Build multimodal message
multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]

# Call the agent
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
print(response[-1].content)

asyncio.run(multimodal_chat())
```

### Non-Stream Mode

```python
# Disable stream in configuration
config.generation_config.stream = False

agent = LLMAgent(config=config)

multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]

# Non-stream mode: returns complete response directly
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
print(f"[Response] {response[-1].content}")
print(f"[Token Usage] Input: {response[-1].prompt_tokens}, Output: {response[-1].completion_tokens}")
```

### Stream Mode

```python
# Enable stream in configuration
config.generation_config.stream = True

agent = LLMAgent(config=config)

multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]

# Stream mode: returns a generator
generator = await agent.run(
messages=[Message(role="user", content=multimodal_content)],
stream=True
)

full_response = ""
async for response_chunk in generator:
if response_chunk and len(response_chunk) > 0:
last_msg = response_chunk[-1]
if last_msg.content:
# Stream output of new content
print(last_msg.content[len(full_response):], end='', flush=True)
full_response = last_msg.content

print(f"\n[Full Response] {full_response}")
```

### Multi-Turn Conversations

LLMAgent supports multi-turn conversations, allowing you to mix images and text:

```python
agent = LLMAgent(config=config, tag="multimodal_conversation")

# Turn 1: Send an image
multimodal_content = [
{"type": "text", "text": "How many people are in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]

messages = [Message(role="user", content=multimodal_content)]
response = await agent.run(messages=messages)
print(f"[Turn 1 Response] {response[-1].content}")

# Turn 2: Follow-up question (text only, preserving context)
messages = response # Use previous response as context
messages.append(Message(role="user", content="What are they doing?"))
response = await agent.run(messages=messages)
print(f"[Turn 2 Response] {response[-1].content}")
```

## Multimodal Message Format

ms-agent uses the OpenAI-compatible multimodal message format. Images can be provided in three ways:

### 1. Image URL

```python
from ms_agent.llm.utils import Message

multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]

messages = [
Message(role="user", content=multimodal_content)
]

response = llm.generate(messages=messages)
```

### 2. Base64 Encoding

```python
import base64

# Read and encode the image
with open('image.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')

multimodal_content = [
{"type": "text", "text": "What is this?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]

messages = [Message(role="user", content=multimodal_content)]
response = llm.generate(messages=messages)
```

### 3. Local File Path

```python
import base64
import os

image_path = 'path/to/image.png'

# Get MIME type
ext = os.path.splitext(image_path)[1].lower()
mime_type = {
'.png': 'image/png',
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.gif': 'image/gif',
'.webp': 'image/webp'
}.get(ext, 'image/png')

# Read and encode
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')

multimodal_content = [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{image_data}"
}
}
]

messages = [Message(role="user", content=multimodal_content)]
response = llm.generate(messages=messages)
```

## Running Examples

### Running the Agent Example

```bash
# Run the complete test suite (including stream and non-stream modes)
python examples/agent/test_llm_agent_multimodal.py
```

## FAQ

### Q: Are there image size limits?

A: Yes, different models have different limits:
- qwen3.5-plus: Recommended image size under 4MB
- Recommended resolution not exceeding 2048x2048

### Q: What image formats are supported?

A: Commonly supported formats:
- JPEG / JPG
- PNG
- GIF
- WebP

### Q: Can I send multiple images at once?

A: Yes, you can add multiple `image_url` blocks in a single message:

```python
multimodal_content = [
{"type": "text", "text": "Compare these two images."},
{"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}}
]
```

### Q: Is streaming output supported?

A: Yes, multimodal conversations support streaming output. Set `stream: true`:

```python
config.generation_config.stream = True
response = llm.generate(messages=messages, stream=True)
```
4 changes: 3 additions & 1 deletion docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,16 @@ MS-Agent DOCUMENTATION
Components/LLMAgent
Components/Workflow
Components/SupportedModels
Components/MultimodalSupport
Components/Tools
Components/AgentSkills
Components/ContributorGuide

.. toctree::
:maxdepth: 2
:caption: 📁 Projects

Projects/AgentSkills
Projects/CodeGenesis
Projects/DeepResearch
Projects/FinResearch
Projects/VideoGeneration
Loading
Loading