Skip to content

improvement: new image interface modeled after VLM interface#304

Draft
makiroll1125 wants to merge 1 commit into
V1.3.3from
improvement/generate-image
Draft

improvement: new image interface modeled after VLM interface#304
makiroll1125 wants to merge 1 commit into
V1.3.3from
improvement/generate-image

Conversation

@makiroll1125
Copy link
Copy Markdown
Collaborator

Image Generation Feature — Code Overview

This document explains the primary function of each changed file in the improvement/generate-image branch. The feature adds image generation as a first-class capability alongside CraftBot's existing LLM and VLM support, following the same layered architecture.


Architecture at a Glance

The feature is structured in three layers:

agent_core (provider-agnostic library)
  └── InterfaceType.IMAGE_GEN
  └── MODEL_REGISTRY entries
  └── ModelFactory support
  └── ImageGenInterface (core engine)

app (CraftBot application wrappers)
  └── app/image_gen_interface.py  — hooks into CraftBot state
  └── app/config.py               — reads settings.json
  └── app/agent_base.py           — lifecycle management
  └── app/internal_action_interface.py — action entry point
  └── app/data/action/generate_image.py — the callable action

UI (frontend settings)
  └── modelSettingsSlice.ts / selectors / ModelSettings.tsx / model_settings.py / browser_adapter.py

The pattern is identical to how VLM works. Every file has a VLM counterpart; image gen simply adds a parallel track.


agent_core Layer

agent_core/core/models/types.py

Defines the InterfaceType enum. A single value was added:

IMAGE_GEN = "image_gen"

This enum is the key used everywhere (registry lookups, factory dispatch, validation) to refer to the image generation interface type.

agent_core/core/models/model_registry.py

A dictionary mapping provider → InterfaceType → default model name. Two entries were added for IMAGE_GEN:

  • openai"gpt-image-2"
  • gemini"gemini-3.1-flash-image-preview"

agent_core/core/models/factory.py

ModelFactory.create() was modified so a small guard was added: if the registry returns None for the requested interface+provider combination, and the caller hasn't opted into deferred init, it raises a clear ValueError listing supported providers instead of failing later with a confusing error.

No special image-gen code path was needed as the existing OpenAI and Gemini client construction already covers it.

agent_core/core/impl/image_gen/interface.py (new)

The core engine. This is the most substantial new file. It:

  • Accepts provider, model, api_key, base_url, and optional hooks for token counting and usage reporting (same constructor signature as VLMInterface)
  • Initializes via ModelFactory.create() (same as VLM)
  • Dispatches to _openai_generate() or _gemini_generate() based on provider
  • Handles resolution (1K/2K/4K), aspect ratio, negative prompts, reference images, and safety filters
  • Saves output images
  • reinitialize() allows swapping providers at runtime without re-creating the object

VLMInterface has the identical structure with ModelFactory init, provider dispatch, sync+async public methods, same hooks. The main difference is the operation (describe vs. generate images).

agent_core/core/impl/image_gen/__init__.py and agent_core/core/image_gen_interface.py (new)

__init__.py exports ImageGenInterface from the impl module; image_gen_interface.py re-exports it at the agent_core.core level. Identical in structure to the VLM equivalents (vlm_interface.py, impl/vlm/__init__.py).

agent_core/__init__.py

Added ImageGenInterface to the package's public exports (__all__), making it importable as from agent_core import ImageGenInterface. Same treatment as VLMInterface.


app Layer

app/image_gen_interface.py (new)

CraftBot-specific subclass of the core ImageGenInterface. Its sole job is to inject the three CraftBot state hooks at construction time:

  • get_token_count / set_token_count — persist per-session token usage to the STATE singleton
  • report_usage — emit usage events to the billing/usage reporter

The core ImageGenInterface knows nothing about CraftBot's state system; this wrapper bridges that gap. Mirrors app/vlm_interface.py exactly.

app/config.py

Two new config accessors added:

  • get_image_gen_provider() — reads model.image_gen_provider from settings.json (default: "openai")
  • get_image_gen_model() — reads model.image_gen_model (optional override; None means use registry default)

The default settings dict also gets both keys. Pattern is identical to get_vlm_provider() / get_vlm_model().

app/agent_base.py

Manages the lifecycle of the ImageGenInterface instance for a running agent:

  • Constructor: reads image_gen_provider and image_gen_model from config, creates ImageGenInterface with deferred=True (doesn't hit the API until first use), passes it to InternalActionInterface.initialize()
  • reinitialize_image_gen(): creates a fresh ImageGenInterface instance and atomically replaces both self.image_gen and InternalActionInterface.image_gen_interface. Fresh-instance approach means any in-flight actions that hold a reference to the old instance complete cleanly. Mirrors the pattern used by the existing reinitialize_llm().

app/internal_action_interface.py

The shared class-level registry that actions use to reach CraftBot services without importing the full agent. Two changes:

  • Added image_gen_interface: Optional[ImageGenInterface] class variable
  • Added generate_image(**kwargs) classmethod that delegates to cls.image_gen_interface.generate_image(**kwargs)

Pattern is identical to how describe_image() is wired through the VLM interface.

app/data/action/generate_image.py

The @action-decorated function that the agent calls. It:

  1. Returns early in simulated_mode (for tests)
  2. Checks the registry to confirm the current provider supports image generation — returns a user-friendly error if not
  3. Validates that prompt is non-empty
  4. Delegates to InternalActionInterface.generate_image() with normalized parameters
  5. Returns a {"status": "success", "image_paths": [...]} dict

Follows the same pattern as app/data/action/describe_image.py (the VLM equivalent): thin action wrapper, provider guard, delegate to interface, return dict.

app/main.py

Passes the configured image_gen_provider and image_gen_model through to the AgentBase constructor at startup, so the agent initializes with the right provider from the moment it starts.


UI / Settings Layer

The UI layer lets users configure the image generation provider and API key separately from the LLM provider. Each piece mirrors what was already in place for VLM.

app/ui_layer/settings/model_settings.py

Backend settings API. Changes:

  • get_available_providers() now includes has_image_gen: bool and image_gen_model: str|None on each ProviderInfo, derived from the registry. The frontend uses has_image_gen to filter the provider dropdown.
  • get_model_settings() returns image_gen_provider and image_gen_model alongside existing fields.
  • update_model_settings() accepts and saves both, with validation: it rejects any image_gen_provider value not present in the registry before touching settings.json.

app/ui_layer/adapters/browser_adapter.py

Handles the WebSocket model_settings_update message from the frontend. Added extraction of imageGenProvider / imageGenModel from the message payload, saving them via update_model_settings(), then calling agent.reinitialize_image_gen() so the running agent immediately switches to the new provider without a restart.

app/ui_layer/browser/frontend/src/store/slices/modelSettingsSlice.ts

Redux slice managing model settings state. Added:

  • imageGenProvider: string and currentImageGenModel: string state fields
  • setImageGenProvider and setCurrentImageGenModel actions
  • ProviderInfo interface extended with has_image_gen and image_gen_model
  • Both model_settings_get and model_settings_update socket message handlers updated to populate the new fields

app/ui_layer/browser/frontend/src/store/selectors/modelSettings.ts

Adds selectImageGenProvider and selectCurrentImageGenModel selectors for the two new state fields. Same pattern as the existing LLM/VLM selectors.

app/ui_layer/browser/frontend/src/pages/Settings/ModelSettings.tsx

Adds an "Image Generation" section to the Settings page (after VLM, before Slow Mode). Contains:

  • A provider dropdown filtered to only providers with has_image_gen: true
  • An API key field (shown only when the provider requires_api_key) with configured/required badge
  • A model override text input (auto-populated from the provider's default when the provider is switched)
  • A Save button that sends a model_settings_update socket message

The section is self-contained with its own local state (newImageGenProvider, newImageGenApiKey, newImageGenModel, imageGenHasChanges, isImageGenSaving) and save handler, matching the existing LLM provider section's structure.

@makiroll1125 makiroll1125 requested a review from ahmad-ajmal June 1, 2026 04:04
@makiroll1125 makiroll1125 self-assigned this Jun 1, 2026
target_model = None

try:
ctx = ModelFactory.create(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not use the same helper function for __init__ and reinitialize?

openai_key = get_api_key("openai")
gemini_key = get_api_key("gemini")
image_gen = iai.InternalActionInterface.image_gen_interface
current_provider = get_image_gen_provider()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a fallback - if user is using a model that does not have image_gen but does have the token for a model that does have image_gen with priority (google, open ai, then others) - then use that model (would also need this login in the image_gen_interface

self._init_api_key = api_key
self._init_base_url = base_url

self._get_token_count = get_token_count or (lambda: 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't used anywhere so the image_gen won't count any usage

if result.get("success") and image_gen_provider:
try:
agent = self._controller.agent
agent.reinitialize_image_gen(image_gen_provider)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_model_settings saves the new provider to disk before reinitialize_image_gen runs, and the reinit only swaps the instance if ok. So if you switch to a provider whose key isn't configured, reinit fails → settings say the new provider, but the live image_gen_interface still points at the old one.


try:
api_key = self._gemini_client._api_key
client = genai.Client(api_key=api_key)
Copy link
Copy Markdown
Collaborator

@ahmad-ajmal ahmad-ajmal Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a new client on every call to this action. Why not just use self._gemini_client

Comment thread app/agent_base.py
)
return llm_ok and vlm_ok

def reinitialize_image_gen(self, provider: str | None = None) -> bool:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another reinitialize inside the interface.py. Is this duplicated code? Could you use that as the helper and this as the wrapper?

Copy link
Copy Markdown
Collaborator

@ahmad-ajmal ahmad-ajmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

I selected the model during the hard onboarding phase but this is still blank in my settings,
Also, I think having this separate is fine - then don't need to worry about the fallback. Please ask @zfoong what he thinks about this.

Also, please update the provider versions in requirements.txt file since the current ones don't have the required values.

Please also make sure that you've run ruff lint checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants