Skip to content

Conversation

@noorbhatia
Copy link
Contributor

  • Implements prewarm() for MLXLanguageModel that improves first response time.
  • Prewarms the model with instructions, tools and prefixPrompt

@noorbhatia noorbhatia force-pushed the noor/mlx-prewarm-model branch from ede4b54 to 6d7cbd5 Compare January 29, 2026 11:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements prewarm(for:promptPrefix:) for MLXLanguageModel to reduce first-response latency by loading the model context and priming the MLX processor with session instructions, tools, and an optional prompt prefix.

Changes:

  • Add MLXLanguageModel.prewarm(for:promptPrefix:) implementation.
  • Prewarm loads/caches ModelContext and calls context.processor.prepare(input:) with a minimal chat + tool specs.
  • Include session instructions and optional prompt prefix in the prewarm input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +373 to +377
Task {

let context = try await loadContext(modelId: modelId, hub: hub, directory: directory)

// Build chat history similar to respond() to prime the cache effectively
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task { ... } inherits the caller’s actor context. If prewarm() is called from the main actor/UI, the model load + tokenization work inside this task can end up running on the main actor and cause UI hitches. Prefer running this as a detached/background task (e.g., Task.detached or explicitly hopping off the main actor) and consider setting an appropriate priority for prewarming work.

Copilot uses AI. Check for mistakes.
Comment on lines +375 to +402
let context = try await loadContext(modelId: modelId, hub: hub, directory: directory)

// Build chat history similar to respond() to prime the cache effectively
var chat: [MLXLMCommon.Chat.Message] = []

// Add system instructions if present
if let instructions, !instructions.isEmpty {
chat.append(.init(role: .system, content: instructions))
}

// Add prompt prefix or minimal user message
let promptText = promptPrefix?.description ?? "."
chat.append(.init(role: .user, content: promptText))

// Convert tools to MLX format
let toolSpecs: [ToolSpec]? =
tools.isEmpty
? nil
: tools.map { convertToolToMLXSpec($0) }

let userInput = MLXLMCommon.UserInput(
chat: chat,
processing: .init(resize: .init(width: 512, height: 512)),
tools: toolSpecs
)

// Prepare input - triggers tokenization and processor initialization
_ = try await context.processor.prepare(input: userInput)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Task body can throw (loadContext / prepare), but the error is never handled or surfaced. Since prewarm is fire-and-forget, it should catch and intentionally ignore (or otherwise report) failures so prewarm errors don’t get silently lost in a failed task.

Suggested change
let context = try await loadContext(modelId: modelId, hub: hub, directory: directory)
// Build chat history similar to respond() to prime the cache effectively
var chat: [MLXLMCommon.Chat.Message] = []
// Add system instructions if present
if let instructions, !instructions.isEmpty {
chat.append(.init(role: .system, content: instructions))
}
// Add prompt prefix or minimal user message
let promptText = promptPrefix?.description ?? "."
chat.append(.init(role: .user, content: promptText))
// Convert tools to MLX format
let toolSpecs: [ToolSpec]? =
tools.isEmpty
? nil
: tools.map { convertToolToMLXSpec($0) }
let userInput = MLXLMCommon.UserInput(
chat: chat,
processing: .init(resize: .init(width: 512, height: 512)),
tools: toolSpecs
)
// Prepare input - triggers tokenization and processor initialization
_ = try await context.processor.prepare(input: userInput)
do {
let context = try await loadContext(modelId: modelId, hub: hub, directory: directory)
// Build chat history similar to respond() to prime the cache effectively
var chat: [MLXLMCommon.Chat.Message] = []
// Add system instructions if present
if let instructions, !instructions.isEmpty {
chat.append(.init(role: .system, content: instructions))
}
// Add prompt prefix or minimal user message
let promptText = promptPrefix?.description ?? "."
chat.append(.init(role: .user, content: promptText))
// Convert tools to MLX format
let toolSpecs: [ToolSpec]? =
tools.isEmpty
? nil
: tools.map { convertToolToMLXSpec($0) }
let userInput = MLXLMCommon.UserInput(
chat: chat,
processing: .init(resize: .init(width: 512, height: 512)),
tools: toolSpecs
)
// Prepare input - triggers tokenization and processor initialization
_ = try await context.processor.prepare(input: userInput)
} catch {
// Intentionally ignore prewarm failures (model will be loaded on demand)
// You may replace this with a more sophisticated logging mechanism if desired.
print("MLXLanguageModel prewarm failed for modelId \(modelId): \(error)")
}

Copilot uses AI. Check for mistakes.
Comment on lines +385 to +387
// Add prompt prefix or minimal user message
let promptText = promptPrefix?.description ?? "."
chat.append(.init(role: .user, content: promptText))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless "." has special significance in MLX, this makes me think that promptPrefix should be non-optional (and maybe non-empty?)

What do you think?


let userInput = MLXLMCommon.UserInput(
chat: chat,
processing: .init(resize: .init(width: 512, height: 512)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the kind of thing that we'd want to parameterize in the method, rather than hard-code.

@mattt
Copy link
Owner

mattt commented Jan 29, 2026

Thanks for opening this PR, @noorbhatia!

I think this kind of functionality gets into the realm of KV cache management, which so far this implementation hasn't attempted to support. At a high-level, I'd expect an API that has some concept of prewarming a common prefix of tokens, caching that, and then reusing for various suffixes. Most likely, cache selection and management would be automatic; I'm not sure yet what controls we'd want to expose.

Can you say more about how you understand the problem?

@noorbhatia
Copy link
Contributor Author

Thanks for opening this PR, @noorbhatia!

I think this kind of functionality gets into the realm of KV cache management, which so far this implementation hasn't attempted to support. At a high-level, I'd expect an API that has some concept of prewarming a common prefix of tokens, caching that, and then reusing for various suffixes. Most likely, cache selection and management would be automatic; I'm not sure yet what controls we'd want to expose.

Can you say more about how you understand the problem?

My problem: The very first respond() call has significant latency because the model must be loaded from disk, transferred to GPU memory, and the processor initialized. This cold start happens every time the app launches or when a model is first used.

My understanding of prewarm is simply to have the model ready before the user sends their first query, so respond() can start generation immediately.

I'd like to understand your vision and would be happy to implement a KV cache based solution if you could point me to the right direction.

@mattt
Copy link
Owner

mattt commented Jan 30, 2026

@noorbhatia Thanks for elaborating — that's really helpful.

Let's break the cold start into two parts:

  • First, load the model
  • Second, create + cache the context with a given prompt prefix.

I suspect that the first step—loading the model—is the bulk of the time spent waiting. So let's try solving that first.

Then, once we have that, we can implement all of the KV cache infrastructure needed to make the promptPrefix parameter do something.

Does that track with your mental model of the problem?

@noorbhatia
Copy link
Contributor Author

Understood. And solving the first step, loading the model should be simply calling loadContext in prewarm?

Perhaps we can expose loadContext but that would require a change in LanguageModel's API contract. What do you suggest?

@mattt
Copy link
Owner

mattt commented Jan 30, 2026

Understood. And solving the first step, loading the model should be simply calling loadContext in prewarm?

Yes, exactly.

Perhaps we can expose loadContext but that would require a change in LanguageModel's API contract. What do you suggest?

I'd be interested to see how far we can get within the constraints of the existing Foundation Models API abstraction before we expand the surface area. Let's revisit this when we dig into KV caching.

@noorbhatia
Copy link
Contributor Author

Great, I'll update the PR. Thanks @mattt !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants