Skip to content

[kernelbench] Initial XeGPU support#129

Closed
tkarna wants to merge 6 commits intollvm:mainfrom
tkarna:kb-xegpu
Closed

[kernelbench] Initial XeGPU support#129
tkarna wants to merge 6 commits intollvm:mainfrom
tkarna:kb-xegpu

Conversation

@tkarna
Copy link
Copy Markdown
Contributor

@tkarna tkarna commented May 4, 2026

Adds initial support for xegpu in the kernel_bench tool.

  • inspect_payload utility returns payload matmul shapes in the dict. Utility is moved to lighthouse/utils/mlir.py.
  • xegpu parameter_selector is moved to lighthouse/schedule/xegpu/xegpu_parameter_selector.py.
  • Refactored Runner.get_gpu_argument_access_callback: takes host numpy buffer and func arg index.
  • Add xegpu support to kernel_bench
    • Adds --target CLI option defaults to "cpu".
    • Uses inspect_payload to infer func args and matmul shapes.
    • If pipeline is not set and target is "xegpu", uses "xegpu_mlp_pipeline" which is currently hard-coded.
    • Uses xegpu_parameter_selector to generate mlp schedule parameters.
    • Handles GPU data movement with GPUMemoryManager.

@tkarna tkarna requested review from adam-smnk, fschlimb and rengolin May 4, 2026 13:49
@tkarna
Copy link
Copy Markdown
Contributor Author

tkarna commented May 4, 2026

The main difference with the previous lowering flow is that

now we need to access the payload IR module to inspect it
https://github.com/tkarna/lighthouse/blob/5bf6ff5b7c98e2723d0cbffa06785a5023585980/tools/kernel_bench#L229-L231

and we need to be able to pass a parameter dict to the lowering schedule
https://github.com/tkarna/lighthouse/blob/5bf6ff5b7c98e2723d0cbffa06785a5023585980/tools/kernel_bench#L92-L95

Comment thread tools/kernel_bench
Comment on lines +94 to +95
driver.add_module_stage(mlp_schedule(pipeline_params))
driver.add_module_stage(xegpu_to_binary())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think mlp_schedule with its params could be represented by pipeline yaml file?

I'd be cleaner to avoid special builders but fine if this is simpler. Just asking to understand possible limitations.

Copy link
Copy Markdown
Member

@rengolin rengolin May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd second that. Hard-coding the API at this point will make it hard later when we have too many variations. This tool should not get hard-coded schedules via API. If anything, we add transform stages to the yaml file, or even improve the yaml support.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is a central question for the pipeline design.

I suspect that in the long run we are going to have some kind of oracle/cost model/autotuner that generates tiles sizes etc. parameters. In the generic case this is some blackbox python routine. How we handle this in the yaml interface is an open question.

a) We could find a way to serialize the parameter dict in the yaml file, either directly as as yaml entry or use some placeholder e.g. to indicate that the params should be read from a foo.json file, or something like that. I think this is too restrictive though - E.g., the cost model should first inspect the payload and proposed pipeline and then dump parameters to a yaml or json file. The user would then pass in the generated yaml/json files to use the optimal parameters.

b) We could try to incorporate payload analysis and cost models into the transform schedules. Analysis results could be encoded e.g. in payload module attributes that subsequent schedules read. The cost model/oracle, I think in the general case cannot be represented as transfrom ops, so it would be some magic python routine call in the transform schedule. Then the entire pipeline with analysis and cost models could be represented as a list of schedules, e.g. in a yaml file. Developing this capability is nontrivial though, we cannot expect to have this in the short term to e.g. run the kernel bench on GPU.

c) We could use some "placeholder" schedule names in the yaml file. In this PR I'm proposing "xegpu_mlp_pipeline" string. It's not a schedule per se but refers to a "standard" lowering flow, that can include payload analysis and parameter selection stages etc. The flow is defined in python - currently just hard-coded in kernel_bench but once established it should be moved to lighthouse. We could use this string in the pipeline yaml file (not done in the PR yet). This option allows defining the full lowering with string representation and at the same time allows using arbitrarily complex dynamic analysis/cost model routines under the hood, and is faster to implement than b.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g., the cost model should first inspect the payload and proposed pipeline and then dump parameters to a yaml or json file. The user would then pass in the generated yaml/json files to use the optimal parameters.

Yes, this is one of the uses I had in mind.

We could try to incorporate payload analysis and cost models into the transform schedules. Then the entire pipeline with analysis and cost models could be represented as a list of schedules, e.g. in a yaml file. Developing this capability is nontrivial though

Yup, this is why the one above should come first.

We could use some "placeholder" schedule names in the yaml file. In this PR I'm proposing "xegpu_mlp_pipeline" string.

The question is: can that string just be a yaml file instead? For now, hard-coded, so neither solutions above need to exist for this to merge.

Copy link
Copy Markdown
Member

@adam-smnk adam-smnk May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning toward option C i.e., shifting complexity to yaml.

As you mentioned, ideally all these decision could be reified into IR using schedules or annotation attributes. Good north-star but not feasible as is today.
That's why I think adding custom logic and syntax to yaml descriptor is a good placeholder. Then we can slowly start shifting what works into IR proper over time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is: can that string just be a yaml file instead? For now, hard-coded, so neither solutions above need to exist for this to merge.

Well, we cannot express the "inspect payload -> call parameter_selector -> pass parameters to schedule" progression in yaml. We can use some placeholders strings ("xegpu_mlp_pipeline") in yaml which are interpreted correctly in python.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The progression is in the class, the pipeline in the yaml file. I would not try to encode payload/parameter into yaml, just the pipeline. The rest is in Python on a GPU class that holds the pipeline driver and just loads the referred yaml file. The CPU class would be even simpler, and just do what is done today.

There could be a base class that does the importing and other similar tasks.

Copy link
Copy Markdown
Contributor Author

@tkarna tkarna May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the yaml pipeline in this case would contain the schedules ["mlp_schedule", "xegpu_to_binary"]. And the high-level GPU lowering class has the payload inspector and parameter selector calls hard coded in the lowering flow. How do we know that the parameter dict should be passed to the "mlp_schedule" schedule and not to the other one? What if the user adds a new "foobar" schedule? To me it seems that in any case we need a special stage string (e.g. "xegpu_mlp_schedule" or "xegpu_mlp_pipeline") that the gpu lowering class recognizes and does the right thing.

Copy link
Copy Markdown
Member

@rengolin rengolin May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the yaml file would contain what today both of those schedules contain.

The small parts that are actual schedules need to be factored out. This is what we planned for the CPU pipeline, too. We don't need passes inside schedules.

Copy link
Copy Markdown
Contributor Author

@tkarna tkarna May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the mlp_schedule into passes and smaller parts does not change the big picture. For example these parts are going to be transform schedules that require parameters passed in as a python dict. (Unless we implement option B).

Comment thread tools/kernel_bench

def define_compiler_pipeline(
mlir_file: Path,
driver: CompilerDriver,
Copy link
Copy Markdown
Member

@rengolin rengolin May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not defining the compiler pipeline, it's receiving from outside. This is the same discussion as before: users control their data structures, libraries get passed down from the user. This here is a creator of Compiler driver, so it can't receive it from outside.

We need to compartamentalize top-down, otherwise it will be hard to know what's broken when things start to fall appart.

Copy link
Copy Markdown
Contributor Author

@tkarna tkarna May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this change is needed because we must be able to inspect the payload module object that CompilerDriver owns, to obtain the parameters we pass to the lowering schedule. Thus this must be done before we populate the CompilerDriver stages. We can refactor the define_compiler_pipeline function in any way we like, but the flow does not change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this can be encapsulated in the GPU specific code that wraps this function, and leave it out of the CPU specific code that doesn't need any of that.

It's not clear to me if we need a class inside lighthouse at this point, or just here in this file for now. But it seems for now we keep them here and see if we need to move later.

To be clear, both classes must own the driver/pipeline as before, so that any logic is performed by them, not external functions.

Comment thread tools/kernel_bench
# Set target specific default pipeline if no pipeline is provided.
default_pipeline = None
pipeline_params = None
if args.target == "xegpu" and args.pipeline is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have separated GPU / CPU logic outside of main? Having all those if xegpu is not reasonable for such a high-level tool.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we can move it outside main by refactoring the inspect-lower-and-execute logic into a helper function or object. The current version is however kernel bench specific. I'd suggest we refactor to a generic method once we have more similar flows/use cases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'd still keep it kernel bench specific, for now. But we need to separate CPU/GPU in a way that doesn't leave if/else blocks on each function.

@rengolin
Copy link
Copy Markdown
Member

rengolin commented May 5, 2026

now we need to access the payload IR module to inspect it
https://github.com/tkarna/lighthouse/blob/5bf6ff5b7c98e2723d0cbffa06785a5023585980/tools/kernel_bench#L229-L231

This is (for now) a GPU-only issue, and can go to the GPU class. But in time, we'll need to do that for all targets in order to know about inputs. For example, look into kernel bench's global variables to know what's configurable (for the init args), look at global constants, etc.

and we need to be able to pass a parameter dict to the lowering schedule
https://github.com/tkarna/lighthouse/blob/5bf6ff5b7c98e2723d0cbffa06785a5023585980/tools/kernel_bench#L92-L95

This can/should be done using the yaml descriptor.

@tkarna
Copy link
Copy Markdown
Contributor Author

tkarna commented May 7, 2026

Closing this for now. The kernel_bench changes should be introduced/merged only after we have an xegpu pipeline that can run without any parameters (e.g. has a built-in tile size selector).

@tkarna tkarna closed this May 7, 2026
tkarna added a commit that referenced this pull request May 8, 2026
…133)

Pulling in the commits from #129 that are not `kernel_bench` related.
Generic util changes that'll be useful in the future.

- `inspect_payload` utility returns payload matmul shapes in the dict.
Utility is moved to `lighthouse/utils/mlir.py`.
- xegpu `parameter_selector` is moved to
`lighthouse/schedule/xegpu/xegpu_parameter_selector.py`.
- Refactored `Runner.get_gpu_argument_access_callback`: takes host numpy
buffer and func arg index.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants