Add Triton sampling backends alongside FlashInfer by FlamingoPg · Pull Request #280 · lightseekorg/tokenspeed

FlamingoPg · 2026-05-27T07:55:49Z

Summary

Add TokenSpeed-native triton and triton_full sampling backends alongside the existing flashinfer / flashinfer_full probability-route backends.
Keep NVIDIA default sampling backend as flashinfer; this PR does not remove FlashInfer sampling or change the default route.
Adapt vLLM MRV2 sampling principles at the kernel/runtime boundary: logits-to-Gumbel-Max sampling, stateless in-kernel RNG, TokenSpeed pool-state indirection, and CUDA-graph-safe sampler variants.
Keep runtime dependencies behind the tokenspeed-kernel boundary; attention/MoE/quantization FlashInfer paths are untouched.

Changes

Add pool-aware Triton sampling ops for no-filter Gumbel, finite top-k, finite top-k + top-p, top-p-only, min-p/full sampling helpers, selected-token logprob, and verify-chain support.
Add triton and triton_full runtime backends with separate MRV2-style Gumbel routes.
Add a neutral PoolSamplingBackend so FlashInfer and Triton share TokenSpeed request-pool state without Triton inheriting FlashInfer probability/coin state.
Add CUDA graph sampler variants so graph replay can select the captured route for no-filter, top-k, top-k+top-p, top-p-only, verify, and full/min-p paths.
Keep FlashInfer/CUDA probability-route code available as a parallel backend and baseline.
Reorganize sampling kernel tests so existing sampling/CUDA tests stay in test_sampling.py, while Triton-specific coverage lives in test_sampling_triton.py and test_sampling_triton_full.py.

Benchmark

These are the latest focused sampling-path results I could trace from the current branch. Source artifacts:

Normal sample focused benchmark: /tmp/tokenspeed_sampling_mr/focused_sampling_path_bench.csv
Normal sample focused benchmark, large vocab: /tmp/tokenspeed_sampling_mr/focused_sampling_path_bench_151936.csv
Full/min-p focused operator benchmark: /tmp/tokenspeed_sampling_mr/current_sampling_ops.csv

Environment from the benchmark logs: NVIDIA H100 80GB HBM3. Timing uses CUDA events. All numbers are milliseconds, shown as median / p95. Speedup is FlashInfer baseline / Triton, so higher is better.

Important scope notes:

Normal sample rows compare against flashinfer.sample() core behavior. That route does not call fused_topk_topp_renorm; it uses softmax -> top_k_top_p_sampling_from_probs.
Full/min-p rows compare against flashinfer_full-style probability behavior. That route does call fused_topk_topp_renorm on NVIDIA before min_p_sampling_from_probs.
current_triton_pool_op is the latest optimized pool-aware Triton op path.
current_runtime_sample is the full TokenSpeed runtime sampling backend call and includes route/backend overhead.
These are focused sampler-path measurements, not full serving throughput claims.

A. Normal Sample: Triton vs `flashinfer.sample()`

Baseline route:

softmax(logits / temperature)
-> top_k_top_p_sampling_from_probs(...)
-> token

Candidate route:

logits + pool scalars
-> TokenSpeed Triton Gumbel/candidate sampler
-> token

Current Pool-Aware Triton Op vs Old FlashInfer Core

mode	vocab	bs	old FlashInfer core	current Triton pool op	speedup
no_filter	32768	1	0.087296 / 0.102688	0.042960 / 0.047200	2.03x
no_filter	32768	8	0.085504 / 0.102496	0.042240 / 0.046784	2.02x
no_filter	32768	32	0.086384 / 0.099680	0.042304 / 0.047104	2.04x
top_k_top_p	32768	1	0.103248 / 0.115648	0.060272 / 0.071072	1.71x
top_k_top_p	32768	8	0.103840 / 0.118304	0.058672 / 0.071616	1.77x
top_k_top_p	32768	32	0.106208 / 0.114304	0.058896 / 0.072384	1.80x
no_filter	151936	1	0.141440 / 0.173440	0.044880 / 0.056096	3.15x
no_filter	151936	8	0.147280 / 0.156704	0.045472 / 0.055040	3.24x
no_filter	151936	32	0.167904 / 0.178080	0.052016 / 0.058080	3.23x
top_k_top_p	151936	1	0.195264 / 0.215296	0.091920 / 0.100032	2.12x
top_k_top_p	151936	8	0.203504 / 0.216416	0.096208 / 0.100128	2.12x
top_k_top_p	151936	32	0.234256 / 0.243520	0.134336 / 0.139328	1.74x

Full Runtime Sample Call vs Old FlashInfer Core

mode	vocab	bs	old FlashInfer core	current runtime sample	speedup
no_filter	32768	1	0.087296 / 0.102688	0.083232 / 0.093696	1.05x
no_filter	32768	8	0.085504 / 0.102496	0.081232 / 0.089568	1.05x
no_filter	32768	32	0.086384 / 0.099680	0.081504 / 0.090112	1.06x
top_k_top_p	32768	1	0.103248 / 0.115648	0.077200 / 0.094624	1.34x
top_k_top_p	32768	8	0.103840 / 0.118304	0.077120 / 0.090816	1.35x
top_k_top_p	32768	32	0.106208 / 0.114304	0.077392 / 0.091232	1.37x
no_filter	151936	1	0.141440 / 0.173440	0.100048 / 0.130112	1.41x
no_filter	151936	8	0.147280 / 0.156704	0.100512 / 0.114112	1.47x
no_filter	151936	32	0.167904 / 0.178080	0.100496 / 0.115904	1.67x
top_k_top_p	151936	1	0.195264 / 0.215296	0.109248 / 0.122432	1.79x
top_k_top_p	151936	8	0.203504 / 0.216416	0.112416 / 0.119168	1.81x
top_k_top_p	151936	32	0.234256 / 0.243520	0.150912 / 0.156448	1.55x

B. Full/min-p Path: Triton Full vs `flashinfer_full`

Baseline route on NVIDIA:

apply penalties/bias if enabled
-> softmax(logits / temperature)
-> fused_topk_topp_renorm(...)
-> min_p_sampling_from_probs(...)
-> token

Candidate route:

logits + full sampling pools
-> TokenSpeed Triton full/min-p Gumbel/rejection sampler
-> token

This table is the focused full/min-p operator comparison from current_sampling_ops.csv.

mode	vocab	bs	FlashInfer full baseline	Triton full op	speedup
min_p	32768	1	0.097856 / 0.106944	0.095472 / 0.108576	1.02x
min_p	32768	8	0.182784 / 0.194080	0.095312 / 0.109440	1.92x
min_p	32768	32	0.202912 / 0.217952	0.095696 / 0.112288	2.12x
top_k_top_p_min_p	32768	1	0.108672 / 0.114976	0.096208 / 0.109472	1.13x
top_k_top_p_min_p	32768	8	0.114528 / 0.120640	0.095312 / 0.107328	1.20x
top_k_top_p_min_p	32768	32	0.126416 / 0.134528	0.097888 / 0.110592	1.29x
min_p	151936	1	0.176240 / 0.180416	0.098080 / 0.113792	1.80x
min_p	151936	8	0.230592 / 0.234496	0.096384 / 0.110080	2.39x
min_p	151936	32	0.609088 / 0.632736	0.162848 / 0.172320	3.74x
top_k_top_p_min_p	151936	1	0.206560 / 0.210592	0.097024 / 0.112128	2.13x
top_k_top_p_min_p	151936	8	0.255536 / 0.260544	0.096496 / 0.110560	2.65x
top_k_top_p_min_p	151936	32	0.346656 / 0.355264	0.162208 / 0.166272	2.14x

Benchmark Interpretation

Normal sample and full/min-p are different baselines. Normal flashinfer.sample() does not use fused_topk_topp_renorm; flashinfer_full does.
The Triton op-level route is the clear win on normal sampling: 2.02x-3.24x on no-filter and 1.71x-2.12x on finite top-k + top-p except 151936/bs32, which is still 1.74x.
The full runtime normal-sample call still wins, but the margin is smaller because route selection and backend scaffolding are included. The smallest win is 32768 no-filter at about 1.05x.
Against the flashinfer_full route that includes fused_topk_topp_renorm, Triton full also wins in these focused full/min-p operator rows. The tightest case is 32768/bs1 min_p, where median improves slightly while p95 is roughly comparable.
This PR does not claim full serving throughput speedup from these focused numbers; it keeps FlashInfer as the default backend while the Triton routes remain opt-in.

Validation

pre-commit run --all-files (passed)
python -m pytest test/runtime/test_sampling_backend_pool.py test/runtime/test_sampling_backend_registry.py test/runtime/test_cli_config_compat.py (90 passed, 18 warnings)
python -m pytest tokenspeed-kernel/test/ops/test_sampling.py tokenspeed-kernel/test/ops/test_sampling_triton.py tokenspeed-kernel/test/ops/test_sampling_triton_full.py (60 passed, 18 warnings)

Notes

PR is intentionally additive: flashinfer / flashinfer_full remain available, and NVIDIA default remains flashinfer.
Generated benchmark artifacts and migration scratch docs are not included.

lightseek-bot · 2026-05-27T07:57:45Z

@@ -18,14 +18,48 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 # SOFTWARE.

-"""Triton sampling helper kernels."""
+"""TokenSpeed-native Triton sampling kernels.


Please add vLLM's copyright here

Added SPDX-FileCopyrightText: Copyright contributors to the vLLM project to the Triton sampling kernel header in commit c3c25815.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3c2581533

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T08:12:07Z

+        top_k_ok = (top_k <= 0) | (before_count < top_k)
+        top_p_ok = (top_p >= 1.0 - 1.0e-6) | (before_prob < top_p * total_prob)
+        ok = top_k_ok & top_p_ok


Use top-k-renormalized mass for top-p rejection

When a request has both finite top_k and top_p < 1 and falls into the generic rejection path (for example top_k 65–127, mixed batches with such a row, or the full Triton backend), this predicate compares before_prob with top_p * total_prob where total_prob is the full-vocab mass. The direct sampler and build_top_k_top_p_probs_from_logits() used by speculative verify apply top-p after top-k renormalization, so normal sampling and verify can admit different nuclei and corrupt speculative decoding probabilities for those sampling params.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 981f9400f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T09:08:15Z

+        block_id = tl.min(
+            tl.where((row_vals == selected_value) & mask, cols, 2147483647),


Resolve top-k samples by token id, not logit value

When the direct top-k/top-p path is used (top_k <= 64) and two nucleus candidates have identical logits, the Gumbel draw selects a rank but this resolution step only carries the sampled logit value and then returns the lowest vocab id with that value. For uniform rows or tied scores this collapses all probability mass for tied candidates onto the smallest token id instead of sampling among them, so finite-top-k requests can produce biased/incorrect tokens; the non-compact helper has the same value-only resolution pattern.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 681396814e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T09:29:46Z

+    finite_top_k = (top_k > 0) & (top_k <= TOP_K_PAD)
+    top_p_total_prob = tl.where(finite_top_k, top_k_total_prob, total_prob)
+
+    selected = tl.load(candidate_ids_ptr + row * candidate_row_stride + RETRIES)


Avoid falling back to argmax after only four top-p retries

For top-p-only requests (the default top_k=-1 with top_p<1) this generic rejection path draws candidates from the unfiltered distribution and only accepts candidates inside the nucleus; if all four retries miss, selected remains initialized to the argmax here. With low-but-valid top_p values such as 0.1 or 0.2, that miss probability is large ((1-top_p)^4), so sampling frequently collapses to greedy instead of the requested nucleus distribution, unlike the previous exact renormalized sampler.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28fa045441

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T09:52:03Z

+        if not self.config.enable_output_logprobs and num_tokens_per_req == 1:
+            variants = (
+                _CUDA_GRAPH_VARIANT_DEFAULT,
+                _CUDA_GRAPH_VARIANT_NO_FILTER,
+                _CUDA_GRAPH_VARIANT_TOP_K_TOP_P,
+            )
+            return variants
+        return (_CUDA_GRAPH_VARIANT_DEFAULT,)


Capture seeded sampler variants when logprobs are enabled

When --enable-output-logprobs is set, this returns only the default graph, so _graph_key_for_replay() falls back to the generic rejection graph for requests whose prepared mode is no_filter or top_k_top_p. Those same seeded requests use the direct Gumbel/top-k path in eager or uncaptured batches, while the rejection path derives different random streams (e.g. offset * RETRIES + retry instead of offset), so the same prompt and seed can produce different tokens depending only on whether the batch replays a captured graph. Either capture the same variants with logprobs or force replay/eager through the same sampler mode.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7179e82b15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T19:12:24Z

+        if num_tokens_per_req != 1:
+            return _TRITON_GUMBEL_MODE_NONE


Handle mixed speculative prefill batches

When speculative decoding is enabled, ModelExecutor.prepare_step() passes self.config.output_length (the draft-token count, usually >1) for the whole batch, even if the batch also contains prefill rows; _run_sampling() then calls sample() for those prefill rows before verify() for the decode rows. For any non-greedy sampling params in such a mixed prefill+decode step, this branch leaves _step_triton_gumbel_mode as NONE, so the subsequent sample() path raises RuntimeError("did not select a sampling mode") instead of sampling the newly admitted request. The mode selection needs to keep a valid single-token sample mode for mixed speculative batches, or sample() must resolve its mode independently of the verify width.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d81cab179f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T20:05:36Z

+        top_k_ok = (top_k <= 0) | (before_count < top_k)
+        top_p_ok = (top_p >= 1.0 - 1.0e-6) | (before_prob < top_p * top_p_total_prob)
+        min_p_ok = (min_p <= 0.0) | (tl.exp(candidate_logit - max_scaled) >= min_p)
+        ok = top_k_ok & top_p_ok & min_p_ok


Preserve top-k distribution in rejection sampling

When triton_full serves a finite-top_k request (and the plain triton backend falls into generic rejection for top_k 65–127), candidates are still drawn from the full vocabulary and this predicate only accepts them if they land in the top-k set. With top_p=1 and a typical large vocab, the probability that all four retries miss the top-k set is very high (e.g. uniform logits with top_k=100), so the sampler falls back to the initialized argmax instead of sampling from the requested top-k distribution.

Useful? React with 👍 / 👎.

yweng0828

Could you please compare the performance of fused_topk_topp before we remove this code? #184

Thanks.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a76e165f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-30T17:39:27Z

+    min_p = tl.load(min_p_pool_ptr + pool_idx).to(tl.float32)
+    min_p_log_threshold = tl.log(tl.maximum(min_p, 1.0e-20))


Treat min_p=0 rows as unfiltered

When a triton_full batch contains any request with min_p > 0, _full_has_min_p sends the whole batch through the min-p kernels, including rows whose own min_p is 0. For those rows this clamp turns the disabled cutoff into row_max + log(1e-20), so tokens more than about 46 scaled-logit units below the max are silently removed; with low temperatures or large logit_bias/penalties this changes a request that explicitly disabled min-p. Use -inf for min_p <= 0 (the same issue appears in the parallel min-p path below).

Useful? React with 👍 / 👎.

Signed-off-by: FlamingoPg <1106310035@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: baab8fce85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T09:50:49Z

+        accepted_found = tl.where(accepted, 1, accepted_found)
+        attempt += 1
+
+    token = tl.where(accepted_found != 0, accepted_token, row_argmax)


Avoid falling back to greedy after top-p repair misses

For top-p-only requests with low but valid top_p values, this repair path still gives up after _TOP_P_REPAIR_NUM_ATTEMPTS and returns row_argmax when every rejection draw lands outside the nucleus. That happens with probability (1 - top_p)^8 (about 43% for top_p=0.1), so the optimized Triton top-p route frequently emits greedy tokens instead of sampling from the requested nucleus distribution; use an exact nucleus sampler or keep drawing until acceptance rather than defaulting to argmax.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-01T09:50:50Z

+            self._seed_pool,
+            offsets_pool,
+            out[:rows],
+            min_p_pool=self._min_p_pool,


Disable min-p in the generic full route when absent

When a triton_full batch has no min_p requests but falls into the generic mixed route (for example mixing top_k=-1 and finite top_k), _full_has_min_p is false but this still passes _min_p_pool into the generic sampler. Rows with the default min_p=0 then use log(max(0, 1e-20)) as a cutoff, silently filtering tokens more than ~46 scaled-logit units below the max even though min-p is disabled; pass None here unless _full_has_min_p is true.

Useful? React with 👍 / 👎.

FlamingoPg requested a review from a team as a code owner May 27, 2026 07:55

lightseek-bot reviewed May 27, 2026

View reviewed changes

FlamingoPg force-pushed the flamingo/sample branch from a76a3e9 to c3c2581 Compare May 27, 2026 08:02

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

FlamingoPg force-pushed the flamingo/sample branch from c3c2581 to 981f940 Compare May 27, 2026 09:01

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

FlamingoPg force-pushed the flamingo/sample branch from 981f940 to 6813968 Compare May 27, 2026 09:21

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

FlamingoPg force-pushed the flamingo/sample branch from 6813968 to 28fa045 Compare May 27, 2026 09:42

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

FlamingoPg force-pushed the flamingo/sample branch from 27b7316 to 288f13f Compare May 27, 2026 20:32

yweng0828 requested changes May 28, 2026

View reviewed changes

FlamingoPg marked this pull request as draft May 28, 2026 04:06

FlamingoPg force-pushed the flamingo/sample branch from dc66686 to c121645 Compare May 28, 2026 04:12

FlamingoPg changed the title ~~Replace sampling FlashInfer backend with TokenSpeed Triton~~ Add Triton sampling backends alongside FlashInfer May 30, 2026

FlamingoPg marked this pull request as ready for review May 30, 2026 17:33

FlamingoPg marked this pull request as draft May 30, 2026 17:34

FlamingoPg marked this pull request as ready for review May 30, 2026 17:34

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

FlamingoPg added 9 commits June 1, 2026 09:43

Replace sampling FlashInfer backend with Triton

19e47d4

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): capture greedy triton graph variant

38227a6

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): limit greedy graph variant to spec verify

57b1190

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): randomize greedy triton ties

5d75a54

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): keep triton greedy deterministic

0e6d697

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): keep single-token greedy graph stable

f62191b

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): guard verify logits before argmax

f3a2e15

Signed-off-by: FlamingoPg <1106310035@qq.com>

test(sampling): gate verify nan guard to nvidia

06a4e94

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): reject invalid greedy verify rows

52513b2

Signed-off-by: FlamingoPg <1106310035@qq.com>

FlamingoPg added 4 commits June 1, 2026 09:43

perf(sampling): specialize top-p rejection path

c85040c

Signed-off-by: FlamingoPg <1106310035@qq.com>

restore fused topk topp thirdparty sources

f4abf9b

Signed-off-by: FlamingoPg <1106310035@qq.com>

feat(sampling): add triton sampling backends

65dfa4f

Signed-off-by: FlamingoPg <1106310035@qq.com>

Refine Triton sampling implementation

baab8fc

Signed-off-by: FlamingoPg <1106310035@qq.com>

FlamingoPg force-pushed the flamingo/sample branch from 633932c to baab8fc Compare June 1, 2026 09:45

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

		block_id = tl.min(
		tl.where((row_vals == selected_value) & mask, cols, 2147483647),

		min_p = tl.load(min_p_pool_ptr + pool_idx).to(tl.float32)
		min_p_log_threshold = tl.log(tl.maximum(min_p, 1.0e-20))

Conversation

FlamingoPg commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark

A. Normal Sample: Triton vs flashinfer.sample()

Current Pool-Aware Triton Op vs Old FlashInfer Core

Full Runtime Sample Call vs Old FlashInfer Core

B. Full/min-p Path: Triton Full vs flashinfer_full

Benchmark Interpretation

Validation

Notes

Uh oh!

lightseek-bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

FlamingoPg May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

yweng0828 left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

FlamingoPg commented May 27, 2026 •

edited

Loading

A. Normal Sample: Triton vs `flashinfer.sample()`

B. Full/min-p Path: Triton Full vs `flashinfer_full`