context : perform output reorder lazily upon access after sync #14853

ggerganov · 2025-07-24T11:43:28Z

After processing a batch, remember the indices that we have to swap and apply the data swap (of logits and embeddings) upon access via llama_get_...

ggml-ci

slaren · 2025-07-24T11:45:29Z

Can we avoid the reorder entirely if we drop llama_get_logits in favor of llama_get_logits_ith?

ggerganov · 2025-07-24T13:12:40Z

Hm, I think llama_get_logits_ith also requires the reorder?

ggerganov · 2025-07-24T13:13:32Z

Ah, got it. Let me see.

ggerganov · 2025-07-24T13:21:04Z

Yes, we can "redirect" the index on-the-fly using the output_ids. Seems like a good idea.

Do it in a follow-up PR, so we can patch the current issue for now?

compilade · 2025-07-24T14:14:13Z

Hm, I think llama_get_logits_ith also requires the reorder?

Only because the out_ids are sorted, otherwise it wouldn't.

llama.cpp/src/llama-context.cpp

Lines 1221 to 1225 in e4868d1

    
           std::fill(output_ids.begin(), output_ids.end(), -1); 
        
           for (uint32_t i = 0; i < n_outputs; ++i) { 
        
               output_ids[out_ids[i]] = i; 
        
           }

If llama_get_logits and llama_get_embeddings are removed, then the sorting and output reordering can be removed altogether.

(Right, this comment is redundant with #14853 (comment) and #14853 (comment))

compilade · 2025-07-24T16:50:44Z

Actually, because of how llama_get_logits_ith is called in llama-perplexity, sorting is still required.

llama.cpp/tools/perplexity/perplexity.cpp

Line 612 in e4868d1

    
           const float * all_logits = num_batches > 1 ? logits.data() : llama_get_logits_ith(ctx, seq*n_ctx + first);

The above assumes the logits after the ith one are in the correct order.

There might need to be an API for logits ranges.

ggerganov · 2025-07-24T16:59:30Z

This usage of llama_get_logits_ith() is technically wrong because it assumes the logits are sequential, but this is not stated in the API. So I think it is OK to break it.

slaren · 2025-07-24T17:01:36Z

It is indirectly stated in the comment that says Equivalent to: llama_get_logits(ctx) + ctx->output_ids[i]*n_vocab. The examples would need to be updated, but that should be an easy problem to solve, the bigger problem is that it may be break silently 3rd party code that we have no control over.

…org#14853) * context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO

* origin/master: docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874) ggml : remove invalid portPos specifiers from dot files (ggml-org#14838) context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870) mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503) rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868) sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855) musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498) sync : ggml cmake : fix usage issues (ggml/1257) ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) context : perform output reorder lazily upon access after sync (ggml-org#14853) chat : fix kimi-k2 chat template (ggml-org#14852) sycl: fixed semantics of block offset calculation (ggml-org#14814) llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850) docs: add libcurl-dev install hint for Linux distros (ggml-org#14801) metal : fix fusion across different encoders (ggml-org#14849) sycl: fix undefined variable in work group size check (ggml-org#14843) convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823) CUDA: fix overflow in FA, tune performance (ggml-org#14840) CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)

context : perform output reorder after lazily upon access after sync

5e58711

ggml-ci

ggerganov requested review from slaren and compilade July 24, 2025 11:43

ggerganov changed the title ~~context : perform output reorder after lazily upon access after sync~~ context : perform output reorder lazily upon access after sync Jul 24, 2025

slaren approved these changes Jul 24, 2025

View reviewed changes

cont : add TODO

978c88b

ggerganov merged commit e4868d1 into master Jul 24, 2025
47 checks passed

taronaeo pushed a commit to taronaeo/llama.cpp-s390x that referenced this pull request Jul 25, 2025

context : perform output reorder lazily upon access after sync (ggml-…

7c5ca60

…org#14853) * context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

context : perform output reorder lazily upon access after sync #14853

context : perform output reorder lazily upon access after sync #14853

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

slaren commented Jul 24, 2025

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

Uh oh!

compilade commented Jul 24, 2025 •

edited

Loading

Uh oh!

compilade commented Jul 24, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

slaren commented Jul 24, 2025

Uh oh!

Uh oh!

context : perform output reorder lazily upon access after sync #14853

context : perform output reorder lazily upon access after sync #14853

Uh oh!

Conversation

ggerganov commented Jul 24, 2025

Uh oh!

slaren commented Jul 24, 2025

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

Uh oh!

compilade commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

slaren commented Jul 24, 2025

Uh oh!

Uh oh!

compilade commented Jul 24, 2025 •

edited

Loading

compilade commented Jul 24, 2025 •

edited

Loading