Skip to content

Conversation

ggerganov
Copy link
Member

ref #14795 (comment)

After processing a batch, remember the indices that we have to swap and apply the data swap (of logits and embeddings) upon access via llama_get_...

@ggerganov ggerganov requested review from slaren and compilade July 24, 2025 11:43
@slaren
Copy link
Member

slaren commented Jul 24, 2025

Can we avoid the reorder entirely if we drop llama_get_logits in favor of llama_get_logits_ith?

@ggerganov ggerganov changed the title context : perform output reorder after lazily upon access after sync context : perform output reorder lazily upon access after sync Jul 24, 2025
@ggerganov
Copy link
Member Author

Hm, I think llama_get_logits_ith also requires the reorder?

@ggerganov
Copy link
Member Author

Ah, got it. Let me see.

@ggerganov
Copy link
Member Author

Yes, we can "redirect" the index on-the-fly using the output_ids. Seems like a good idea.

Do it in a follow-up PR, so we can patch the current issue for now?

@ggerganov ggerganov merged commit e4868d1 into master Jul 24, 2025
47 checks passed
@compilade
Copy link
Collaborator

compilade commented Jul 24, 2025

Hm, I think llama_get_logits_ith also requires the reorder?

Only because the out_ids are sorted, otherwise it wouldn't.

std::fill(output_ids.begin(), output_ids.end(), -1);
for (uint32_t i = 0; i < n_outputs; ++i) {
output_ids[out_ids[i]] = i;
}

If llama_get_logits and llama_get_embeddings are removed, then the sorting and output reordering can be removed altogether.

(Right, this comment is redundant with #14853 (comment) and #14853 (comment))

@compilade
Copy link
Collaborator

compilade commented Jul 24, 2025

Actually, because of how llama_get_logits_ith is called in llama-perplexity, sorting is still required.

const float * all_logits = num_batches > 1 ? logits.data() : llama_get_logits_ith(ctx, seq*n_ctx + first);

The above assumes the logits after the ith one are in the correct order.

There might need to be an API for logits ranges.

@ggerganov
Copy link
Member Author

This usage of llama_get_logits_ith() is technically wrong because it assumes the logits are sequential, but this is not stated in the API. So I think it is OK to break it.

@slaren
Copy link
Member

slaren commented Jul 24, 2025

It is indirectly stated in the comment that says Equivalent to: llama_get_logits(ctx) + ctx->output_ids[i]*n_vocab. The examples would need to be updated, but that should be an easy problem to solve, the bigger problem is that it may be break silently 3rd party code that we have no control over.

taronaeo pushed a commit to taronaeo/llama.cpp-s390x that referenced this pull request Jul 25, 2025
…org#14853)

* context : perform output reorder after lazily upon access after sync

ggml-ci

* cont : add TODO
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 25, 2025
* origin/master:
docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874)
ggml : remove invalid portPos specifiers from dot files (ggml-org#14838)
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870)
mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503)
rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868)
sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855)
musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498)
sync : ggml
cmake : fix usage issues (ggml/1257)
ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
context : perform output reorder lazily upon access after sync (ggml-org#14853)
chat : fix kimi-k2 chat template (ggml-org#14852)
sycl: fixed semantics of block offset calculation (ggml-org#14814)
llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850)
docs: add libcurl-dev install hint for Linux distros (ggml-org#14801)
metal : fix fusion across different encoders (ggml-org#14849)
sycl: fix undefined variable in work group size check (ggml-org#14843)
convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823)
CUDA: fix overflow in FA, tune performance (ggml-org#14840)
CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants