-
Notifications
You must be signed in to change notification settings - Fork 12.8k
context : perform output reorder lazily upon access after sync #14853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can we avoid the reorder entirely if we drop |
Hm, I think |
Ah, got it. Let me see. |
Yes, we can "redirect" the index on-the-fly using the Do it in a follow-up PR, so we can patch the current issue for now? |
Only because the llama.cpp/src/llama-context.cpp Lines 1221 to 1225 in e4868d1
If (Right, this comment is redundant with #14853 (comment) and #14853 (comment)) |
Actually, because of how llama.cpp/tools/perplexity/perplexity.cpp Line 612 in e4868d1
The above assumes the logits after the ith one are in the correct order. There might need to be an API for logits ranges. |
This usage of |
It is indirectly stated in the comment that says |
…org#14853) * context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO
* origin/master: docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874) ggml : remove invalid portPos specifiers from dot files (ggml-org#14838) context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870) mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503) rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868) sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855) musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498) sync : ggml cmake : fix usage issues (ggml/1257) ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) context : perform output reorder lazily upon access after sync (ggml-org#14853) chat : fix kimi-k2 chat template (ggml-org#14852) sycl: fixed semantics of block offset calculation (ggml-org#14814) llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850) docs: add libcurl-dev install hint for Linux distros (ggml-org#14801) metal : fix fusion across different encoders (ggml-org#14849) sycl: fix undefined variable in work group size check (ggml-org#14843) convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823) CUDA: fix overflow in FA, tune performance (ggml-org#14840) CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)
ref #14795 (comment)
After processing a batch, remember the indices that we have to swap and apply the data swap (of logits and embeddings) upon access via
llama_get_...