-
Notifications
You must be signed in to change notification settings - Fork 479
May Binary Update #1179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
May Binary Update #1179
Conversation
Upgraded Linux runners to Ubuntu 24.04
Upgraded Linux AVX512 to Ubuntu 24
…port' into update_apr_2025
update gitignore and add missing xml files
|
Tested on MacOS. My personal opinion is that the Mobile project should be on a different solution. It depends so much on specific SDK and your platform. |
|
I've removed LLama.Mobile from the main solution for now. We can work out the best way to handle it in later PRs. |
|
Martin, be careful with the llama.cpp @ be7c303 |
|
I have tested the code and it works well, also with new models. The binaries compiled from cpp are much smaller now (my ggml-cuda.dll is 45 MB only). Thank you! |
|
After further testing the KernelMemory example does not work anymore. There is a strange crash at Strange not enough memory. Is there a default parameter change somewhere? For example, if the context size of the model is used instead of the user defined one, then this may happen... |
|
Fixed the submodule, thanks for the reminder.
Which platforms did you test on? At the moment we've got an issue with the Linux CPU binaries failing for some people on Linux that we're having trouble narrowing down. So if you had success there that'd be an interesting datapoint. |
I just tested There aren't any config changes, just one extra field in the config struct which is always null. If you collapse |
Thank you! I have found the problem. My modifications are not in your PR yet. They will probably get activated when you merge. This is the bug I have identified before. The SplitMode BUG in llama.cpp. If you offload all layers to the GPU and use Native.GPUSplitMode.None the code will crash. I have showed you the C++ code before where this happens. |
If you off-load all layers (GpuLayerCount = -1) it will crash. See my remark above. |
I just tried it with I just double checked the context size, since it's not specified in the config. If you trace that through KM it defaults to As a side note, there is some pretty suspect stuff happening here though!
None of this has changed, so it's not the cause of your problem. But it does look like it could use a re-work! Would you be interested in working on that in the near future? |
Yes, I think you're right. This will be the reason for the sudden peak in GPU memory use. Many contexts are created and maintained everywhere. We either need to destroy them after use or ensure they are reusable. Since it is stateless, it can be leveraged in various situations. I'm particularly curious about the embedder's context, as it might require additional parameters. I'll try to make some time over the weekend to explore a more memory-efficient solution. |
|
There is one way to do it: move |
|
They both seems like things shouldn't need a context, can they be moved to the |
I think that we better associate a context with the executor instead of the weights. I will try to propose something and then we can do the discussion there. |
|
It was a good idea Martin! I have quickly put together a modification with streamlining contexts everywhere and it decreases GPU memory use with 30%! In my solution, both LLamaEmbedder and StatelessExecutor get their special CountTokens and GetTokens functions. These are being called and we can create the context on the fly and like this there is only 1 context in memory at all times. |
- Commented out flakey test
|
Triggered a new build run, hopefully resolving the Linux CPU issues: https://github.com/martindevans/LLamaSharp/actions/runs/14956801082 |
Updated llama.cpp binaries to ceda28ef8e310a8dee60bf275077a3eedae8e36c, compiled with this run.
This PR includes work done by @nipeone in #1138 (adding Linux-ARM64 support) and by @AmSmart in #1130 (adding Android support).
Testing: