Skip to content

Conversation

@martindevans
Copy link
Member

@martindevans martindevans commented May 1, 2025

Updated llama.cpp binaries to ceda28ef8e310a8dee60bf275077a3eedae8e36c, compiled with this run.

This PR includes work done by @nipeone in #1138 (adding Linux-ARM64 support) and by @AmSmart in #1130 (adding Android support).

Testing:

  • Windows CPU
  • Windows CUDA
  • Windows Vulkan
  • Linux CPU (x64)
  • Linux CPU (ARM64)
  • Linux CPU (musl)
  • Linux CUDA
  • Linux Vulkan
  • MacOS

AmSmart and others added 30 commits March 19, 2025 03:13
Upgraded Linux runners to Ubuntu 24.04
Upgraded Linux AVX512 to Ubuntu 24
update gitignore and add missing xml files
@SignalRT
Copy link
Collaborator

SignalRT commented May 3, 2025

Tested on MacOS.

My personal opinion is that the Mobile project should be on a different solution. It depends so much on specific SDK and your platform.

@martindevans
Copy link
Member Author

I've removed LLama.Mobile from the main solution for now. We can work out the best way to handle it in later PRs.

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

Martin, be careful with the llama.cpp @ be7c303
You probably still need to update it. That is not ceda28ef8e310a8dee60bf275077a3eedae8e36c yet.

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

I have tested the code and it works well, also with new models. The binaries compiled from cpp are much smaller now (my ggml-cuda.dll is 45 MB only).

Thank you!

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

After further testing the KernelMemory example does not work anymore. There is a strange crash at MemoryAnswer answer = await memory.AskAsync(question);. Could you please test it on your computer?

Strange not enough memory. Is there a default parameter change somewhere? For example, if the context size of the model is used instead of the user defined one, then this may happen...

[llama Error]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 560.00 MiB on device 0: cudaMalloc failed: out of memory
[llama Error]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 587206656
[llama Error]: llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

@martindevans
Copy link
Member Author

martindevans commented May 9, 2025

Fixed the submodule, thanks for the reminder.

I have tested the code and it works well

Which platforms did you test on? At the moment we've got an issue with the Linux CPU binaries failing for some people on Linux that we're having trouble narrowing down. So if you had success there that'd be an interesting datapoint.

@martindevans
Copy link
Member Author

martindevans commented May 9, 2025

After further testing the KernelMemory example does not work anymore

I just tested Kernel Memory: Document Q&A with Qwen3-30B-A3B Q4_K_M and it worked for me, with some GPU load so it seems like it offloaded properly.

There aren't any config changes, just one extra field in the config struct which is always null. If you collapse LLama.Mobile in the diff you can see there actually aren't that many changes in this PR - it just looks big because that added a lot of files!

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

Fixed the submodule, thanks for the reminder.

I have tested the code and it works well

Which platforms did you test on? At the moment we've got an issue with the Linux CPU binaries failing for some people on Linux that we're having trouble narrowing down. So if you had success there that'd be an interesting datapoint.

Thank you!

I have found the problem. My modifications are not in your PR yet. They will probably get activated when you merge.

This is the bug I have identified before. The SplitMode BUG in llama.cpp. If you offload all layers to the GPU and use Native.GPUSplitMode.None the code will crash. I have showed you the C++ code before where this happens.

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

After further testing the KernelMemory example does not work anymore

I just tested Kernel Memory: Document Q&A with Qwen3-30B-A3B Q4_K_M and it worked for me, with some GPU load so it seems like it offloaded properly.

There aren't any config changes, just one extra field in the config struct which is always null. If you collapse LLama.Mobile in the diff you can see there actually aren't that many changes in this PR - it just looks big because that added a lot of files!

If you off-load all layers (GpuLayerCount = -1) it will crash. See my remark above.

@martindevans
Copy link
Member Author

If you off-load all layers (GpuLayerCount = -1) it will crash. See my remark above.

I just tried it with GpuLayerCount = -1 and it still worked for me. However there was about 20GB of system memory in use as well as all of my VRAM, so maybe it spilled from VRAM into system RAM? I think that's something that can be turned off in the drivers, if you've got it off that'd explain the difference.

I just double checked the context size, since it's not specified in the config. If you trace that through KM it defaults to 2048 if none is specified so that's fine.

As a side note, there is some pretty suspect stuff happening here though!

  • context = weights.CreateContext(parameters); creates a context
  • var executor = new StatelessExecutor(weights, parameters); internally creates a context
  • new LLamaSharpTextEmbeddingGenerator(config, weights) creates a context for embedding
  • new LlamaSharpTextGenerator(weights, context, executor, config.DefaultInferenceParams) accepts both an executor (with a context) and a context (but not necessarily the same context).

None of this has changed, so it's not the cause of your problem. But it does look like it could use a re-work! Would you be interested in working on that in the near future?

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

I just tried it with GpuLayerCount = -1 and it still worked for me. However there was about 20GB of system memory in use as well as all of my VRAM, so maybe it spilled from VRAM into system RAM? I think that's something that can be turned off in the drivers, if you've got it off that'd explain the difference.

I just double checked the context size, since it's not specified in the config. If you trace that through KM it defaults to 2048 if none is specified so that's fine.

As a side note, there is some pretty suspect stuff happening here though!

  • context = weights.CreateContext(parameters); creates a context
  • var executor = new StatelessExecutor(weights, parameters); internally creates a context
  • new LLamaSharpTextEmbeddingGenerator(config, weights) creates a context for embedding
  • new LlamaSharpTextGenerator(weights, context, executor, config.DefaultInferenceParams) accepts both an executor (with a context) and a context (but not necessarily the same context).

None of this has changed, so it's not the cause of your problem. But it does look like it could use a re-work! Would you be interested in working on that in the near future?

Yes, I think you're right. This will be the reason for the sudden peak in GPU memory use. Many contexts are created and maintained everywhere. We either need to destroy them after use or ensure they are reusable. Since it is stateless, it can be leveraged in various situations. I'm particularly curious about the embedder's context, as it might require additional parameters. I'll try to make some time over the weekend to explore a more memory-efficient solution.

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

There is one way to do it: move public int CountTokens(string text) and public IReadOnlyList<string> GetTokens(string text) to the StatelessExecutor and access them from there. If this is OK for you, then I will try to implement it during the weekend!?

@martindevans
Copy link
Member Author

martindevans commented May 9, 2025

They both seems like things shouldn't need a context, can they be moved to the LLamaWeights? (btw let's move this discussion to a separate issue, since it's not related to this PR).

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

They both seems like things shouldn't need a context, can they be moved to the LLamaWeights? (btw let's move this discussion to a separate issue, since it's not related to this PR).

I think that we better associate a context with the executor instead of the weights. I will try to propose something and then we can do the discussion there.

@zsogitbe
Copy link
Contributor

zsogitbe commented May 9, 2025

It was a good idea Martin! I have quickly put together a modification with streamlining contexts everywhere and it decreases GPU memory use with 30%! In my solution, both LLamaEmbedder and StatelessExecutor get their special CountTokens and GetTokens functions. These are being called and we can create the context on the fly and like this there is only 1 context in memory at all times.
I will clean it up and present the PR during the weekend.

@martindevans
Copy link
Member Author

Triggered a new build run, hopefully resolving the Linux CPU issues: https://github.com/martindevans/LLamaSharp/actions/runs/14956801082

@martindevans martindevans merged commit 1668e76 into SciSharp:master May 11, 2025
6 checks passed
@martindevans martindevans deleted the update_apr_2025 branch May 11, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants