May Binary Update #1179

martindevans · 2025-05-01T14:20:22Z

Updated llama.cpp binaries to ceda28ef8e310a8dee60bf275077a3eedae8e36c, compiled with this run.

This PR includes work done by @nipeone in #1138 (adding Linux-ARM64 support) and by @AmSmart in #1130 (adding Android support).

Testing:

Upgraded Linux runners to Ubuntu 24.04

Upgraded Linux AVX512 to Ubuntu 24

…_2025

…port' into update_apr_2025

update gitignore and add missing xml files

SignalRT · 2025-05-03T06:28:03Z

Tested on MacOS.

My personal opinion is that the Mobile project should be on a different solution. It depends so much on specific SDK and your platform.

martindevans · 2025-05-04T02:59:00Z

I've removed LLama.Mobile from the main solution for now. We can work out the best way to handle it in later PRs.

zsogitbe · 2025-05-09T08:08:15Z

Martin, be careful with the llama.cpp @ be7c303
You probably still need to update it. That is not ceda28ef8e310a8dee60bf275077a3eedae8e36c yet.

zsogitbe · 2025-05-09T08:53:54Z

I have tested the code and it works well, also with new models. The binaries compiled from cpp are much smaller now (my ggml-cuda.dll is 45 MB only).

Thank you!

zsogitbe · 2025-05-09T12:06:43Z

After further testing the KernelMemory example does not work anymore. There is a strange crash at MemoryAnswer answer = await memory.AskAsync(question);. Could you please test it on your computer?

Strange not enough memory. Is there a default parameter change somewhere? For example, if the context size of the model is used instead of the user defined one, then this may happen...

[llama Error]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 560.00 MiB on device 0: cudaMalloc failed: out of memory
[llama Error]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 587206656
[llama Error]: llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

martindevans · 2025-05-09T12:46:27Z

Fixed the submodule, thanks for the reminder.

I have tested the code and it works well

Which platforms did you test on? At the moment we've got an issue with the Linux CPU binaries failing for some people on Linux that we're having trouble narrowing down. So if you had success there that'd be an interesting datapoint.

martindevans · 2025-05-09T12:54:02Z

After further testing the KernelMemory example does not work anymore

I just tested Kernel Memory: Document Q&A with Qwen3-30B-A3B Q4_K_M and it worked for me, with some GPU load so it seems like it offloaded properly.

There aren't any config changes, just one extra field in the config struct which is always null. If you collapse LLama.Mobile in the diff you can see there actually aren't that many changes in this PR - it just looks big because that added a lot of files!

zsogitbe · 2025-05-09T12:55:04Z

Fixed the submodule, thanks for the reminder.

I have tested the code and it works well

Which platforms did you test on? At the moment we've got an issue with the Linux CPU binaries failing for some people on Linux that we're having trouble narrowing down. So if you had success there that'd be an interesting datapoint.

Thank you!

I have found the problem. My modifications are not in your PR yet. They will probably get activated when you merge.

This is the bug I have identified before. The SplitMode BUG in llama.cpp. If you offload all layers to the GPU and use Native.GPUSplitMode.None the code will crash. I have showed you the C++ code before where this happens.

zsogitbe · 2025-05-09T12:56:35Z

After further testing the KernelMemory example does not work anymore

I just tested Kernel Memory: Document Q&A with Qwen3-30B-A3B Q4_K_M and it worked for me, with some GPU load so it seems like it offloaded properly.

There aren't any config changes, just one extra field in the config struct which is always null. If you collapse LLama.Mobile in the diff you can see there actually aren't that many changes in this PR - it just looks big because that added a lot of files!

If you off-load all layers (GpuLayerCount = -1) it will crash. See my remark above.

martindevans · 2025-05-09T13:10:15Z

If you off-load all layers (GpuLayerCount = -1) it will crash. See my remark above.

I just tried it with GpuLayerCount = -1 and it still worked for me. However there was about 20GB of system memory in use as well as all of my VRAM, so maybe it spilled from VRAM into system RAM? I think that's something that can be turned off in the drivers, if you've got it off that'd explain the difference.

I just double checked the context size, since it's not specified in the config. If you trace that through KM it defaults to 2048 if none is specified so that's fine.

As a side note, there is some pretty suspect stuff happening here though!

context = weights.CreateContext(parameters); creates a context
var executor = new StatelessExecutor(weights, parameters); internally creates a context
new LLamaSharpTextEmbeddingGenerator(config, weights) creates a context for embedding
new LlamaSharpTextGenerator(weights, context, executor, config.DefaultInferenceParams) accepts both an executor (with a context) and a context (but not necessarily the same context).

None of this has changed, so it's not the cause of your problem. But it does look like it could use a re-work! Would you be interested in working on that in the near future?

zsogitbe · 2025-05-09T13:49:21Z

I just tried it with GpuLayerCount = -1 and it still worked for me. However there was about 20GB of system memory in use as well as all of my VRAM, so maybe it spilled from VRAM into system RAM? I think that's something that can be turned off in the drivers, if you've got it off that'd explain the difference.

I just double checked the context size, since it's not specified in the config. If you trace that through KM it defaults to 2048 if none is specified so that's fine.

As a side note, there is some pretty suspect stuff happening here though!

context = weights.CreateContext(parameters); creates a context

var executor = new StatelessExecutor(weights, parameters); internally creates a context

new LLamaSharpTextEmbeddingGenerator(config, weights) creates a context for embedding

new LlamaSharpTextGenerator(weights, context, executor, config.DefaultInferenceParams) accepts both an executor (with a context) and a context (but not necessarily the same context).

None of this has changed, so it's not the cause of your problem. But it does look like it could use a re-work! Would you be interested in working on that in the near future?

Yes, I think you're right. This will be the reason for the sudden peak in GPU memory use. Many contexts are created and maintained everywhere. We either need to destroy them after use or ensure they are reusable. Since it is stateless, it can be leveraged in various situations. I'm particularly curious about the embedder's context, as it might require additional parameters. I'll try to make some time over the weekend to explore a more memory-efficient solution.

zsogitbe · 2025-05-09T14:17:51Z

There is one way to do it: move public int CountTokens(string text) and public IReadOnlyList<string> GetTokens(string text) to the StatelessExecutor and access them from there. If this is OK for you, then I will try to implement it during the weekend!?

martindevans · 2025-05-09T14:19:12Z

They both seems like things shouldn't need a context, can they be moved to the LLamaWeights? (btw let's move this discussion to a separate issue, since it's not related to this PR).

zsogitbe · 2025-05-09T14:26:36Z

They both seems like things shouldn't need a context, can they be moved to the LLamaWeights? (btw let's move this discussion to a separate issue, since it's not related to this PR).

I think that we better associate a context with the executor instead of the weights. I will try to propose something and then we can do the discussion there.

zsogitbe · 2025-05-09T15:04:07Z

It was a good idea Martin! I have quickly put together a modification with streamlining contexts everywhere and it decreases GPU memory use with 30%! In my solution, both LLamaEmbedder and StatelessExecutor get their special CountTokens and GetTokens functions. These are being called and we can create the context on the fly and like this there is only 1 context in memory at all times.
I will clean it up and present the PR during the weekend.

- Commented out flakey test

martindevans · 2025-05-11T14:32:59Z

Triggered a new build run, hopefully resolving the Linux CPU issues: https://github.com/martindevans/LLamaSharp/actions/runs/14956801082

AmSmart and others added 30 commits March 19, 2025 03:13

add android support

6610db5

add maui test project

38f6bb3

update workflow

ded0f07

uncomment Android section of Gather Binaries

9f8863a

add ggm-base and ggml-cpu

2ac7871

update MSBuild for ggm-base and ggml-cpu

bb528ba

add support for linux-arm64

2105dc3

update compile.yml

bc4dde8

update compile.yml DGGML_CPU_ARM_ARCH=armv8-a

aeef2eb

update runtime.targets

80d75d9

Merge branch 'SciSharp:master' into master

5996b40

Merge branch 'SciSharp:master' into master

6f8b7ce

Merge branch 'SciSharp:master' into master

31c1218

Merge branch 'SciSharp:master' into master

6becd43

create separate Android nuspec

472140e

fix failing tests

4421364

sync fork

bc4bdf9

Apply suggestions from code review

c62980f

Upgraded Linux runners to Ubuntu 24.04

Update .github/workflows/compile.yml

dfb3cc9

Upgraded Linux AVX512 to Ubuntu 24

Merge branch 'master' into master

47f90c4

Changed for binary update to ggml-org/llama.cpp@0013715

45d1964

Merge remote-tracking branch 'remotes/nipeone/master' into update_apr…

c06f458

…_2025

Merge remote-tracking branch 'remotes/AmSmart/feature/add-android-sup…

e3f9f6c

…port' into update_apr_2025

Removed Android x86 target, not compatible with some parts of llama.cpp

fab97c3

Removed Android x86

c51fae0

Using new binaries

8aca6d3

Updated to changes for ceda28ef8e310a8dee60bf275077a3eedae8e36c

c334534

Updated to newer binaries (ceda28ef8e310a8dee60bf275077a3eedae8e36c)

98af2df

update gitignore and add missing xml files

67a3aea

Merge pull request #9 from AmSmart/feature/add-android-support

4654c21

update gitignore and add missing xml files

sangyuxiaowu mentioned this pull request May 3, 2025

Unknown model architecture: qwen3 #1173

Closed

Removed LLama.Mobile from main solution

0930060

martindevans mentioned this pull request May 7, 2025

add LLamaReranker and tests #1150

Merged

Updated submodule

c2419d2

martindevans added 2 commits May 11, 2025 15:30

- Reverted Ubuntu to 22, fixing issue with WSL compatibility

f38b007

- Commented out flakey test

Merge branch 'master' into update_apr_2025

a2a47dd

martindevans added 2 commits May 11, 2025 17:33

Using binaries built with Ubuntu-22 to fix WSL

af702e0

Commented out flakey test

4a0a665

martindevans merged commit 1668e76 into SciSharp:master May 11, 2025
6 checks passed

martindevans deleted the update_apr_2025 branch May 11, 2025 18:57

LucaMaccarini mentioned this pull request Jul 2, 2025

Added MAUI usage example (Android) #1217

Closed

This was referenced Dec 3, 2025

Bump the all group with 24 updates tryAGI/LangChain.Providers#226

Closed

Bump LLamaSharp and Newtonsoft.Json tryAGI/LangChain.Providers#228

Open

Bump LLamaSharp from 0.23.0 to 0.25.0 tryAGI/LangChain.Providers#231

Open

May Binary Update #1179

May Binary Update #1179

Uh oh!

Conversation

martindevans commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SignalRT commented May 3, 2025

Uh oh!

martindevans commented May 4, 2025

Uh oh!

zsogitbe commented May 9, 2025

Uh oh!

zsogitbe commented May 9, 2025

Uh oh!

zsogitbe commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsogitbe commented May 9, 2025

Uh oh!

zsogitbe commented May 9, 2025

Uh oh!

martindevans commented May 9, 2025

Uh oh!

zsogitbe commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsogitbe commented May 9, 2025

Uh oh!

martindevans commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsogitbe commented May 9, 2025

Uh oh!

zsogitbe commented May 9, 2025

Uh oh!

martindevans commented May 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

martindevans commented May 1, 2025 •

edited

Loading

zsogitbe commented May 9, 2025 •

edited

Loading

martindevans commented May 9, 2025 •

edited

Loading

martindevans commented May 9, 2025 •

edited

Loading

zsogitbe commented May 9, 2025 •

edited

Loading

martindevans commented May 9, 2025 •

edited

Loading