Skip to content

Conversation

@hero78119
Copy link
Collaborator

@hero78119 hero78119 commented Jun 4, 2025

Change

This PR sync with ceno master, and rollback partial of change to assure not affect ceno mainflow benchmark

benchmark against master

Benchmark Median Time (s) Median Change (%)
fibonacci_max_steps_1048576 2.1283 +2.0905% (Change within noise)
fibonacci_max_steps_2097152 3.6231 +0.9229% (No change in performance)
fibonacci_max_steps_4194304 6.4747 -0.1104% (No change in performance)

hero78119 and others added 30 commits May 19, 2025 14:43
benchmark shows there are quite of time spending on glibc free (drop)
when object end of its scopes.

Follow openvm using
[jemalloc](https://github.com/openvm-org/openvm/blob/c771a213f5e7f0732e0ddbafb273e15d99c5049d/crates/vm/Cargo.toml#L56)
as global allocators.
and set jemalloc parameter follows
https://github.com/openvm-org/openvm/blob/c771a213f5e7f0732e0ddbafb273e15d99c5049d/.github/workflows/benchmark-call.yml#L218
> I do not use jemalloc "background_thread: true" as I thought thread in
background might occupied other schedule which affect cpu intensive
program

### change scope
- enable jemalloc by default when compiling ceno_cli
- support `cargo make cli` to install ceno_cli
- introduce "jemalloc" features

### benchmark

benchmark on AMD EPYC 32 cores with command
`JEMALLOC_SYS_WITH_MALLOC_CONF="retain:true,metadata_thp:always,thp:always,dirty_decay_ms:-1,muzzy_decay_ms:-1,abort_conf:true"
cargo bench --bench fibonacci --features jemalloc --package ceno_zkvm --
--baseline opt-baseline`
 

| Benchmark | Average Time | Improvement | Throughput (instructions/sec)
|

|-----------------|--------------|-------------|---------------------------|
| fibonacci 2^20 | 2.0020 s | -14.74% | 523.76k |
| fibonacci 2^21 | 3.5903 s | -18.89% | 584.34k |
| fibonacci 2^22 | 6.6531 s | -24.69% | 630.28k |

---------

Co-authored-by: Zhang Zhuo <[email protected]>
## Motivation

We want to unify the prover's workflow for opcode circuits and table
circuits. As they follow the same kind of workflow, i.e.

1. infer tower witness;
2. run tower prover;
3. run main sumcheck which is optional for table circuits.

Before this pr, the **opcode** circuit includes multiple
read/write/logup records in a **single** tower while **table** circuit
packs read/write/logup records into one dedicated tower for each
read/write/logup expression. We found that the way that's used by table
circuit to build tower tree is better than that of opcode.

## Performance

| benchmark | proof size (MB) | proving time |
|------------|-----------------|-------------|
| fibonacci 2^20 | 1.14 -> 1.2 (5%) | -0.8% |
| fibonacci 2^21 | 1.22 -> 1.28 (5%) |  -5% |
| fibonacci 2^22 | 1.3 -> 1.37 (5%) | -10%|
    
**New issue**: The proof size increase is due to we have more `ProdSpec`
and `LogupSpec` which implies more points and evaluations in the `struct
TowerProof`. Note that after we abandon the old "interleaving" method,
the number of rounds per product spec and lougup spec are same now,
therefore, we can remove this new overhead in follow up pr.

## Impact 
Blocker for scroll-tech#923.

---------

Co-authored-by: sm.wu <[email protected]>
To serve for various purpose, e.g. benchmark
…oll-tech#954)

### Change Scope
- [x] example run failed in e2e
https://github.com/scroll-tech/ceno/blob/ef93198c83e3b4fcd7f9949ebbc07bc9c93e4de9/examples/examples/hashing.rs#L16
In e2e we only support hints as u32 item and write one by one. But some
example requires it as whole vector. Thus, guest program will always
failed since unable to serve hint properly.
- [x] move most of verbose message from `info` to `trace/debug` so the
default e2e be more clean
- [x] more comments and polish readme

---------

Co-authored-by: Akase Haruka <[email protected]>
…-tech#956)

Extracted from scroll-tech#952.

Observe a bottleneck on previous interpolation which contribute to most
of time due to `vector.extend` operation and bunch of allocations.
This PR rewrite univariate extrapolation
1. as the point to be interpolate are fixed set, we can pre-compute all
stuff require field inverse
2. in-place change to avoid allocation 

### benchmark
In Ceno opcode main sumcheck part we batch different degree > 1 into one
batch so this function will be used.
It shows a slightly improvement (~3%) on Fibonacci 2^24 e2e

| Benchmark | Median Time (s) | Median Change (%) |

|----------------------------------|------------------|--------------------|
| fibonacci_max_steps_1048576 | 2.3978 | +0.9805% (No significant change
) |
| fibonacci_max_steps_2097152 | 4.2579 | +1.7587% (Change within noise)
|
| fibonacci_max_steps_4194304 | 7.7561 | -3.5338% |
build on top of scroll-tech#956 to address review comments
clean up point from sumcheck proof, as verifier should derived itself
refactor univariate interpolation in barycentric and unroll version.

cross refer to issue
scroll-tech/ceno-recursion-verifier#6
…-tech#956)

Extracted from scroll-tech#952.

Observe a bottleneck on previous interpolation which contribute to most
of time due to `vector.extend` operation and bunch of allocations.
This PR rewrite univariate extrapolation
1. as the point to be interpolate are fixed set, we can pre-compute all
stuff require field inverse
2. in-place change to avoid allocation

In Ceno opcode main sumcheck part we batch different degree > 1 into one
batch so this function will be used.
It shows a slightly improvement (~3%) on Fibonacci 2^24 e2e

| Benchmark | Median Time (s) | Median Change (%) |

|----------------------------------|------------------|--------------------|
| fibonacci_max_steps_1048576 | 2.3978 | +0.9805% (No significant change
) |
| fibonacci_max_steps_2097152 | 4.2579 | +1.7587% (Change within noise)
|
| fibonacci_max_steps_4194304 | 7.7561 | -3.5338% |
benchmark shows there are quite of time spending on glibc free (drop)
when object end of its scopes.

Follow openvm using
[jemalloc](https://github.com/openvm-org/openvm/blob/c771a213f5e7f0732e0ddbafb273e15d99c5049d/crates/vm/Cargo.toml#L56)
as global allocators.
and set jemalloc parameter follows
https://github.com/openvm-org/openvm/blob/c771a213f5e7f0732e0ddbafb273e15d99c5049d/.github/workflows/benchmark-call.yml#L218
> I do not use jemalloc "background_thread: true" as I thought thread in
background might occupied other schedule which affect cpu intensive
program

### change scope
- enable jemalloc by default when compiling ceno_cli
- support `cargo make cli` to install ceno_cli
- introduce "jemalloc" features

### benchmark

benchmark on AMD EPYC 32 cores with command
`JEMALLOC_SYS_WITH_MALLOC_CONF="retain:true,metadata_thp:always,thp:always,dirty_decay_ms:-1,muzzy_decay_ms:-1,abort_conf:true"
cargo bench --bench fibonacci --features jemalloc --package ceno_zkvm --
--baseline opt-baseline`
 

| Benchmark | Average Time | Improvement | Throughput (instructions/sec)
|

|-----------------|--------------|-------------|---------------------------|
| fibonacci 2^20 | 2.0020 s | -14.74% | 523.76k |
| fibonacci 2^21 | 3.5903 s | -18.89% | 584.34k |
| fibonacci 2^22 | 6.6531 s | -24.69% | 630.28k |

---------

Co-authored-by: Zhang Zhuo <[email protected]>
@hero78119 hero78119 requested a review from spherel June 4, 2025 12:36
Copy link
Member

@spherel spherel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hero78119 hero78119 merged commit 7066fa8 into scroll-tech:tianyi/refactor-prover Jun 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants