Consider using the standard benchmark harness instead of criterion

We currently use `criterion` for benchmarking. The [main limitation](https://bheisler.github.io/criterion.rs/book/user_guide/known_limitations.html) of `criterion` is that only a crate's public API can be tested. 

In contrast, the standard benchmark harness (provided by the unstable [`test` crate](https://doc.rust-lang.org/test/index.html)) allows you to mark any function with `#[bench]` (as you would a `#[test]`) - allowing you to benchmark internal functions. I believe this would be so valuable that it would outweigh the cons of using `test`. While working on #247, we were interested in knowing the specific timing of internal functions (e.g. [here](https://github.com/facebook/winterfell/pull/247#discussion_r1531852432)). I ended up manually inserting timing statements in the code, running the right test, and inspecting the output. IMO, this shows how limiting using `criterion` really is.

Note that `criterion` has a feature to allow benchmarking internal functions similar to `#[bench]`, which [depends on the `custom_test_framework` nightly feature](https://bheisler.github.io/criterion.rs/book/user_guide/custom_test_framework.html) . However, the [tracking issue for `custom_test_framework` has been closed](https://github.com/rust-lang/rust/issues/50297) due to inactivity. So I personally would stay away from it if/until that changes.

### Pros of using `test` over `criterion`

- Ability to benchmark internal functions

### Cons of using `test` over `criterion`

- Less sophisticated reporting
  - e.g. `criterion` gives you the performance increase over the last run of the benchmark
- No generated html reports
- Probably less reliable
  - e.g. `criterion` [pre-populates the cache](https://bheisler.github.io/criterion.rs/book/analysis.html) before running a benchmark (not sure whether or not `test` does that too, but at least it's not advertised in the benchmark output)
- `test` is still unstable

### Summary

Although there are more cons than pros, I believe the ability to benchmark internal functions far outweighs any of the cons (as explained earlier). We can deal with the dependency on nightly by adding a `"benchmarking"` feature, which controls the use of `nightly`.

```rust
#![cfg_attr(feature = "benchmarking", feature(test))]

#[cfg(feature = "benchmarking")]
extern crate test;

fn internal_function() { ... }

#[cfg(test)]
#[cfg(feature = "benchmarking")]
mod bench {
    use super::*;
    use test::Bencher;

    #[bench]
    fn bench_internal_function(b: &mut Bencher) {
        b.iter(|| internal_function()));
    }
}
```

```sh
$ cargo +nightly bench --features=benchmarking bench_internal_function
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Consider using the standard benchmark harness instead of criterion #264

Pros of using `test` over `criterion`

Cons of using `test` over `criterion`

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Consider using the standard benchmark harness instead of criterion #264

Description

Pros of using test over criterion

Cons of using test over criterion

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Pros of using `test` over `criterion`

Cons of using `test` over `criterion`