-
Notifications
You must be signed in to change notification settings - Fork 29
Proposed structure for integration of benchmarks with pytest-benchmark #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Proposed structure for integration of benchmarks with pytest-benchmark #128
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the basic structure of this looks good to me! I like being able to integrate this into pytest.
5dbcf66
to
f3003e8
Compare
I also added an abstract base class for benchmark problems, and implemented this for our very favourite Rosenbrock function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this would work if it was defined in a benchmarks/conftest.py
(outside of tests)?
CONTRIBUTING.md
Outdated
benchmark tests with | ||
|
||
```bash | ||
pytest --extensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe --benchmark
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is to much confusion with the built-in flags then? We already have --benchmark-save
, which is built-in and specifies where to save the results in benchmark format, and we also use the built-in benchmarking fixture with @pytest.mark.benchmark
.
I think that means that our own decorator and flags should be different! Maybe optx_benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or @cutest
and --cutest
? This will work fine until we integrate other benchmarking suites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're anticipating a need to separate out solver benchmarks (cutest) vs compile-time benchmarks?
Anyway given that this is now in a separate directory, I'm realising that I think these will be ignored by pytest by default, and only ran if you're in the directory or if you run pytest benchmarks
from the top-level, so perhaps we don't need a flag here at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they're not ignored by default, even if they are in a separate directory.
They would be ignored if the names did not start with test_
, but then we'd have the reverse problem.
I think runtime and compile time benchmarks can be run with the same problem / solver setup, but the benchmark
fixture stores all the results and I think things might get muddled there? But happy to try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at the structure of the .json
output, and it seems wiser to have two different test functions for compile and runtime. Keeps things neater.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! So in that case maybe --compile-benchmarks
and --runtime-benchmarks
?
Or compose with existing pytest functionality, pytest --benchmark -k compile
vs pytest --benchmark -k runtime
?
Mostly I'd just like to have the flag make it clear that this isn't about running slow tests or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now use these built in flags and skip by default. Regular -k flags also work.
I moved the benchmarking script into the regular tests folder again, since it is now quite small.
This makes the setup less complex, in particular we are no longer dealing with two different conftest.py files, I had noticed that running with pytest benchmarks/
was actually needed to pull in the custom flags (e.g. --max-dimension
). Now there is a single flag to run the benchmarks, and custom ones for subsets.
I'm now going with And we might want to tweak the configuration - to check how well we do in 32-bit, for instance - without affecting the main test suite. |
CONTRIBUTING.md
Outdated
benchmark tests with | ||
|
||
```bash | ||
pytest --extensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're anticipating a need to separate out solver benchmarks (cutest) vs compile-time benchmarks?
Anyway given that this is now in a separate directory, I'm realising that I think these will be ignored by pytest by default, and only ran if you're in the directory or if you run pytest benchmarks
from the top-level, so perhaps we don't need a flag here at all?
benchmarks/test_benchmarks.py
Outdated
def block_tree_until_ready(x): | ||
dynamic, static = eqx.partition(x, eqx.is_inexact_array) | ||
dynamic = jtu.tree_map(lambda x: x.block_until_ready(), dynamic) | ||
return eqx.combine(dynamic, static) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the partition/combine operations will add a small but measurable amount of overhead, as this happens outside JIT.
I'd be tempted to suggest instead that outputs should be required to always only be JAX arrays, and to then call jax.block_until_ready
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could return the value of the objective function at the solution and use that to determine if the problem has been solved to some degree of accuracy.
This is all that is available for 90+ % of CUTEST problems anyway.
More detailed metrics such as the number of iterations could be analysed separately, but we'd lose access to them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think these are separate concerns? I'm not considering convergence here, just a wallclock overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you mean only returning the dynamic part of the solution? I understood this as doing something like return solution.state.f_info.f
or something!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need at least solution.RESULTS
which does not support block_until_ready
. To accommodate this, I wrote a minimal wrapper that calls the compiled function, calls block_until_ready
on the objective value and returns some extra stuff as well.
Since the function does not actually do anything else, the overhead should be minimal. I don't know how it compares to a tree-mapped block_until_ready
, though. If we really want to trim the microseconds then we could use success = solution.result == optx.RESULTS.successful
, which is an array.
I suggest we do that when benchmarking for publication, but otherwise it seems overkill - the usage of the benchmarking set is to catch regressions and test performance of new features, and in this case I think it is nice to have a .json file from which we can read informative results, not just minimal timing results.
Hi @johannahaffner,
|
They are! Thank you :)
I was completely unaware of this! But that makes sense. I think depending on the size this benchmark set grows to (fingers crossed), we might not want to integrate it into the CI, but actually run it locally and add some scripts to help analyse the results. So this is another point in favour of maybe not doing this as part of our CI.
|
I'm introducing a benchmark-time dependency on sif2jax here, which is a current WIP of mine aiming to (have LLMs) port the complete CUTEst collection to clean, human-readable and functionally pure JAX implementations. I'll mark this as ready for review again when we have done further verification of the benchmark problems; currently about 247 benchmark problems pass tests against the Fortran implementation. |
This does seem to have improved? I've left this in as a commented-out suggestion, but I'm not quite sure what specific behaviour it is meant to prevent if enabled, @pfackeldey. |
So I tested this a bit more rigorously and it seems like most things are working on our end! What we have right now:
I updated the README to reflect this. Things that require a bit of thought still:
I'll mark this as ready to review for now :) |
Typically this is handled with a work-precision diagram.
Interesting idea! This strikes me as the kind of thing we should log somewhere, just not in version control.
I think that sounds like a reasonable ask! Realistically if someone is going to the effort of contributing a solver then they're already in pretty deep, and probably have opinions about the existing options available :D
sg! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the look of this.
CONTRIBUTING.md
Outdated
benchmark tests with | ||
|
||
```bash | ||
pytest --extensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! So in that case maybe --compile-benchmarks
and --runtime-benchmarks
?
Or compose with existing pytest functionality, pytest --benchmark -k compile
vs pytest --benchmark -k runtime
?
Mostly I'd just like to have the flag make it clear that this isn't about running slow tests or something.
benchmarks/README.md
Outdated
For reproducibility, make sure that all your changes have been committed and your working tree is clean before running the benchmarks. `pytest-benchmark` will otherwise mark your benchmarking results as `dirty`. | ||
Saved results include the commit, branch, version, and an exact timestamp by default. | ||
|
||
Note that benchmarks are run with `throw=False` enabled, since otherwise no result is written in the json file, but we do want to know if we failed to solve a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted we could/should set EQX_ON_ERROR=nan
here too. This is essentially equivalent to the above and is also a performance improvement.
(This has to happen before Equinox is imported.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is essential for vjp as throw=False does nothing there I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh I see that I might have boxed myself a little into a corner here - I could simplify the workflow and put the benchmarks back in the tests folder. This way we only get one conftest
and one setup to deal with, and there are no stupid foot guns such as custom flags only being available if we're also specifying the folder name: e.g.
pytest benchmarks/ --benchmark-only--max-dimension=10
would work, but
pytest --benchmark-only --max-dimension=10
would not, which I had not previously realised.
But this also means that the environment variable applies to the whole test session, right? And while we want it for benchmarking, we don't want it for testing.
Can I force a fresh environment for a specific testing module? Alternatively it might make sense to modify --benchmark-only
so that if present, we do set the environment variable and not otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I would use a new environment for benchmarks (I find this hepful in my workflow as tests run faster than benchmarks and I can fail fast if something is wrong). Then no need to fiddle around with environment variables in tests/
or benchmarks/
and they are just set in the workflow actions (or by user). This is also helpful as it allows the user to run with different setting to compare impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Github workflows? That makes sense! Locally we default to no benchmark runs, and I just discovered that I can monkeypatch the environment variables for specific tests only - would that work, @patrick-kidger?
|
||
def wrapped(y0): | ||
solution = solve(y0) | ||
objective_value = solution.state.f_info.f.block_until_ready() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just jax.block_until_ready(solution)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would add the overhead of a tree_map
and for that reason we had decided not to do that. Background here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
benchmarks/test_benchmarks.py
Outdated
return objective_value, solution.result, num_steps | ||
|
||
# Benchmark the runtime of the compiled function | ||
values = benchmark.pedantic(wrapped, args=(problem.y0(),), rounds=5, iterations=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the benchmarking work here btw? As compared to e.g. min(timeit.repeat(..., number=..., repeat=...))
or the ipython %%timeit
? (E.g. the former uses minimisation to handle one-sided noise, whilst the latter automatically figures out how many runs to perform.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So rounds and iterations correspond to repeats and number. I switched to @jpbrodrick89's suggestion of not using the pedantic option.
When saving the result, more than just the minimum gets saved - a bunch of statistics get written to an output .json
, and these can be selected among during analysis. There is also an option to save full results if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically 5 rounds is very small compared to defaults and will be very susceptible to noise but depends how expensive the calculation is. I typically trust the standard non pedantic option.
Hi guys, I will chime in a tiny bit more detail this afternoon as I've been trying a lot of this on one of my repos. Unfortunately CodSpeeds usual instrumentation provides no value for Jax functions as everything counts as a system calls which is not trackable by valgrind. Therefore to use their tooling you have to use their wall runner where costs can mount up pretty quickly. I had 360 micro benchmarks and used up the free 2 hours in just four commits. Also, even on their bare metal wall runner there is still some noise probably due the time of actually retrieving the function from the jax cache rather than running the actual floap. Trying to work out if I can make this more useable next week. At least they've updated their docs in the last couple weeks quite substantially. |
A call would be great! I was just getting started experimenting with it (apologies for spamming your inbox with workflow failures, @patrick-kidger). In theory it seems like a nice setup! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed most comments, rest tonight :) Thank you both for your input!
CONTRIBUTING.md
Outdated
benchmark tests with | ||
|
||
```bash | ||
pytest --extensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now use these built in flags and skip by default. Regular -k flags also work.
I moved the benchmarking script into the regular tests folder again, since it is now quite small.
This makes the setup less complex, in particular we are no longer dealing with two different conftest.py files, I had noticed that running with pytest benchmarks/
was actually needed to pull in the custom flags (e.g. --max-dimension
). Now there is a single flag to run the benchmarks, and custom ones for subsets.
benchmarks/conftest.py
Outdated
jax.config.update("jax_enable_x64", True) | ||
jax.config.update("jax_numpy_rank_promotion", "raise") | ||
jax.config.update("jax_numpy_dtype_promotion", "standard") | ||
# Remark Peter: limit JAX to single-thread with taskset -c 0 pytest --cutest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it what this is meant to prevent/accomplish, if we added it.
|
||
def wrapped(y0): | ||
solution = solve(y0) | ||
objective_value = solution.state.f_info.f.block_until_ready() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would add the overhead of a tree_map
and for that reason we had decided not to do that. Background here :)
benchmarks/test_benchmarks.py
Outdated
return objective_value, solution.result, num_steps | ||
|
||
# Benchmark the runtime of the compiled function | ||
values = benchmark.pedantic(wrapped, args=(problem.y0(),), rounds=5, iterations=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So rounds and iterations correspond to repeats and number. I switched to @jpbrodrick89's suggestion of not using the pedantic option.
When saving the result, more than just the minimum gets saved - a bunch of statistics get written to an output .json
, and these can be selected among during analysis. There is also an option to save full results if desired.
.github/workflows/codspeed.yml
Outdated
|
||
# - name: Run benchmarks | ||
# uses: CodSpeedHQ/action@v3 | ||
# with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where you can add the env variables
# with: | |
# with: | |
# env: | |
# EQX_ON_ERROR: "nan" |
Summary of changesStreamlined the workflow
Addressed various other points
No codspeed So for now we have benchmarks that can be used during development to get feedback on how our solvers are doing, and whose output can inform discussions. This is a step up from the microbenchmarks we otherwise default too, and we can still move to a workflow-based option in the future, if codspeed issues can be resolved or if we find a better alternative. |
Oh and I have a dirty workaround for now: |
Fyi, I assume this only makes sense if you are interested in having a history of benchmarks over multiple CI runs, and given the premise that Github Actions runner may have different number of cores. (I'm not sure how homogeneous GitHub actions are?). (This only works on linux, and if the cores differ in clock frequency between different workers, that also doesn't help of course) |
Thank you for the explanation! Does this mean that you can get around some of the variability in CI runs? Apparently how fast things are in there depends on a lot of factors. That is actually what codspeed advertises - they claim that they get the noise down from around 30 % to 1 %; and they do that through some clever isolation of the bit to be benchmarked, which is then run just once. But this does, apparently, not really mesh well with JAX. FWIW I've come to the conclusion that the pragmatic way forward is to just use the benchmarks locally during development. |
Yes, good chat @johannahaffner! To summarise, basically CodSpeed "instrumentation" is a dead end and CodSpeed "wall time" can easily eat up minutes if you're not careful. I'm playing with it in our repo and will see if I can share any further feedback somehow when I've tried a bit more customisation. I'm not sure whether setting the environment variable using monkeypatch will work, it all depends on whether With regards to threads and taskset, it might be interesting to do separate runs trying to disable threading or not, but in practice users will let Jax use the cores it can so letting it use more than one (maybe set an upper limit) might be more realistic. Standard runners always have 4 cores so this would seem a reasonable upper bound in case you somehow get thrown onto a better one. Clock speed and traffic variation is much more of a problem. Nice work on all this! |
983d24d
to
c88ab03
Compare
…luation of Optimistix' solvers.
ebc5896
to
640b26a
Compare
I think this is ready! Thanks to all of you for your help, @patrick-kidger @jpbrodrick89 and @pfackeldey. Small updates made since last synopsis:
|
Exciting times @johannahaffner! 🚀 It looks like you set eqx on error as a global python variable rather than an os environment variable, see here for different options for setting env variables in pytest https://stackoverflow.com/questions/36141024/how-to-pass-environment-variables-to-pytest The way I do it is just to set on command line with |
Thank you! I'm quite unfamiliar with this one. I switched to using |
Awesome! Not at the desk right now to to check, but i think you should be able to get rid of the noqa e402s now as os is meant to be ignored (https://docs.astral.sh/ruff/rules/module-import-not-at-top-of-file/) I'm not sure about the i001would be nice if this had a similar exception but it's not documented. |
Indeed! And |
I tried some benchmarking on our brand new L-BFGS and the current setup, using |
This PR is a proposal for what integration of benchmarks could look like.
I suggest:
I have not included any compile time benchmarking yet, but I think this thing here can happily live next to differently flavoured pytest-codspeed benchmarks, such as discussed in patrick-kidger/equinox#1001
I'm parsing a collection of CUTEST problems for now, and once you get rid of all the FORTRAN stuff and weird formatting these don't take up that much space and can either live here, or perhaps elsewhere if that is more practical.