Skip to content

Conversation

sim642
Copy link
Contributor

@sim642 sim642 commented Aug 13, 2021

After switching a large project from manual equal and compare implementations to ones derived by ppx_deriving (goblint/analyzer#227), we noticed significant performance decreases in in those functions (goblint/analyzer#265).

Benchmarking the generated equal functions directly (https://github.com/goblint/analyzer/blob/a21f33511183074c693c15269990210b283ae045/bench/deriving/benchEq.ml, executable independently from our project) showed up to 3 times slowdown. For example, when the old manual implementation

let equal (x1, y1) (x2, y2) = Int.equal x1 x2 && String.equal y1 y2

was replaced with one derived for Int.t * String.t.
The modules Int and String are just chosen as example, we often use functors where the modules come abstractly from arguments, so using primitive types isn't an option for deriving.

Inefficient derived code

Currently, the following implementation is derived:

let equal =
  let __1 () = String.equal
  and __0 () = Int.equal in
  ((let open! ((Ppx_deriving_runtime)[@ocaml.warning "-A"]) in
    fun (lhs0, lhs1) ->
      fun (rhs0, rhs1) ->
        ((fun x -> (__0 ()) x) lhs0 rhs0) &&
          ((fun x -> (__1 ()) x) lhs1 rhs1))
  [@ocaml.warning "-A"])

The benchmarks confirm that this literal code performs the same as the derived one.

Optimizations

This PR suggests two optimizations for forwarding the equal calls.

Avoid unnecessary eta-expansion

PR #55 introduced eta-expansion to forwarded calls. This was necessary since derived implementations for mutual recursion in the generated let rec must be statically constructive (https://ocaml.org/manual/letrecvalues.html), so expanding a forwarding call into a fun works around the restriction.

The problem in this case is that the eta-expansion is added to all forwarding calls, including those deeper in the resulting equal implementation, where the restriction doesn't apply any more because the outermost expression is a fun already. AFAIK, eta-expansion is only needed at the top level if it's not a fun already (most cases of expr_of_typ are anyway). Therefore this PR implements exactly that. The linked benchmarks show a 50%…300% speedup from this change.

Avoid unnecessary closures in quoting

Issue #57 and commit c3bee7b introduced quoting to forwarded calls. The quoting mechanism introduces functions with () argument. There is no explanation why, but I'm guessing it's to be lazy: in case the quoted expression involves some computation, then don't perform it immediately at the beginning (it might not be necessary due to short-circuiting, etc) but perform it each time it's actually used, just like if the quoted expression where independently at each use.

The problem in this case is that forward calls being quoted are just identifier expressions, which AFAIK cannot be performing any computation on their own. Therefore this PR implements a special case of quoting just identifiers to be without the () argument. The linked benchmarks show a 200%…400% speedup from this change on the pair examples. Just forwarding to a single function doesn't seem to gain any speedup, nor slowdown.

Efficient derived code

As a result of these changes, the new derived code for the example above is:

let equal =
  let __1 = String.equal
  and __0 = Int.equal in
  ((let open! ((Ppx_deriving_runtime)[@ocaml.warning "-A"]) in
      fun (lhs0, lhs1) ->
        fun (rhs0, rhs1) -> (__0 lhs0 rhs0) && (__1 lhs1 rhs1))
    [@ocaml.warning "-A"])

And overall the derived implementations seem to be at least 3 times faster, bringing them mostly on par with the manual implementations.

@gasche
Copy link
Contributor

gasche commented Aug 13, 2021

Thanks for the very clear PR that looks high-quality (I haven't looked at the code yet).

A quick comment:

The quoting mechanism introduces functions with () argument. There is no explanation why, but I'm guessing it's to be lazy: in case the quoted expression involves some computation, then don't perform it immediately at the beginning (it might not be necessary due to short-circuiting, etc) but perform it each time it's actually used, just like if the quoted expression where independently at each use.

In general, suppose we are transforming an expression C[e] (a subexpression e inside a wider context C[...]), where e is some source expression that we want to transform into a desugared/transformed piece of code , where e' should be evaluated in the current scope (for example, it refers to some identifier x that we know scopes over C[e]. Transforming this into C[e'] would be incorrect (un-hygienic), because the context C[...] written by the user could contain binders that shadow this x with a different definition; for example C[...] could be fun x -> ... or let open Foo in ..., where Foo defines an x. So instead one may want to translate this term into let __private_identifier = e' in C[__private_identifier], but this is incorrect because it changes the order of evaluation (or the number of types e' is evaluated, etc.). On the other hand, the translation let __private_identifier () = e' in C[__private_identifier ()] is always correct, scope-wise and evaluation-wise. Of course, as you point out it introduces constant-factor inefficiencies in general, and is not necessary when is known to be a constant or an identifier (its evaluation order does not matter, and evaluating it when it is unnecessary does not degrade performances).

(Other translations are possible, such as let c x = C[x] in c e', or let x = e' and c x = C[x] in c x, which do not require generating "private" identifiers; but we digress, and those introduce a function call that is harder to simplify away with local reasoning.)

@gasche
Copy link
Contributor

gasche commented Aug 13, 2021

I have now read the code and I believe it is correct, thanks again!

One minor remark: you left some comments (* eta-expanded if outermost *) in two places in eq and ord each. I'm not sure those leftover comments (in one case a trace of a previous workaround, in the other a bug opportunity that you noticed) really help code understanding. Putting myself in the shoes of a newcomer, it's not clear what those comments mean, and they seem redundant with the comment that exists at the place where eta-expansion does actually take place. I would remove those extraneous comments, and enrich the comment on the actual eta-expansion site with a reference to the present PR that contains a detailed description for more context.

@sim642
Copy link
Contributor Author

sim642 commented Aug 13, 2021

Thanks for the feedback! I updated both the eta-expansion and the quoter comments to provide some explanation for the special optimized case.

in the other a bug opportunity that you noticed

Indeed, when trying to understand what this eta-expansion and quoting is about, I realized that eta-expansion on the outermost level was missing if a non-fun expression was explicitly provided via [@equal …] or [@compare …]. It took me a while to engineer the added tests because it only matters in the extremely obscure case when such expression refers to another derived one in the same mutually recursive definition (otherwise it's statically constructive by the simplest case) and that requires explicitly referencing the other to-be derived function.

@sim642
Copy link
Contributor Author

sim642 commented Aug 13, 2021

A couple of additional remarks:

  1. Pexp_ident probably isn't the only case where the quote optimization can be done. For example, I think Pexp_constant and maybe some other combinations could also allow it. But I didn't want to attempt to implement a function to determine the purity of an expression by syntax and Pexp_ident covers the most likely use case of quoting.
  2. This repository doesn't contain any benchmark code but if there's interest, I could also include that to continue living in this repository, since as it turns out small optimizations to derived code can make a big impact.

@gasche
Copy link
Contributor

gasche commented Aug 13, 2021

This repository doesn't contain any benchmark code but if there's interest, I could also include that to continue living in this repository, since as it turns out small optimizations to derived code can make a big impact.

Yes, but how would you robustly check for performance regression, in a way that does not depend on the system on which the tests are run? One way to do this would be to count bytecode instructions, but I don't know of an easy, practical way to track this. (You can run a bytecode binary with the instrumented runtime, ocamlrund -t foo.bc | wc -l, but it's a bit cumbersome to setup.)

@gasche
Copy link
Contributor

gasche commented Aug 13, 2021

Pexp_ident probably isn't the only case where the quote optimization can be done. For example, I think Pexp_constant and maybe some other combinations could also allow it. But I didn't want to attempt to implement a function to determine the purity of an expression by syntax and Pexp_ident covers the most likely use case of quoting.

I wondered the same and I agree. Your optimization makes sense because it corresponds to a well-understood use case (we are not optimized, rather de-pessimizing). I would wait until a clear use-case comes for constants (or functions, etc.) before going further.

@gasche gasche merged commit 0a89b61 into ocaml-ppx:master Aug 13, 2021
@sim642
Copy link
Contributor Author

sim642 commented Aug 13, 2021

Yes, but how would you robustly check for performance regression, in a way that does not depend on the system on which the tests are run? One way to do this would be to count bytecode instructions, but I don't know of an easy, practical way to track this. (You can run a bytecode binary with the instrumented runtime, ocamlrund -t foo.bc | wc -l, but it's a bit cumbersome to setup.)

Automating it is nontrivial indeed.
I suppose one way would be to have an explicit copy of the last known good (i.e. with expected performance) derived code in the benchmark and have another one be derived via the current deriver implementation. Then check if the results significantly differ (beyond some margin of error) to automatically catch pessimization or optimization. And when intentional optimizations are made, just update the explicit copy from there on.

Anyway, this goes beyond what currently would be needed, just a thought about how it might be done.

@gasche
Copy link
Contributor

gasche commented Aug 13, 2021

Yes, having to update the "baseline" on a semi-regular basis is okay, but we want to avoid flaky tests that bring the CI down irregularly. If you want to try to work something out and submit a PR, that would be warmly welcome, but I am not sure how to get a robust test -- besides the ocamlrund trick I mentioned.

(Maybe a simpler approach would be to just store the -dparsetree output of the transformation, so that the test fails and has to be promoted whenever the generated code changes. When people change this test output we can think about the performance difference, possibly by running some benchmarks.)

sim642 added a commit to sim642/ppxlib that referenced this pull request Mar 20, 2022
sim642 added a commit to sim642/ppxlib that referenced this pull request Mar 20, 2022
sim642 added a commit to sim642/ppxlib that referenced this pull request Mar 20, 2022
sim642 added a commit to sim642/ppxlib that referenced this pull request Mar 26, 2022
sim642 added a commit to sim642/ppxlib that referenced this pull request Mar 26, 2022
sim642 added a commit to sim642/ppxlib that referenced this pull request Apr 8, 2022
NathanReb added a commit to NathanReb/opam-repository that referenced this pull request Mar 12, 2024
CHANGES:

* Port standard plugins to ppxlib registration and attributes
  ocaml-ppx/ppx_deriving#263
  (Simmo Saan)

* Optimize forwarding in eq and ord plugins
  ocaml-ppx/ppx_deriving#252
  (Simmo Saan)

* Delegate quoter to ppxlib
  ocaml-ppx/ppx_deriving#263
  (Simmo Saan)

* Introduce `Ppx_deriving_runtime.Stdlib` with OCaml >= 4.07.
  This module already exists in OCaml < 4.07 but was missing otherwise.
  ocaml-ppx/ppx_deriving#258
  (Kate Deplaix)
NathanReb added a commit to NathanReb/opam-repository that referenced this pull request Apr 15, 2024
CHANGES:

* Fix a bug in `[@@deriving make]` that caused errors when it was used on a set
  of type declarations containing at least one non record type.
  ocaml-ppx/ppx_deriving#281
  (@NathanReb)

* Embed errors instead of raising exceptions when generating code
  with `ppx_deriving.make`
  ocaml-ppx/ppx_deriving#281
  (@NathanReb)

* Remove `[%derive.iter ...]`, `[%derive.map ...]` and `[%derive.fold ...]`
  extensions
  ocaml-ppx/ppx_deriving#278
  (Simmo Saan)

* Port standard plugins to ppxlib registration and attributes
  ocaml-ppx/ppx_deriving#263
  (Simmo Saan)

* Optimize forwarding in eq and ord plugins
  ocaml-ppx/ppx_deriving#252
  (Simmo Saan)

* Delegate quoter to ppxlib
  ocaml-ppx/ppx_deriving#263
  (Simmo Saan)

* Introduce `Ppx_deriving_runtime.Stdlib` with OCaml >= 4.07.
  This module already exists in OCaml < 4.07 but was missing otherwise.
  ocaml-ppx/ppx_deriving#258
  (Kate Deplaix)
semgrep-ci bot pushed a commit to semgrep/semgrep that referenced this pull request Sep 18, 2024
…rep-proprietary#2266)

## Goal
The goal of this PR is to reduce our total allocated memory by not
infering `compare : t -> t -> int` for our `Gensym` library.

## Why not ppx derive `compare` for `Gensym`?
TL;DR: the derived ppx compare allocates a closure

In one of my traces, I found the following memory behavior:
<img width="482" alt="Screenshot 2024-09-18 at 11 17 01 AM"
src="https://github.com/user-attachments/assets/1bc79b21-2925-4a2f-8d68-94204ad5bfbf">

[About 40 Gigabytes of 600 of this semgrep run is allocated in
`IL.compare_name`; which is weird, as this function is just a string +
Int
comparison](https://github.com/semgrep/semgrep-proprietary/blob/80957a396f497e9b5906b6db018bdaef03a5f4d3/OSS/src/il/IL.ml#L111-L116C15),
with no place to really allocate memory. The source of all of these
allocations was in the ppx derived `G.SId.compare`, which compares two
ints as so:
```ocaml
let rec compare : t -> t -> Ppx_deriving_runtime.int =
  ((let __0 () (a : int) b = Ppx_deriving_runtime.compare a b in
    ((let open! ((Ppx_deriving_runtime)[@ocaml.warning "-A"]) in
      fun x -> (__0 ()) x)
      [@ocaml.warning "-A"]))
        [@ocaml.warning "-39"])[@@ocaml.warning "-39"]
```
Which unnecessarily allocates a closure. 

## Post ppx-unrolling
Manually implementing a compare however we get the following trace:
<img width="443" alt="Screenshot 2024-09-18 at 12 02 01 PM"
src="https://github.com/user-attachments/assets/60aeb2aa-c9a7-44f0-b10b-05ff85f991d1">

## Notes
- There was a bug fix on ppx trying to resolve this exact issue
(ocaml-ppx/ppx_deriving#252)
- All of these allocations end up in the minor heap, so it isn't much of
a performance win, but we should get rid of it anyways
## Test plan
### Correctness
- `make test`
### performance
- run the command generated by `semgrep --pro --config
tests/perf/rules/deepsemgrep-sqli-rules.yaml blaze-persistence -d
--timeout=0` ([where blaze-persistance is this repo
](https://github.com/Blazebit/blaze-persistence)), before and after
introducing the fix w/Memtrace.

synced from Pro 69685213f1c376f8103846da4c36c9cc63e443ce
semgrep-ci bot pushed a commit to semgrep/semgrep that referenced this pull request Sep 19, 2024
…rep-proprietary#2266)

## Goal
The goal of this PR is to reduce our total allocated memory by not
infering `compare : t -> t -> int` for our `Gensym` library.

## Why not ppx derive `compare` for `Gensym`?
TL;DR: the derived ppx compare allocates a closure

In one of my traces, I found the following memory behavior:
<img width="482" alt="Screenshot 2024-09-18 at 11 17 01 AM"
src="https://github.com/user-attachments/assets/1bc79b21-2925-4a2f-8d68-94204ad5bfbf">

[About 40 Gigabytes of 600 of this semgrep run is allocated in
`IL.compare_name`; which is weird, as this function is just a string +
Int
comparison](https://github.com/semgrep/semgrep-proprietary/blob/80957a396f497e9b5906b6db018bdaef03a5f4d3/OSS/src/il/IL.ml#L111-L116C15),
with no place to really allocate memory. The source of all of these
allocations was in the ppx derived `G.SId.compare`, which compares two
ints as so:
```ocaml
let rec compare : t -> t -> Ppx_deriving_runtime.int =
  ((let __0 () (a : int) b = Ppx_deriving_runtime.compare a b in
    ((let open! ((Ppx_deriving_runtime)[@ocaml.warning "-A"]) in
      fun x -> (__0 ()) x)
      [@ocaml.warning "-A"]))
        [@ocaml.warning "-39"])[@@ocaml.warning "-39"]
```
Which unnecessarily allocates a closure. 

## Post ppx-unrolling
Manually implementing a compare however we get the following trace:
<img width="443" alt="Screenshot 2024-09-18 at 12 02 01 PM"
src="https://github.com/user-attachments/assets/60aeb2aa-c9a7-44f0-b10b-05ff85f991d1">

## Notes
- There was a bug fix on ppx trying to resolve this exact issue
(ocaml-ppx/ppx_deriving#252)
- All of these allocations end up in the minor heap, so it isn't much of
a performance win, but we should get rid of it anyways
## Test plan
### Correctness
- `make test`
### performance
- run the command generated by `semgrep --pro --config
tests/perf/rules/deepsemgrep-sqli-rules.yaml blaze-persistence -d
--timeout=0` ([where blaze-persistance is this repo
](https://github.com/Blazebit/blaze-persistence)), before and after
introducing the fix w/Memtrace.

synced from Pro 69685213f1c376f8103846da4c36c9cc63e443ce
nmote pushed a commit to semgrep/semgrep that referenced this pull request Sep 19, 2024
…rep-proprietary#2266)

## Goal
The goal of this PR is to reduce our total allocated memory by not
infering `compare : t -> t -> int` for our `Gensym` library.

## Why not ppx derive `compare` for `Gensym`?
TL;DR: the derived ppx compare allocates a closure

In one of my traces, I found the following memory behavior:
<img width="482" alt="Screenshot 2024-09-18 at 11 17 01 AM"
src="https://github.com/user-attachments/assets/1bc79b21-2925-4a2f-8d68-94204ad5bfbf">

[About 40 Gigabytes of 600 of this semgrep run is allocated in
`IL.compare_name`; which is weird, as this function is just a string +
Int
comparison](https://github.com/semgrep/semgrep-proprietary/blob/80957a396f497e9b5906b6db018bdaef03a5f4d3/OSS/src/il/IL.ml#L111-L116C15),
with no place to really allocate memory. The source of all of these
allocations was in the ppx derived `G.SId.compare`, which compares two
ints as so:
```ocaml
let rec compare : t -> t -> Ppx_deriving_runtime.int =
  ((let __0 () (a : int) b = Ppx_deriving_runtime.compare a b in
    ((let open! ((Ppx_deriving_runtime)[@ocaml.warning "-A"]) in
      fun x -> (__0 ()) x)
      [@ocaml.warning "-A"]))
        [@ocaml.warning "-39"])[@@ocaml.warning "-39"]
```
Which unnecessarily allocates a closure. 

## Post ppx-unrolling
Manually implementing a compare however we get the following trace:
<img width="443" alt="Screenshot 2024-09-18 at 12 02 01 PM"
src="https://github.com/user-attachments/assets/60aeb2aa-c9a7-44f0-b10b-05ff85f991d1">

## Notes
- There was a bug fix on ppx trying to resolve this exact issue
(ocaml-ppx/ppx_deriving#252)
- All of these allocations end up in the minor heap, so it isn't much of
a performance win, but we should get rid of it anyways
## Test plan
### Correctness
- `make test`
### performance
- run the command generated by `semgrep --pro --config
tests/perf/rules/deepsemgrep-sqli-rules.yaml blaze-persistence -d
--timeout=0` ([where blaze-persistance is this repo
](https://github.com/Blazebit/blaze-persistence)), before and after
introducing the fix w/Memtrace.

synced from Pro 69685213f1c376f8103846da4c36c9cc63e443ce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants