Skip to content

Conversation

@j-berman
Copy link
Collaborator

Idea by @kayabaNerve to speed up field element inversions when we need to do multiple inversions at one time. This is a significant perf optimization used in FCMP++ tree building.

The perf test shows that batch inverting 1000 elements is ~98% faster than inverting each one individually.

Copy link
Contributor

@jeffro256 jeffro256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a change to avoid allocations:

diff --git a/src/crypto/crypto-ops.c b/src/crypto/crypto-ops.c
index d5db2481e..6205d19e8 100644
--- a/src/crypto/crypto-ops.c
+++ b/src/crypto/crypto-ops.c
@@ -30,7 +30,6 @@
 
 #include <assert.h>
 #include <stdint.h>
-#include <stdlib.h>
 
 #include "warnings.h"
 #include "crypto-ops.h"
@@ -322,28 +321,22 @@ int fe_batch_invert(fe *out, const fe *in, const int n) {
   }
 
   // Step 1: collect initial muls
-  fe *init_muls = (fe *) malloc(n * sizeof(fe));
-  if (!init_muls) {
-    return 1;
-  }
-  fe_copy(init_muls[0], in[0]);
+  fe_copy(out[0], in[0]);
   for (int i = 1; i < n; ++i) {
-    fe_mul(init_muls[i], init_muls[i-1], in[i]);
+    fe_mul(out[i], out[i-1], in[i]);
   }
 
   // Step 2: get the inverse of all elems multiplied together
   fe a;
-  fe_invert(a, init_muls[n-1]);
+  fe_invert(a, out[n-1]);
 
   // Step 3: get each inverse
   for (int i = n; i > 1; --i) {
-    fe_mul(out[i-1], a, init_muls[i-2]);
+    fe_mul(out[i-1], a, out[i-2]);
     fe_mul(a, a, in[i-1]);
   }
   fe_copy(out[0], a);
 
-  free(init_muls);
-
   return 0;
 }
 

@j-berman j-berman force-pushed the fcmp++-fe-batch-invert branch from b75df59 to be4c802 Compare September 27, 2025 02:04
@j-berman j-berman force-pushed the fcmp++-fe-batch-invert branch from be4c802 to 4e9adb7 Compare September 27, 2025 02:06
// Montgomery's trick
// https://iacr.org/archive/pkc2004/29470042/29470042.pdf 2.2
int fe_batch_invert(fe *out, const fe *in, const int n) {
if (n == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the function be labeled vartime due to this branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time variable to the width, which generally isn't a secret, shouldn't be an issue. You'd have to argue that the amount of inputs/amounts is a secret and should be constant-time to them accordingly (or whatever defines the width here).


bool init()
{
m_fes = (fe *) malloc(n_elems * sizeof(fe));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for nullptr? Could easily use std::vector ...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used a vector instead of a pointer in the latest. I think I did that originally and then changed it, I don't remember why. Good spot

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failure reminded me why I changed it from using a vector...

/Applications/Xcode_16.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/v1/__memory/allocator.h:168:84: error: object expression of non-scalar type 'int[10]' cannot be used in a pseudo-destructor expression
  168 |   _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_HIDE_FROM_ABI void destroy(pointer __p) { __p->~_Tp(); }

I figured using a raw pointer was the simplest way to deal with this. Do you see a better way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this pattern in FCMP++ integration code, not just in tests here, so it would definitely be good to improve if you see a better way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to using a raw pointer in the latest. Open to a better approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops.

I think the only way is to wrap in another object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unique_ptr mostly works here, but you have to match it with the correct allocation call. (new [] array).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this pattern in FCMP++ integration code, not just in tests here, so it would definitely be good to improve if you see a better way.

Hopefully the code doesn't leak on exceptions (like these tests kind of do).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the latest look? I'd update the integration code with that pattern too (it does currently leak on exceptions)


bool test()
{
fe *inv_fes = (fe *) malloc(n_elems * sizeof(fe));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

{
fe *ptr = (fe *) malloc(n * sizeof(fe));
if (!ptr)
throw std::runtime_error("failed to malloc fe *");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw std::bad_alloc is probably more appropriate, as this has to allocate memory for the std::string after just failing to malloc.

Again, not sure why std::vector isn't used instead.

{
fe *batch_inverted = alloc(n_elems);
ASSERT_EQ(fe_batch_invert(batch_inverted, init_elems, n_elems), 0);
ASSERT_EQ(memcmp(batch_inverted, norm_inverted, n_elems * sizeof(fe)), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that in-memory representations of fe might not be equal even if the fe_tobyte() canonical representations would be the same. They happen to be in this case, and it might be worth keeping this test as-is? But it's conceivable that some correct change to e.g. fe_mul() down the line might cause this line to fail. I would recommend keeping this line, but documenting it in case it fails later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref10 in general doesn't guarantee they're reduced. That caused annoyances with the FCMP++ code...

(Adding on to what you said, as you identified and raised)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning in the latest, also linking to an fe_reduce implementation that could be used to update the test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented an fe_equals over here: j-berman@2a56788

Will make a follow-up PR to this one

{
std::vector<fe> inv_fes(n_elems);

if (batched)
Copy link
Contributor

@jeffro256 jeffro256 Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (batched)
if constexpr (batched)

For good measure, to help compiler optimize the branch out, and to signal to reader that that is what it's doing

@j-berman
Copy link
Collaborator Author

Test failure is unrelated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants