LBFGS Hess approx #135

BalzaniEdoardo · 2025-05-06T15:57:41Z

L-BFGS Implementation for `quasi_newton.py`

Hello, and first of all, thank you for the great package!

In this PR, I’m working on implementing the L-BFGS algorithm within the quasi_newton.py module, targeting part of #116 . Before consolidating the code with tests and full integration, I’d appreciate guidance on design decisions.

Implementation Overview

The descent direction is computed via _lim_mem_hess_inv_operator_fn, which implements the two-loop recursion using the history of parameter and gradient residuals.
_lim_mem_hess_inv_operator acts as an operator factory: it partially evaluates _lim_mem_hess_inv_operator_fn with the current residuals and returns a lineax.FunctionLinearOperator.
Currently, the operator is materialized before returning, which likely defeats the purpose of using an implicit representation.
The buffers with the residuals are stored in a pytree of arrays with the same structure as the parameters, but with an additional dimension (of length "buffer size").
Residuals are stored in a dictionary within _QuasiNewtonState.
The Hessian update returns an additional dictionary carrying the updated residual state.

Questions

Materialization vs. Tree Equality

To satisfy this assertion in _iterate.py:
```
assert eqx.tree_equal(static_state, new_static_state) is True
```
I had to materialize the FunctionLinearOperator. Is there a way to avoid this and retain the implicit operator while still passing this check?
The memory buffer size is currently fixed at 10 iterations. What’s the best way to expose this as a user-settable parameter without altering the broader solver API in quasi_newton.py?
I did not systematically tested it yet, but for low-dimensional parameters, JIT-compiling the function _lim_mem_hess_inv_operator_fn and calling it directly seems significantly faster than the construction of a FunctionLinearOperator. Below is line_profiler output for a minimise call:

Timer unit: 1e-06 s

Total time: 0.052245 s
File: /Users/ebalzani/Code/optimistix/optimistix/_solver/quasi_newton.py
Function: _lim_mem_hess_inv_operator at line 120

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  120                                           @line_profiler.profile
  121                                           def _lim_mem_hess_inv_operator(
  122                                                   residual_par: PyTree[Array],
  123                                                   residual_grad: PyTree[Array],
  124                                                   rho: Array,
  125                                                   index_start: Array,
  126                                                   input_shape: Optional[PyTree[jax.ShapeDtypeStruct]] = None,
  127                                           ):
  128                                               """Define a `lineax` linear operator implementing the L-BFGS inverse Hessian approximation.
  129                                           
  130                                               This operator computes the action of the approximate inverse Hessian on a vector `pytree`
  131                                               using the limited-memory BFGS (L-BFGS) two-loop recursion. It does not materialize the matrix
  132                                               explicitly but returns a `lineax.FunctionLinearOperator`.
  133                                           
  134                                               - `residual_par`: History of parameter updates `s_k = x_{k+1} - x_k`
  135                                               - `residual_grad`: History of gradient updates `y_k = g_{k+1} - g_k`
  136                                               - `rho`: Reciprocal dot products `rho_k = 1 / ⟨s_k, y_k⟩`
  137                                               - `index_start`: Index of the most recent update in the circular buffer
  138                                           
  139                                               Returns a `lineax.FunctionLinearOperator` with input and output shape matching a single element
  140                                               of `residual_par`.
  141                                           
  142                                               """
  143         4          2.0      0.5      0.0      operator_func = partial(
  144         2          0.0      0.0      0.0          _lim_mem_hess_inv_operator_fn,
  145         2          0.0      0.0      0.0          residual_par=residual_par,
  146         2          0.0      0.0      0.0          residual_grad=residual_grad,
  147         2          0.0      0.0      0.0          rho=rho,
  148         2          0.0      0.0      0.0          index_start=index_start
  149                                               )
  150         2          0.0      0.0      0.0      input_shape = (
  151         2        942.0    471.0      1.8          jax.eval_shape(lambda: jtu.tree_map(lambda x: x[0], residual_par))
  152         2          0.0      0.0      0.0          if input_shape is None
  153                                                   else input_shape
  154                                               )
  155         4      28219.0   7054.8     54.0      op = lx.FunctionLinearOperator(
  156         2          0.0      0.0      0.0          operator_func,
  157         2          0.0      0.0      0.0          input_shape,
  158         2          1.0      0.5      0.0          tags=lx.positive_semidefinite_tag,
  159                                               )
  160         2      23081.0  11540.5     44.2      return lx.materialise(op)

Let me know how to move on from here and thanks in advance for the insights!

PS tagging my collaborator here too: @bagibence

johannahaffner · 2025-05-06T20:08:41Z

Awesome! And that was impressively quick.
I can go through this by Thursday :)

johannahaffner

This looks very reasonable. I've left a first round of comments. Let me know what you think!

optimistix/_solver/quasi_newton.py

johannahaffner · 2025-05-07T23:04:51Z

optimistix/_solver/quasi_newton.py

+    )
+    input_shape = (
+        jax.eval_shape(lambda: jtu.tree_map(lambda x: x[0], residual_par))
+        if input_shape is None


jax.eval_shape(lambda: y) should always be the input shape, so I think we can use that directly here and avoid making it an optional argument. I can't think of a case where we would want to use any other shape, since the Hessian is always with respect to y and always symmetric.

Yes, there is no case in which that's not true! The idea was to provided it directly as a static arg but once the function is jit-compiled that won't matter I guess

Exactly, this is inside the jitted region and the shape of y is static.

johannahaffner · 2025-05-07T23:07:03Z

optimistix/_solver/quasi_newton.py

+        input_shape,
+        tags=lx.positive_semidefinite_tag,
+    )
+    return lx.materialise(op)


I think the shape complaint you got when just returning the function linear operator is due to the jaxpr in it, we solve this in a few places by just returning the dynamic portion of the equinox module, e.g. here. (And here for some background.)

Incidentally I've been meaning to track down what's going on here - this indicates that we are retracing jaxprs, which is not ideal from a compile-time perspective. In a more perfect world we can just directly use the one we already have!

Where do you think that originates?

Using eqx.filter makes sense here. I believe the issue comes from the fact that the JAXPRs creates a unique id for the traced variables:

(Pdb) elem.invars [Var(id=4544898944):float64[1]] (Pdb) elem_.invars [Var(id=4775038400):float64[1]]

here elem are the JAXPRs vars in the scope of eqx.tree_equal

For what it concerns the jaxpr, I followed the suggestion of filtering the tree.
I am implementing the filtering during the __call__ method of the LBFGSUpdate:

Get the static part of the tree before calling update/no_update

In update/no_update get the operator and filter for the dynamic sub-tree.

After the update call combine the static with the dynamic .

Does this make sense?

Filtering as you've done here makes perfect sense!

optimistix/_solver/quasi_newton.py

johannahaffner · 2025-05-07T23:55:05Z

To address this question (others in the comments):

JIT-compiling the function _lim_mem_hess_inv_operator_fn and calling it directly seems significantly faster than the construction of a FunctionLinearOperator.

Maybe I don't see it, did you include the timing results for the jit compilation of just the function that creates the operator?

BalzaniEdoardo · 2025-05-12T21:55:40Z

Thanks for the feedback!

I gave it another pass, let me know if this is good enough for me to start writing tests.

BalzaniEdoardo · 2025-05-12T23:24:04Z

To address this question (others in the comments):

JIT-compiling the function _lim_mem_hess_inv_operator_fn and calling it directly seems significantly faster than the construction of a FunctionLinearOperator.

Maybe I don't see it, did you include the timing results for the jit compilation of just the function that creates the operator?

I am not sure when this could be a bottleneck, likely only for small problems for which the evaluation of the loss and its gradient is very fast. I am attaching a quick bench marking script

from time import perf_counter

import jax
import jax.numpy as jnp
from jax import random
from optimistix._solver.quasi_newton import _make_lbfgs_operator, _lbfgs_operator_fn

jax.config.update("jax_enable_x64", True)

key = random.PRNGKey(123)
par_shape = (1,)
memory_size = 10

key, subkey1, subkey2, subkey3 = random.split(key, 4)

X = random.normal(subkey1, (30, *par_shape))
true_pars = random.normal(subkey2, par_shape)
noise = 0.8 * random.normal(subkey3, (30,))
y = jnp.dot(X, true_pars) + noise

init_par = jnp.zeros(par_shape)

grad_diff = jnp.zeros((memory_size, *par_shape))
param_diff = jnp.zeros((memory_size, *par_shape))
inner_products = jnp.zeros((memory_size, *par_shape))
index_start = jnp.array(0)

# ---------------------------
# Without JIT
# ---------------------------
print("=== Without JIT ===")

with jax.disable_jit(True):
    t0 = perf_counter()
    _lbfgs_operator_fn(init_par, param_diff, grad_diff, inner_products, index_start)
    print("Function call (tracing JAXPR):", perf_counter() - t0)

    t0 = perf_counter()
    _lbfgs_operator_fn(init_par, param_diff, grad_diff, inner_products, index_start)
    print("Function call (interpreted, no trace):", perf_counter() - t0)

    t0 = perf_counter()
    op = _make_lbfgs_operator(
        y_diff_history=param_diff,
        grad_diff_history=grad_diff,
        inner_history=inner_products,
        index_start=index_start,
    )
    print("LinearOperator construction (traces internal matvec):", perf_counter() - t0)

    t0 = perf_counter()
    op = _make_lbfgs_operator(
        y_diff_history=param_diff,
        grad_diff_history=grad_diff,
        inner_history=inner_products,
        index_start=index_start,
    )
    op.mv(init_par)
    print("LinearOperator construction + mv (new trace):", perf_counter() - t0)


# ---------------------------
# With JIT
# ---------------------------
print("\n=== With JIT ===")

t0 = perf_counter()
_lbfgs_operator_fn(init_par, param_diff, grad_diff, inner_products, index_start)
print("Function call (JIT trace + compile + run):", perf_counter() - t0)

t0 = perf_counter()
_lbfgs_operator_fn(init_par, param_diff, grad_diff, inner_products, index_start)
print("Function call (JIT execution only):", perf_counter() - t0)

t0 = perf_counter()
op_jit = _make_lbfgs_operator(
    y_diff_history=param_diff,
    grad_diff_history=grad_diff,
    inner_history=inner_products,
    index_start=index_start,
)
op_jit.mv(init_par)
print("LinearOperator JIT construction + mv (compile + run):", perf_counter() - t0)

t0 = perf_counter()
op_jit = _make_lbfgs_operator(
    y_diff_history=param_diff,
    grad_diff_history=grad_diff,
    inner_history=inner_products,
    index_start=index_start,
)
op_jit.mv(init_par)
print("LinearOperator JIT reconstruction + mv (new trace + run):", perf_counter() - t0)

t0 = perf_counter()
op_jit.mv(init_par)
print("LinearOperator mv call (JIT execution only):", perf_counter() - t0)

which outputs:

=== Without JIT ===
Function call (tracing JAXPR): 1.0509237500373274
Function call (interpreted, no trace): 0.030687999911606312
LinearOperator construction (traces internal matvec): 0.1505412079859525
LinearOperator construction + mv (new trace): 0.9351932499557734
=== With JIT ===
Function call (JIT trace + compile + run): 0.05465291708242148
Function call (JIT execution only): 2.062495332211256e-05
LinearOperator JIT construction + mv (compile + run): 0.07817629189230502
LinearOperator JIT reconstruction + mv (new trace + run): 0.002992084017023444
LinearOperator mv call (JIT execution only): 0.0002868750598281622

johannahaffner · 2025-05-13T21:09:14Z

Thank you for the update! I can review tomorrow night.

johannahaffner

Nice work! I've left a few small comments and I can follow along much more easily now, thanks for the descriptive variable names!

For tests you can just add it to the list of minimisers defined in tests/helpers.py.

optimistix/_solver/quasi_newton.py

johannahaffner · 2025-05-14T19:37:36Z

optimistix/_solver/quasi_newton.py

+
+
+class LBFGSUpdate(AbstractQuasiNewtonUpdate, strict=True):
+    """Private intermediate class for LBFGS updates."""


This should be a public class, I think. We want to expose the update classes to users who may wish to build custom solvers, in which case they are required. If you have a reference for this specific implementation, it would be great to add it here!

Small thing: can we move the definition of the update class up, so that it is grouped with the other update classes, we then define the abstract solver and then have the concrete ones at the bottom of the file? 🙈

Absolutely 😀 I totally understand the need for linting the code!

johannahaffner · 2025-05-14T19:40:24Z

optimistix/_solver/quasi_newton.py

+    recursion. It does not materialize the matrix explicitly but returns a
+    `lineax.FunctionLinearOperator`.
+
+    - `y_diff_history`: History of parameter updates `s_k = x_{k+1} - x_k`


Can we add a small comment that s_k and y_k are the typical variable names used in the literature? Since our y is something else.

optimistix/_solver/quasi_newton.py

johannahaffner · 2025-05-14T19:45:41Z

optimistix/_solver/quasi_newton.py

+        self.descent = NewtonDescent()
+        self.search = search
+        self.hessian_update = LBFGSUpdate(
+            use_inverse=True,


This only supports use_inverse = True right now, correct? Do you know what this would look like for the approximation of the Hessian itself, not its inverse?

I looked this up and I believe that an approx it the hessian directly may be possible starting from a representation like the ine of chapter 7.2 of this:
https://www.math.uci.edu/~qnie/Publications/NumericalOptimization.pdf

when the author talks about the compact representation of the update. Let me know if my intuition is correct and I'll dig more into it!

ps is the approx of the hessian something you would want?

patrick-kidger

Okay, apologies for taking so long to get around to this!

I've not checked the details of the algorithms precisely, but other than that I think this basically all looks reasonable to me. I've left lots of nitty comments below on edge cases and code tidiness and such.

optimistix/_misc.py

optimistix/_search.py

optimistix/_solver/__init__.py

optimistix/_solver/quasi_newton.py

optimistix/_solver/limited_memory_bfgs.py

johannahaffner

Addressed some small fixes and moved the function to create an identity operator back to quasi Newton, without the tag shenanigans :)

optimistix/_misc.py

johannahaffner · 2025-07-08T17:46:08Z

Alright, tweaked some more things!

Of note:

I suggest we do the handling of self.use_inverse in a separate PR that turn moves the Hessian update machinery into solver methods @patrick-kidger, and leave this as-is here (see comments above).
likewise happy to refine the jaxpr handling in the course of improving our static handling of these (to make everything compatible with jax.vmap)
@BalzaniEdoardo can you do the jnp.where safeguard for the no-jit + debug + vmap edge case? It seems like you know what would need to change :)
finally, I'm questioning if failing to unpack the shape of the y_diff history isn't actually an informative way to fail in the case in which y is an empty pytree.

BalzaniEdoardo · 2025-07-09T01:23:25Z

@BalzaniEdoardo can you do the jnp.where safeguard for the no-jit + debug + vmap edge case? It seems like you know what would need to change :)

Sounds good, I'll add that tomorrow.

…to hess_approx

johannahaffner · 2025-07-09T16:54:24Z

I think you're missing a jnp.asarray around the inner in the set :)

johannahaffner · 2025-07-12T13:58:14Z

History length is now a state attribute, and will become a solver-level attribute when the update methods become solver methods :)

patrick-kidger

Okay I think this LGTM! I have only nitty remarks/questions, see below.

@johannahaffner happy to merge this into dev :)

optimistix/_solver/quasi_newton.py

optimistix/_solver/limited_memory_bfgs.py

patrick-kidger · 2025-07-13T07:55:42Z

optimistix/_solver/limited_memory_bfgs.py

+    history_length: int
+    y_diff_history: PyTree[Y]
+    grad_diff_history: PyTree[Y]
+    y_diff_grad_diff_cross_inner: Float[Array, " history_length history_length"]


Just to confirm, this is intentionally only used when use_inverse=False?

Yes, we have different update states for the approximation of the Hessian and its inverse, and they do use different inner / outer products.

patrick-kidger · 2025-07-13T07:59:28Z

optimistix/_solver/limited_memory_bfgs.py

+            y_diff_grad_diff_cross_inner = state.y_diff_grad_diff_cross_inner.at[
+                state.index_start % self.history_length
+            ].set(v_tree_dot(state.grad_diff_history, y_diff))
+            y_diff_grad_diff_cross_inner = y_diff_grad_diff_cross_inner.at[
+                :, state.index_start % self.history_length
+            ].set(0)


IIUC this is a trick to gradually fill in the lower triangular part of matrix (across multiple calls to _update), with zero diagonal, and the upper triangular part will be filled with either zeros or nonsense data depending where we are? And the upper triangular part is fine because it's not read by the triangular solve calls we use, so we're happy with it containing meaningless data?

This is subtle enough that I think it deserves a comment 😄

Good catch, added an explanation!

optimistix/_solver/limited_memory_bfgs.py

… small fixes

johannahaffner · 2025-07-13T13:58:04Z

optimistix/_solver/limited_memory_bfgs.py

+    # We know that gamma > 0 because we only update the Hessian approximation if the
+    # inner product is positive, to maintain positive definiteness of the Hessian
+    # approximation, and thus this operator is only ever called in that case.
+    latest_y_diff, latest_grad_diff = jtu.tree_map(


Is this the oldest pair, or the most current one @BalzaniEdoardo? Wikipedia indexes this with k-m.

(I think that means that it is the oldest?)

It is the most recent in the code, because the index is incremented by one before _update is called, so subtracting one selects the most recent update. This matches Byrd et al. (1994) Eq. (3.12) - and the optax implementation, where they use the most recent (s_{k-1}, y_{k-1}) pair (see attached figure).

I also noticed this discrepancy from the Wikipedia page, which suggests using (s_{k−m}, y_{k−m}) instead. I’m not aware of an alternative that is provably better, so I went with what seemed to be the most common practice.

Perfect, thank you!

…illed matrices

BalzaniEdoardo added 7 commits May 3, 2025 09:56

add a linear operator computing the descent update

a8b3ffb

removed unused cls

956329e

added update hessian state dict

9380261

partial applied to fn

36a9d3c

allow pytree at set

30669b4

added fixes

c0be4bc

added import lbfgs

45a8f69

johannahaffner reviewed May 7, 2025

View reviewed changes

BalzaniEdoardo added 5 commits May 12, 2025 17:33

linted

925a73c

improved varnames

f5e8bc5

linted

f2aa7a6

fixed docstrings

3ccac0f

revert unnecessary change

a5d16ef

BalzaniEdoardo added 4 commits May 12, 2025 19:41

do not use threshold

d29374e

improved comment

5822ad5

removed todo

3dbd061

use hist len attr

689c876

BalzaniEdoardo added 2 commits May 14, 2025 09:23

renamed variable

b5ba14d

remove unused linear solver

d81fe10

johannahaffner reviewed May 14, 2025

View reviewed changes

BalzaniEdoardo added 5 commits May 14, 2025 17:11

test lbfgs linear op

090df71

remove jit

85fe537

added LBFGS to tests, added a test for the operator

c0a09b9

test pre-commit

b83db06

test pre-commit 2

080afd2

patrick-kidger reviewed Jul 6, 2025

View reviewed changes

johannahaffner mentioned this pull request Jul 6, 2025

Anderson acceleration methods #145

Open

small fixes

2a8df02

johannahaffner reviewed Jul 8, 2025

View reviewed changes

optimistix/_misc.py Outdated Show resolved Hide resolved

Johanna Haffner added 3 commits July 8, 2025 10:16

teeny tiny fix

c6c46be

move a comment

6e76c03

implement .conj method (abstract in Lineax)

e466ced

bagibence mentioned this pull request Jul 8, 2025

Unified solver interface for compatibility with JAXopt and Optimistix flatironinstitute/nemos#365

Draft

more small fixes

6d334b7

BalzaniEdoardo added 3 commits July 9, 2025 11:57

add safe divide for inner and comment

523cfed

linted

422f839

Merge branch 'hess_approx' of github.com:BalzaniEdoardo/optimistix in…

329bac5

…to hess_approx

BalzaniEdoardo and others added 4 commits July 9, 2025 12:57

add as array

77dba7a

add asarray

6cf0725

write the history length into the state

d724a4d

fix history length in L-BFGS special tst

02a69b7

patrick-kidger approved these changes Jul 13, 2025

View reviewed changes

johannahaffner reviewed Jul 13, 2025

View reviewed changes

optimistix/_solver/limited_memory_bfgs.py Show resolved Hide resolved

reorder and improve documentation of inverse operator function + some…

a85bca4

… small fixes

johannahaffner reviewed Jul 13, 2025

View reviewed changes

Johanna Haffner added 3 commits July 13, 2025 18:09

add comments explicating our strategy for Cholesky solves of partly f…

e7a55b3

…illed matrices

document hessian update state in quasi-Newton

89bc63d

document choice for computation of gamma

bd09b85

johannahaffner merged commit 8dbc8e5 into patrick-kidger:dev Jul 14, 2025
2 checks passed

bagibence mentioned this pull request Jul 14, 2025

Add support for Optimistix's L-BFGS on release flatironinstitute/nemos#370

Open

johannahaffner mentioned this pull request Jul 14, 2025

Inline methods for updates to the Hessian and inverse Hessian approximations #146

Merged



		class LBFGSUpdate(AbstractQuasiNewtonUpdate, strict=True):
		"""Private intermediate class for LBFGS updates."""

LBFGS Hess approx #135

LBFGS Hess approx #135

Uh oh!

Conversation

BalzaniEdoardo commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

L-BFGS Implementation for quasi_newton.py

Implementation Overview

Questions

Uh oh!

johannahaffner commented May 6, 2025

Uh oh!

johannahaffner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BalzaniEdoardo May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johannahaffner commented May 7, 2025

Uh oh!

BalzaniEdoardo commented May 12, 2025

Uh oh!

BalzaniEdoardo commented May 12, 2025

Uh oh!

johannahaffner commented May 13, 2025

Uh oh!

johannahaffner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrick-kidger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BalzaniEdoardo commented May 6, 2025 •

edited

Loading

L-BFGS Implementation for `quasi_newton.py`

BalzaniEdoardo May 12, 2025 •

edited

Loading

patrick-kidger left a comment •

edited

Loading

johannahaffner Jul 13, 2025 •

edited

Loading

johannahaffner Jul 13, 2025 •

edited

Loading