`_safe_log` is not safe

I found that `_safe_log` is unfortunately not a safe operation in all cases. Only affected in architectures with bad floating point precision.

The follow logs are available:

- [i386 (aka x86 32-bit)](https://buildd.debian.org/status/fetch.php?pkg=iminuit&arch=arm64&ver=2.22.0-1&stamp=1688430799&raw=0): fails during `_ubinned_dll` (probably trying to sum `-inf` from `x=0`)
- [s390x](https://buildd.debian.org/status/fetch.php?pkg=iminuit&arch=s390x&ver=2.22.0-1&stamp=1688430563&raw=0): fails during `_unbinned_dll` in `test_UnbinnedNLL` (not quite clear to me where exactly this goes wrong)
- [mips64el](https://buildd.debian.org/status/fetch.php?pkg=iminuit&arch=mips64el&ver=2.22.0-1&stamp=1688432984&raw=0): fails during `poisson_chi2`
- [mipsel](https://buildd.debian.org/status/fetch.php?pkg=iminuit&arch=mipsel&ver=2.22.0-1&stamp=1688439302&raw=0): fails during `poisson_chi2`

What connects all of these failures is the fact that they call `_safe_log`. Notably, the implementation:
```python3
def _safe_log(x):
    # guard against x = 0
    return np.log(x + 1e-323)
```

`1e-323` was probably picked since it is close to `np.finfo(float).smallest_subnormal` (which would be the exact value).

However, this only works when working with 64-bit floats, `np.finfo(np.float32).smallest_subnormal` evaluates to `1e-45`. I think the issue is that during the addition when x is a 32-bit float, `1e-323` is casted to a 32-bit float for some reason. Because ` np.float32(1e-323)` evaluates to `0.0` the log returns `-inf`.

One way this could be tested more precisely is by checking the result of `_safe_log` in the tests is actually not `-inf` since `np.log(0.0)` does not throw any error (would need to be adjusted in `test_without_numba`).

I'm not entirely sure what the best solution is here - I see 3 options:
1. Instead of adding something to x, don't execute the log for values where x is zero: `np.log(x, where=(x>0.), out=np.full_like(x, np.log(np.finfo(float).smallest_subnormal)))`. Disadvantage: always returns array even for basic floats, potentially slow
2. Use the 32-bit smallest subnormal instead of the 64-bit one. Disadvantage: potential loss of precision.
3. Provide a `_safe_log` option that depends on the datatype and uses the appropriate smallest subnormal. However I'm not sure how this can be implemented.
4. Clip the `-inf`: `np.clip(np.log(x), -1e3, None)`. Disadvantage: slightly slower than current implementation probably.

I did a quick test of option 2 using the following code:
```python3
import numpy as np

_SSB32 = np.finfo(np.float32).smallest_subnormal
_SSB64 = np.finfo(np.float64).smallest_subnormal
_reasonably_small = 1e-40
_normal_small = 1e-30

def pretty(x):
  return np.format_float_scientific(x, exp_digits=3, precision=4, min_digits=4)

def logdiff(x):
  return np.log(x+_SSB32)-np.log(x)

print(f"""
For {pretty(_SSB64)}: {pretty(logdiff(_SSB64))}
For {pretty(_SSB32)}: {pretty(logdiff(_SSB32))}
For {pretty(_reasonably_small)}: {pretty(logdiff(_reasonably_small))}
For {pretty(_normal_small)}: {pretty(logdiff(_normal_small))}
""")
```
Output:
```
For 4.9407e-324: 6.4116e+002
For 1.4013e-045: 6.9315e-001
For 1.0000e-040: 1.4013e-005
For 1.0000e-030: 0.0000e+000
```

So I don't think it's entirely unreasonable, for more or less all number people should use (i.e. x>1e-30), the difference is precisely zero. However, for the current approach this limits is roughly 1e-310.

I think overall option 4 provides the best balance between precision, performance, ease of implementation and platform compatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`_safe_log` is not safe #913

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_safe_log is not safe #913

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`_safe_log` is not safe #913