Speed Up Worst-Case Overhead #376

Theelx · 2025-08-06T20:32:45Z

The following PR transforms this callback to this. More yellow is worse.
Before:

After:

The final result is to lower the worst-case overhead from ~60x to ~40x, which gets closer to the ~25-30x measured in 4.1.3. The average overhead remains at ~2x. Worst-case overhead output below:

kernprof -lvz test.py
Time Elapsed without profiling: 90.72ms
Time Elapsed profiling: 3686.82ms
Slowdown from profiling: 3596.1000000000004ms, 40.64x
Wrote profile results to 'test.py.lprof'
Timer unit: 1e-06 s

Total time: 2.78653 s
File: test.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main(profiled=False):
     4         1          2.9      2.9      0.0      start = time.perf_counter()
     5   2000001    1401860.4      0.7     50.3      for x in range(2000000): #2000000
     6   2000000    1384564.3      0.7     49.7          y = x
     7         1         31.1     31.1      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1          1.2      1.2      0.0      if profiled:
     9         1         60.8     60.8      0.0          print(f"Time Elapsed profiling: {elapsed_ms}ms")
    10                                               else:
    11                                                   print(f"Time Elapsed without profiling: {elapsed_ms}ms")
    12         1          4.9      4.9      0.0      return elapsed_ms

Theelx · 2025-08-06T20:34:21Z

Note: Unfortunately we can't use last_map in additional areas near the end of the callback due to segmentation faults and stale data, so the speedup from avoiding re-resolution is negated by broken functionality.

Theelx · 2025-08-06T20:37:56Z

Huh, it's interesting that the tests fail. On my end, the only error I get when running run_tests.py is in the attached screenshot. Any ideas how I can reproduce the errors in the CI on my local machine so I can fix them?

TTsangSC · 2025-08-07T00:57:31Z

My tests so far:

_Py_HashBytes() requires jumping more hoops to get in 3.13+, and is promoted to a public function Py_HashBuffer() in 3.14. This is also why Build sdist fails. Here is what I added to Python_wrapper.h:

#if PY_VERSION_HEX < 0x030e00a1  // 3.14.0a1
#   define Py_HashBuffer(b, s) _Py_HashBytes(b, s)
#   if PY_VERSION_HEX < 0x030d00a1  // 3.13.0a1
#       ifndef Py_BUILD_CORE
#           define Py_BUILD_CORE 1
#       endif
#       include "internal/pycore_pyhash.h"  // _Py_HashBytes()
#   else
#       include "pyhash.h"
#   endif 
#endif

With this the package does build, but line_profiler remains un-importable:

Traceback (most recent call last):
  File [...]"/line_profiler.py", line 20, in <module>
    from ._line_profiler import LineProfiler as CLineProfiler
ImportError: dlopen([...]/_line_profiler.cpython-313-darwin.so, 0x0002): symbol not found in flat namespace '__Z13_Py_HashBytesPKvl'

EDIT: I made a typo above, one of the comparisons was backwards. Should be irrelevant after the update though.

Theelx · 2025-08-07T16:11:55Z

Ok, when I try building on 3.13.3 on Linux, I need the following code to get it to build:

#if PY_VERSION_HEX < 0x030e00a1  // 3.14.0a1
#   define Py_HashBuffer(b, s) _Py_HashBytes(b, s)
#   ifndef Py_BUILD_CORE
#       define Py_BUILD_CORE 1
#   endif
#   include "internal/pycore_pyhash.h"  // _Py_HashBytes()
#   undef Py_BUILD_CORE
#endif

I'm working on getting it to import and run tests now.

Theelx · 2025-08-08T19:22:56Z

Oh btw, I got it to build on all the python versions by replacing Py_HashBytes. Now though there seem to be test failures on this CI thing that I can't reproduce locally on Ubuntu. Any ideas?

TTsangSC · 2025-08-08T19:48:41Z

Can confirm that it does build now. I've been running tests on my end too, and the failures are baffling... In principle you haven't even ended up changing that much of the code,¹ and the only essential changes are the lookup into _bh_cache and the byte iteration. Still, that also resulted in similar errors on my local (I'm on M3 Mac) as with in CI.

These lines (https://github.com/pyutils/line_profiler/actions/runs/16812079270/job/47619650075?pr=376#step:5:1679) in particular caught my interest:

    Total time: -1180.67 s
    File: /tmp/pytest-of-root/pytest-0/test_multiple_profilers_identi4/func4.py
    Function: func4 at line 1
  
    Line #      Hits         Time   Per Hit   % Time  Line Contents
    ===============================================================
         1                                            def func4():
         2         1    -1.07e+11 -1.07e+11      9.1      result = []
         3         5    -5.37e+11 -1.07e+11     45.5      for _ in range(4):
         4         4    -4.29e+11 -1.07e+11     36.4          result.append(4)
         5         1    -1.07e+11 -1.07e+11      9.1      return result

So in some cases garbage/uninitialized values are being written to LineProfiler._c_code_map or ._c_last_time. Weird...

The hash() in the old code also compiles to Py_ObjectHash() in C++, so nothing really changed on that end. Maybe some other workaround is needed in order to secure the performance gain, or maybe we just give up on 3.13 and fall back to Py_ObjectHash() there, while using the faster _Py_HashBytes() or Py_HashBuffer() for other versions. Unfortunately the CPython _Py_HashBytes() seems to make use of some internal stuff so it's hard to vendor in, and other projects like bitarray (see their pythoncapi_compat.h) also have to resort to the fallback. ↩

Theelx · 2025-08-08T19:51:29Z

If you remove the following, does it still error?

        else:
            # this should never matter because time isn't used when this is hit
            # but gcc doesn't know that, so we include it to avoid a compiler warning
            time = 0

And if the above doesn't work, can you replace the references to last_map one-by-one with prof._c_last_time[ident] and see if it works at any point?

TTsangSC · 2025-08-08T19:56:28Z

Yes, am also thinking of gradually reapplying your changes and checking when it is when it breaks. Will get back to this in a moment.

TTsangSC · 2025-08-08T21:49:21Z

Commenting out the retrieval of block_hash from _bh_cache makes all tests pass. But I just can't for the life of me figure out why that part is problematic... 99% of the cases we're just dealing with a bytestring and it shouldn't be all that complicated. And since code objects are immutable I don't see why using their identities as unordered_map keys would cause issues – after all I don't think we reuse object addresses in Python.
About the time = 0 thing:
- Good news: commenting it out does make the garbage timings disappear.
- Bad news: the tests still passed with the garbage timings, so maybe some of those need to be rewritten to be more strict.

Theelx · 2025-08-09T01:48:38Z

Huh, that's odd. Well in that case I'll just remove the block hash cache and time = 0, and I can try to do more tests to figure out why it's a problem later.

Theelx · 2025-08-09T02:34:31Z

Ok, well it seems the tests pass fine on Ubuntu and Mac but not Windows... does anyone have a Windows computer on which they can debug why this is occurring?

TTsangSC · 2025-08-09T17:31:34Z

Here are my latest tests as to the overhead, based on the tight-looped function that you proposed:

Test script (click to expand)

#!/usr/bin/env python3
from argparse import ArgumentParser
from contextlib import nullcontext
from time import perf_counter
from typing import Collection, Sequence, Optional

from line_profiler import LineProfiler


def func(n: int = 2000000, profiled: bool = False) -> float:
    with (PROF if profiled else nullcontext()):
        return inner_func(n)


def inner_func(n: int) -> float:
    start = perf_counter()
    for x in range(n):
        y = x
    return (perf_counter() - start) * 1000


def mean(x: Collection[float]) -> float:
    return sum(x) / len(x)


def sd(x: Collection[float]) -> float:
    mx = mean(x)
    msqx = mean([y * y for y in x])
    return ((msqx - mx * mx) / (1 - 1 / len(x))) ** .5


def main(args: Optional[Sequence[str]] = None) -> None:
    parser = ArgumentParser()
    parser.add_argument('n', nargs='?', default=2000000, type=int)
    parser.add_argument('-v', '--verbose', action='store_true')
    parser.add_argument('-p', '--prec', default=2, type=int)
    parser.add_argument('-s', '--samples', default=5, type=int)
    vanilla_times = []
    profiled_times = []
    options = parser.parse_args(args)
    for i in range(1, options.samples + 1):
        for name, times, profiled in [
            ('without profiling', vanilla_times, False),
            ('with profiling', profiled_times, True),
        ]:
            result = func(options.n, profiled=profiled)
            if options.verbose:
                print(
                    f'Time elapsed {name} (#{i}): '
                    f'{result:.{options.prec}f} ms'
                )
            times.append(result)
    if options.verbose:
        PROF.print_stats()
    mean_v = mean(vanilla_times)
    mean_p = mean(profiled_times)
    overhead = mean_p / mean_v - 1
    err_v = sd(vanilla_times) * options.samples ** -.5
    err_p = sd(profiled_times) * options.samples ** -.5
    err_o = overhead * ((err_v / mean_v) ** 2 + (err_p / mean_p) ** 2) ** .5
    print(
        f'From {options.samples} run(s) @ {options.n} loops:',
        'Overhead = {0:.{4}f} ± {1:.{4}f}x ({2:.{4}f} ms -> {3:.{4}f} ms)'
        .format(overhead, err_o, mean_v, mean_p, options.prec),
    )


PROF = LineProfiler(inner_func)


if __name__ == '__main__':
    main()

Python version\branch	this	`main`
3.12	43.6(9)	50.3(7)
3.13	39.56(97)	44.5(8)
3.14	64.9(2)	71.7(3)

Results are taken over 10 samples each.

So while there is indeed a marked improvement, it unfortunately isn't as significant as initially planned – at least on my machine. Another peculiarity is why the overhead is so big on 3.14. Will try to play around with tweaks, see if there's anything good (workaround for _Py_HashBytes()), and write a PR to your PR if so.

As for the Windows thing... IDK, let's see if @Erotemic has any idea...

Theelx · 2025-08-09T17:51:53Z

Awesome, thanks for the test script and measurements! I currently have a few pending commits locally to extend the iPython/Jupyter notebook support with an lprun_all magic function (to automatically profile the entire cell being run instead of just a specific function/module), but it's based off this branch so the sooner these test issues can be resolved and this PR merged, the better. Once this is merged (or closed), I can submit the new PR (which is intended for students to use this Fall semester). Once that is done, I think I'm going to work on integrating a benchmarking harness and other representative use cases to run on new commits/PRs here so we can measure and detect regressions before they happen.

line_profiler/_line_profiler.pyx - Added macro `hash_bytecode()` for hashing bytestrings (doesn't seem to have performance benefits so it's currently unused; TODO: more tests) - Replaced repeated retrival of `prof._c_last_time[ident]` with a stored reference thereto

TTsangSC · 2025-08-10T02:55:27Z

@Theelx

Just figured that I should maybe also ping you here – apparently people don't necessarily get notifications when their forks get PRs, been there.

In Theelx#4 I replaced those repeated item accesses in inner_trace_callback() with pointers and squeezed quite a bit of performance out of them. However, the bad news is that Py_HashBuffer() doesn't seem to be any faster than PyObject_Hash() (at least in my tests) so those yellow lines may be here to stay.

Further optimizations

TTsangSC · 2025-08-10T03:11:52Z

Dammit, still snafu on Windows (well, not that we have any reason to believe whatever is wrong has been fixed). Guess that we still need someone with a Windows machine to help...

Theelx · 2025-08-10T03:15:32Z

Thanks for pinging me, I indeed had not gotten any notifications. When I run the test script you included on the latest version of the code on 3.13.3 (saved as test5.py) and compare it to the original test script I had (saved as test.py), I get very different numbers for slowdowns. Am I calling your script wrong? The latest version of the code (with PyObject_Hash instead of hash_bytecode) is being used in this case, I'm planning to check hash_bytecode next. I'm not sure if this is related, but I think the two different invocations (python3 vs kernprof) might have something to do with this discrepancy (especially because of the different output units). I'm going to try changing your script to run with the cmdline kernprof.

Edit: When I do a dirty hack to change to the cmdline kernprof, I get that the runtimes with profiling are ~3000ms instead of ~4000ms with the API version. I'll probably need to do some digging later to see why the API version is slower!

kernprof -lvz test.py
Time Elapsed without profiling: 91.4ms
Time Elapsed profiling: 2968.16ms
Slowdown from profiling: 2876.7599999999998ms, 32.47x
Wrote profile results to 'test.py.lprof'
Timer unit: 1e-06 s

Total time: 2.60081 s
File: test.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main(profiled=False):
     4         1          2.6      2.6      0.0      start = time.perf_counter()
     5   2000001    1310737.8      0.7     50.4      for x in range(2000000): #2000000
     6   2000000    1289972.1      0.6     49.6          y = x
     7         1         25.2     25.2      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1          1.4      1.4      0.0      if profiled:
     9         1         68.9     68.9      0.0          print(f"Time Elapsed profiling: {elapsed_ms}ms")
    10                                               else:
    11                                                   print(f"Time Elapsed without profiling: {elapsed_ms}ms")
    12         1          3.9      3.9      0.0      return elapsed_ms

theel@theel-hp:~/Coding/line_profiler/line_profiler$ python3 test5.py
From 5 run(s) @ 2000000 loops: Overhead = 43.82 ± 0.20x (87.40 ms -> 3916.86 ms)
theel@theel-hp:~/Coding/line_profiler/line_profiler$ python3 test5.py -v
Time elapsed without profiling (#1): 85.48 ms
Time elapsed with profiling (#1): 3908.07 ms
Time elapsed without profiling (#2): 85.74 ms
Time elapsed with profiling (#2): 3958.72 ms
Time elapsed without profiling (#3): 85.97 ms
Time elapsed with profiling (#3): 3902.86 ms
Time elapsed without profiling (#4): 85.95 ms
Time elapsed with profiling (#4): 3949.99 ms
Time elapsed without profiling (#5): 85.74 ms
Time elapsed with profiling (#5): 3977.27 ms
Timer unit: 1e-09 s

Total time: 15.4224 s
File: /home/theel/Coding/line_profiler/line_profiler/test5.py
Function: inner_func at line 15

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    15                                           def inner_func(n: int) -> float:
    16         5       9857.0   1971.4      0.0      start = perf_counter()
    17  10000005 7856910303.0    785.7     50.9      for x in range(n):
    18  10000000 7565454885.0    756.5     49.1          y = x
    19         5      69276.0  13855.2      0.0      return (perf_counter() - start) * 1000

From 5 run(s) @ 2000000 loops: Overhead = 44.93 ± 0.17x (85.78 ms -> 3939.38 ms)

I *really* hope this is not why the Windows CI is failing lol

TTsangSC · 2025-08-10T03:27:26Z

Hmm... baffling. All that I can say is that the different timer units are not what is wrong, since kernprof defaults to 1E-6 and LineProfiler.print_stats() defaults to LineProfiler.timer_unit (which is platform-dependent and can be whatever value).

Let me also try something closer to your invocation to see if I can replicate the discrepancy that you saw.

This reverts commit c5f95e2.

TTsangSC · 2025-08-10T03:44:50Z

I tried to replicate your setup with the following script and then by running kernprof -lvz ../test_simple.py:

Content of ../test_simple.py

#!/usr/bin/env python3
import time
from line_profiler import profile


def main(profiled=False):
    start = time.perf_counter()
    for x in range(2000000): #2000000
        y = x
    elapsed_ms = round((time.perf_counter()-start)*1000, 2)
    if profiled:
        print(f"Time Elapsed profiling: {elapsed_ms}ms")
    else:
        print(f"Time Elapsed without profiling: {elapsed_ms}ms")
    return elapsed_ms


if __name__ == '__main__':
    vanilla = main()
    profiled = profile(main)(profiled=True)
    print(
        'Slowdown from profiling: {}ms, {:.2f}x'
        .format(profiled - vanilla, profiled / vanilla - 1),
    )

And the slowdown (also on 3.13.3) is around 25–30x. That seems consistent with the numbers reported in Theelx#4... so yeah, unfortunately, no idea why the slowdowns look different in the above example.

Oh and apropos, I did also try replacing all usage of threading.get_ident() with PyThread_get_thread_ident(). Rather unintuitively, it seems to have anywhere from no performance impacts to being a very slight slowdown, so I didn't bother including that in the PR. But maybe it's just my machine that's weird. Feel free to give it a try on yours and see if it's different for Linux...

Theelx · 2025-08-10T03:49:12Z

Huh, well that's odd that the slowdown is still the same. Are you on a Mac, or Linux? If you're on a Mac, I could see a reality where replacing threading.get_ident() with the C-API version would have no difference in performance where Linux would. Mac handles just about everything in an... interesting manner (imo).

Theelx · 2025-08-10T04:52:09Z

Well.......... it seems the Windows errors have been fixed! The issue was replacing hash(). Why we can't use PyObject_Hash or any other C-API hashing on Windows baffles me, but the tests don't lie. I think that this is ready for final review now by @Erotemic now that the tests pass. I'm going to bed now, goodnight and thanks Terence for helping debug!

codecov · 2025-08-10T05:08:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.92%. Comparing base (60e928f) to head (5e35362).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #376      +/-   ##
==========================================
+ Coverage   87.56%   87.92%   +0.35%     
==========================================
  Files          18       20       +2     
  Lines        1641     1871     +230     
  Branches      348      401      +53     
==========================================
+ Hits         1437     1645     +208     
- Misses        149      166      +17     
- Partials       55       60       +5

see 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2295c1c...5e35362. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Erotemic · 2025-08-10T14:29:39Z

I took a glance, but I need to carve out some time to do a more thorough manual review. In the meantime, I put the diff into GPT5, and it's making a few suggestions. I'm not sure if they are valid or not: https://chatgpt.com/share/e/6898aae9-714c-800d-8635-6df54c4eb9de

I'm not sure I shared the conversation correctly. I'm uploading a markdown copy of the transcript.
gpt-review.md

One thing I would like is to get proper regression tests in. I have an idea for it, so let me see if I can code that up quickly.

Theelx · 2025-08-10T16:32:22Z

I think that the GPT review looks somewhat useful. I'm trying to keep this PR changing as little previous behavior as reasonably possible, and staying focused on the performance boosts, so I want to leave additional documentation of invariants/assumptions to a separate PR (preferably submitted before the next release). Its #4 is something I already considered, and was actually planning to push a commit that reverts the type to unsigned long, however unsigned long long won't cause errors for now since it's a bigger type. I just didn't want to add a 15th commit to this PR if it wasn't necessary. However, just to be careful I'll push a 15th commit to fix it. Here's a point-by-point commentary:

Because we kept hash() to fix Windows issues, this behavior is the same as it was before and therefore should be documented in a separate PR.
Indeed, this is a valid concern. Since it's so short (an inline comment), I'll add it in a commit here.
This isn't a concern because the only modifications to last_map are done at the very end, in mutually exclusive blocks (if/elif), and the value of last_map will be updated on the next call. Because of that, any rehashes/insertions (e.g. lines 1479 and 1485) are basically immediately followed by line 1462.
See beginning
This seems like a valid concern that's best followed up in a separate PR.
As it correctly notes, we did check count(block_hash) before doing this so I don't think its a concern.

Theelx · 2025-08-12T19:10:25Z

@Erotemic Have you had a chance to take a look at the new state of this PR? I was hoping to get this over the finish line soon, but if that isn't possible then we can wait!

Erotemic · 2025-08-12T21:07:25Z

I hope to get to it this week. Like always, I want to feel confident I understand what is happening and why before I merge.

I generally see that you are trying to use Python-C-API calls directly replacing the indirect python calls (e.g. threading.get_ident -> PyThread_get_thread_ident).

I'm not sure I understand what the issue on windows was.
It looks like hash_bytecode was written, but you never used it and fell back to hash itself?

I still need to think about the new logic and convince myself it is the same. I think it would also help if there was a comment about why we are using decref(last_map) instead of prof._c_last_time[ident], as it makes the code much harder to follow, but I do understand why its faster this way.

Theelx · 2025-08-12T22:38:47Z

Completely fine that the code should be generally understood before merging! Yep, that's correct that the goal is to replace indirect Python calls by explicit C-API calls where possible. I suspect the issue on Windows is a bug introduced in relatively recent Python versions, because the tests passed on all platforms when I tried replacing hash with PyObject_Hash 2-3 years ago (though test coverage was less extensive back then). hash_bytecode was written and not used, that's correct. I didn't remove it because I think it can be used in an update sometime soon to speed things up additionally (I just need to find if there's a bug in Python regarding PyObject_Hash having different semantics from hash() on Windows or whether it's intentional, and then work around that).

I will add a comment about why we're using deref() to the code itself shortly. The short version is that because last_map is a pointer/reference to prof._c_last_time[ident], every time we want to call its operators implemented in C++ we have to deref it. The deref() idiom is Cython's version of the -> operator in C++. If we weren't using deref(), Cython would assume we're trying to do pointer indexing rather than the map's operator[] we're trying to do.

Erotemic

Ok, I'm pretty sure I understand what's going on now. The main issue is that the code has become a lot less readable. Anything we can do to improve that would be good.

Perhaps a few inline methods could help. Maybe something like:

# -------- LastTimeMap helpers (ensure + mutate) --------

# Type aliases
ctypedef unordered_map[int64, LastTime] LastTimeMap
ctypedef unordered_map[int64, LineTime] LineTimeMap

# ---- LastTimeMap helpers (pointer-based) ----

cdef inline bint last_contains(LastTimeMap* m, int64 key) nogil:
    return deref(m).count(key) != 0

cdef inline LastTime last_get_value(LastTimeMap* m, int64 key):
    # Precondition: last_contains(m, key)
    return deref(m)[key]

cdef inline LastTime* last_find_ptr(LastTimeMap* m, int64 key) nogil:
    cdef LastTimeMap.iterator it = deref(m).find(key)
    if it == deref(m).end():
        return NULL
    return &deref(it).second

cdef inline void last_set_now(LastTimeMap* m, int64 key, int lineno):
    # Writes a fresh LastTime with the current timestamp
    deref(m)[key] = LastTime(lineno, hpTimer())

cdef inline void last_erase_if_present(LastTimeMap* m, int64 key) nogil:
    cdef LastTimeMap.iterator it = deref(m).find(key)
    if it != deref(m).end():
        deref(m).erase(it)
        

# ---- LineTimeMap helpers (pointer-based) ----

cdef inline LineTime* line_ensure_entry(LineTimeMap* m, int lineno, int64 code_hash):
    """
    Ensure a LineTime entry for `lineno`; return a pointer to it.
    Pointer remains valid until the next insertion/rehash into *this same* map.
    """
    if not deref(m).count(lineno):
        deref(m)[lineno] = LineTime(code_hash, lineno, 0, 0)
    return &(deref(m)[lineno])



cdef inline inner_trace_callback(
        int is_line_event, set instances, object code, int lineno):

    ...

    cdef LastTimeMap* last_map;
    cdef LineTimeMap* line_entries:
    ...

    for prof_ in instances:

        ...

        last_map = &(prof._c_last_time[ident])

        if last_contains(last_map, block_hash):

            old = last_get_value(last_map, block_hash)
            line_entries = &(prof._c_code_map[code_hash])
            entry = line_ensure_entry(line_entries, old.f_lineno, code_hash)
            entry.nhits += 1
            entry.total_time += time - old.time
            
        if is_line_event:
            last_set_now(last_map, block_hash, lineno)
        elif last_contains(last_map, block_hash):
            last_erase_if_present(last_map, block_hash)

EDIT: After posting this, I kinda wish it was easy to template these map functions over unordered_map[KT, VT], but we can do something similar with just a bit of boilerplate overhead.

# Value type can be LastTime or LineTime
ctypedef fused _ValT:
    LastTime
    LineTime

# A map from int64 to either LastTime or LineTime
ctypedef unordered_map[int64, _ValT] _MapT

And then call the functions as funcname(<_MapT*>last_map, key). It might also be worth having the "get" function return a pointer rather than a copy of the value.

line_profiler/_line_profiler.pyx

Theelx · 2025-08-13T23:27:39Z

Sure, I'll add some helper types and functions to make it more readable! One thing to consider though is single-line functions that are only used once tend to make it more unreadable, because we could just use a comment to explain/name it without requiring the reader to scroll elsewhere to see the actual function code. If it was used multiple times, then I would think a single-line function could make sense, but with only one use it doesn't seem right. Also, at least in earlier versions of Cython, fused types would cause compilation of a version of any function using them for every possible permutation of input types. This could easily cause invisible slowdowns when the CPU fetches critical lines, especially if the code begins to overflow the L1i cache. If fused types are included here, we better have very good benchmarking and regression tests.

Erotemic · 2025-08-13T23:33:46Z

I'm currently playing with how it looks. I'm trying to separate them out into a pxi. It does add more code, but each individual unit of code is simpler (although that requires we pick very good descriptive helper names for the functions, so people can intuit what they are doing). But if done right, for repos that are already medium/large, short function with longer files is easier to read than long functions in shorter files.

Erotemic · 2025-08-13T23:57:22Z

@Theelx I'm thinking something like this: Theelx#5

Theelx · 2025-08-13T23:59:19Z

Can I modify this cli call to force cp38 abi3 wheels?
cibuildwheel --config-file pyproject.toml --platform linux --archs x86_64 # line_profiler: +UNCOMMENT_IF(binpy)

That should work!

Erotemic · 2025-08-14T00:05:24Z

That was actually a prompt to GPT. The PR I have compiles, but it looks like the module is unimportable outside cibuildwheel? IIRC you hit something like that earlier.

Theelx · 2025-08-14T00:07:38Z

Yep, that issue can happen. I don't recall how I fixed it permanently, but a temporary solution I had for my version of the issue was to do python3 setup.py build_ext --inplace in the root directory.

Erotemic · 2025-08-14T15:01:44Z

So I thought I was still getting an import error when installing the wheels, but I think its actually an issue with the abi3 wheels. They build, but they don't seem to import correctly, but on the CI I haven't updated the tests to force a specific environment where they are exclusively installed and tested. I though the testing inside cibuildwheel would be good enough. Problem exist on main too, so this branch seems good to merge.

Theelx · 2025-08-14T15:17:14Z

Thank you for testing the ABI wheels and figuring out the problem! I also appreciate the good reviews, comments, and responses over this PR, even though I may have come off as impatient or rude. If I did, I apologize because that was not my intent.

Theelx added 2 commits August 6, 2025 16:02

Add block-hash cache to speed up more complex block hashing

122049c

Speedups from avoiding re-resolution and threading.get_ident()

46669f8

Replace Py_HashBytes with Py_ObjectHash

e82688a

Fix test issues

8e38eda

TTsangSC added 2 commits August 9, 2025 20:47

WIP: optimizations

75697f3

line_profiler/_line_profiler.pyx - Added macro `hash_bytecode()` for hashing bytestrings (doesn't seem to have performance benefits so it's currently unused; TODO: more tests) - Replaced repeated retrival of `prof._c_last_time[ident]` with a stored reference thereto

Optimization: store pointer to the LineTime obj

c138da2

TTsangSC mentioned this pull request Aug 9, 2025

Further optimizations Theelx/line_profiler#4

Merged

Merge pull request #4 from TTsangSC/test-theelx-opt

b34f8f3

Further optimizations

Test if threading is causing the Windows CI issue

c5f95e2

I *really* hope this is not why the Windows CI is failing lol

Revert "Test if threading is causing the Windows CI issue"

468458a

This reverts commit c5f95e2.

Test whether long bit-length difference on Windows could cause issues

5dce87d

Theelx added 4 commits August 10, 2025 00:24

Test preventing potential truncation by lying about types

86bbded

Revert to some Python types to try to fix Windows tests

1b4c083

I'm desparate

4c0fc7e

Revert certain changes to see if Windows still works

4a5715d

Theelx requested a review from Erotemic August 10, 2025 04:52

Fix GPT review suggestions 2 and 4

0a2817b

Erotemic mentioned this pull request Aug 10, 2025

[WIP] Benchmarks POC #381

Open

Add comments for deref and hash_bytecode

7b09279

Erotemic requested changes Aug 13, 2025

View reviewed changes

line_profiler/_line_profiler.pyx Outdated Show resolved Hide resolved

line_profiler/_line_profiler.pyx Show resolved Hide resolved

line_profiler/_line_profiler.pyx Outdated Show resolved Hide resolved

Erotemic mentioned this pull request Aug 13, 2025

add inline helpers Theelx/line_profiler#5

Closed

Implement review suggestions

1dc229b

add changelog note

5e35362

Erotemic merged commit d05582d into pyutils:main Aug 14, 2025
5 checks passed

Theelx mentioned this pull request Aug 14, 2025

Worst-Case Overhead Increase Since 4.1.3 #375

Closed

TTsangSC mentioned this pull request Oct 27, 2025

DeprecationWarning: path is deprecated. Use files() instead. #405

Open

Speed Up Worst-Case Overhead #376

Speed Up Worst-Case Overhead #376

Uh oh!

Conversation

Theelx commented Aug 6, 2025

Uh oh!

Theelx commented Aug 6, 2025

Uh oh!

Theelx commented Aug 6, 2025

Uh oh!

TTsangSC commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Theelx commented Aug 7, 2025

Uh oh!

Theelx commented Aug 8, 2025

Uh oh!

TTsangSC commented Aug 8, 2025

Footnotes

Uh oh!

Theelx commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TTsangSC commented Aug 8, 2025

Uh oh!

TTsangSC commented Aug 8, 2025

Uh oh!

Theelx commented Aug 9, 2025

Uh oh!

Theelx commented Aug 9, 2025

Uh oh!

TTsangSC commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Theelx commented Aug 9, 2025

Uh oh!

TTsangSC commented Aug 10, 2025

Uh oh!

TTsangSC commented Aug 10, 2025

Uh oh!

Theelx commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TTsangSC commented Aug 10, 2025

Uh oh!

TTsangSC commented Aug 10, 2025

Uh oh!

Theelx commented Aug 10, 2025

Uh oh!

Theelx commented Aug 10, 2025

Uh oh!

codecov bot commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Erotemic commented Aug 10, 2025

Uh oh!

Theelx commented Aug 10, 2025

Uh oh!

Theelx commented Aug 12, 2025

Uh oh!

Erotemic commented Aug 12, 2025

Uh oh!

Theelx commented Aug 12, 2025

Uh oh!

Erotemic left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Theelx commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Erotemic commented Aug 13, 2025

Uh oh!

Erotemic commented Aug 13, 2025

Uh oh!

Theelx commented Aug 13, 2025

TTsangSC commented Aug 7, 2025 •

edited

Loading

Theelx commented Aug 8, 2025 •

edited

Loading

TTsangSC commented Aug 9, 2025 •

edited

Loading

Theelx commented Aug 10, 2025 •

edited

Loading

codecov bot commented Aug 10, 2025 •

edited

Loading

Erotemic left a comment •

edited

Loading

Theelx commented Aug 13, 2025 •

edited

Loading