-
Notifications
You must be signed in to change notification settings - Fork 130
Speed Up Worst-Case Overhead #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note: Unfortunately we can't use last_map in additional areas near the end of the callback due to segmentation faults and stale data, so the speedup from avoiding re-resolution is negated by broken functionality. |
My tests so far:
EDIT: I made a typo above, one of the comparisons was backwards. Should be irrelevant after the update though. |
Ok, when I try building on 3.13.3 on Linux, I need the following code to get it to build: #if PY_VERSION_HEX < 0x030e00a1 // 3.14.0a1
# define Py_HashBuffer(b, s) _Py_HashBytes(b, s)
# ifndef Py_BUILD_CORE
# define Py_BUILD_CORE 1
# endif
# include "internal/pycore_pyhash.h" // _Py_HashBytes()
# undef Py_BUILD_CORE
#endif I'm working on getting it to import and run tests now. |
Oh btw, I got it to build on all the python versions by replacing Py_HashBytes. Now though there seem to be test failures on this CI thing that I can't reproduce locally on Ubuntu. Any ideas? |
Can confirm that it does build now. I've been running tests on my end too, and the failures are baffling... In principle you haven't even ended up changing that much of the code,1 and the only essential changes are the lookup into These lines (https://github.com/pyutils/line_profiler/actions/runs/16812079270/job/47619650075?pr=376#step:5:1679) in particular caught my interest:
So in some cases garbage/uninitialized values are being written to Footnotes
|
If you remove the following, does it still error? else:
# this should never matter because time isn't used when this is hit
# but gcc doesn't know that, so we include it to avoid a compiler warning
time = 0 And if the above doesn't work, can you replace the references to last_map one-by-one with |
Yes, am also thinking of gradually reapplying your changes and checking when it is when it breaks. Will get back to this in a moment. |
|
Huh, that's odd. Well in that case I'll just remove the block hash cache and time = 0, and I can try to do more tests to figure out why it's a problem later. |
Ok, well it seems the tests pass fine on Ubuntu and Mac but not Windows... does anyone have a Windows computer on which they can debug why this is occurring? |
Here are my latest tests as to the overhead, based on the tight-looped function that you proposed: Test script (click to expand)#!/usr/bin/env python3
from argparse import ArgumentParser
from contextlib import nullcontext
from time import perf_counter
from typing import Collection, Sequence, Optional
from line_profiler import LineProfiler
def func(n: int = 2000000, profiled: bool = False) -> float:
with (PROF if profiled else nullcontext()):
return inner_func(n)
def inner_func(n: int) -> float:
start = perf_counter()
for x in range(n):
y = x
return (perf_counter() - start) * 1000
def mean(x: Collection[float]) -> float:
return sum(x) / len(x)
def sd(x: Collection[float]) -> float:
mx = mean(x)
msqx = mean([y * y for y in x])
return ((msqx - mx * mx) / (1 - 1 / len(x))) ** .5
def main(args: Optional[Sequence[str]] = None) -> None:
parser = ArgumentParser()
parser.add_argument('n', nargs='?', default=2000000, type=int)
parser.add_argument('-v', '--verbose', action='store_true')
parser.add_argument('-p', '--prec', default=2, type=int)
parser.add_argument('-s', '--samples', default=5, type=int)
vanilla_times = []
profiled_times = []
options = parser.parse_args(args)
for i in range(1, options.samples + 1):
for name, times, profiled in [
('without profiling', vanilla_times, False),
('with profiling', profiled_times, True),
]:
result = func(options.n, profiled=profiled)
if options.verbose:
print(
f'Time elapsed {name} (#{i}): '
f'{result:.{options.prec}f} ms'
)
times.append(result)
if options.verbose:
PROF.print_stats()
mean_v = mean(vanilla_times)
mean_p = mean(profiled_times)
overhead = mean_p / mean_v - 1
err_v = sd(vanilla_times) * options.samples ** -.5
err_p = sd(profiled_times) * options.samples ** -.5
err_o = overhead * ((err_v / mean_v) ** 2 + (err_p / mean_p) ** 2) ** .5
print(
f'From {options.samples} run(s) @ {options.n} loops:',
'Overhead = {0:.{4}f} ± {1:.{4}f}x ({2:.{4}f} ms -> {3:.{4}f} ms)'
.format(overhead, err_o, mean_v, mean_p, options.prec),
)
PROF = LineProfiler(inner_func)
if __name__ == '__main__':
main()
Results are taken over 10 samples each. So while there is indeed a marked improvement, it unfortunately isn't as significant as initially planned – at least on my machine. Another peculiarity is why the overhead is so big on 3.14. Will try to play around with tweaks, see if there's anything good (workaround for As for the Windows thing... IDK, let's see if @Erotemic has any idea... |
Awesome, thanks for the test script and measurements! I currently have a few pending commits locally to extend the iPython/Jupyter notebook support with an lprun_all magic function (to automatically profile the entire cell being run instead of just a specific function/module), but it's based off this branch so the sooner these test issues can be resolved and this PR merged, the better. Once this is merged (or closed), I can submit the new PR (which is intended for students to use this Fall semester). Once that is done, I think I'm going to work on integrating a benchmarking harness and other representative use cases to run on new commits/PRs here so we can measure and detect regressions before they happen. |
line_profiler/_line_profiler.pyx - Added macro `hash_bytecode()` for hashing bytestrings (doesn't seem to have performance benefits so it's currently unused; TODO: more tests) - Replaced repeated retrival of `prof._c_last_time[ident]` with a stored reference thereto
Just figured that I should maybe also ping you here – apparently people don't necessarily get notifications when their forks get PRs, been there. In Theelx#4 I replaced those repeated item accesses in |
Further optimizations
Dammit, still snafu on Windows (well, not that we have any reason to believe whatever is wrong has been fixed). Guess that we still need someone with a Windows machine to help... |
Thanks for pinging me, I indeed had not gotten any notifications. When I run the test script you included on the latest version of the code on 3.13.3 (saved as test5.py) and compare it to the original test script I had (saved as test.py), I get very different numbers for slowdowns. Am I calling your script wrong? The latest version of the code (with PyObject_Hash instead of hash_bytecode) is being used in this case, I'm planning to check hash_bytecode next. I'm not sure if this is related, but I think the two different invocations (python3 vs kernprof) might have something to do with this discrepancy (especially because of the different output units). I'm going to try changing your script to run with the cmdline kernprof. Edit: When I do a dirty hack to change to the cmdline kernprof, I get that the runtimes with profiling are ~3000ms instead of ~4000ms with the API version. I'll probably need to do some digging later to see why the API version is slower! kernprof -lvz test.py
Time Elapsed without profiling: 91.4ms
Time Elapsed profiling: 2968.16ms
Slowdown from profiling: 2876.7599999999998ms, 32.47x
Wrote profile results to 'test.py.lprof'
Timer unit: 1e-06 s
Total time: 2.60081 s
File: test.py
Function: main at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 def main(profiled=False):
4 1 2.6 2.6 0.0 start = time.perf_counter()
5 2000001 1310737.8 0.7 50.4 for x in range(2000000): #2000000
6 2000000 1289972.1 0.6 49.6 y = x
7 1 25.2 25.2 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
8 1 1.4 1.4 0.0 if profiled:
9 1 68.9 68.9 0.0 print(f"Time Elapsed profiling: {elapsed_ms}ms")
10 else:
11 print(f"Time Elapsed without profiling: {elapsed_ms}ms")
12 1 3.9 3.9 0.0 return elapsed_ms
theel@theel-hp:~/Coding/line_profiler/line_profiler$ python3 test5.py
From 5 run(s) @ 2000000 loops: Overhead = 43.82 ± 0.20x (87.40 ms -> 3916.86 ms)
theel@theel-hp:~/Coding/line_profiler/line_profiler$ python3 test5.py -v
Time elapsed without profiling (#1): 85.48 ms
Time elapsed with profiling (#1): 3908.07 ms
Time elapsed without profiling (#2): 85.74 ms
Time elapsed with profiling (#2): 3958.72 ms
Time elapsed without profiling (#3): 85.97 ms
Time elapsed with profiling (#3): 3902.86 ms
Time elapsed without profiling (#4): 85.95 ms
Time elapsed with profiling (#4): 3949.99 ms
Time elapsed without profiling (#5): 85.74 ms
Time elapsed with profiling (#5): 3977.27 ms
Timer unit: 1e-09 s
Total time: 15.4224 s
File: /home/theel/Coding/line_profiler/line_profiler/test5.py
Function: inner_func at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 def inner_func(n: int) -> float:
16 5 9857.0 1971.4 0.0 start = perf_counter()
17 10000005 7856910303.0 785.7 50.9 for x in range(n):
18 10000000 7565454885.0 756.5 49.1 y = x
19 5 69276.0 13855.2 0.0 return (perf_counter() - start) * 1000
From 5 run(s) @ 2000000 loops: Overhead = 44.93 ± 0.17x (85.78 ms -> 3939.38 ms) |
I *really* hope this is not why the Windows CI is failing lol
Hmm... baffling. All that I can say is that the different timer units are not what is wrong, since Let me also try something closer to your invocation to see if I can replicate the discrepancy that you saw. |
This reverts commit c5f95e2.
I tried to replicate your setup with the following script and then by running Content of ../test_simple.py#!/usr/bin/env python3
import time
from line_profiler import profile
def main(profiled=False):
start = time.perf_counter()
for x in range(2000000): #2000000
y = x
elapsed_ms = round((time.perf_counter()-start)*1000, 2)
if profiled:
print(f"Time Elapsed profiling: {elapsed_ms}ms")
else:
print(f"Time Elapsed without profiling: {elapsed_ms}ms")
return elapsed_ms
if __name__ == '__main__':
vanilla = main()
profiled = profile(main)(profiled=True)
print(
'Slowdown from profiling: {}ms, {:.2f}x'
.format(profiled - vanilla, profiled / vanilla - 1),
) And the slowdown (also on 3.13.3) is around 25–30x. That seems consistent with the numbers reported in Theelx#4... so yeah, unfortunately, no idea why the slowdowns look different in the above example. Oh and apropos, I did also try replacing all usage of |
Huh, well that's odd that the slowdown is still the same. Are you on a Mac, or Linux? If you're on a Mac, I could see a reality where replacing threading.get_ident() with the C-API version would have no difference in performance where Linux would. Mac handles just about everything in an... interesting manner (imo). |
Well.......... it seems the Windows errors have been fixed! The issue was replacing hash(). Why we can't use PyObject_Hash or any other C-API hashing on Windows baffles me, but the tests don't lie. I think that this is ready for final review now by @Erotemic now that the tests pass. I'm going to bed now, goodnight and thanks Terence for helping debug! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #376 +/- ##
==========================================
+ Coverage 87.56% 87.92% +0.35%
==========================================
Files 18 20 +2
Lines 1641 1871 +230
Branches 348 401 +53
==========================================
+ Hits 1437 1645 +208
- Misses 149 166 +17
- Partials 55 60 +5 see 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
I took a glance, but I need to carve out some time to do a more thorough manual review. In the meantime, I put the diff into GPT5, and it's making a few suggestions. I'm not sure if they are valid or not: https://chatgpt.com/share/e/6898aae9-714c-800d-8635-6df54c4eb9de I'm not sure I shared the conversation correctly. I'm uploading a markdown copy of the transcript. One thing I would like is to get proper regression tests in. I have an idea for it, so let me see if I can code that up quickly. |
I think that the GPT review looks somewhat useful. I'm trying to keep this PR changing as little previous behavior as reasonably possible, and staying focused on the performance boosts, so I want to leave additional documentation of invariants/assumptions to a separate PR (preferably submitted before the next release). Its #4 is something I already considered, and was actually planning to push a commit that reverts the type to unsigned long, however unsigned long long won't cause errors for now since it's a bigger type. I just didn't want to add a 15th commit to this PR if it wasn't necessary. However, just to be careful I'll push a 15th commit to fix it. Here's a point-by-point commentary:
|
@Erotemic Have you had a chance to take a look at the new state of this PR? I was hoping to get this over the finish line soon, but if that isn't possible then we can wait! |
I hope to get to it this week. Like always, I want to feel confident I understand what is happening and why before I merge. I generally see that you are trying to use Python-C-API calls directly replacing the indirect python calls (e.g.
I still need to think about the new logic and convince myself it is the same. I think it would also help if there was a comment about why we are using |
Completely fine that the code should be generally understood before merging! Yep, that's correct that the goal is to replace indirect Python calls by explicit C-API calls where possible. I suspect the issue on Windows is a bug introduced in relatively recent Python versions, because the tests passed on all platforms when I tried replacing hash with I will add a comment about why we're using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm pretty sure I understand what's going on now. The main issue is that the code has become a lot less readable. Anything we can do to improve that would be good.
Perhaps a few inline methods could help. Maybe something like:
# -------- LastTimeMap helpers (ensure + mutate) --------
# Type aliases
ctypedef unordered_map[int64, LastTime] LastTimeMap
ctypedef unordered_map[int64, LineTime] LineTimeMap
# ---- LastTimeMap helpers (pointer-based) ----
cdef inline bint last_contains(LastTimeMap* m, int64 key) nogil:
return deref(m).count(key) != 0
cdef inline LastTime last_get_value(LastTimeMap* m, int64 key):
# Precondition: last_contains(m, key)
return deref(m)[key]
cdef inline LastTime* last_find_ptr(LastTimeMap* m, int64 key) nogil:
cdef LastTimeMap.iterator it = deref(m).find(key)
if it == deref(m).end():
return NULL
return &deref(it).second
cdef inline void last_set_now(LastTimeMap* m, int64 key, int lineno):
# Writes a fresh LastTime with the current timestamp
deref(m)[key] = LastTime(lineno, hpTimer())
cdef inline void last_erase_if_present(LastTimeMap* m, int64 key) nogil:
cdef LastTimeMap.iterator it = deref(m).find(key)
if it != deref(m).end():
deref(m).erase(it)
# ---- LineTimeMap helpers (pointer-based) ----
cdef inline LineTime* line_ensure_entry(LineTimeMap* m, int lineno, int64 code_hash):
"""
Ensure a LineTime entry for `lineno`; return a pointer to it.
Pointer remains valid until the next insertion/rehash into *this same* map.
"""
if not deref(m).count(lineno):
deref(m)[lineno] = LineTime(code_hash, lineno, 0, 0)
return &(deref(m)[lineno])
cdef inline inner_trace_callback(
int is_line_event, set instances, object code, int lineno):
...
cdef LastTimeMap* last_map;
cdef LineTimeMap* line_entries:
...
for prof_ in instances:
...
last_map = &(prof._c_last_time[ident])
if last_contains(last_map, block_hash):
old = last_get_value(last_map, block_hash)
line_entries = &(prof._c_code_map[code_hash])
entry = line_ensure_entry(line_entries, old.f_lineno, code_hash)
entry.nhits += 1
entry.total_time += time - old.time
if is_line_event:
last_set_now(last_map, block_hash, lineno)
elif last_contains(last_map, block_hash):
last_erase_if_present(last_map, block_hash)
EDIT: After posting this, I kinda wish it was easy to template these map functions over unordered_map[KT, VT]
, but we can do something similar with just a bit of boilerplate overhead.
# Value type can be LastTime or LineTime
ctypedef fused _ValT:
LastTime
LineTime
# A map from int64 to either LastTime or LineTime
ctypedef unordered_map[int64, _ValT] _MapT
And then call the functions as funcname(<_MapT*>last_map, key)
. It might also be worth having the "get" function return a pointer rather than a copy of the value.
Sure, I'll add some helper types and functions to make it more readable! One thing to consider though is single-line functions that are only used once tend to make it more unreadable, because we could just use a comment to explain/name it without requiring the reader to scroll elsewhere to see the actual function code. If it was used multiple times, then I would think a single-line function could make sense, but with only one use it doesn't seem right. Also, at least in earlier versions of Cython, fused types would cause compilation of a version of any function using them for every possible permutation of input types. This could easily cause invisible slowdowns when the CPU fetches critical lines, especially if the code begins to overflow the L1i cache. If fused types are included here, we better have very good benchmarking and regression tests. |
I'm currently playing with how it looks. I'm trying to separate them out into a pxi. It does add more code, but each individual unit of code is simpler (although that requires we pick very good descriptive helper names for the functions, so people can intuit what they are doing). But if done right, for repos that are already medium/large, short function with longer files is easier to read than long functions in shorter files. |
That should work! |
That was actually a prompt to GPT. The PR I have compiles, but it looks like the module is unimportable outside cibuildwheel? IIRC you hit something like that earlier. |
Yep, that issue can happen. I don't recall how I fixed it permanently, but a temporary solution I had for my version of the issue was to do |
So I thought I was still getting an import error when installing the wheels, but I think its actually an issue with the abi3 wheels. They build, but they don't seem to import correctly, but on the CI I haven't updated the tests to force a specific environment where they are exclusively installed and tested. I though the testing inside cibuildwheel would be good enough. Problem exist on main too, so this branch seems good to merge. |
Thank you for testing the ABI wheels and figuring out the problem! I also appreciate the good reviews, comments, and responses over this PR, even though I may have come off as impatient or rude. If I did, I apologize because that was not my intent. |
The following PR transforms this callback to this. More yellow is worse.


Before:
After:
The final result is to lower the worst-case overhead from ~60x to ~40x, which gets closer to the ~25-30x measured in 4.1.3. The average overhead remains at ~2x. Worst-case overhead output below: