Skip to content

Conversation

@radoslawcybulski
Copy link

@radoslawcybulski radoslawcybulski commented Mar 31, 2025

Adds source location of next resume point to task objects. When constructing tasktrace object, source location of resume points will be added to it. When printing, the source location of next resume point will be printed as well, hopefully improving debug experience.

Example:

0x2b3899 /home/y/work/scylladb-test/build/Debug/seastar/libseastar.so+0x454c364 ...
   --------
   test_app/main.cpp:15:15
   seastar::continuation<seastar::internal::promise_base_with_type<void>, ...

Partially fixes #2381

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 02aa70a to c0149e3 Compare March 31, 2025 14:28
scheduling_group _sg;
private:
std::source_location resume_point = {};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making a very heavily used object larger.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding std::source_location

coroutine_traits_base<>::promise_type (which inherits from task) is 64 bytes,
continuation<Promise, Func, Wrapper, T> is around 104-120, depending on exact version.

No difference except for that single unlucky value, who is "promoted" to the next 64 byte cache line in case of continuation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay.

@avikivity
Copy link
Member

Aren't those addresses derivable from the stack trace? Or are those points lost?

How large is std::source_location?

@radoslawcybulski
Copy link
Author

radoslawcybulski commented Mar 31, 2025

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from c0149e3 to 08d3c8a Compare March 31, 2025 14:58
@avikivity
Copy link
Member

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

Maybe we can tolerate it. @tgrabiec what's your opinion?

Please run scylladb's perf-simple-query and report insn-per-op before and after (although we won't learn of the impact of the size change).

@radoslawcybulski
Copy link
Author

my patch

instructions_per_op: mean=48465.87 standard-deviation=33.03 median=48459.42 median-absolute-deviation=22.63 maximum=48520.81 minimum=48437.81
instructions_per_op: mean=48482.90 standard-deviation=21.16 median=48487.16 median-absolute-deviation=6.97 maximum=48504.13 minimum=48447.34
instructions_per_op: mean=48469.42 standard-deviation=18.34 median=48472.45 median-absolute-deviation=17.26 maximum=48488.53 minimum=48448.81

master

instructions_per_op: mean=48337.41 standard-deviation=21.29 median=48334.08 median-absolute-deviation=19.67 maximum=48364.48 minimum=48317.02
instructions_per_op: mean=48352.36 standard-deviation=27.87 median=48347.35 median-absolute-deviation=16.28 maximum=48395.26 minimum=48322.29
instructions_per_op: mean=48339.83 standard-deviation=26.48 median=48323.12 median-absolute-deviation=20.80 maximum=48372.64 minimum=48319.03

My patch will run more instructions, as there are additional stores to the task object (std::source_location needs to be written).

Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?

@avikivity
Copy link
Member

Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?

Time is very unstable (just try it) since it depends on temperature and cooling.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

@avikivity
Copy link
Member

So we add 0.25% overhead. Hard to judge if it's worthwhile.

@radoslawcybulski
Copy link
Author

Time is very unstable (just try it) since it depends on temperature and cooling.

Agree.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.

@avikivity
Copy link
Member

Time is very unstable (just try it) since it depends on temperature and cooling.

Agree.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.

Writes are more-or-less free, except for the increase in instruction footprint (which are extra reads).

We aren't close to running at max IPC. Typical for mini-benchmarks like perf-simple-query is 2, and for production ~1. The main killer is waiting for instruction fetch (although maybe in production mispredicts would also contribute since the the data is less regular). I'm sure an M4 would give much higher IPC.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?


#ifndef SEASTAR_MODULE
#include <coroutine>
#include <source_location>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it took me a while to check if it's safe to use <source_location> instead of util/std-compat.hh . the answer is yes. as we support the latest two major versions of the GCC and Clang compilers. in general, we use libstdc++, which added the support to source_location in gcc-mirror/gcc@57d76ee, which is included by gcc 12 and up. and the latest stable release of GCC is GCC 14.2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #2961 on this.

@radoslawcybulski
Copy link
Author

radoslawcybulski commented Apr 14, 2025

if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?

This has been discussed here (#2381) and consensus is that the release is where it's useful the most.

@radoslawcybulski radoslawcybulski self-assigned this Apr 14, 2025
@tgrabiec
Copy link
Contributor

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

Maybe we can tolerate it. @tgrabiec what's your opinion?

I think it's probably ok.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

protected:
scheduling_group _sg;
private:
std::source_location resume_point = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the resume_point first since it's larger than scheduling_group. that way the task object size is 4 bytes less.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really - if the class is not packed then the size includes the alignment, so order doesn't matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. The object will be aligned to 8 bytes boundary. However in case we add something after these variables, less than 5 bytes, then there will be a difference, so having it ordered by size might be beneficial in the future.

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch 2 times, most recently from 6fe697d to 80ce73a Compare April 22, 2025 09:46

SEASTAR_MODULE_EXPORT
class task {
std::source_location resume_point = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Should not we prepend class member variables with an underscore: resume_point->_resume_point?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 80ce73a to 58aa04e Compare April 22, 2025 12:38
@radoslawcybulski
Copy link
Author

@scylladb/seastar-maint please merge.

@denesb
Copy link
Contributor

denesb commented Jun 27, 2025

So we add 0.25% overhead. Hard to judge if it's worthwhile.

@avikivity I think the gain is much higher than 0.25% here. We need to reduce the bar of entry into debugging seastar applications.

@travisdowns
Copy link
Contributor

travisdowns commented Sep 8, 2025

wrt __builtin_return_address(0) it seems like the problematic case will be when the thing that's calling this (in this change) was inlined into the caller, in which case your stack won't include the creation of the task at all (only some high level frames).

If you captured that plus the current IP you'd have enough info to reconstruct always at least the caller.

@radoslawcybulski
Copy link
Author

radoslawcybulski commented Sep 8, 2025

@avikivity godbolt says it works. Given it's pretty much undocumented builtin - probably.

As i mentioned before - it does something else. The point of std::source_location is to give everyone compile time constant that survives inlining optimization. Runtime return address is pretty much useless - you will get return pointer into template inside god knows which header.

I will also argue __builtin_return_address(0) provides LESS information, not more - otherwise we would all be using it in our -O3 mass inlined builds everywhere. Heck, even full stacktrace will probably contain less information, due to optimization mess.

I can fork our own source_location implementation using __builtin_FILE and __builtin_LINE, those should limit memory usage.

EDIT: those are clang / gcc specific.

@avikivity
Copy link
Member

@avikivity godbolt says it works. Given it's pretty much undocumented builtin - probably.

It's not undocumented.

As i mentioned before - it does something else. The point of std::source_location is to give everyone compile time constant that survives inlining optimization. Runtime return address is pretty much useless - you will get return pointer into template inside god knows which header.

I will also argue __builtin_return_address(0) provides LESS information, not more - otherwise we would all be using it in our -O3 mass inlined builds everywhere.

I don't understand the logic. If X does not use Y, does it follow that Y is not useful? Perhaps X hasn't thought of using Y yet.

If it applied, it would be impossible to do anything new.

Heck, even full stacktrace will probably contain less information, due to optimization mess.

Stack traces (when explored with addr2line -i) show much more information. Each return address on the stack encodes all the inlined call-sites that led to it up to the next uninlined caller. The more inlining happens, the more information it carries.

See https://godbolt.org/z/G678Eo7rh. Although there is just one location that calls g(), it knows exactly which path called it through a.

I can fork our own source_location implementation using __builtin_FILE and __builtin_LINE, those should limit memory usage.

I suggest you try out __builtin_return_address(0) and compare the results.

@radoslawcybulski
Copy link
Author

radoslawcybulski commented Sep 10, 2025

I stand corrected, stack traces are actually sort of useful now, debug info has improved massively.

Stack traces (when explored with addr2line -i) show much more information. Each return address on the stack encodes all the inlined call-sites that led to it up to the next uninlined caller. The more inlining happens, the more information it carries.

This assumes every instruction has unambiguous inlined sequence of calls leading to it. Or am i missing something?

See https://godbolt.org/z/G678Eo7rh. Although there is just one location that calls g(), it knows exactly which path called it through a.

Good example. I took the liberty of removing [[gnu::noinline]] and [[gnu::always_inline]] and adding h implementation, that prints the ptr value. The code repeats 4 different values at -O2 and -O3 (which makes sense as clang just inlined to some depth). Adding flags used for release build makes the code print single value.
I also added a small if test, that calls foo from both branches - prints the same value in both cases.

Am i missing something? It doesn't take much to generate code, that drops ambiguous return pointers after optimization. Is there a way to resolve those i don't know (which would make me a X)? Otherwise that's pretty much the point i'm so poorly making. You can probably make it sort-of work with clever noinline flags (adding noinline to functions that also call __builtin_return_address would probably do it). Why bother? This will also hurt performance.

@avikivity
Copy link
Member

I stand corrected, stack traces are actually sort of useful now, debug info has improved massively.

Stack traces (when explored with addr2line -i) show much more information. Each return address on the stack encodes all the inlined call-sites that led to it up to the next uninlined caller. The more inlining happens, the more information it carries.

This assumes every instruction has unambiguous inlined sequence of calls leading to it. Or am i missing something?

It doesn't assume anything.

If inlining happens to duplicate a call to (let's say) future::then, then you will get more information.

If it doesn't, then you don't.

In the worst case __builtin_return_address will return the same amount of information as std::source_location.

In the common case, it will return more, since you will see the direct caller, and its callers up to the first uninlined function.

See https://godbolt.org/z/G678Eo7rh. Although there is just one location that calls g(), it knows exactly which path called it through a.

Good example. I took the liberty of removing [[gnu::noinline]] and [[gnu::always_inline]] and adding h implementation, that prints the ptr value. The code repeats 4 different values at -O2 and -O3 (which makes sense as clang just inlined to some depth). Adding flags used for release build makes the code print single value. I also added a small if test, that calls foo from both branches - prints the same value in both cases.

Am i missing something? It doesn't take much to generate code, that drops ambiguous return pointers after optimization. Is there a way to resolve those i don't know (which would make me a X)? Otherwise that's pretty much the point i'm so poorly making. You can probably make it sort-of work with clever noinline flags (adding noinline to functions that also call __builtin_return_address would probably do it). Why bother? This will also hurt performance.

If you compare to std::source_location, then in all combinations of inlining options it returns one call site. __builtin_return_address will return from 1 to 32 call sites, depending on the options.

Here's the example updated to report both std::source_location and __builtin_return_address. If you change h() to print both, you will get 32 different values for __builtin_return_address but only one value for std::source_location.

https://godbolt.org/z/MqaocP1x5

@radoslawcybulski
Copy link
Author

radoslawcybulski commented Sep 10, 2025

In the worst case __builtin_return_address will return the same amount of information as std::source_location.
In the common case, it will return more, since you will see the direct caller, and its callers up to the first uninlined function.

Consider this stack trace (top is current)

__builtin_return_address(0) -> returns next instruction in `caz`
foo [inlined]
bar [inlined]
baz [not inlined]
gez [inlined]
caz [not inlined]

__builtin_return_address will return return address of baz, which means you will get gez -> caz, which is what you said - you get inlined stacks up to first non-inlined. You lose top level inlined stack tho (foo, bar). Your example works, because in your case you've artificially made foo not inlined (which means it has no inline calls between itself and a call to __builtin_return_address(0), so nothing to loose).
This is what i also said - we can sort-of make it work at the cost of performance (more instructions executed, as those stack frames don't generate themselves).

This might be improvable (although i doubt i'm the first to think about it), if - instead of calling __builtin_return_address (which gets return address of current function, which is not that useful or we need to stop inlining) we could get next execution address pointer. This sort-of reverves the issue - now we have to force all then implementations. Then next instruction pointer will be as if return pointer of then call if it were forced to noinline, we should get missing top level inlined frames and it should fly.
My instinct tells me i'm not the first one to think about it and they are major issues with this approach. While std::source_location will work 100% of times on all systems.

EDIT: i will give it a try, but it will take some time.

@michoecho
Copy link
Contributor

michoecho commented Sep 10, 2025

In the worst case __builtin_return_address will return the same amount of information as std::source_location.

@avikivity No, why?

Imagine I want to debug a deadlocked fiber. It's stuck in this coroutine:

future<> my_function() {
    co_await a();
    co_await deadlock();
    co_await b();
}

std::source_location will point to the co_await deadlock(); line. That's what I want. __builtin_return_address(0) will point to reactor::do_run(). (Or some other internal function that wakes the coroutine handle). That's useless.

The replacement for std::source_location is not the return address, but rip.

@michoecho
Copy link
Contributor

michoecho commented Sep 10, 2025

This might be improvable (although i doubt i'm the first to think about it), if - instead of calling __builtin_return_address (which gets return address of current function, which is not that useful or we need to stop inlining) we could get next execution address pointer. This sort-of reverves the issue - now we have to force all then implementations. Then next instruction pointer will be as if return pointer of then call if it were forced to noinline, we should get missing top level inlined frames and it should fly.
My instinct tells me i'm not the first one to think about it and they are major issues with this approach. While std::source_location will work 100% of times on all systems.

@radoslawcybulski You mean, use rip instead of std::source_location like I did in #2381 (comment)?

I wanted to use the instruction pointer instead of source location from the start, because it theoretically gives more information, but I didn't try to push that because in practice (when I was privately using that patch for debugging) the compiler was very often generating bad debug info for the addresses I obtained this way. (E.g. pointing to the right file, but to line 0 instead of the actual line). std::source_location prevents problems like that. So it's a tradeoff — less info, but more reliable due to explicit compiler support, or more info with a chance that the info will be less useful (without manual investigation to find the actual line) because the compiler didn't care enough to generate good debug info.

@radoslawcybulski
Copy link
Author

@radoslawcybulski You mean, use rip instead of std::source_location like I did in #2381 (comment)?

Yep, the same. Also your patch highlights my other point - it's rare to figure out something new, most of the time if people are not using something there's a reason for it.

@avikivity
Copy link
Member

In the worst case __builtin_return_address will return the same amount of information as std::source_location.
In the common case, it will return more, since you will see the direct caller, and its callers up to the first uninlined function.

Consider this stack trace (top is current)

__builtin_return_address(0) -> returns next instruction in `caz`
foo [inlined]
bar [inlined]
baz [not inlined]
gez [inlined]
caz [not inlined]

__builtin_return_address will return return address of baz, which means you will get gez -> caz, which is what you said - you get inlined stacks up to first non-inlined. You lose top level inlined stack tho (foo, bar). Your example works, because in your case you've artificially made foo not inlined (which means it has no inline calls between itself and a call to __builtin_return_address(0), so nothing to loose). This is what i also said - we can sort-of make it work at the cost of performance (more instructions executed, as those stack frames don't generate themselves).

This might be improvable (although i doubt i'm the first to think about it), if - instead of calling __builtin_return_address (which gets return address of current function, which is not that useful or we need to stop inlining) we could get next execution address pointer. This sort-of reverves the issue - now we have to force all then implementations. Then next instruction pointer will be as if return pointer of then call if it were forced to noinline, we should get missing top level inlined frames and it should fly. My instinct tells me i'm not the first one to think about it and they are major issues with this approach. While std::source_location will work 100% of times on all systems.

Right, my idea the point where you capture __builtin_return_address(0) is itself not inlined, but that's hardly a given.

What we want is the current instruction pointer (if inlined) or __builtin_return_address(0) if not, but that's not expressible.

std::source_location works, but is not always useful. It can point to, say, some function in loop.hh and give you not much idea about what's going on.

@avikivity
Copy link
Member

In the worst case __builtin_return_address will return the same amount of information as std::source_location.

@avikivity No, why?

Imagine I want to debug a deadlocked fiber. It's stuck in this coroutine:

future<> my_function() {
    co_await a();
    co_await deadlock();
    co_await b();
}

std::source_location will point to the co_await deadlock(); line. That's what I want. __builtin_return_address(0) will point to reactor::do_run(). (Or some other internal function that wakes the coroutine handle). That's useless.

Well, it depends (which is bad) on whether the function capturing %rip was inlined.

Perhaps we could always_inline a wrapper, and let the compiler choose whether to inline the wrappee or or not.

The replacement for std::source_location is not the return address, but rip.

yes+no

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch 2 times, most recently from 6e78428 to cabf015 Compare September 18, 2025 13:12
@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from cabf015 to 5f0d144 Compare October 13, 2025 17:32
@radoslawcybulski
Copy link
Author

radoslawcybulski commented Oct 15, 2025

I've run three tests (without patch, with std::source_location and with slim_source_location (RIP + const char * filename + uint64_t line). Results (size of scylla + output of simple-perf-query run) follows:

without patch:
-rwxr-xr-x. 1 y y 117237760 Oct 15 14:04 build/dev/scylla
109785.54 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   56731 insns/op,   34064 cycles/op,        0 errors)
129664.30 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56783 insns/op,   34771 cycles/op,        0 errors)
132627.36 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56763 insns/op,   35072 cycles/op,        0 errors)
133472.86 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56768 insns/op,   34881 cycles/op,        0 errors)
137312.72 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56748 insns/op,   33911 cycles/op,        0 errors)
throughput:
        mean=   128572.56 standard-deviation=10851.14
        median= 132627.36 median-absolute-deviation=4900.30
        maximum=137312.72 minimum=109785.54
instructions_per_op:
        mean=   56758.37 standard-deviation=19.65
        median= 56762.56 median-absolute-deviation=10.69
        maximum=56782.65 minimum=56731.24
cpu_cycles_per_op:
        mean=   34539.85 standard-deviation=518.31
        median= 34771.03 median-absolute-deviation=475.91
        maximum=35071.78 minimum=33911.26
with std::source_location:
-rwxr-xr-x. 1 y y 119546200 Oct 15 14:40 build/dev/scylla
132906.63 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56899 insns/op,   34873 cycles/op,        0 errors)
118051.07 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   56861 insns/op,   39176 cycles/op,        0 errors)
128228.93 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56843 insns/op,   36077 cycles/op,        0 errors)
135714.37 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56858 insns/op,   33995 cycles/op,        0 errors)
134982.33 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56878 insns/op,   34199 cycles/op,        0 errors)
throughput:
        mean=   129976.67 standard-deviation=7277.31
        median= 132906.63 median-absolute-deviation=5005.67
        maximum=135714.37 minimum=118051.07
instructions_per_op:
        mean=   56867.64 standard-deviation=21.30
        median= 56860.67 median-absolute-deviation=10.37
        maximum=56898.66 minimum=56843.22
cpu_cycles_per_op:
        mean=   35663.98 standard-deviation=2125.23
        median= 34872.53 median-absolute-deviation=1464.79
        maximum=39176.33 minimum=33994.54
diff size is 2308440 (~2.2 MB)
with `slim_source_location`:
-rwxr-xr-x. 1 y y 119431736 Oct 15 13:25 build/dev/scylla
124916.79 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   57187 insns/op,   37190 cycles/op,        0 errors)
118595.81 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   57126 insns/op,   39020 cycles/op,        0 errors)
129827.40 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   57165 insns/op,   35711 cycles/op,        0 errors)
133294.11 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   57172 insns/op,   34779 cycles/op,        0 errors)
133896.32 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   57152 insns/op,   34626 cycles/op,        0 errors)
throughput:
        mean=   128106.09 standard-deviation=6403.56
        median= 129827.40 median-absolute-deviation=5188.02
        maximum=133896.32 minimum=118595.81
instructions_per_op:
        mean=   57160.30 standard-deviation=23.14
        median= 57164.96 median-absolute-deviation=11.48
        maximum=57187.23 minimum=57125.81
cpu_cycles_per_op:
        mean=   36265.20 standard-deviation=1847.07
        median= 35710.56 median-absolute-deviation=1486.16
        maximum=39019.93 minimum=34626.24
diff size is 2193976 (~2 MB)

@radoslawcybulski
Copy link
Author

We get additional instructions (well, 3 writes instead of one), roughly 200 instructions per op, (0.3%).
scylla shrunk by approximately 200kb.

Note sure if it's worth. Most of our tools probably deals with std::source_location, but that's adaptable.
@avikivity please make a call or tell me, whom should i pull for more info here.

@avikivity
Copy link
Member

We get additional instructions (well, 3 writes instead of one), roughly 200 instructions per op, (0.3%). scylla shrunk by approximately 200kb.

Note sure if it's worth. Most of our tools probably deals with std::source_location, but that's adaptable. @avikivity please make a call or tell me, whom should i pull for more info here.

Our tools deal well with addresses too.

Looks like the slim source_location isn't so slim. I can understand it - source_location is fat out-of-line but slim inline.

I need to think more about it. My feeling is that source_location won't work well as more infrastructure is converted to coroutines, we'll just see some internal coroutine there.

@avikivity
Copy link
Member

compiler was very often generating bad debug info for the addresses I obtained this way. (E.g. pointing to the right file, but to line 0 instead of the actual line).

Was this with addr2line or llvm-addr2line? We recently saw that llvm-addr2line is better.


[[gnu::always_inline]] slim_source_location(const char* file = __builtin_FILE(), std::int32_t line = __builtin_LINE(), std::int32_t column = __builtin_COLUMN())
: _file(file), _line(line), _column(column) {
#ifdef SSL_HAS_RIP_X86
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSL?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slim source line.
My brain keeps telling me i've seen this abbreviation, but i've not found anything better. I can just expand into SLIM_SOURCE_LINE_***, those macros are contained in this file anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix macros with SEASTAR_, and don't abbreviate, it just confuses people.

: _file(file), _line(line), _column(column) {
#ifdef SSL_HAS_RIP_X86
std::uintptr_t rip;
asm("leaq 0(%%rip), %0":"=r"(rip));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could slim it down with something like

struct slim_data* sd;
asm ("1: lea 2f, %0 \n"
          ".pushsection .rodata.something \n"
          "2: \n"
          ".quad 1b \n"
          ".quad %1 \n"
          ".long %2, %3 \n"
          ".popsection"
          : "=r"(sd) : "i"(__builtin_FILE()) : "i"(__builtin_LINE()), "i"(__builtin_COLUMN()));

This pushes the data to the data section and we end up with one instruction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, i do understand few identifiers from the example, but that's all. :)

What is 1: and 2: (i assume labels? They don't seem used? Are those global identifiers or local to function / module?). Why we do .quad 1b (quad is 64bit probably, so 1b is just 1)? long is 32bit i assume? What is =r... syntax?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 1: and 2: are labels, 2f = 2 forward, 1b = 1 back. Using numbers makes them local.

"=r" -> output register

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's preparing a static struct pointing back at the lea instruction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 2 forward?
Will fix it later on, i'm on sick leave until tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, in cases where macros are on the table (in this case they aren't), you can get this static struct without stooping to inline asm.

Eg. I have this tracepoint macro (which I'm actually using in practice right now, when writing a debug util for scylladb/scylladb#25679). (entry) is the static struct in this example. And I use its address as the metadata header for each event in the trace.

struct tracepoint_entry {
    const char* name;
    const char* file;
    int line;
    const char* function;
    const char* signature;
    void *rip;
};

#define TRACEPOINT(eventlevel, tp_name, ...) { \
__label__ rip; \
    using namespace seastar; \
    static constexpr auto sig __attribute__((section("tracepoint_signatures"), used)) = COMPUTE_SIGNATURE(SIG(__VA_ARGS__)); \
    static constexpr char namearr[] __attribute__((section("tracepoint_names"), used)) = tp_name; \
    static constexpr char filearr[] __attribute__((section("tracepoint_files"), used)) = __FILE__; \
    static constexpr tracepoint_entry entry __attribute__((section("tracepoints"), used)) = { \
        .name = namearr, \
        .file = filearr, \
        .line = __LINE__, \
        .function = __PRETTY_FUNCTION__, \
        .signature = sig.data(), \
        .rip = &&rip, \
    }; \
rip: \
    size_t sz = compute_size(EXTRACT_ARGS(__VA_ARGS__)); \
    auto out = local_tracer->write(eventlevel, sz + 16); \
    seastar::write_le<uintptr_t>(reinterpret_cast<char*>(out), reinterpret_cast<uintptr_t>(&entry)); \
    out += sizeof(uintptr_t); \
    seastar::write_le<uint64_t>(reinterpret_cast<char*>(out), __rdtsc()); \
    out += sizeof(uint64_t); \
    serialize_tracepoint(out __VA_OPT__(,) EXTRACT_ARGS(__VA_ARGS__)); \
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michoecho why do we need this black magic incantation with __attribute__((section("..."), ised))? Woudln't simple static suffice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michoecho why this woulnd't work?

struct slim_data* sd;

void *foo(const char *file, int line, int column) {
asm ("1: lea 2f, %0 \n"
          ".pushsection .rodata.something \n"
          "2: \n"
          ".quad 1b \n"
          ".quad %1 \n"
          ".long %2, %3 \n"
          ".popsection"
          : "=r"(sd) : "i"(file) : "i"(line), "i"(column));
...
}

?

Copy link
Contributor

@michoecho michoecho Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this woulnd't work?

Because Avi's wish would be to combine the instruction pointer and the source location (and maybe __PRETTY_FUNCTION__ too, why not) into a static, constant-initialized struct, so that all statically-known info can be represented inside a task by a single pointer at runtime. And that's what the piece of asm is trying to accomplish.

But if you want that struct to be constant-initialized, you must initialize it with constants. And in your example, you are trying to initialize it with not-constants. (If the function is always inlined, then the values of line and column and file are known at compile time, but something like this doesn't fit into the type system. A function must stand on its own, there's no "post-inlining constexpr"). It will be rejected by the compiler. (The "i" in "i"(line) stands for "immediate". This in x86 terms means a constant woven directly into the encoding of the instruction. And you are trying to pass a variable there).

why do we need this black magic incantation with __attribute__((section("..."), used))?

I want all tracepoint_entry structs to be collected by the linker into their own section (tracepoints here) so that my trace decoder can later, while preparing for decoding, look at the Scylla binary and iterate over all tracepoint_entry structs scattered inside it, to get the signatures for all event types in the trace.

If they aren't packed into a section, they will be in random places in .rodata, and they can't be found without knowing their names. If they are packed into a section, my decoder can open the Scylla executable, find its tracepoints section, cast it to a tracepoint_entry[] array, iterate over it, and generate a corresponding decoder for each TRACEPOINT in the program.

For example, I can put

TRACEPOINT(event_level::debug, "io_begin", "class", io_request.class, "id", io_request.id, "size", io_request.size);

anywhere in the program, and it will produce a

tracepoint_entry{.name = "io_begin", .signature = "class:u8,id:u64,size:u64", ...}

struct instance to the tracepoints section at some address X, and the decoder will iterate over X (among others), see the signature and use it generate a decoder function which reads a

struct io_begin {
    uint8_t class;
    uint64_t id;
    uint64_t size;
};

from the trace whenever the header is X.
And then the functions generated from signatures can be compiled into the actual decoder program.

Other sections are not needed for anything, they are just there to keep the metadata strings neatly ordered in the ELF.

The used is probably not needed either, I just wanted to make sure the linker (and/or LTO) won't play garbage collection tricks on me.


Anyway, all that's not very applicable in this thread, because macros are off the table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michoecho amazing, thank you for this explanation!

@michoecho
Copy link
Contributor

michoecho commented Oct 16, 2025

compiler was very often generating bad debug info for the addresses I obtained this way. (E.g. pointing to the right file, but to line 0 instead of the actual line).

Was this with addr2line or llvm-addr2line? We recently saw that llvm-addr2line is better.

@avikivity I'm 80% sure that the problem was at compile time, not decode time. But I don't remember my problem well, I only have the vague memory that I was getting the RIPs I wanted but the file and line info for those RIPs in the DWARF was wrong(/useless) sometimes.

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 5f0d144 to 37a4a68 Compare November 19, 2025 09:48
@radoslawcybulski
Copy link
Author

Reverted "slim" source_location commit, we're back to std::source_location only. Are we good to go, @avikivity ?

Radosław Cybulski added 3 commits November 20, 2025 20:08
Add an empty, default constructed std::source_location object
to the task object and getter / setter.
Add calls to update `resume_point` variable with location of
next resume location to all `await_suspend` functions and `then`
functions.
Add resume point locations to `tasktrace` object.
Update `formatter::format` to print source location of next resume alone
with task type.
@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 37a4a68 to 9762ed0 Compare November 20, 2025 19:48
@radoslawcybulski
Copy link
Author

Patch rebased, "upgraded" two await_suspend(std::coroutine_handle<> consumer) functions to await_suspend(std::coroutine_handle<Promise> consumer) with template parameter Promise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Looking for stuck/deadlocked fibers could be easier

10 participants