Add source location to task and tasktrace object #2707

radoslawcybulski · 2025-03-31T14:23:38Z

Adds source location of next resume point to task objects. When constructing tasktrace object, source location of resume points will be added to it. When printing, the source location of next resume point will be printed as well, hopefully improving debug experience.

Example:

0x2b3899 /home/y/work/scylladb-test/build/Debug/seastar/libseastar.so+0x454c364 ...
   --------
   test_app/main.cpp:15:15
   seastar::continuation<seastar::internal::promise_base_with_type<void>, ...

Partially fixes #2381

avikivity · 2025-03-31T14:48:28Z

include/seastar/core/task.hh

    scheduling_group _sg;
 private:
+    std::source_location resume_point = {};
+


This is making a very heavily used object larger.

After adding std::source_location

coroutine_traits_base<>::promise_type (which inherits from task) is 64 bytes,
continuation<Promise, Func, Wrapper, T> is around 104-120, depending on exact version.

No difference except for that single unlucky value, who is "promoted" to the next 64 byte cache line in case of continuation.

I think it's okay.

avikivity · 2025-03-31T14:51:12Z

Aren't those addresses derivable from the stack trace? Or are those points lost?

How large is std::source_location?

radoslawcybulski · 2025-03-31T14:55:09Z

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

avikivity · 2025-03-31T15:09:23Z

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

Maybe we can tolerate it. @tgrabiec what's your opinion?

Please run scylladb's perf-simple-query and report insn-per-op before and after (although we won't learn of the impact of the size change).

radoslawcybulski · 2025-03-31T16:21:39Z

my patch

instructions_per_op: mean=48465.87 standard-deviation=33.03 median=48459.42 median-absolute-deviation=22.63 maximum=48520.81 minimum=48437.81
instructions_per_op: mean=48482.90 standard-deviation=21.16 median=48487.16 median-absolute-deviation=6.97 maximum=48504.13 minimum=48447.34
instructions_per_op: mean=48469.42 standard-deviation=18.34 median=48472.45 median-absolute-deviation=17.26 maximum=48488.53 minimum=48448.81

master

instructions_per_op: mean=48337.41 standard-deviation=21.29 median=48334.08 median-absolute-deviation=19.67 maximum=48364.48 minimum=48317.02
instructions_per_op: mean=48352.36 standard-deviation=27.87 median=48347.35 median-absolute-deviation=16.28 maximum=48395.26 minimum=48322.29
instructions_per_op: mean=48339.83 standard-deviation=26.48 median=48323.12 median-absolute-deviation=20.80 maximum=48372.64 minimum=48319.03

My patch will run more instructions, as there are additional stores to the task object (std::source_location needs to be written).

Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?

avikivity · 2025-03-31T17:37:35Z

Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?

Time is very unstable (just try it) since it depends on temperature and cooling.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

avikivity · 2025-03-31T17:38:40Z

So we add 0.25% overhead. Hard to judge if it's worthwhile.

radoslawcybulski · 2025-03-31T18:01:15Z

Time is very unstable (just try it) since it depends on temperature and cooling.

Agree.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.

avikivity · 2025-03-31T18:08:08Z

Time is very unstable (just try it) since it depends on temperature and cooling.

Agree.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.

Writes are more-or-less free, except for the increase in instruction footprint (which are extra reads).

We aren't close to running at max IPC. Typical for mini-benchmarks like perf-simple-query is 2, and for production ~1. The main killer is waiting for instruction fetch (although maybe in production mispredicts would also contribute since the the data is less regular). I'm sure an M4 would give much higher IPC.

tchaikov

if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?

tchaikov · 2025-04-07T07:50:40Z

include/seastar/core/coroutine.hh


 #ifndef SEASTAR_MODULE
 #include <coroutine>
+#include <source_location>


it took me a while to check if it's safe to use <source_location> instead of util/std-compat.hh . the answer is yes. as we support the latest two major versions of the GCC and Clang compilers. in general, we use libstdc++, which added the support to source_location in gcc-mirror/gcc@57d76ee, which is included by gcc 12 and up. and the latest stable release of GCC is GCC 14.2.

Opened #2961 on this.

radoslawcybulski · 2025-04-14T06:27:30Z

if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?

This has been discussed here (#2381) and consensus is that the release is where it's useful the most.

tgrabiec · 2025-04-17T11:53:38Z

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

Maybe we can tolerate it. @tgrabiec what's your opinion?

I think it's probably ok.

tchaikov

lgtm

bitpathfinder · 2025-04-22T07:22:48Z

include/seastar/core/task.hh

 protected:
    scheduling_group _sg;
 private:
+    std::source_location resume_point = {};


I'd put the resume_point first since it's larger than scheduling_group. that way the task object size is 4 bytes less.

Not really - if the class is not packed then the size includes the alignment, so order doesn't matter.

Yes, that's true. The object will be aligned to 8 bytes boundary. However in case we add something after these variables, less than 5 bytes, then there will be a difference, so having it ordered by size might be beneficial in the future.

bitpathfinder · 2025-04-22T12:34:43Z

include/seastar/core/task.hh


 SEASTAR_MODULE_EXPORT
 class task {
+    std::source_location resume_point = {};


(nit) Should not we prepend class member variables with an underscore: resume_point->_resume_point?

radoslawcybulski · 2025-04-22T13:13:22Z

@scylladb/seastar-maint please merge.

denesb · 2025-06-27T14:11:51Z

So we add 0.25% overhead. Hard to judge if it's worthwhile.

@avikivity I think the gain is much higher than 0.25% here. We need to reduce the bar of entry into debugging seastar applications.

travisdowns · 2025-09-08T18:18:40Z

wrt __builtin_return_address(0) it seems like the problematic case will be when the thing that's calling this (in this change) was inlined into the caller, in which case your stack won't include the creation of the task at all (only some high level frames).

If you captured that plus the current IP you'd have enough info to reconstruct always at least the caller.

radoslawcybulski · 2025-09-08T18:20:49Z

@avikivity godbolt says it works. Given it's pretty much undocumented builtin - probably.

As i mentioned before - it does something else. The point of std::source_location is to give everyone compile time constant that survives inlining optimization. Runtime return address is pretty much useless - you will get return pointer into template inside god knows which header.

I will also argue __builtin_return_address(0) provides LESS information, not more - otherwise we would all be using it in our -O3 mass inlined builds everywhere. Heck, even full stacktrace will probably contain less information, due to optimization mess.

I can fork our own source_location implementation using __builtin_FILE and __builtin_LINE, those should limit memory usage.

EDIT: those are clang / gcc specific.

avikivity · 2025-09-09T08:54:44Z

@avikivity godbolt says it works. Given it's pretty much undocumented builtin - probably.

It's not undocumented.

As i mentioned before - it does something else. The point of std::source_location is to give everyone compile time constant that survives inlining optimization. Runtime return address is pretty much useless - you will get return pointer into template inside god knows which header.

I will also argue __builtin_return_address(0) provides LESS information, not more - otherwise we would all be using it in our -O3 mass inlined builds everywhere.

I don't understand the logic. If X does not use Y, does it follow that Y is not useful? Perhaps X hasn't thought of using Y yet.

If it applied, it would be impossible to do anything new.

Heck, even full stacktrace will probably contain less information, due to optimization mess.

Stack traces (when explored with addr2line -i) show much more information. Each return address on the stack encodes all the inlined call-sites that led to it up to the next uninlined caller. The more inlining happens, the more information it carries.

See https://godbolt.org/z/G678Eo7rh. Although there is just one location that calls g(), it knows exactly which path called it through a.

I can fork our own source_location implementation using __builtin_FILE and __builtin_LINE, those should limit memory usage.

I suggest you try out __builtin_return_address(0) and compare the results.

radoslawcybulski · 2025-09-10T08:51:39Z

I stand corrected, stack traces are actually sort of useful now, debug info has improved massively.

Stack traces (when explored with addr2line -i) show much more information. Each return address on the stack encodes all the inlined call-sites that led to it up to the next uninlined caller. The more inlining happens, the more information it carries.

This assumes every instruction has unambiguous inlined sequence of calls leading to it. Or am i missing something?

See https://godbolt.org/z/G678Eo7rh. Although there is just one location that calls g(), it knows exactly which path called it through a.

Good example. I took the liberty of removing [[gnu::noinline]] and [[gnu::always_inline]] and adding h implementation, that prints the ptr value. The code repeats 4 different values at -O2 and -O3 (which makes sense as clang just inlined to some depth). Adding flags used for release build makes the code print single value.
I also added a small if test, that calls foo from both branches - prints the same value in both cases.

Am i missing something? It doesn't take much to generate code, that drops ambiguous return pointers after optimization. Is there a way to resolve those i don't know (which would make me a X)? Otherwise that's pretty much the point i'm so poorly making. You can probably make it sort-of work with clever noinline flags (adding noinline to functions that also call __builtin_return_address would probably do it). Why bother? This will also hurt performance.

avikivity · 2025-09-10T09:33:56Z

I stand corrected, stack traces are actually sort of useful now, debug info has improved massively.

Stack traces (when explored with addr2line -i) show much more information. Each return address on the stack encodes all the inlined call-sites that led to it up to the next uninlined caller. The more inlining happens, the more information it carries.

This assumes every instruction has unambiguous inlined sequence of calls leading to it. Or am i missing something?

It doesn't assume anything.

If inlining happens to duplicate a call to (let's say) future::then, then you will get more information.

If it doesn't, then you don't.

In the worst case __builtin_return_address will return the same amount of information as std::source_location.

In the common case, it will return more, since you will see the direct caller, and its callers up to the first uninlined function.

See https://godbolt.org/z/G678Eo7rh. Although there is just one location that calls g(), it knows exactly which path called it through a.

Good example. I took the liberty of removing [[gnu::noinline]] and [[gnu::always_inline]] and adding h implementation, that prints the ptr value. The code repeats 4 different values at -O2 and -O3 (which makes sense as clang just inlined to some depth). Adding flags used for release build makes the code print single value. I also added a small if test, that calls foo from both branches - prints the same value in both cases.

Am i missing something? It doesn't take much to generate code, that drops ambiguous return pointers after optimization. Is there a way to resolve those i don't know (which would make me a X)? Otherwise that's pretty much the point i'm so poorly making. You can probably make it sort-of work with clever noinline flags (adding noinline to functions that also call __builtin_return_address would probably do it). Why bother? This will also hurt performance.

If you compare to std::source_location, then in all combinations of inlining options it returns one call site. __builtin_return_address will return from 1 to 32 call sites, depending on the options.

Here's the example updated to report both std::source_location and __builtin_return_address. If you change h() to print both, you will get 32 different values for __builtin_return_address but only one value for std::source_location.

https://godbolt.org/z/MqaocP1x5

radoslawcybulski · 2025-09-10T10:39:48Z

In the worst case __builtin_return_address will return the same amount of information as std::source_location.
In the common case, it will return more, since you will see the direct caller, and its callers up to the first uninlined function.

Consider this stack trace (top is current)

__builtin_return_address(0) -> returns next instruction in `caz`
foo [inlined]
bar [inlined]
baz [not inlined]
gez [inlined]
caz [not inlined]

__builtin_return_address will return return address of baz, which means you will get gez -> caz, which is what you said - you get inlined stacks up to first non-inlined. You lose top level inlined stack tho (foo, bar). Your example works, because in your case you've artificially made foo not inlined (which means it has no inline calls between itself and a call to __builtin_return_address(0), so nothing to loose).
This is what i also said - we can sort-of make it work at the cost of performance (more instructions executed, as those stack frames don't generate themselves).

This might be improvable (although i doubt i'm the first to think about it), if - instead of calling __builtin_return_address (which gets return address of current function, which is not that useful or we need to stop inlining) we could get next execution address pointer. This sort-of reverves the issue - now we have to force all then implementations. Then next instruction pointer will be as if return pointer of then call if it were forced to noinline, we should get missing top level inlined frames and it should fly.
My instinct tells me i'm not the first one to think about it and they are major issues with this approach. While std::source_location will work 100% of times on all systems.

EDIT: i will give it a try, but it will take some time.

michoecho · 2025-09-10T12:03:24Z

In the worst case __builtin_return_address will return the same amount of information as std::source_location.

@avikivity No, why?

Imagine I want to debug a deadlocked fiber. It's stuck in this coroutine:

future<> my_function() {
    co_await a();
    co_await deadlock();
    co_await b();
}

std::source_location will point to the co_await deadlock(); line. That's what I want. __builtin_return_address(0) will point to reactor::do_run(). (Or some other internal function that wakes the coroutine handle). That's useless.

The replacement for std::source_location is not the return address, but rip.

michoecho · 2025-09-10T12:10:55Z

This might be improvable (although i doubt i'm the first to think about it), if - instead of calling __builtin_return_address (which gets return address of current function, which is not that useful or we need to stop inlining) we could get next execution address pointer. This sort-of reverves the issue - now we have to force all then implementations. Then next instruction pointer will be as if return pointer of then call if it were forced to noinline, we should get missing top level inlined frames and it should fly.
My instinct tells me i'm not the first one to think about it and they are major issues with this approach. While std::source_location will work 100% of times on all systems.

@radoslawcybulski You mean, use rip instead of std::source_location like I did in #2381 (comment)?

I wanted to use the instruction pointer instead of source location from the start, because it theoretically gives more information, but I didn't try to push that because in practice (when I was privately using that patch for debugging) the compiler was very often generating bad debug info for the addresses I obtained this way. (E.g. pointing to the right file, but to line 0 instead of the actual line). std::source_location prevents problems like that. So it's a tradeoff — less info, but more reliable due to explicit compiler support, or more info with a chance that the info will be less useful (without manual investigation to find the actual line) because the compiler didn't care enough to generate good debug info.

radoslawcybulski · 2025-09-10T20:17:31Z

@radoslawcybulski You mean, use rip instead of std::source_location like I did in #2381 (comment)?

Yep, the same. Also your patch highlights my other point - it's rare to figure out something new, most of the time if people are not using something there's a reason for it.

avikivity · 2025-09-11T16:58:51Z

In the worst case __builtin_return_address will return the same amount of information as std::source_location.
In the common case, it will return more, since you will see the direct caller, and its callers up to the first uninlined function.

Consider this stack trace (top is current)
__builtin_return_address(0) -> returns next instruction in `caz`
foo [inlined]
bar [inlined]
baz [not inlined]
gez [inlined]
caz [not inlined]
__builtin_return_address will return return address of baz, which means you will get gez -> caz, which is what you said - you get inlined stacks up to first non-inlined. You lose top level inlined stack tho (foo, bar). Your example works, because in your case you've artificially made foo not inlined (which means it has no inline calls between itself and a call to __builtin_return_address(0), so nothing to loose). This is what i also said - we can sort-of make it work at the cost of performance (more instructions executed, as those stack frames don't generate themselves).

This might be improvable (although i doubt i'm the first to think about it), if - instead of calling __builtin_return_address (which gets return address of current function, which is not that useful or we need to stop inlining) we could get next execution address pointer. This sort-of reverves the issue - now we have to force all then implementations. Then next instruction pointer will be as if return pointer of then call if it were forced to noinline, we should get missing top level inlined frames and it should fly. My instinct tells me i'm not the first one to think about it and they are major issues with this approach. While std::source_location will work 100% of times on all systems.

Right, my idea the point where you capture __builtin_return_address(0) is itself not inlined, but that's hardly a given.

What we want is the current instruction pointer (if inlined) or __builtin_return_address(0) if not, but that's not expressible.

std::source_location works, but is not always useful. It can point to, say, some function in loop.hh and give you not much idea about what's going on.

avikivity · 2025-09-11T17:01:55Z

In the worst case __builtin_return_address will return the same amount of information as std::source_location.

@avikivity No, why?

Imagine I want to debug a deadlocked fiber. It's stuck in this coroutine:
future<> my_function() {
    co_await a();
    co_await deadlock();
    co_await b();
}
std::source_location will point to the co_await deadlock(); line. That's what I want. __builtin_return_address(0) will point to reactor::do_run(). (Or some other internal function that wakes the coroutine handle). That's useless.

Well, it depends (which is bad) on whether the function capturing %rip was inlined.

Perhaps we could always_inline a wrapper, and let the compiler choose whether to inline the wrappee or or not.

The replacement for std::source_location is not the return address, but rip.

yes+no

radoslawcybulski · 2025-10-15T12:49:01Z

I've run three tests (without patch, with std::source_location and with slim_source_location (RIP + const char * filename + uint64_t line). Results (size of scylla + output of simple-perf-query run) follows:

without patch:
-rwxr-xr-x. 1 y y 117237760 Oct 15 14:04 build/dev/scylla
109785.54 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   56731 insns/op,   34064 cycles/op,        0 errors)
129664.30 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56783 insns/op,   34771 cycles/op,        0 errors)
132627.36 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56763 insns/op,   35072 cycles/op,        0 errors)
133472.86 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56768 insns/op,   34881 cycles/op,        0 errors)
137312.72 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56748 insns/op,   33911 cycles/op,        0 errors)
throughput:
        mean=   128572.56 standard-deviation=10851.14
        median= 132627.36 median-absolute-deviation=4900.30
        maximum=137312.72 minimum=109785.54
instructions_per_op:
        mean=   56758.37 standard-deviation=19.65
        median= 56762.56 median-absolute-deviation=10.69
        maximum=56782.65 minimum=56731.24
cpu_cycles_per_op:
        mean=   34539.85 standard-deviation=518.31
        median= 34771.03 median-absolute-deviation=475.91
        maximum=35071.78 minimum=33911.26

with std::source_location:
-rwxr-xr-x. 1 y y 119546200 Oct 15 14:40 build/dev/scylla
132906.63 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56899 insns/op,   34873 cycles/op,        0 errors)
118051.07 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   56861 insns/op,   39176 cycles/op,        0 errors)
128228.93 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56843 insns/op,   36077 cycles/op,        0 errors)
135714.37 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56858 insns/op,   33995 cycles/op,        0 errors)
134982.33 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   56878 insns/op,   34199 cycles/op,        0 errors)
throughput:
        mean=   129976.67 standard-deviation=7277.31
        median= 132906.63 median-absolute-deviation=5005.67
        maximum=135714.37 minimum=118051.07
instructions_per_op:
        mean=   56867.64 standard-deviation=21.30
        median= 56860.67 median-absolute-deviation=10.37
        maximum=56898.66 minimum=56843.22
cpu_cycles_per_op:
        mean=   35663.98 standard-deviation=2125.23
        median= 34872.53 median-absolute-deviation=1464.79
        maximum=39176.33 minimum=33994.54
diff size is 2308440 (~2.2 MB)

with `slim_source_location`:
-rwxr-xr-x. 1 y y 119431736 Oct 15 13:25 build/dev/scylla
124916.79 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   57187 insns/op,   37190 cycles/op,        0 errors)
118595.81 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   57126 insns/op,   39020 cycles/op,        0 errors)
129827.40 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   57165 insns/op,   35711 cycles/op,        0 errors)
133294.11 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   57172 insns/op,   34779 cycles/op,        0 errors)
133896.32 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   57152 insns/op,   34626 cycles/op,        0 errors)
throughput:
        mean=   128106.09 standard-deviation=6403.56
        median= 129827.40 median-absolute-deviation=5188.02
        maximum=133896.32 minimum=118595.81
instructions_per_op:
        mean=   57160.30 standard-deviation=23.14
        median= 57164.96 median-absolute-deviation=11.48
        maximum=57187.23 minimum=57125.81
cpu_cycles_per_op:
        mean=   36265.20 standard-deviation=1847.07
        median= 35710.56 median-absolute-deviation=1486.16
        maximum=39019.93 minimum=34626.24
diff size is 2193976 (~2 MB)

radoslawcybulski · 2025-10-15T12:51:17Z

We get additional instructions (well, 3 writes instead of one), roughly 200 instructions per op, (0.3%).
scylla shrunk by approximately 200kb.

Note sure if it's worth. Most of our tools probably deals with std::source_location, but that's adaptable.
@avikivity please make a call or tell me, whom should i pull for more info here.

avikivity · 2025-10-16T09:05:16Z

We get additional instructions (well, 3 writes instead of one), roughly 200 instructions per op, (0.3%). scylla shrunk by approximately 200kb.

Note sure if it's worth. Most of our tools probably deals with std::source_location, but that's adaptable. @avikivity please make a call or tell me, whom should i pull for more info here.

Our tools deal well with addresses too.

Looks like the slim source_location isn't so slim. I can understand it - source_location is fat out-of-line but slim inline.

I need to think more about it. My feeling is that source_location won't work well as more infrastructure is converted to coroutines, we'll just see some internal coroutine there.

avikivity · 2025-10-16T09:06:29Z

compiler was very often generating bad debug info for the addresses I obtained this way. (E.g. pointing to the right file, but to line 0 instead of the actual line).

Was this with addr2line or llvm-addr2line? We recently saw that llvm-addr2line is better.

avikivity · 2025-10-16T10:42:32Z

include/seastar/core/slim_source_location.hh

+
+        [[gnu::always_inline]] slim_source_location(const char* file = __builtin_FILE(), std::int32_t line = __builtin_LINE(), std::int32_t column = __builtin_COLUMN())
+                : _file(file), _line(line), _column(column) {
+#ifdef SSL_HAS_RIP_X86


slim source line.
My brain keeps telling me i've seen this abbreviation, but i've not found anything better. I can just expand into SLIM_SOURCE_LINE_***, those macros are contained in this file anyway.

Prefix macros with SEASTAR_, and don't abbreviate, it just confuses people.

avikivity · 2025-10-16T10:53:02Z

include/seastar/core/slim_source_location.hh

+                : _file(file), _line(line), _column(column) {
+#ifdef SSL_HAS_RIP_X86
+            std::uintptr_t rip;
+            asm("leaq 0(%%rip), %0":"=r"(rip));


We could slim it down with something like

struct slim_data* sd; asm ("1: lea 2f, %0 \n" ".pushsection .rodata.something \n" "2: \n" ".quad 1b \n" ".quad %1 \n" ".long %2, %3 \n" ".popsection" : "=r"(sd) : "i"(__builtin_FILE()) : "i"(__builtin_LINE()), "i"(__builtin_COLUMN()));

This pushes the data to the data section and we end up with one instruction.

Well, i do understand few identifiers from the example, but that's all. :)

What is 1: and 2: (i assume labels? They don't seem used? Are those global identifiers or local to function / module?). Why we do .quad 1b (quad is 64bit probably, so 1b is just 1)? long is 32bit i assume? What is =r... syntax?

Yes, 1: and 2: are labels, 2f = 2 forward, 1b = 1 back. Using numbers makes them local.

"=r" -> output register

It's preparing a static struct pointing back at the lea instruction.

What is 2 forward?
Will fix it later on, i'm on sick leave until tomorrow.

By the way, in cases where macros are on the table (in this case they aren't), you can get this static struct without stooping to inline asm.

Eg. I have this tracepoint macro (which I'm actually using in practice right now, when writing a debug util for scylladb/scylladb#25679). (entry) is the static struct in this example. And I use its address as the metadata header for each event in the trace.

struct tracepoint_entry { const char* name; const char* file; int line; const char* function; const char* signature; void *rip; }; #define TRACEPOINT(eventlevel, tp_name, ...) { \ __label__ rip; \ using namespace seastar; \ static constexpr auto sig __attribute__((section("tracepoint_signatures"), used)) = COMPUTE_SIGNATURE(SIG(__VA_ARGS__)); \ static constexpr char namearr[] __attribute__((section("tracepoint_names"), used)) = tp_name; \ static constexpr char filearr[] __attribute__((section("tracepoint_files"), used)) = __FILE__; \ static constexpr tracepoint_entry entry __attribute__((section("tracepoints"), used)) = { \ .name = namearr, \ .file = filearr, \ .line = __LINE__, \ .function = __PRETTY_FUNCTION__, \ .signature = sig.data(), \ .rip = &&rip, \ }; \ rip: \ size_t sz = compute_size(EXTRACT_ARGS(__VA_ARGS__)); \ auto out = local_tracer->write(eventlevel, sz + 16); \ seastar::write_le<uintptr_t>(reinterpret_cast<char*>(out), reinterpret_cast<uintptr_t>(&entry)); \ out += sizeof(uintptr_t); \ seastar::write_le<uint64_t>(reinterpret_cast<char*>(out), __rdtsc()); \ out += sizeof(uint64_t); \ serialize_tracepoint(out __VA_OPT__(,) EXTRACT_ARGS(__VA_ARGS__)); \ }

@michoecho why do we need this black magic incantation with __attribute__((section("..."), ised))? Woudln't simple static suffice?

@michoecho why this woulnd't work?

struct slim_data* sd; void *foo(const char *file, int line, int column) { asm ("1: lea 2f, %0 \n" ".pushsection .rodata.something \n" "2: \n" ".quad 1b \n" ".quad %1 \n" ".long %2, %3 \n" ".popsection" : "=r"(sd) : "i"(file) : "i"(line), "i"(column)); ... }

?

Why this woulnd't work?

Because Avi's wish would be to combine the instruction pointer and the source location (and maybe __PRETTY_FUNCTION__ too, why not) into a static, constant-initialized struct, so that all statically-known info can be represented inside a task by a single pointer at runtime. And that's what the piece of asm is trying to accomplish.

But if you want that struct to be constant-initialized, you must initialize it with constants. And in your example, you are trying to initialize it with not-constants. (If the function is always inlined, then the values of line and column and file are known at compile time, but something like this doesn't fit into the type system. A function must stand on its own, there's no "post-inlining constexpr"). It will be rejected by the compiler. (The "i" in "i"(line) stands for "immediate". This in x86 terms means a constant woven directly into the encoding of the instruction. And you are trying to pass a variable there).

why do we need this black magic incantation with __attribute__((section("..."), used))?

I want all tracepoint_entry structs to be collected by the linker into their own section (tracepoints here) so that my trace decoder can later, while preparing for decoding, look at the Scylla binary and iterate over all tracepoint_entry structs scattered inside it, to get the signatures for all event types in the trace.

If they aren't packed into a section, they will be in random places in .rodata, and they can't be found without knowing their names. If they are packed into a section, my decoder can open the Scylla executable, find its tracepoints section, cast it to a tracepoint_entry[] array, iterate over it, and generate a corresponding decoder for each TRACEPOINT in the program.

For example, I can put

TRACEPOINT(event_level::debug, "io_begin", "class", io_request.class, "id", io_request.id, "size", io_request.size);

anywhere in the program, and it will produce a

tracepoint_entry{.name = "io_begin", .signature = "class:u8,id:u64,size:u64", ...}

struct instance to the tracepoints section at some address X, and the decoder will iterate over X (among others), see the signature and use it generate a decoder function which reads a

struct io_begin { uint8_t class; uint64_t id; uint64_t size; };

from the trace whenever the header is X.
And then the functions generated from signatures can be compiled into the actual decoder program.

Other sections are not needed for anything, they are just there to keep the metadata strings neatly ordered in the ELF.

The used is probably not needed either, I just wanted to make sure the linker (and/or LTO) won't play garbage collection tricks on me.

Anyway, all that's not very applicable in this thread, because macros are off the table.

@michoecho amazing, thank you for this explanation!

michoecho · 2025-10-16T16:16:48Z

compiler was very often generating bad debug info for the addresses I obtained this way. (E.g. pointing to the right file, but to line 0 instead of the actual line).

Was this with addr2line or llvm-addr2line? We recently saw that llvm-addr2line is better.

@avikivity I'm 80% sure that the problem was at compile time, not decode time. But I don't remember my problem well, I only have the vague memory that I was getting the RIPs I wanted but the file and line info for those RIPs in the DWARF was wrong(/useless) sometimes.

radoslawcybulski · 2025-11-19T09:49:19Z

Reverted "slim" source_location commit, we're back to std::source_location only. Are we good to go, @avikivity ?

Add an empty, default constructed std::source_location object to the task object and getter / setter.

Add calls to update `resume_point` variable with location of next resume location to all `await_suspend` functions and `then` functions.

Add resume point locations to `tasktrace` object. Update `formatter::format` to print source location of next resume alone with task type.

radoslawcybulski · 2025-11-20T19:49:20Z

Patch rebased, "upgraded" two await_suspend(std::coroutine_handle<> consumer) functions to await_suspend(std::coroutine_handle<Promise> consumer) with template parameter Promise.

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 02aa70a to c0149e3 Compare March 31, 2025 14:28

radoslawcybulski requested a review from tchaikov March 31, 2025 14:30

avikivity reviewed Mar 31, 2025

View reviewed changes

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from c0149e3 to 08d3c8a Compare March 31, 2025 14:58

radoslawcybulski requested a review from avikivity March 31, 2025 16:22

tchaikov reviewed Apr 7, 2025

View reviewed changes

radoslawcybulski requested a review from tchaikov April 14, 2025 10:35

radoslawcybulski self-assigned this Apr 14, 2025

tgrabiec approved these changes Apr 17, 2025

View reviewed changes

tchaikov approved these changes Apr 22, 2025

View reviewed changes

bitpathfinder reviewed Apr 22, 2025

View reviewed changes

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch 2 times, most recently from 6fe697d to 80ce73a Compare April 22, 2025 09:46

radoslawcybulski requested a review from bitpathfinder April 22, 2025 12:31

bitpathfinder reviewed Apr 22, 2025

View reviewed changes

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 80ce73a to 58aa04e Compare April 22, 2025 12:38

radoslawcybulski requested a review from bitpathfinder April 22, 2025 12:40

bitpathfinder approved these changes Apr 22, 2025

View reviewed changes

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch 2 times, most recently from 6e78428 to cabf015 Compare September 18, 2025 13:12

xemul force-pushed the master branch from 5b52717 to 8549271 Compare October 10, 2025 08:26

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from cabf015 to 5f0d144 Compare October 13, 2025 17:32

avikivity reviewed Oct 16, 2025

View reviewed changes

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 5f0d144 to 37a4a68 Compare November 19, 2025 09:48

Radosław Cybulski added 3 commits November 20, 2025 20:08

Add a std::source_location (resume_point) to task object.

83a9f0b

Add an empty, default constructed std::source_location object to the task object and getter / setter.

Add calls to update resume_point

93d60a0

Add calls to update `resume_point` variable with location of next resume location to all `await_suspend` functions and `then` functions.

Update backtrace with source locations of resume points

9762ed0

Add resume point locations to `tasktrace` object. Update `formatter::format` to print source location of next resume alone with task type.

radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 37a4a68 to 9762ed0 Compare November 20, 2025 19:48

empty

41dc739

Add source location to task and tasktrace object #2707

Are you sure you want to change the base?

Add source location to task and tasktrace object #2707

Uh oh!

Conversation

radoslawcybulski commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avikivity commented Mar 31, 2025

Uh oh!

radoslawcybulski commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Mar 31, 2025

Uh oh!

radoslawcybulski commented Mar 31, 2025

Uh oh!

avikivity commented Mar 31, 2025

Uh oh!

avikivity commented Mar 31, 2025

Uh oh!

radoslawcybulski commented Mar 31, 2025

Uh oh!

avikivity commented Mar 31, 2025

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

radoslawcybulski commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgrabiec commented Apr 17, 2025

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

radoslawcybulski commented Apr 22, 2025

Uh oh!

denesb commented Jun 27, 2025

Uh oh!

travisdowns commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

radoslawcybulski commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Sep 9, 2025

Uh oh!

radoslawcybulski commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Sep 10, 2025

Uh oh!

radoslawcybulski commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michoecho commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

radoslawcybulski commented Mar 31, 2025 •

edited

Loading

radoslawcybulski commented Mar 31, 2025 •

edited

Loading

radoslawcybulski commented Apr 14, 2025 •

edited

Loading

travisdowns commented Sep 8, 2025 •

edited

Loading

radoslawcybulski commented Sep 8, 2025 •

edited

Loading

radoslawcybulski commented Sep 10, 2025 •

edited

Loading

radoslawcybulski commented Sep 10, 2025 •

edited

Loading

michoecho commented Sep 10, 2025 •

edited

Loading

michoecho commented Sep 10, 2025 •

edited

Loading

radoslawcybulski commented Oct 15, 2025 •

edited

Loading

michoecho Oct 16, 2025 •

edited

Loading