-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add source location to task and tasktrace object #2707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add source location to task and tasktrace object #2707
Conversation
02aa70a to
c0149e3
Compare
| scheduling_group _sg; | ||
| private: | ||
| std::source_location resume_point = {}; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making a very heavily used object larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adding std::source_location
coroutine_traits_base<>::promise_type (which inherits from task) is 64 bytes,
continuation<Promise, Func, Wrapper, T> is around 104-120, depending on exact version.
No difference except for that single unlucky value, who is "promoted" to the next 64 byte cache line in case of continuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay.
|
Aren't those addresses derivable from the stack trace? Or are those points lost? How large is std::source_location? |
Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.
8 bytes. |
c0149e3 to
08d3c8a
Compare
Maybe we can tolerate it. @tgrabiec what's your opinion? Please run scylladb's perf-simple-query and report insn-per-op before and after (although we won't learn of the impact of the size change). |
|
my patch master My patch will run more instructions, as there are additional stores to the task object (std::source_location needs to be written). Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something? |
Time is very unstable (just try it) since it depends on temperature and cooling. Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change). |
|
So we add 0.25% overhead. Hard to judge if it's worthwhile. |
Agree.
It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write. |
Writes are more-or-less free, except for the increase in instruction footprint (which are extra reads). We aren't close to running at max IPC. Typical for mini-benchmarks like perf-simple-query is 2, and for production ~1. The main killer is waiting for instruction fetch (although maybe in production mispredicts would also contribute since the the data is less regular). I'm sure an M4 would give much higher IPC. |
tchaikov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?
|
|
||
| #ifndef SEASTAR_MODULE | ||
| #include <coroutine> | ||
| #include <source_location> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it took me a while to check if it's safe to use <source_location> instead of util/std-compat.hh . the answer is yes. as we support the latest two major versions of the GCC and Clang compilers. in general, we use libstdc++, which added the support to source_location in gcc-mirror/gcc@57d76ee, which is included by gcc 12 and up. and the latest stable release of GCC is GCC 14.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #2961 on this.
This has been discussed here (#2381) and consensus is that the release is where it's useful the most. |
I think it's probably ok. |
tchaikov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
include/seastar/core/task.hh
Outdated
| protected: | ||
| scheduling_group _sg; | ||
| private: | ||
| std::source_location resume_point = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put the resume_point first since it's larger than scheduling_group. that way the task object size is 4 bytes less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really - if the class is not packed then the size includes the alignment, so order doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true. The object will be aligned to 8 bytes boundary. However in case we add something after these variables, less than 5 bytes, then there will be a difference, so having it ordered by size might be beneficial in the future.
6fe697d to
80ce73a
Compare
include/seastar/core/task.hh
Outdated
|
|
||
| SEASTAR_MODULE_EXPORT | ||
| class task { | ||
| std::source_location resume_point = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Should not we prepend class member variables with an underscore: resume_point->_resume_point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
80ce73a to
58aa04e
Compare
|
@scylladb/seastar-maint please merge. |
@avikivity I think the gain is much higher than 0.25% here. We need to reduce the bar of entry into debugging seastar applications. |
|
wrt If you captured that plus the current IP you'd have enough info to reconstruct always at least the caller. |
|
@avikivity godbolt says it works. Given it's pretty much undocumented builtin - probably. As i mentioned before - it does something else. The point of I will also argue I can fork our own source_location implementation using EDIT: those are clang / gcc specific. |
It's not undocumented.
I don't understand the logic. If X does not use Y, does it follow that Y is not useful? Perhaps X hasn't thought of using Y yet. If it applied, it would be impossible to do anything new.
Stack traces (when explored with addr2line -i) show much more information. Each return address on the stack encodes all the inlined call-sites that led to it up to the next uninlined caller. The more inlining happens, the more information it carries. See https://godbolt.org/z/G678Eo7rh. Although there is just one location that calls g(), it knows exactly which path called it through a.
I suggest you try out __builtin_return_address(0) and compare the results. |
|
I stand corrected, stack traces are actually sort of useful now, debug info has improved massively.
This assumes every instruction has unambiguous inlined sequence of calls leading to it. Or am i missing something?
Good example. I took the liberty of removing Am i missing something? It doesn't take much to generate code, that drops ambiguous return pointers after optimization. Is there a way to resolve those i don't know (which would make me a X)? Otherwise that's pretty much the point i'm so poorly making. You can probably make it sort-of work with clever |
It doesn't assume anything. If inlining happens to duplicate a call to (let's say) future::then, then you will get more information. If it doesn't, then you don't. In the worst case __builtin_return_address will return the same amount of information as std::source_location. In the common case, it will return more, since you will see the direct caller, and its callers up to the first uninlined function.
If you compare to std::source_location, then in all combinations of inlining options it returns one call site. __builtin_return_address will return from 1 to 32 call sites, depending on the options. Here's the example updated to report both std::source_location and __builtin_return_address. If you change h() to print both, you will get 32 different values for __builtin_return_address but only one value for std::source_location. |
Consider this stack trace (top is current)
This might be improvable (although i doubt i'm the first to think about it), if - instead of calling EDIT: i will give it a try, but it will take some time. |
@avikivity No, why? Imagine I want to debug a deadlocked fiber. It's stuck in this coroutine:
The replacement for |
@radoslawcybulski You mean, use I wanted to use the instruction pointer instead of source location from the start, because it theoretically gives more information, but I didn't try to push that because in practice (when I was privately using that patch for debugging) the compiler was very often generating bad debug info for the addresses I obtained this way. (E.g. pointing to the right file, but to line 0 instead of the actual line). |
Yep, the same. Also your patch highlights my other point - it's rare to figure out something new, most of the time if people are not using something there's a reason for it. |
Right, my idea the point where you capture __builtin_return_address(0) is itself not inlined, but that's hardly a given. What we want is the current instruction pointer (if inlined) or __builtin_return_address(0) if not, but that's not expressible. std::source_location works, but is not always useful. It can point to, say, some function in loop.hh and give you not much idea about what's going on. |
Well, it depends (which is bad) on whether the function capturing %rip was inlined. Perhaps we could always_inline a wrapper, and let the compiler choose whether to inline the wrappee or or not.
yes+no |
6e78428 to
cabf015
Compare
cabf015 to
5f0d144
Compare
|
I've run three tests (without patch, with |
|
We get additional instructions (well, 3 writes instead of one), roughly 200 instructions per op, (0.3%). Note sure if it's worth. Most of our tools probably deals with |
Our tools deal well with addresses too. Looks like the slim source_location isn't so slim. I can understand it - source_location is fat out-of-line but slim inline. I need to think more about it. My feeling is that source_location won't work well as more infrastructure is converted to coroutines, we'll just see some internal coroutine there. |
Was this with addr2line or llvm-addr2line? We recently saw that llvm-addr2line is better. |
|
|
||
| [[gnu::always_inline]] slim_source_location(const char* file = __builtin_FILE(), std::int32_t line = __builtin_LINE(), std::int32_t column = __builtin_COLUMN()) | ||
| : _file(file), _line(line), _column(column) { | ||
| #ifdef SSL_HAS_RIP_X86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slim source line.
My brain keeps telling me i've seen this abbreviation, but i've not found anything better. I can just expand into SLIM_SOURCE_LINE_***, those macros are contained in this file anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix macros with SEASTAR_, and don't abbreviate, it just confuses people.
| : _file(file), _line(line), _column(column) { | ||
| #ifdef SSL_HAS_RIP_X86 | ||
| std::uintptr_t rip; | ||
| asm("leaq 0(%%rip), %0":"=r"(rip)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could slim it down with something like
struct slim_data* sd;
asm ("1: lea 2f, %0 \n"
".pushsection .rodata.something \n"
"2: \n"
".quad 1b \n"
".quad %1 \n"
".long %2, %3 \n"
".popsection"
: "=r"(sd) : "i"(__builtin_FILE()) : "i"(__builtin_LINE()), "i"(__builtin_COLUMN()));This pushes the data to the data section and we end up with one instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, i do understand few identifiers from the example, but that's all. :)
What is 1: and 2: (i assume labels? They don't seem used? Are those global identifiers or local to function / module?). Why we do .quad 1b (quad is 64bit probably, so 1b is just 1)? long is 32bit i assume? What is =r... syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 1: and 2: are labels, 2f = 2 forward, 1b = 1 back. Using numbers makes them local.
"=r" -> output register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's preparing a static struct pointing back at the lea instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is 2 forward?
Will fix it later on, i'm on sick leave until tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, in cases where macros are on the table (in this case they aren't), you can get this static struct without stooping to inline asm.
Eg. I have this tracepoint macro (which I'm actually using in practice right now, when writing a debug util for scylladb/scylladb#25679). (entry) is the static struct in this example. And I use its address as the metadata header for each event in the trace.
struct tracepoint_entry {
const char* name;
const char* file;
int line;
const char* function;
const char* signature;
void *rip;
};
#define TRACEPOINT(eventlevel, tp_name, ...) { \
__label__ rip; \
using namespace seastar; \
static constexpr auto sig __attribute__((section("tracepoint_signatures"), used)) = COMPUTE_SIGNATURE(SIG(__VA_ARGS__)); \
static constexpr char namearr[] __attribute__((section("tracepoint_names"), used)) = tp_name; \
static constexpr char filearr[] __attribute__((section("tracepoint_files"), used)) = __FILE__; \
static constexpr tracepoint_entry entry __attribute__((section("tracepoints"), used)) = { \
.name = namearr, \
.file = filearr, \
.line = __LINE__, \
.function = __PRETTY_FUNCTION__, \
.signature = sig.data(), \
.rip = &&rip, \
}; \
rip: \
size_t sz = compute_size(EXTRACT_ARGS(__VA_ARGS__)); \
auto out = local_tracer->write(eventlevel, sz + 16); \
seastar::write_le<uintptr_t>(reinterpret_cast<char*>(out), reinterpret_cast<uintptr_t>(&entry)); \
out += sizeof(uintptr_t); \
seastar::write_le<uint64_t>(reinterpret_cast<char*>(out), __rdtsc()); \
out += sizeof(uint64_t); \
serialize_tracepoint(out __VA_OPT__(,) EXTRACT_ARGS(__VA_ARGS__)); \
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michoecho why do we need this black magic incantation with __attribute__((section("..."), ised))? Woudln't simple static suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michoecho why this woulnd't work?
struct slim_data* sd;
void *foo(const char *file, int line, int column) {
asm ("1: lea 2f, %0 \n"
".pushsection .rodata.something \n"
"2: \n"
".quad 1b \n"
".quad %1 \n"
".long %2, %3 \n"
".popsection"
: "=r"(sd) : "i"(file) : "i"(line), "i"(column));
...
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this woulnd't work?
Because Avi's wish would be to combine the instruction pointer and the source location (and maybe __PRETTY_FUNCTION__ too, why not) into a static, constant-initialized struct, so that all statically-known info can be represented inside a task by a single pointer at runtime. And that's what the piece of asm is trying to accomplish.
But if you want that struct to be constant-initialized, you must initialize it with constants. And in your example, you are trying to initialize it with not-constants. (If the function is always inlined, then the values of line and column and file are known at compile time, but something like this doesn't fit into the type system. A function must stand on its own, there's no "post-inlining constexpr"). It will be rejected by the compiler. (The "i" in "i"(line) stands for "immediate". This in x86 terms means a constant woven directly into the encoding of the instruction. And you are trying to pass a variable there).
why do we need this black magic incantation with
__attribute__((section("..."), used))?
I want all tracepoint_entry structs to be collected by the linker into their own section (tracepoints here) so that my trace decoder can later, while preparing for decoding, look at the Scylla binary and iterate over all tracepoint_entry structs scattered inside it, to get the signatures for all event types in the trace.
If they aren't packed into a section, they will be in random places in .rodata, and they can't be found without knowing their names. If they are packed into a section, my decoder can open the Scylla executable, find its tracepoints section, cast it to a tracepoint_entry[] array, iterate over it, and generate a corresponding decoder for each TRACEPOINT in the program.
For example, I can put
TRACEPOINT(event_level::debug, "io_begin", "class", io_request.class, "id", io_request.id, "size", io_request.size);anywhere in the program, and it will produce a
tracepoint_entry{.name = "io_begin", .signature = "class:u8,id:u64,size:u64", ...}
struct instance to the tracepoints section at some address X, and the decoder will iterate over X (among others), see the signature and use it generate a decoder function which reads a
struct io_begin {
uint8_t class;
uint64_t id;
uint64_t size;
};
from the trace whenever the header is X.
And then the functions generated from signatures can be compiled into the actual decoder program.
Other sections are not needed for anything, they are just there to keep the metadata strings neatly ordered in the ELF.
The used is probably not needed either, I just wanted to make sure the linker (and/or LTO) won't play garbage collection tricks on me.
Anyway, all that's not very applicable in this thread, because macros are off the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michoecho amazing, thank you for this explanation!
@avikivity I'm 80% sure that the problem was at compile time, not decode time. But I don't remember my problem well, I only have the vague memory that I was getting the RIPs I wanted but the file and line info for those RIPs in the DWARF was wrong(/useless) sometimes. |
5f0d144 to
37a4a68
Compare
|
Reverted "slim" source_location commit, we're back to |
Add an empty, default constructed std::source_location object to the task object and getter / setter.
Add calls to update `resume_point` variable with location of next resume location to all `await_suspend` functions and `then` functions.
Add resume point locations to `tasktrace` object. Update `formatter::format` to print source location of next resume alone with task type.
37a4a68 to
9762ed0
Compare
|
Patch rebased, "upgraded" two |
Adds source location of next resume point to
taskobjects. When constructingtasktraceobject, source location of resume points will be added to it. When printing, the source location of next resume point will be printed as well, hopefully improving debug experience.Example:
Partially fixes #2381