Skip to content

Conversation

@andrewcrawley
Copy link
Contributor

Extends the debug adapter protocol to support reading arbitrary memory, disassembling code, and accessing registers.

CC: @weinand @gregg-miskelly

Extends the debug adapter protocol to support reading arbitrary memory,
disassembling code, and accessing registers.
Copy link
Member

@gregg-miskelly gregg-miskelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@haneefdm
Copy link

Pardon my intrusion. My understanding is that there can be a registers scope per frame on the stack. Yes, only one registers scope (which can have sub-scopes) per frame but in the context of that frame. The global registers values are really in the context of the leaf frame for the thread that got interrupted. Debuggers know which registers are saved on the stack according to the ABI and how to retrieve those values when queried. Am I making sense?

Happy to research it and point to docs if you want.

},
"instructionCount": {
"type": "integer",
"description": "Number of instructions to disassemble starting at the specified location and offset. An adapter must return exactly this number of instructions - any unavailable instructions should be replaced with an implementation-defined 'invalid instruction' value."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does instructionCount translate well with variable instruction lengths? I mean it is hard to ask or deliver a count. Both Intel (up to 15 bytes) and ARM have variable lengths? I see issues with both developing a DA and a DA's client. Should this be more like a memory read request? Specify the number of bytes to decode. Like @andrewcrawley found out, it is a lot trickier than one would imagine :-) There are alignment issues as well. Once we prototype, we will know more.

Copy link

@haneefdm haneefdm May 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, a disassembly request with a symbol/section reference always succeeds. Normally, we make that request with what we find on the stack trace.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some times there can be data embedded along with instructions. Usually at the beginning or the end of a function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've uploaded a file to show how data can be interspersed with instructions...I just took a random file I had, disassembled it and pasted a few functions.

https://github.com/haneefdm/cortex-debug-samples/blob/master/demo/tmp.s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "lazy scrolling" UI knows that it needs to request N lines of disassembly to fill in after scrolling, so it issues a request saying "starting at <memoryReference + offset>, provide the next N instructions". It's up to the DA to handle variable-length instructions by continuing to disassemble until it has the requested number of instructions.

The only string field on a DisassembledInstruction that is interpreted in any way by the UI is the address field. This means that you can return anything you want in the other fields and the disassembly UI will just display them as-is, so it's fine to use ".word 0x00c1b280" or whatever as the instruction value if a memory location is actually known to contain data.

Here's an example. I compiled the following code:

int main()
{
	start:
	std::cout << "Hello World!\n"; 
	goto start;
}

In the VS disassembler, the DisassembledInstructions are mapped as follows:

image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewcrawley Yes nice, so the instruction can be data as well -- .word type thing. I was strictly thinking of 'instructions'. And, I think, get 'N' instructions as part of a request would also work fine. And the instruction count includes the .word type stuff? To me, the term instruction is not precise but I can't offer an alternative. The proposal says it must return exactly this number of instructions

My assumption (that you confirmed) is that if the memoryReference + offset is bogus, so will the output. It is the DA clients responsibility to make a proper request. The DA will make no attempt to figure out if the request is proper or adjust.

I do think negative offsets are super useful for implementing lazy query/rendering. But how does the DA client make such a request -- a valid one. Can I just request a 1000 instructions before the <memoryReference + offset> and then from there the instructionOffset applies? But why have two types of offsets? What if the DA client asks for 1000 instructions before but there aren't that many out there? I think this is a lot of responsibility on the DA client to form proper requests

I am thinking offset is not needed assuming your heuristic works esp. for negative values for instructions and the client can easily do the <memoryReference + offset> calculation anyway.

If available, I would love to look at your vsdbg implementation. Then everything might make sense.

Finally, isn't this a holiday for you?

Copy link
Contributor Author

@andrewcrawley andrewcrawley May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DisassembledInstruction is basically equivalent to a line of text in the disassembly UI - it represents data decoded at a specific address, whether that is an instruction, data, or even inaccessible (such as an unmapped page). If the host asks for 1000 instructions (in either direction), it's up to the DA to return 1000 DisassembledInstruction objects. If there are only 100 instructions after the current location, and then an unmapped page of memory, the DA should return 100 valid instructions, then 900 "invalid" instructions, which must contain a valid decoded address (generally incremented [or decremented, if using negative offsets] from the last instruction address by whatever the minimum instruction size for the architecture is, taking any required alignment into account), but otherwise can contain whatever text you want to use to represent an invalid location. Most engines in VS just use "??" for all fields.

The reason for having both a byte and instruction offset is because the VS disassembly API allows consumers to say "go to this address (represented by memoryReference + offset), skip instructionOffset instructions, then decode the next instructionCount instructions", and the disassembly window makes use of this by remembering the address of the last instruction decoded. A UI could also work entirely by instructionOffset, in which case the byte offset wouldn't be necessary, but that's not how VS works.

vsdbg is not open source, unfortunately, but I believe MIEngine uses a similar heuristic - take a look at SeekBack in https://github.com/microsoft/MIEngine/blob/master/src/MIDebugEngine/Engine.Impl/Disassembly.cs

Yes, today is a holiday, but I'm happy to have a distraction from cleaning out my garage : )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewcrawley Thank you, sir, for your patience. I was not aware of what vsdbg was currently doing.

},
"instructionOffset": {
"type": "integer",
"description": "Optional offset (in instructions) to be applied after the byte offset (if any) before disassembling. Can be negative."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a negative instructionOffset implementable? same goes for negative offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the adapter turns the memoryReference value back into an address, the offset value is simply added to it, so negative values are easy to handle there. Handling negative instructionOffset values may require heuristics on architectures with variable-length instructions, which is up to the DA to implement, as we did in vsdbg.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I should check out how you did that. You said 'heuristics' as oppose to `algorithm'. I can see going back to a previous symbol and then going forward, but I bet you did something way smarter.

Copy link
Member

@gregg-miskelly gregg-miskelly May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two techniques that I am aware of for reverse disassembly: As you suggested, look for symbols -or- heuristically by repeating the following.

You can see the MIEngine's basic implementation of this stuff by looking at SeekBack.

Heuristic for backwards disassembly without symbols:

  1. start_address = last_known_address - numInstructionsToSeekBack*maxInstructionSize
  2. If start_address is unreadible memory - move forward to the next readable memory address.
  3. Disassemble forward from start_address to last_known_address
  4. If the disassembly steam ends with an instruction boundary at 'last_known_address' and all the instructions along the way are valid - decide you are done
  5. Else - increment start_address by 1 byte and repeat from step 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup @gregg-miskelly that was what I thinking what we would have to do :-)

I pretty much typed up your heuristic and said to myself, oh you guys must have something much much better. I thought I would look stupider typing it out. Some processors have instruction align (like 2-byte) requirements as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. This problem is certainly a lot easier on architectures that have such restrictions.

"type": "string",
"description": "Text representing the instruction and its operands, in an implementation-defined format."
},
"symbol": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question. Does it strictly have to be an actual symbol or can it be something like <main+0x52>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the intent here is to display things like goto labels. So you probably wouldn't want to return something like <main+0x52> as that would presumably stick a label on every instruction. That said, I would say that a proper client UI shouldn't try and interpret symbol at all. So if a debug adapter thought that users would like to see <main+0x52> then that would be totally acceptable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was wrong about symbols. No such thing as <main+52> as a symbol. I was more worried about what happens after the instruction. Some examples below (the first group is an actual symbol the others are just (helpful) annotation which is sometimes as a comment and others as not. @andrewcrawley confirmed that other than address, the DA can do anything with other fields.

0x10080edc <main>:
0x10080edc:	b480      	push	{r7}

0x10080ef0:	4b07      	ldr	r3, [pc, #28]	; (10080f10 <main+0x34>)
0x10080f0c:	e7ef      	b.n	10080eee <main+0x12>
0x10002590: 03 4b           	ldr	r3, [pc, #12]	; (0x100025a0 <main+24>)
0x1000259a: 01 f0 fd f8     	bl	0x10003798 <Cy_SysLib_Delay>
0x1000259e: f7 e7           	b.n	0x10002590 <main+8>

Upto the client to dicide how to display all of this, but it is the DA job to consolidate the first two lines above into one DisassembledInstruction because there is no null instruction?

Copy link
Member

@gregg-miskelly gregg-miskelly May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upto the client to dicide how to display all of this, but it is the DA job to consolidate the first two lines above into one DisassembledInstruction because there is no null instruction?

Correct

"type": "string",
"description": "Raw bytes representing the instruction and its operands, in an implementation-defined format."
},
"instruction": {
Copy link

@haneefdm haneefdm May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: So instruction can be really any string (raw). including a comment description? Examples:

 80242f2:	4b08      	ldr	r3, [pc, #32]	; (8024314 <Cy_Flash_RAMDelay+0x60>)
 8024308:	f8d3 351c 	ldr.w	r3, [r3, #1308]	; 0x51c
 80242cc:	daf5      	bge.n	80242ba <Cy_Flash_RAMDelay+0x6>

I am not just focused on ARM. Just an example I have at the moment. I can find similar things with MS/Intel stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, if an adapter wanted to return something like ldr r3, [pc, #32] ; (8024314 <Cy_Flash_RAMDelay+0x60>) that would be entirely fine.

"properties": {
"memoryReference": {
"type": "string",
"description": "Memory reference to the base location containing the instructions to disassemble."
Copy link

@haneefdm haneefdm May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an additional requirement is for the address to be properly aligned to start of actual instruction. Or else, you get garbage as output?

Again, I am thinking of variable instruction lengths. Anything relative to the PC or symbols in the object file or other things on the stack is generally okay. It should be the DA clients responsibility to make the proper address request or does the DA figure it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. In the VS case, disassembly will generally be started from a location known to correspond with an instruction - the user can right-click a stack frame and select "go to disassembly", in which case we'll use the StackFrame.instructionPointerReference value, or the user can right-click in the source editor and select "go to disassembly", in which case we'll issue a gotoTargets request for the file / line / column, then use the GotoTarget.instructionPointerReference value.

Nothing stops the user from bringing up the diassembly UI and entering a random address, of course, in which case the disassembly will likely be nonsense.

@haneefdm
Copy link

@andrewcrawley You said having an offset from a memoryReference was useful/helped based on your some recent work. I am 110% positive you found that to be case. May I ask how it was useful? I ask because the DA client can easily do that calculation and the DA implementation becomes more complicated.

@haneefdm
Copy link

About Disassembly: OMG, sorry to be so annoying. cringe. I am sure this was already thought out and I hope I am not wasting your time.

As of today, the protocol does not expose querying symbols and their current (relocated) addresses. How is a DA client supposed to figure out what an address of anything is to make a query? With Windows/PE, they can/will be rebased so reading the dll/exe is useless. ELF may not have absolute addresses -- virtualized or not.

But generally, the debugger knows. Or could memoryReference also support a symbol reference. My head hurts but if you give some time, I can figure it out.

@andrewcrawley
Copy link
Contributor Author

@haneefdm:

Pardon my intrusion. My understanding is that there can be a registers scope per frame on the stack. Yes, only one registers scope (which can have sub-scopes) per frame but in the context of that frame. The global registers values are really in the context of the leaf frame for the thread that got interrupted. Debuggers know which registers are saved on the stack according to the ABI and how to retrieve those values when queried. Am I making sense?

Happy to research it and point to docs if you want.

Correct. You'd get the registers for whichever frame's frameId you pass when making the scopes request.

@andrewcrawley You said having an offset from a memoryReference was useful/helped based on your some recent work. I am 110% positive you found that to be case. May I ask how it was useful? I ask because the DA client can easily do that calculation and the DA implementation becomes more complicated.

Certain use cases for the VS disassembly window required the ability to say "go to this address and disassemble N instructions", and the way to represent an arbitrary address in the protocol is via a memoryReference + offset value.

About Disassembly: OMG, sorry to be so annoying. cringe. I am sure this was already thought out and I hope I am not wasting your time.

As of today, the protocol does not expose querying symbols and their current (relocated) addresses. How is a DA client supposed to figure out what an address of anything is to make a query? With Windows/PE, they can/will be rebased so reading the dll/exe is useless. ELF may not have absolute addresses -- virtualized or not.

But generally, the debugger knows. Or could memoryReference also support a symbol reference. My head hurts but if you give some time, I can figure it out.

If I understand correctly, you're asking how to get the address of "my_func" to use as (for example) the start point for disassembly? VS handles this by issuing an evaluate request for "my_func", then uses the memoryReference provided on the response as the start point for disassembly.

@haneefdm
Copy link

Okay, thanks.

@gregg-miskelly
Copy link
Member

@haneefdm in case you didn't see, this PR was superseded by:

microsoft/debug-adapter-protocol#49
microsoft/debug-adapter-protocol#50

@haneefdm
Copy link

@gregg-miskelly Yup, I saw that, somehow I got notified. Cracked me up when @weinand said "where the truth lives" microsoft/debug-adapter-protocol#49

@gregg-miskelly
Copy link
Member

gregg-miskelly commented May 28, 2019

As of today, the protocol does not expose querying symbols and their current (relocated) addresses. How is a DA client supposed to figure out what an address of anything is to make a query? With Windows/PE, they can/will be rebased so reading the dll/exe is useless. ELF may not have absolute addresses -- virtualized or not.

Are you talking about a scenario where the user wants to navigate to the disassembly of a function by inputting the function's name? If so, the way that would work would be to issue an evaluate request, using the text of the function name. The response from that should contain a memory reference which could then be used to issue a disassembly request.

If you are asking more about how a native debugger actually implements this feature - a native debugger needs to be aware of what modules are loaded into the target process and what base address the module is loaded at (or addresses on platforms where the loaded is allowed to relocate different sections of the image). Then it can use the base address of the module combined with the relative virtual address (RVA) that it obtained from the dll/pdb/elf/dwarf info to decide what address(es) the function is at.

@haneefdm
Copy link

@gregg-miskelly Less worried about how the debugger figures it out.

From a DA clients perspective, it needs to be a fully qualified name right? Which module dll/so/elf-section/etc because there can be duplicates without fully qualified names. To get a proper memory reference for a DA client, I was trying to figure out how that is done for an arbitrary function.

What helps is that all evaluate's happen in the context of frame/scope. Perhaps this is all that is needed.

@gregg-miskelly
Copy link
Member

gregg-miskelly commented May 28, 2019

What helps is that all evaluate's happen in the context of frame/scope. Perhaps this is all that is needed.

Correct. The DA could also require qualification in cases where things are ambiguous. For example, the native VS debugger supports 'module.dll!Function', and even stranger syntax's for static functions.

@weinand
Copy link
Contributor

weinand commented May 28, 2019

I've released support for registers and "experimental" support for disassembly and memory access to the DAP in both the "debug-adapter-protocol" and in the "vscode-debugadapter-node" repositories.

@weinand weinand closed this May 28, 2019
@haneefdm
Copy link

You people are awesome. I did not add anything to what you already had thought through and your constraints. Wasted your time.

As punishment and for posterity, do you want me to document the clarifications and use cases as a summary? Once it is approved.

@gregg-miskelly
Copy link
Member

@haneefdm if you have some clarifications which you think help - I would suggest opening a PR against microsoft/debug-adapter-protocol. If you are talking about a longer document for the disassembly API - I am not sure where to put it. So its up to you if you think it is worth your effort to find a place.

@haneefdm
Copy link

@gregg-miskelly may I ask what the next steps are? and where I can help? my first wish is 'registers'

@andrewcrawley andrewcrawley deleted the registers-memory-disassembly branch May 29, 2019 04:46
@andrewcrawley
Copy link
Contributor Author

@haneefdm: We should probably stop using this defunct PR as a discussion forum : ) I've replied to you on the MIEngine issue here: microsoft/MIEngine#816

@weinand
Copy link
Contributor

weinand commented May 29, 2019

Yes please, this repo is only for the node.js based client library for DAP.
If you are not using these npm modules (or if you do not have issues with it :-), please don't use this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants