Merge pull request #32 from riscv-non-isa/Issue-23

IainCRobertson · web-flow · commit bdb1bbce7c31 · 2025-11-06T01:55:40.000-08:00
Reworked multi-memory-access data trace requriements - Issue-23
diff --git a/src/hti.adoc b/src/hti.adoc
@@ -586,66 +586,6 @@ been returned to the hart from the memory system.
 
 The two parts of a split load are associated by use of a transaction ID.
 
-The Zc (code-size reduction) extension introduced push and pop
-instructions (_cm.push_, _cm.pop_, _cm.popret_ and _cm.popretz_) that
-each result in multiple loads or stores. To allow the resulting loads or
-stores to be associated with the correct instruction, these
-multi-memory-access instructions (and any other future instructions with
-similar characteristics) must be reported on the instruction trace
-interface multiple times (once for each individual load or store) using
-*itype* 0 except for the final load or store, which must retire using
-the natural *itype* for the instruction (for example, a _cm.popret_
-instruction must use *itype* 13 for the final load to signal the
-return). The instruction address reported will be the same for each
-occurrence.
-
-The following illustrations show the retirement sequences when a single
-_cm.push_ or _cm.popret_ is used to push or pop 4 registers from the
-stack. They assume a RISC-V to encoder interface that can report a block
-of 1 or more retired instructions and one load or store per cycle. Each
-comprises 4 elements, and shows the instruction information reported for
-each load and store. As detailed in section
-<<sec:InstructionTraceInterface>>, this takes the form of the address
-of an instruction, the length of the block (1 for a single instruction)
-and the type of the last instruction in the block. In each element,
-’Block’ indicates a block of 1 or more instructions (i.e. could also be
-a single instruction), whereas ’Single’ indicates a single instruction
-(i.e. a block with a length of 1).
-
-A _cm.push_ is equivalent to 4 store instructions:
-
-. Block - last instruction is _cm.push_, *itype* 0 (data trace interface
-reports 1st store);
-. Single - _cm.push_, *itype* 0 (data trace interface reports 2nd
-store);
-. Single - _cm.push_, *itype* 0 (data trace interface reports 3rd
-store);
-. Block - 1st instruction is _cm.push_, *itype* dependent on last
-instruction in block (data trace interface reports 4th store);
-
-A _cm.popret_ is equivalent to 4 loads and a return:
-
-. Block - last instruction is _cm.popret_, *itype* 0 (data trace
-interface reports 1st load);
-. Single - _cm.popret_, *itype* 0 (data trace interface reports 2nd
-load);
-. Single - _cm.popret_, *itype* 0 (data trace interface reports 3rd
-load);
-. Single - _cm.popret_, *itype* 13 (data trace interface reports 4th
-load);
-
-If an exception occurs part way through the sequence of loads or stores
-initiated by such an instruction, and the instruction is re-executed
-after the exception handler has been serviced, the load or store
-sequence must recommence from the beginning.
-
-[NOTE]
-====
-This is required for data trace only. If data trace is not
-implemented, the push or pop may instead be reported just once in the
-normal way when all associated loads or stores complete successfully.
-====
-
 [[sec:DataTraceInterface]]
 === Data Trace Interface
 
@@ -695,7 +635,7 @@ is high
 |*lid*[_lrid_width_p_-1:0] | S | Split Load ID. Valid when *lresp* is
 non-zero
 |*sdata*[_sdata_width_p_-1:0] | S | Store data. Valid when *dretire* is
-high and access is a store (*dtype* is 1) or atomic (*dtype* is 8 - 14).
+high and access is a store (*dtype* is 1 or 3) or atomic (*dtype* is 8 - 14).
 |*ldata*[_ldata_width_p_-1:0] | S | Load data. Valid when *lresp* is
 non-zero
 |===
@@ -709,8 +649,8 @@ non-zero
 | *Value* | *Description*
 |0| Load
 |1 | Store
-|2 | reserved
-|3 | reserved
+|2 | Multi-memory-access Load
+|3 | Multi-memory-access Store
 |4 | CSR read-write
 |5 | CSR read-set
 |6 | CSR read-clear
@@ -726,8 +666,7 @@ non-zero
 |===
 
 The maximum value of _dtype_width_p_ is 4. However, if only loads and
-stores are supported, _dtype_width_p_ can be 1. If CSRs are supported
-but atomics are not, _dtype_width_p_ can be 3.
+stores are supported, _dtype_width_p_ can be 1. If Multi-memory access instructions are also required, _dtype_width_p_ can be 2.  If CSRs are supported but atomics are not, _dtype_width_p_ can be 3.
 
 Atomic and CSR accesses have either both load and store data, or store
 data and an operand. For CSRs and unified atomics, both values are
@@ -769,8 +708,8 @@ However, that is not the case for data trace. Consider two scenarios:
 * Case 2: 1st block contains instructions 1, 2; second block contains 3,
 4, 5
 
-Given that *iretire* is non-zero in the same cycle that the data access
-retires, the encoder knows the address of the 1st and last instructions
+The signals from the <<sec:InstructionTraceInterface>> that provide details about the instruction block containing a data access instruction must be valid when *dretire* is active.
+Given this, the encoder knows the address of the 1st and last instructions
 in a block, but does not know precisely where in the block the data
 access is. In both cases, the first block matches the filtering criteria
 (it contains the address of instruction 1), and the second block does
@@ -796,3 +735,27 @@ access is associated with, so the encoder knows which block address to
 combine with the LSBs in order to construct the actual data access
 instruction address. 1 bit for 2 blocks per cycle, 2 bits for 4, and so
 on.
+
+==== Multi-memory-access Instructions
+
+[NOTE]
+====
+This section introduces a normative change from the behavior described in the E-trace spec.  This is necessary because the original behavior would result in incorrect N-Trace instruction trace.  It could also result in misleading E-Trace instruction trace when one of these instructions traps part way through.
+
+The previous requirement to report multi-access instructions as retired multiple times (once for each load or store) is withdrawn.  They should be reported as retired only once in the same way as any other instruction.
+====
+The Zcmp and Vector extensions include instructions that result in multiple loads or stores, and other future extensions may add other instructions with similar characteristics.  It is not practical for the hart to provide information about all the loads or stores initiated by such an instruction simultenously when the instruction retires; it must be provided for each load or store in turn as it completes.  This means that load/store information for these instructions is provided to the decoder before the instruction itself retires.  This in turn means that *dretire* will be active when *iretire* is not, and this requires that the signals from the <<sec:InstructionTraceInterface>> that provide details about the instruction that are normally defined to be valid when *iretire* is non-zero must also be valid when *dretire* is 1.
+
+These multi-memory-access instructions may be interrupted part-way through their access sequence, or a trap may occur on one of the accesses.  This can lead to difficulties correlating the data and instruction trace streams.  Consider the following situation:
+
+* Instruction "X" retires;
+* A vector store that performs 4 stores starts executing;
+* 2 stores are completed and then a trap occurs on the 3rd store;
+* The trap handler includes a least one store;
+* On returning from the trap handler, the vector store instruction resumes, performs the remaining stores and then retires.
+
+The program flow decoded from the instruction trace stream will show that a trap occurred following instruction "X", with the vector store following the trap handler return.  On the other hand, the data trace stream will have the 2 vector stores that occured before the trap, then the stores and loads from the trap handler, then the remaining vector stores.  It can be seen here that the order of the stores in the data trace stream does not match the order of store retirements in the instruction trace stream.
+
+In order to allow the decoder to resolve ordering issues such as this, and correlate the loads and stores with the correct instruction, loads or stores from multi-memory-access instructions are identified using dedicated *dtype* encodings.  On encountering these loads or stores in the data trace stream, the decoder must set these aside in a stack-like structure until a multi-access load or store instruction is encountered in the instruction trace stream.  It can then correlate the appropriate number of the most recent multi-memory-access loads or stores with the instruction.
+
+Some multi-access instructions restart from the beginning following a trap, and this complicates the correlation process, but such repeat loads or stores can be identified from their addresses.  For example, a _cm.push_ instruction that performs 4 stores interrupted after the 2nd store will result in a sequence of 6 stores at addresses A, A+1, A, A+1, A+2, A+3.  This will only happen when the multi-access load/store retires immediately after a *ret, and the number of repeated accesses can be determined from the addressing pattern.