Generic QSPI with PYNQ-Z1 PL overlay
See the overview about this project in a PDF:
Create a Generic QSPI interface on PYNQ-Z1: it should allow to generate any QSPI transaction (not dependent on QSPI memory devices supported by the AMD/Xilinx axi_quad_spi IP - it is not generic!).
If you need flexibility in terms of:
- any 8bit command to send (on all or just lane 0, as 1-4-4 or 4-4-4)
- a 32bit ADDR - on all lanes
- a 24bit ALT field - on all lanes
- just 2bit CLK cycles as turnaround
- Nx 32bit write words or reading Nx 32bit words
plus: - have a QCLKfb signal: a QCLK signal coming in, e.g. delayed by level shifters, or delay the read sampling falling edge on QCLK by one SYS_CLK period (10ns later)
- and sample a read response not with the internal QSPI: instead use an external QCLKfb in order to compensate the "round trip delay" or sample 10ns later
This is the entire PL overlay project.
ATTENTION:
It needs also the "PYNQ-Z1_GenericQSPI_QSPI" project (as a component for
the block diagram, the QSPI block used as IP in block diagram).
This is the QSPI FPGA RTL sub-project, coming also with a testbench (to simulate and test the QsPI block).
This block is used in the entire project "PYNQ-Z1_GenericQSPI_Blockdiagram".
The maximum QSPI clock speed is 50 MHz (CLK_DIV = 0), half of the PL clock speed. But you can increase the PL PLL clock (3x faster), so potentially 150MHz QSPI possible.
The Generic QSPI is an independent block (with a testbench) to implement
QSPI primitives, e.g.:
send a 8bit CMD word, send a 32bit ADDR, send a 24bit
ALT, generate a 2 clock cycle turnaround, do a read with a delayed QCLKfb signal
(as input, delayed as all the data lanes, e.g. due to external level shifters)
You can modify for other parameters, e.g. number of TurnAround cycles.
Only QSPI mode 3 is currently implemented, but you can change the VHDL code for other modes.
The Generic QSPI block uses four registers:
- a WR register (the word to send)
- a RD register (the word sampled on data lanes)
- a CTL register (to specify the type what to do on QSPI pins, e.g. 32bit address, or generate a turnaorund, generate the nCS signal as 1-of-4), plus: should we change the byte endian, e.g. for data WR and RD part (the CMD, ADDR, ALT are always in big endian!)
- a STS register (has the shift out or sample in finished?)
It contains two AXI_GPIO blocks in order to connect the Generic_QSPI IP block.
There is also a SPI block connected in the usual way.
A read transaction looks like this:
In order to connect the custom Generic QSPI we need four 32bit registers. It can be done via an AXI_IPIF block. Here, to simplify, we use two axi_gpio IP blocks. These provide one 32bit register out, one 32bitr register in. We use two in order to connect all of our four Generic QSPI registers to the AXI bus (and entire PYNQ overlay block diagram).
The drawback is: we had to generate a strobe signal via GPIO, so that the Generic QSPI block would realize: "there is a new command (in the CTRL register). This is done via a two bit counter as bit [31:30] on the CTL port: any different value as before triggers a new transaction.
An improved version could use an AXI_IPIF block and enbedd the same Generic QSPI RTL code, so that an AXI write would trigger already what to do on QSPI signals.
The QSPI is used in a Python script (Jupyter Notebook). The Python code for handling the registers
on the AXI_GPIO block is way too slow: it generates huge gaps between the transactions.
In order to speed up - the Python code is augmented to use C-code. The C-code does the entire transaction and is
way faster.
Remark:
Bear in mind that even the C-code needs the MMIO access done via opening the virtual memory mapping ("mmap").
This is done in the Init() function. It keeps the MMIO memory mapped device open until you call the Deinit() function.
So, all following writes to the QSPI CTL_REG should be as fastest as possible (Using just Python scripts could open and close
the memory mapped device all the time again which slows down a lot).
The C-code is located in file "Generic_QSPI.c". It has to be compiled as a shared library. In order to do so:
- place the C-code in a directory "/home/xilinx/c_code
- compile it - step 1:
gcc -fpic -c -O3 -Wall Generic_QSPI.c - build the shared library - step 2:
gcc -shared -o libGeneric_QSPI.so Generic_QSPI.o
This C-code shared library is used by the Python script via CFFI. It is way faster now.
ATTENTION:
The C-code can be used with or without flow control: if the CLK_DIV is small (0..2) - the Python script is slower as the QsPI transactions.
Without using the flow control: the QsPI transactions become more seamless and faster (no gaps in between).
But with a slower QSPI clock: you can increase the PL PLL setting (up to 333.3MHz, 3x faster) or your use the C-code with flow control.
The selection if with or without flow control is done on the C_QSPI_Init() function (via a parameter).
With a faster PL PLL - the QSPI transactions are faster (possible to generate a QSPI transaction with 75MHz, up to 150MHz).
Remark:
The speed of the Python and C-code, how fast a transaction can be generated, depends on the AXI clock. This clock is default 100MHz.
But the PS (processor) runs with 550MHz - the AXI bus slows down how fast the QSPI can be.
Increasing the PL PLL clock speeds up, remarkably.
Please, find the files needed in folder "overlay":
- design_1.bit and design_1.hwh go (copy) into PYNQ folder "/home/xilinx/pynq/overlays/design_1"
- generic_QSPI.ipynb is the notebook file, copy to "/home/xilinx/jupyter_notebooks"
- libGeneric_QSPI.so is for me on "/home/xilinx/c_code", but it can be also in any of the other folders (just modify in notebook)
One version assumes, that Python sends all transactions with 32bit version (e.g. also the CMD which is an 8bit transaction on QSPI). The other version, the names are extended by "packed", assumes that the Python generates a Byte Stream (with CMD as 8bit, ALT as 24bit etc.). Use the right shared C-code library and Python script.
The project is designed to work with "my" Level Shifter board. The QSPI signals, GPIOs (e.g. out and INT as in) go via a level shifter board. The FPGA (and Python code) could control the direction of the signals (which is IN or OUT, not used really, fix set).
All is based on the SYS_CLK. The READ phase is delayed by one SYS_CLK cycle, in order to compensate the delay in FPGA and due to the Level Shifters. There is an option to use a QSCLKfb clock: use an external QSPI clock, which is a feedback on the outside into the FPGA, as a READ clock. This can compensate all FPGA internal delays (e.g. I/O delay) plus tweaking the RC curcuit also the external level shifters.
The QSPI block does not have a HW reset signal. Instead, the bit27 in the CTL_REG is used as an RESET signal.
If this bit27 is 0 - it holds the QSPI block in reset. So, you have to control the RESET signal via SW:
Keep bit27 always 1 when you write a CTL_REG 32bit word!