Accompanying repository for "Generation of Fast Finite Field Arithmetic for Cortex-M4 with ECDH and SQIsign Applications"
This repository contains the sources relative to the "Generation of Fast Finite Field Arithmetic for Cortex-M4 with ECDH and SQIsign Applications" paper published in TCHES vol. 2025 no. 4. Sources are integrated in a fork of the pqm4 project. Below is the original pqm4 README.
Clone the repo with the --recurse-submodules
git option.
See the pqm4 section below for all the prerequisites to run the benchmarks.
The relevant source code cited in the paper is under the following directories:
- crypto_sign/sqisign1d_lvl1
- crypto_sign/sqisign1d_lvl1_p248_smart
- crypto_sign/sqisign1d_lvl1_p248_uncompressed
- crypto_sign/sqisign1d_lvl3
- crypto_sign/sqisign1d_lvl5
- crypto_sign/sqisign2d_lvl1
- crypto_sign/sqisign2d_lvl3
- crypto_sign/sqisign2d_lvl5
To reproduce the benchmarks in the paper, run the following command, replacing /dev/ttyACM0
with the device name for
the USB-to-RS232 converter of the NUCLEO-L4R5ZI board:
./benchmarks.py --nohashing -p nucleo-l4r5zi -u /dev/ttyACM0 \
sqisign1d_lvl{1,1_p248_{smart,uncompressed},3,5} sqisign2d_lvl{1,3,5}
Collection of post-quantum cryptographic algorithms for the ARM Cortex-M4
The pqm4 library, benchmarking and testing framework started as a result of the PQCRYPTO project funded by the European Commission in the H2020 program. It currently contains implementations post-quantum key-encapsulation mechanisms and post-quantum signature schemes targeting the ARM Cortex-M4 family of microcontrollers. The design goals of the library are to offer
- automated functional testing on a widely available development board;
- automated generation of test vectors and comparison against output of a reference implementation running host-side (i.e., on the computer the development board is connected to);
- automated benchmarking for speed, stack usage, and code-size;
- automated profiling of cycles spent in symmetric primitives (SHA-2, SHA-3, AES);
- integration of clean implementations from PQClean; and
- easy integration of new schemes and implementations into the framework.
The master branch of pqm4 contains schemes that either
- standardized by NIST in FIPS203, FIPS204, or FIPS205,
- selected for standardization by NIST,
- part of the 4th round of the NIST PQC standardization process,
- part or the first round of additional signatures of the NIST PQC standardization process,
- part of the second round of the KpqC competition.
Implementations for previous NIST PQC rounds are available here:
- Signature Round 1: https://github.com/mupq/pqm4/releases/tag/SignatureRound1
- Round 3: https://github.com/mupq/pqm4/releases/tag/Round3
- Round 2: https://github.com/mupq/pqm4/releases/tag/Round2
- Round 1: https://github.com/mupq/pqm4/releases/tag/Round1
For the second round of the NIST PQC process, pqm4 was extended (see #78) with the following features:
- common code was moved to mupq for reuse in pqriscv,
- much simpler build process,
- automated profiling of cycles spent in symmetric primitives (SHA-2, SHA-3, AES),
- reporting of code-size,
- integration of clean implementations from PQClean.
For the third round of the NIST PQC process, pqm4 was extended with the following features:
- overhaul of the build process to support multiple target boards, and
- use of the QEMU simulator to measure stack usage of larger schemes.
For the fourth round of the NIST PQC process pqm4 was extended with the following features:
- Switch to the Nucleo-L4R5ZI board as the default board for measurements, and
- an overhaul of the console output.
For most of the schemes there are multiple implementations. The naming scheme for these implementations is as follows:
clean
: clean reference implementation from PQClean,ref
: the reference implementation submitted to NIST (will be replaced byclean
in the long term),opt
: an optimized implementation in plain C (e.g., the optimized implementation submitted to NIST),m4
: an implementation with Cortex-M4 specific optimizations (typically in assembly).m4f
: an implementation with Cortex-M4F specific optimizations (typically assembly using floating-point registers).
The testing and benchmarking framework of pqm4 targets several development boards, all featuring an ARM Cortex-M4 chip:
nucleo-l4r5zi
(default): The NUCLEO-L4R5ZI board featuring 2MB of Flash and 640KB of RAM. This board does not require a separate USB serial interface converter.stm32f4discovery
: The STM32F4 Discovery board featuring 1MB of Flash, and 192KB of RAM. Connecting the development to the host computer requires a mini-USB cable and a USB-TTL converter together with a 2-pin dupont / jumper cable.nucleo-l476rg
: The NUCLEO-L476RG board featuring 1MB of Flash and 128KB of RAM. This board does not require a separate USB serial interface converter.cw308t-stm32f3
: The ChipWhisperer CW308-STM32F3 target board (in the F3 configuration) featuring 256KB of Flash and 40KB of RAM.mps2-an386
: The ARM MPS2(+) FPGA prototyping board when used with the ARM-Cortex M4 bitstream (see ARM AN386) featuring two 4MB RAM blocks, one used in lieu of Flash one as RAM. This board can also be simulated with the QEMU 5.2 simulator (the cycle counts are, however, meaningless in this case).
The pqm4 build system assumes that you have the arm-none-eabi toolchain
toolchain installed. All benchmarks are performed using this toolchain.
On most Linux systems, the correct toolchain gets installed when you install the arm-none-eabi-gcc
(or gcc-arm-none-eabi
) package.
On some Linux distributions, you will also have to explicitly install libnewlib-arm-none-eabi
.
To flash binaries onto most development boards, pqm4 is using stlink. Depending on your operating system, stlink may be available in your package manager -- if not, please refer to the stlink Github page for instructions on how to compile it from source (in that case, be careful to use libusb-1.0.0-dev, not libusb-0.1).
For the nucleo-l4r5zi
board OpenOCD (tested with version 0.12) is used for flashing binaries.
Depending on your operating system, OpenOCD may be available in your package manager -- if not, please
refer to the OpenOCD README for instructions on how to compile it from source.
The benchmarking scripts used in pqm4 require Python >= 3.8.
The host-side Python code for most platforms requires the pyserial module.
Your package repository might offer python3-serial
(Debian, Ubuntu) or python-pyserial
(Arch) or python3-pyserial
(Fedora, openSUSE) or pyserial
(Slack, CentOS, Gentoo) or py3-pyserial
(Alpine) directly.
Alternatively, this can be easily installed from PyPA by calling pip3 install -r requirements.txt
.
If you do not have pip3
installed yet, you can typically find it as python3-pip
(Debian, Ubuntu) or python-pip
(Arch) using your package manager.
The host-side Python code for the cw308t-stm32f3
board requires the chipwhisperer module.
If you don't target this board, you can skip the installation.
The mps2-an386
platform is simulated with the QEMU
ARM system emulator. You'll need at least the version 5.2, which is fairly
recent at the time of writing and may not be available on your favourite Linux
distro. If you don't target this platform, you can skip the installation.
Connect the board to your host machine using the mini-USB port.
This provides it with power, and allows you to flash binaries onto the board.
It should show up in lsusb
as STMicroelectronics ST-LINK/V2
.
If you are using a UART-USB connector that has a PL2303 chip on board (which appears to be the most common),
the driver should be loaded in your kernel by default. If it is not, it is typically called pl2303
.
On macOS, you will still need to install it (and reboot).
When you plug in the device, it should show up as Prolific Technology, Inc. PL2303 Serial Port
when you type lsusb
.
Using dupont / jumper cables, connect the TX
/TXD
pin of the USB connector to the PA3
pin on the board, and connect RX
/RXD
to PA2
.
Depending on your setup, you may also want to connect the GND
pins.
Finally, obtain the pqm4 library and the submodules:
git clone --recursive https://github.com/mupq/pqm4.git
Now you may pick your platform and compile the code (adapt the PLATFORM
variable to your chosen platform and the number of threads in -j4
to your PC accordingly):
make -j4 PLATFORM=stm32f4discovery
The pqm4 library uses the NIST/SUPERCOP/PQClean API. It is mandated for all included schemes.
KEMs need to define CRYPTO_SECRETKEYBYTES
, CRYPTO_PUBLICKEYBYTES
, CRYPTO_BYTES
, and CRYPTO_CIPHERTEXTBYTES
and implement
int crypto_kem_keypair(unsigned char *pk, unsigned char *sk);
int crypto_kem_enc(unsigned char *ct, unsigned char *ss, const unsigned char *pk);
int crypto_kem_dec(unsigned char *ss, const unsigned char *ct, const unsigned char *sk);
Signature schemes need to define CRYPTO_SECRETKEYBYTES
, CRYPTO_PUBLICKEYBYTES
, and CRYPTO_BYTES
and implement
int crypto_sign_keypair(unsigned char *pk, unsigned char *sk);
int crypto_sign(unsigned char *sm, size_t *smlen,
const unsigned char *msg, size_t len,
const unsigned char *sk);
int crypto_sign_open(unsigned char *m, size_t *mlen,
const unsigned char *sm, size_t smlen,
const unsigned char *pk);
The build system compiles six binaries for each implemenation which can be used to test and benchmark the schemes. For example, for the reference implementation of ML-KEM-768 the following binaries are assembled:
bin/crypto_kem_ml-kem-768_m4_test.bin
tests if the scheme works as expected. For KEMs this tests if Alice and Bob derive the same shared key and for signature schemes it tests if a generated signature can be verified correctly. Several failure cases are also checked, see mupq/crypto_kem/test.c and mupq/crypto_sign/test.c for details.bin/crypto_kem_ml-kem-768_m4_speed.bin
measures the runtime ofcrypto_kem_keypair
,crypto_kem_enc
, andcrypto_kem_dec
for KEMs andcrypto_sign_keypair
,crypto_sign
, andcrypto_sign_open
for signatures. See mupq/crypto_kem/speed.c and mupq/crypto_sign/speed.c.bin/crypto_kem_ml-kem-768_m4_hashing.bin
measures the cycles spent in SHA-2, SHA-3, and AES ofcrypto_kem_keypair
,crypto_kem_enc
, andcrypto_kem_dec
for KEMs andcrypto_sign_keypair
,crypto_sign
, andcrypto_sign_open
for signatures. See mupq/crypto_kem/hashing.c and mupq/crypto_sign/speed.c.bin/crypto_kem_ml-kem-768_m4_stack.bin
measures the stack consumption of each of the procedures involved. The memory allocated outside of the procedures (e.g., public keys, private keys, ciphertexts, signatures) is not included. See mupq/crypto_kem/stack.c and mupq/crypto_sign/stack.c.bin/crypto_kem_ml-kem-768_m4_testvectors.bin
uses a deterministic random number generator to generate testvectors for the implementation. These can be used to cross-check different implemenatations of the same scheme. See mupq/crypto_kem/testvectors.c and mupq/crypto_sign/testvectors.c.bin-host/crypto_kem_ml-kem-768_m4_testvectors
uses the same deterministic random number generator to create the testvectors on your host. See mupq/crypto_kem/testvectors-host.c and mupq/crypto_sign/testvectors-host.c.- An
elf
file for each binary is generated in theelf/
folder if desired.
The elf
files or binaries can be flashed to your board using an appropriate
tool. For example, the stm32f4discovery
platform uses st-flash
, e.g., st-flash write bin/crypto_kem_ml-kem-768_m4_test.bin 0x8000000
. To receive the output, run python3 hostside/host_unidirectional.py
.
If you target the mps2-an386
platform, you can also run the elf
file using
the QEMU ARM emulator:
qemu-system-arm -M mps2-an386 -nographic -semihosting -kernel elf/crypto_kem_ml-kem-512_m4_test.elf
The emulator should exit automatically when the test / benchmark completes. If you run into an error, you can exit QEMU pressing CTRL+A and then X.
The pqm4 framework automates testing and benchmarking for all schemes using Python3 scripts:
python3 test.py
: flashes all test binaries to the boards and checks that no errors occur.python3 testvectors.py
: flashes all testvector binaries to the boards and writes the testvectors totestvectors/
. Additionally, it executes the reference implementations on your host machine. Afterwards, it checks the testvectors of different implementations of the same scheme for consistency.python3 benchmarks.py
: flashes the stack and speed binaries and writes the results tobenchmarks/stack/
andbenchmarks/speed/
. You may want to execute this several times for certain schemes for which the execution time varies significantly.
The scripts take a number of command line arguments, which you'll need to adapt:
--platform <platformname>
or-p <platformname>
: Sets the target platform (defaultstm32f4discovery
).--opt {speed,size,debug}
or-o {speed,size,debug}
: Sets optimization flags for compilation (defaultspeed
).--lto
or-l
: Use link-time optimization during compilation.--no-aio
: Use link-time optimization during compilation.
If you change any of these values, you'll need to run make clean
(the build
system will remind you).
In case you don't want to include all schemes, pass a list of schemes you want to include to any of the scripts, e.g., python3 test.py ml-kem-768 sphincs-shake256-128f-simple
.
In case you want to exclude certain schemes pass --exclude
, e.g., python3 test.py --exclude saber
.
The benchmark results (in benchmarks/
) created by
python3 benchmarks.py
can be automatically converted to a markdown table using python3 convert_benchmarks.py md
or to csv using python3 convert_benchmarks.py csv
.
The current benchmark results can be found in benchmarks.csv or benchmarks.md.
All cycle counts were obtained at 24MHz to avoid wait cycles due to the speed of the memory controller. For most schemes we report minimum, maximum, and average cycle counts of 100 executions. For some particularly slow schemes we reduce the number of executions; the number of executions is reported in parentheses.
The numbers were obtained with arm-none-eabi-gcc (Arm GNU Toolchain 11.3.Rel1) 11.3.1 20220712
from Arm.
The code-size measurements only include the code that is provided by the scheme implementation, i.e., exclude common code like hashing or C standard library functions.
The measurements are performed with arm-none-eabi-size
.
The size contributions to the .text
, .data
, and .bss
sections are also listed separately.
The pqm4 build system is designed to make it very easy to add new schemes and implementations, if these implementations follow the NIST/SUPERCOP/PQClean API.
In case you want to contribute a reference implementation, please open a pull request to PQClean. In case you want to contribute an optimized C implementation, please open a pull request to mupq. In case you want to add an implementation optimized for the Cortex-M4, please open a pull request here.
In the following we consider the example of adding an M4-optimized implementation of NewHope-512-CPA-KEM to pqm4:
- Create a subdirectory for the new scheme under
crypto_kem/
; in the following we assume that this subdirectory is callednewhope512cpa
. - Create a subdirectory
m4
undercrypto_kem/newhope512cpa/
. - Copy all files of the implementation into this new subdirectory
crypto_kem/newhope512cpa/m4/
, except for the file implementing therandombytes
function (typicallyPQCgenKAT_kem.c
).
The procedure for adding a signature scheme is the same, except that it starts with creating a
new subdirectory under crypto_sign/
.
Many schemes submitted to NIST use SHA-3, SHAKE or cSHAKE for hashing.
This is why pqm4 comes with highly optimized Keccak code that is accessible
from all KEM and signature implementations.
Functions from the FIPS202 standard are defined in mupq/common/fips202.h
as follows:
void shake128_absorb(shake128ctx *state, const uint8_t *input, size_t inlen);
void shake128_squeezeblocks(uint8_t *output, size_t nblocks, shake128ctx *state);
void shake128(uint8_t *output, size_t outlen, const uint8_t *input, size_t inlen);
void shake128_inc_init(shake128incctx *state);
void shake128_inc_absorb(shake128incctx *state, const uint8_t *input, size_t inlen);
void shake128_inc_finalize(shake128incctx *state);
void shake128_inc_squeeze(uint8_t *output, size_t outlen, shake128incctx *state);
void shake256_absorb(shake256ctx *state, const uint8_t *input, size_t inlen);
void shake256_squeezeblocks(uint8_t *output, size_t nblocks, shake256ctx *state);
void shake256(uint8_t *output, size_t outlen, const uint8_t *input, size_t inlen);
void shake256_inc_init(shake256incctx *state);
void shake256_inc_absorb(shake256incctx *state, const uint8_t *input, size_t inlen);
void shake256_inc_finalize(shake256incctx *state);
void shake256_inc_squeeze(uint8_t *output, size_t outlen, shake256incctx *state);
void sha3_256_inc_init(sha3_256incctx *state);
void sha3_256_inc_absorb(sha3_256incctx *state, const uint8_t *input, size_t inlen);
void sha3_256_inc_finalize(uint8_t *output, sha3_256incctx *state);
void sha3_256(uint8_t *output, const uint8_t *input, size_t inlen);
void sha3_512_inc_init(sha3_512incctx *state);
void sha3_512_inc_absorb(sha3_512incctx *state, const uint8_t *input, size_t inlen);
void sha3_512_inc_finalize(uint8_t *output, sha3_512incctx *state);
void sha3_512(uint8_t *output, const uint8_t *input, size_t inlen);
Functions from the related publication SP 800-185 (cSHAKE) are defined in mupq/common/sp800-185.h
:
void cshake128_inc_init(shake128incctx *state, const uint8_t *name, size_t namelen, const uint8_t *cstm, size_t cstmlen);
void cshake128_inc_absorb(shake128incctx *state, const uint8_t *input, size_t inlen);
void cshake128_inc_finalize(shake128incctx *state);
void cshake128_inc_squeeze(uint8_t *output, size_t outlen, shake128incctx *state);
void cshake128(uint8_t *output, size_t outlen, const uint8_t *name, size_t namelen, const uint8_t *cstm, size_t cstmlen, const uint8_t *input, size_t inlen);
void cshake256_inc_init(shake256incctx *state, const uint8_t *name, size_t namelen, const uint8_t *cstm, size_t cstmlen);
void cshake256_inc_absorb(shake256incctx *state, const uint8_t *input, size_t inlen);
void cshake256_inc_finalize(shake256incctx *state);
void cshake256_inc_squeeze(uint8_t *output, size_t outlen, shake256incctx *state);
void cshake256(uint8_t *output, size_t outlen, const uint8_t *name, size_t namelen, const uint8_t* cstm, size_t cstmlen, const uint8_t *input, size_t inlen);
Implementations that want to make use of these optimized routines simply include
fips202.h
(or sp800-185.h
). The API for sha3_256
and sha3_512
follows the
SUPERCOP hash API.
The API for shake128
and shake256
is very similar, except that it supports variable-length output.
The SHAKE functions are also accessible via the absorb-squeezeblocks functions, which offer incremental
output generation (but not incremental input handling).
The variants with _inc_
offer both incremental input handling and output generation.
Some schemes submitted to NIST use SHA-224, SHA-256, SHA-384, or SHA-512 for hashing.
We've experimented with assembly-optimized SHA-512, but found that the speed-up
achievable with this compared to the C implementation from
SUPERCOP is negligible
when compiled using arm-none-eabi-gcc-8.3.0
.
For older compiler versions (e.g. 5.4.1
) hand-optimized assembly implementations
were significantly faster.
We've therefore decided to only include a C version of the SHA-2 variants.
The available functions are:
void sha224_inc_init(sha224ctx *state);
void sha224_inc_blocks(sha224ctx *state, const uint8_t *in, size_t inblocks);
void sha224_inc_finalize(uint8_t *out, sha224ctx *state, const uint8_t *in, size_t inlen);
void sha224(uint8_t *out, const uint8_t *in, size_t inlen);
void sha256_inc_init(sha256ctx *state);
void sha256_inc_blocks(sha256ctx *state, const uint8_t *in, size_t inblocks);
void sha256_inc_finalize(uint8_t *out, sha256ctx *state, const uint8_t *in, size_t inlen);
void sha256(uint8_t *out, const uint8_t *in, size_t inlen);
void sha384_inc_init(sha384ctx *state);
void sha384_inc_blocks(sha384ctx *state, const uint8_t *in, size_t inblocks);
void sha384_inc_finalize(uint8_t *out, sha384ctx *state, const uint8_t *in, size_t inlen);
void sha384(uint8_t *out, const uint8_t *in, size_t inlen);
void sha512_inc_init(sha512ctx *state);
void sha512_inc_blocks(sha512ctx *state, const uint8_t *in, size_t inblocks);
void sha512_inc_finalize(uint8_t *out, sha512ctx *state, const uint8_t *in, size_t inlen);
void sha512(uint8_t *out, const uint8_t *in, size_t inlen);
Implementations can use these by including sha2.h
.
Some schemes submitted to NIST make use of AES as a subroutine. We included assembly-optimized implementations of AES-128 and AES-256 in ECB mode and in CTR mode.
Up until January 2021, pqm4 relied on the t-table implementation by Schwabe and Stoffelen published at SAC2016. On Cortex-M4 platforms with a data cache, this implementation may be vulnerable to cache attacks. Hence, pqm4 is now using the bitsliced implementation by Adomnicai and Peyrin published in TCHES2021/1.
The functions that can be used are stated in common/aes.h
as follows:
void aes128_ecb_keyexp(aes128ctx *r, const unsigned char *key);
void aes128_ctr_keyexp(aes128ctx *r, const unsigned char *key);
void aes128_ecb(unsigned char *out, const unsigned char *in, size_t nblocks, const aes128ctx *ctx);
void aes128_ctr(unsigned char *out, size_t outlen, const unsigned char *iv, const aes128ctx *ctx);
void aes256_ecb_keyexp(aes256ctx *r, const unsigned char *key);
void aes256_ctr_keyexp(aes256ctx *r, const unsigned char *key);
void aes256_ecb(unsigned char *out, const unsigned char *in, size_t nblocks, const aes256ctx *ctx);
void aes256_ctr(unsigned char *out, size_t outlen, const unsigned char *iv, const aes256ctx *ctx);
Implementations can use these by including aes.h
.
Some post-quantum schemes use AES with only public inputs (e.g., Kyber and FrodoKEM) and, consequently, do not need a constant-time AES implementation.
As those schemes would be unfairly penalized by swiching to a slower constant-time implementation, we additionally provide the t-table implementation.
The functions that can be used are stated in common/aes-publicinputs.h
as follows:
void aes128_ecb_keyexp_publicinputs(aes128ctx_publicinputs *r, const unsigned char *key);
void aes128_ctr_keyexp_publicinputs(aes128ctx_publicinputs *r, const unsigned char *key);
void aes128_ecb_publicinputs(unsigned char *out, const unsigned char *in, size_t nblocks, const aes128ctx_publicinputs *ctx);
void aes128_ctr_publicinputs(unsigned char *out, size_t outlen, const unsigned char *iv, const aes128ctx_publicinputs *ctx);
void aes192_ecb_keyexp_publicinputs(aes192ctx_publicinputs *r, const unsigned char *key);
void aes192_ctr_keyexp_publicinputs(aes192ctx_publicinputs *r, const unsigned char *key);
void aes192_ecb_publicinputs(unsigned char *out, const unsigned char *in, size_t nblocks, const aes192ctx_publicinputs *ctx);
void aes192_ctr_publicinputs(unsigned char *out, size_t outlen, const unsigned char *iv, const aes192ctx_publicinputs *ctx);
void aes256_ecb_keyexp_publicinputs(aes256ctx_publicinputs *r, const unsigned char *key);
void aes256_ctr_keyexp_publicinputs(aes256ctx_publicinputs *r, const unsigned char *key);
void aes256_ecb_publicinputs(unsigned char *out, const unsigned char *in, size_t nblocks, const aes256ctx_publicinputs *ctx);
void aes256_ctr_publicinputs(unsigned char *out, size_t outlen, const unsigned char *iv, const aes256ctx_publicinputs *ctx);
When referring to this framework in academic literature, please consider using the following bibTeX excerpt:
@misc{PQM4,
title = {{PQM4}: Post-quantum crypto library for the {ARM} {Cortex-M4}},
author = {Matthias J. Kannwischer and Richard Petri and Joost Rijneveld and Peter Schwabe and Ko Stoffelen},
note = {\url{https://github.com/mupq/pqm4}}
}
Please note however, that pqm4 does not author the implementations that are included in pqm4. Most of the implementations that are included in the collection originate from original research projects. Moreover, many implementations have been swapped out over the years. When comparing or improving implementations, please consider not only pqm4, but also cite the publication corresponding to the implementation.
Sometimes it might not be entirely clear which paper to cite. Feel free to you open an issue such that we can help you find it.
Different parts of pqm4 have different licenses. Each subdirectory containing implementations contains a LICENSE or COPYING file stating under what license that specific implementation is released. The files in common contain licensing information at the top of the file (and are currently either public domain or MIT).
All other code in this repository is dual-licensed under Apache-2.0 and under the conditions of CC0.