Performance Bottleneck in expectation_prepare Limits Multi-GPU Scaling in cuQuantum ADAPT-VQE-like Workflow #183

seba2390 · 2025-06-17T08:37:22Z

seba2390
Jun 17, 2025

We are currently working on a variational quantum algorithm (ADAPT-VQE-like) workflow that we would like to scale to a multi-GPU setup. For each iteration, the last gate of the circuit may be changed or reparameterized, and an additional gate may be appended. Following each such modification, one or more expectation values are computed with respect to a fixed Hamiltonian.

At present, we are using the circuit = cuquantum.tensornet.experimental.NetworkState(...) interface to represent the quantum circuit, adding gates via the tensor_id = circuit.apply_tensor_operator(...) method. The Hamiltonian is represented using the hamiltonian = cuquantum.tensornet.experimental.NetworkOperator(...) interface, and the terms (which are all weighted Pauli strings) are added using the hamiltonian.append_product(...) method.

To avoid unnecessary operations and enable caching (as outlined in cuQuantum’s documentation), we construct the hamiltonian only once at the beginning (as it does not change in this algorithm). Similarly, we initialize the circuit only once, storing the corresponding tensor_ids. Any changes to the gates in the circuit are applied via the circuit.update_tensor_operator(...) method.

Expectation values are calculated using the circuit.compute_expectation(hamiltonian, ...) method. In our case, the number of terms in the Hamiltonian scales as O(poly(N)) (where N is the number of qubits), and the number of gates in the circuit scales as O(N). Our expectation was that increasing N would increase the relative computational workload on the GPU. However, after profiling the code using cProfile:

profile_output_file = f"{n_qubits_run}.prof"
profile = cProfile.Profile()
profile.enable()
main(n_qubits=n_qubits)
profile.disable()
profile.dump_stats(profile_output_file)

we observed that for N > 16, approximately 88% of the time is spent in
<built-in method cuquantum.bindings.cutensornet.expectation_prepare>,
whereas only around 10% is spent in
<built-in method cuquantum.bindings.cutensornet.expectation_compute>.

This preparation overhead appears to be a bottleneck for scaling to multi-GPU execution.

The experiments were performed on an A10G GPU (AWS EC2 instance), and we are using the latest available version of cuQuantum.

Do you have any suggestions for how we might shift the performance bottleneck from the preparation phase toward the actual contraction phase? We suspect the preparation includes contraction path optimization and related overheads.

Best regards,
Sebastian Yde Madsen

yangcal · 2025-06-17T17:15:47Z

yangcal
Jun 17, 2025
Maintainer

Thanks for reporting the performance, a quick question here: how many terms do you have in the hamiltonian for your 16 simulation case?

0 replies

seba2390 · 2025-06-18T10:46:32Z

seba2390
Jun 18, 2025
Author

Here is a few of the data points:

14 qubits, 1743 weighted pauli strings in hamiltonian, 87.41 %, 7.88 %
16 qubits, 2964 weighted pauli strings in hamiltonian, 88.59 %, 7.93 %
18 qubits, 4743 weighted pauli strings in hamiltonian, 88.81 %, 8.55 %
20 qubits, 7230 weighted pauli strings in hamiltonian, 90.20 %, 8.19 %
22 qubits, 10593 weighted pauli strings in hamiltonian, 88.38 %, 10.26 %
24 qubits, 15018 weighted pauli strings in hamiltonian, 89.94 %, 8.58 %

0 replies

yangcal · 2025-06-18T15:06:12Z

yangcal
Jun 18, 2025
Maintainer

I see, that's indeed lots of terms in the Hamiltonian. We can imagine one optimization to speed up the full process and will look into the optimization

0 replies

seba2390 · 2025-07-07T09:36:15Z

seba2390
Jul 7, 2025
Author

Any new information on this issue? :))

Best regards,
Sebastian Yde Madsen

1 reply

yangcal Jul 7, 2025
Maintainer

Hello, we have identified the issue and will likely be able to address this in the next release.

seba2390 · 2025-07-08T10:21:44Z

seba2390
Jul 8, 2025
Author

Sound great. Any guidance on when to expect this?

Best regards,
Sebastian Yde Madsen

2 replies

yangcal Jul 8, 2025
Maintainer

Our past releases have mostly been scheduled quarterly, so our next release may be looking at a 25.09 or 25.10

seba2390 Jul 10, 2025
Author

Sounds good! :) We are currently under an NDA with you guys and are very interested in ensuring that this issue is resolved to the extent that multi-GPU scaling of this workflow becomes feasible. Please let me know if there is anything we can do on our side to help (e.g., if any specifics or technical details would be useful), or if you would like my manager to reach out and set up a meeting.

Best regards,
Sebastian Yde Madsen

DmitryLyakh · 2025-07-23T19:06:55Z

DmitryLyakh
Jul 23, 2025

Is this a quantum chemistry Jordan-Wigner Hamiltonian, meaning the Pauli strings are not k-local?

0 replies

DmitryLyakh · 2025-07-23T19:11:55Z

DmitryLyakh
Jul 23, 2025

Also, it looks like the largest simulation case has only 24 qubits, which is well within the reach of a state-vector simulator (cuStateVec). Have you tried running the state vector simulator instead, I would think it should be more efficient for this case.

3 replies

DmitryLyakh Jul 23, 2025

However, if you plan to go beyond 32 qubits, then indeed we need to introduce optimizations in cuTensorNet to reduce the overhead of the contraction path preparation for Hamiltonian components.

DmitryLyakh Jul 23, 2025

By the way, do you plan to use the automatic tensor network contraction parallelization provided by the cuTensorNet library, or you have your own custom parallelization scheme?

DmitryLyakh Jul 23, 2025

If you rely on the automatic expectation value calculation parallelization implemented in cuTensorNet (across multiple/many GPUs), you may just increase the number of qubits to the point where the Compute time will start to dominate the Prepare (path finding) time. Also, I assume you are feeding cuTensorNet with Pauli strings composed of X, Y, Z only, no explicit identities (I), right?

Performance Bottleneck in expectation_prepare Limits Multi-GPU Scaling in cuQuantum ADAPT-VQE-like Workflow #183

Uh oh!

seba2390 Jun 17, 2025

Replies: 7 comments · 6 replies

Uh oh!

yangcal Jun 17, 2025 Maintainer

Uh oh!

seba2390 Jun 18, 2025 Author

Uh oh!

yangcal Jun 18, 2025 Maintainer

Uh oh!

seba2390 Jul 7, 2025 Author

Uh oh!

Uh oh!

yangcal Jul 7, 2025 Maintainer

Uh oh!

seba2390 Jul 8, 2025 Author

Uh oh!

yangcal Jul 8, 2025 Maintainer

Uh oh!

seba2390 Jul 10, 2025 Author

Uh oh!

DmitryLyakh Jul 23, 2025

Uh oh!

DmitryLyakh Jul 23, 2025

Uh oh!

DmitryLyakh Jul 23, 2025

Uh oh!

DmitryLyakh Jul 23, 2025

Uh oh!

DmitryLyakh Jul 23, 2025

seba2390
Jun 17, 2025

Replies: 7 comments 6 replies

yangcal
Jun 17, 2025
Maintainer

seba2390
Jun 18, 2025
Author

yangcal
Jun 18, 2025
Maintainer

seba2390
Jul 7, 2025
Author

yangcal Jul 7, 2025
Maintainer

seba2390
Jul 8, 2025
Author

yangcal Jul 8, 2025
Maintainer

seba2390 Jul 10, 2025
Author

DmitryLyakh
Jul 23, 2025

DmitryLyakh
Jul 23, 2025