Replies: 7 comments 6 replies
-
|
Thanks for reporting the performance, a quick question here: how many terms do you have in the hamiltonian for your 16 simulation case? |
Beta Was this translation helpful? Give feedback.
-
|
Here is a few of the data points:
|
Beta Was this translation helpful? Give feedback.
-
|
I see, that's indeed lots of terms in the Hamiltonian. We can imagine one optimization to speed up the full process and will look into the optimization |
Beta Was this translation helpful? Give feedback.
-
|
Any new information on this issue? :)) Best regards, |
Beta Was this translation helpful? Give feedback.
-
|
Sound great. Any guidance on when to expect this? Best regards, |
Beta Was this translation helpful? Give feedback.
-
|
Is this a quantum chemistry Jordan-Wigner Hamiltonian, meaning the Pauli strings are not k-local? |
Beta Was this translation helpful? Give feedback.
-
|
Also, it looks like the largest simulation case has only 24 qubits, which is well within the reach of a state-vector simulator (cuStateVec). Have you tried running the state vector simulator instead, I would think it should be more efficient for this case. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We are currently working on a variational quantum algorithm (ADAPT-VQE-like) workflow that we would like to scale to a multi-GPU setup. For each iteration, the last gate of the circuit may be changed or reparameterized, and an additional gate may be appended. Following each such modification, one or more expectation values are computed with respect to a fixed Hamiltonian.
At present, we are using the
circuit = cuquantum.tensornet.experimental.NetworkState(...)interface to represent the quantum circuit, adding gates via thetensor_id = circuit.apply_tensor_operator(...)method. The Hamiltonian is represented using thehamiltonian = cuquantum.tensornet.experimental.NetworkOperator(...)interface, and the terms (which are all weighted Pauli strings) are added using thehamiltonian.append_product(...)method.To avoid unnecessary operations and enable caching (as outlined in cuQuantum’s documentation), we construct the
hamiltonianonly once at the beginning (as it does not change in this algorithm). Similarly, we initialize thecircuitonly once, storing the correspondingtensor_ids. Any changes to the gates in the circuit are applied via thecircuit.update_tensor_operator(...)method.Expectation values are calculated using the
circuit.compute_expectation(hamiltonian, ...)method. In our case, the number of terms in the Hamiltonian scales as O(poly(N)) (where N is the number of qubits), and the number of gates in the circuit scales as O(N). Our expectation was that increasing N would increase the relative computational workload on the GPU. However, after profiling the code usingcProfile:we observed that for N > 16, approximately 88% of the time is spent in
<built-in method cuquantum.bindings.cutensornet.expectation_prepare>,whereas only around 10% is spent in
<built-in method cuquantum.bindings.cutensornet.expectation_compute>.This preparation overhead appears to be a bottleneck for scaling to multi-GPU execution.
The experiments were performed on an A10G GPU (AWS EC2 instance), and we are using the latest available version of cuQuantum.
Do you have any suggestions for how we might shift the performance bottleneck from the preparation phase toward the actual contraction phase? We suspect the preparation includes contraction path optimization and related overheads.
Best regards,
Sebastian Yde Madsen
Beta Was this translation helpful? Give feedback.
All reactions