Re-allow restart in driven solver by parsing csv table from file #393

phdum-a · 2025-06-27T02:52:07Z

This resolves #376.

Palace had the "Restart" flag in the driven solver. Previously this was simple because printing always appended to files row-by-row. A started run did not need to know what happened or what was written before the restart.

With then new_postpro changes (#302) and multi-excitation support (#309), this is more complicated since all data for the full run is stored by PostOperatorCSV. This is all due to the fact that multi-excitations can't just append to rows below but need to change columns to the side. To re-enable restart, we have to parse back the csv file from disk, which we do using scnlib. When we load from the csv, we have to compare to the expected table, in order to (a) validate that the read in tables are in a good state (b) recover information about the Table state (like internal column names) not in the csv.

Additionally, we clean up various parts of the code for simplicity and to make it more testable. We add several unit tests for the table reload.

The commits have been squashed and cleaned to be understandable one-by-one.

hughcars

One big bug right now, which is that if you run a non-restarting config file and there was already a postpro folder in the same location, which previously would have overwritten, it will now error saying the data to restart from is inconsistent.

There's also some minor things with the printing to terminal of the total iteration. I tested restarting the uniform cpw case a few times on either excitation and it seems to work on that.

Aside from that looks ok. I have some simplifications on hughcars/fix_restart_printing and have extended the julia testing to allow looking at an already populated postpro folder which makes this simpler. Once this is in we should seriously look at refactoring the csv stuff away, because at this point we're reading and writing so could drastically simplify things by using an existing library. That's beyond the scope of this patch pr however.

When ready, don't forget to add an entry in the CHANGELOG mentioning the bug fix.

palace/models/postoperatorcsv.cpp

palace/drivers/drivensolver.cpp

hughcars · 2025-07-02T22:02:00Z

palace/drivers/drivensolver.cpp

+  std::size_t total_step_i = indexing.restart - 1;
+  for (const auto &[exc_step_i, exc_kv] :
+       EnumerateView(port_excitations.excitations, indexing.excitation_n0))


There's no need for this abstraction, nor the indexing helper, this is pretty straightforward integer arithmetic. Can then delete the EnumerateView and its file.

const int restart_excitation_idx = (iodata.solver.driven.restart - 1) / omega_sample.Size() + 1; const int freq_restart_idx = (iodata.solver.driven.restart - 1) % int(omega_sample.size()); for (const auto &[excitation_idx, excitation_spec] : port_excitations) { if (++excitation_counter < restart_excitation_idx) { continue; } .... // Frequency loop. for (std::size_t omega_i = ((excitation_counter == restart_excitation_idx) ? freq_restart_idx : 0); omega_i < omega_sample.size(); omega_i++) { auto omega = omega_sample[omega_i];

Given restart is only applied to the online loop of the adaptive solve too, the same integer loops can be used there. See hughcars/fix_restarting_printing for details and also some changes to the julia testing calls to allow checking it.

EnumerateView and DrivenSolverIndexing came about is because I had bugs with indexing while coding these loops and so decided I need some structure that help and are easy to test.

Originally, DrivenSolverIndexing was a more complicated structure that generated nested excitation/frequency iterators — until I thought that this was too specific for the driver solver. Then I coded up EnumerateView because we can use it elsewhere in the code too. Also it is just a a feature of the ranges library that we will get in future C++ standards, so this is just bridge to then. Importantly, this class is easily testable in isolation. DrivenSolverIndexing is now just a collection of POD that I am happy to remove.

There's no need for this abstraction, nor the indexing helper, this is pretty straightforward integer arithmetic. Can then delete the EnumerateView and its file.

Sure it is simple integer arithmetic. And I am not particularly attached to EnumerateView. But there is the unfortunate fact that on your hughcars/fix_restarting_printing branch you introduced two separate indexing errors:

In the uniform sweep look (line 125) you wrote: const int restart_excitation_idx = (iodata.solver.driven.restart - 1) / (omega_sample.size() + 1) + 1; I

In the adaptive sweep loop (line 373) you wrote const int restart_excitation_idx =(iodata.solver.driven.restart - 1) / port_excitations.Size() + 1;

These should be the same and equal what you wrote in your github comment: const int restart_excitation_idx = (iodata.solver.driven.restart - 1) /omega_sample.Size() + 1; (except Size() should be size()).

The rest looks correct (I think?), but I still had to draw a little 2d table on a pieced of paper and check various combinations. Specifically:

The above assumes that restart_excitation_idx is 1-index, which is different from freq_restart_idx which is 0-index. Which is super-confusing. Double so since you initialize int excitation_counter = 0; and immediately increment.

restart_excitation_idx and freq_restart_idx should probably be called _counter not _idx since excitation_idx refers to something else (the value of excitation indexing being simulated).

++excitation_counter < restart_excitation_idx is correct but I would argue hard to read.

Not sure how to test this in a quick unit test, without running a bunch of small driven solves.

Like with our discussion on strong-typed indices, I would continue to suggest clean, readable and testable code, even when that requires a few 100s lines of tested boilerplate helpers. But I understand that what counts as "small helpers" vs "unnecessary overhead" is subjective.

Edit: Please make a decision exactly what solution you want here.

Aside: For port_excitations we use std::map. We actually use std::map in all sorts of places for the map API but these things are typically tiny amounts of data, so std::map is pretty inefficient. Also we often want both a "counter index" & "semantic index" (excitation, port, etc). So we could do this with map + enumerate. But we could also consider making a class with a vector backed and the two different indices like boost vector property map does. That would also give us random-access.

With 7 frequency samples, and four excitations, testing restarting in each of the excitation sweeps for cpw, then

const int excitation_restart_counter = ((iodata.solver.driven.restart - 1) / omega_sample.size()) + 1; const int freq_restart_idx = (iodata.solver.driven.restart - 1) % omega_sample.size();

works. excitation_restart_counter is a better name as it's supposed to be reflecting the excitation_counter anyway. It can be changed to 0-based instead, but that means needing if (++excitation_counter < excitation_restart_counter) to become if (excitation_counter++ < excitation_restart_counter) which requires a "mid - left - right" parse, rather than left to right parse.

EnumerateView requires opening another file and then internalizing that, and maintaining that file and testing for it, for one call site, It's not worth it. I don't want to be introducing extra things for developers to have to learn only to loop over a 2 dimensional array.

hughcars · 2025-07-02T22:03:12Z

palace/drivers/drivensolver.hpp

+// Mini helper class that stored indexing information for printing and restart.
+struct DrivenSolverIndexing
+{
+  std::size_t nr_total_samples;
+
+  // Restart (1-Based indexing from config file)
+  std::size_t restart = 1;
+
+  // Offsets from restart (0-Based)
+  std::size_t excitation_n0 = 0;
+  std::size_t omega_n0 = 0;
+
+  DrivenSolverIndexing(std::size_t nr_port_excitations, std::size_t nr_freq_samples,
+                       std::size_t restart_)
+    : nr_total_samples(nr_port_excitations * nr_freq_samples), restart(restart_),
+      excitation_n0(std::size_t(restart - 1) / nr_freq_samples),
+      omega_n0(std::size_t(restart - 1) % nr_freq_samples)
+  {
+  }
+};


This isn't needed, see the loops doing away with EnumerateView.

hughcars · 2025-07-02T22:04:36Z

palace/drivers/drivensolver.cpp

+
+      total_step_i++;  // Increment combined counter.
    }
+    indexing.omega_n0 = 0;  // No offset in omega from "Restart" in later excitations.


Not needed, can compute this on the fly in the print message from the excitation and omega indices when using integer loops.

hughcars · 2025-07-02T23:14:20Z

test/unit/test-view_helper.cpp

@@ -0,0 +1,214 @@
+#include <iterator>


Delete with EnumerateView.

palace/utils/tablecsv.cpp

hughcars · 2025-07-02T23:29:23Z

palace/models/postoperatorcsv.cpp

+
+template <ProblemType solver_t>
+void PostOperatorCSV<solver_t>::MoveTableValidateReload(TableWithCSVFile &t_csv_base,
+                                                        Table &&t_ref)


This appears pretty complicated for loading a csv, then checking the number of rows and columns match the existing dimensions. Additionally it suggests you couldn't restart a uniform run with existing data, by adding an extra excitation index after the fact, as the two sizes wouldn't match exactly, but the loaded be a subset of the to be filled size.

Let me respond to this in reverse:

it suggests you couldn't restart a uniform run with existing data, by adding an extra excitation index after the fact

This is correct — one absolutely cannot do that. Changing (including appending) which frequencies or excitations are being run was not a feature we envisaged and would require bigger changes.

Why? Fundamentally, because in a restart we don't parse the "old-config.json" file. We would need this in order to fully understand what the old data corresponds to and figure out what data is missing that the user asked for the "new-config.json". And even if we had the old-config.json a single restart integer has issues if the shape of frequency / excitations changes (see below).

Importantly, we cannot currently reconstruct everything we need of "old-config.json" from parsing in the "data.csv" files we read in. Instead, we have to assume that the "data.csv" we parse has the same format required by the new-config.json. What MoveTableValidateReload does is (a) checks to validate everything is consistent for the driver solver loop to continue error-free (b) get the information we cannot recover from parsing by looking at the default t_ref table.

Consider things that go wrong with your idea:

Let's assume that we have a uniform run with two excitations (2,4) already done. Now we want to add an extra excitation index 1 after the start. Currently, we store and iterator over excitations in a map, so excitation indices are sorted. 1 has to pre-pend columns to the the table. So the "restart" indexing with a single inter no longer works. It gets worse if we want to add both 1,3 since then we have to interleave column blocks. In principle, we could print column blocks as 2,4,1,3 but we would need extra structure in config.json that encodes the run order of excitations.

No it get's worse if we start off with a uniform run with a single excitation (e.g. 2). Because of backward compatibility we decided to not print the excitation index if there is only a single excitation (except in the S matrix). So when we load the table we don't know what the original excitation was. Now I pass a new-config.json that asks for excitations 1,2,3 and a restart index. Even setting aside what the restart index means in this case, how does it know to move the existing data to column block 2 (and add the [2] print label to it)?

With more effort there are things I could have done with parsing information, but don't solve the indexing problem above:

There are columns options (like float print precision and empty space padding) that I could parse from "data.csv" right now but don't. I just inherit them from t_ref for convenience. These are annoying to parse and would make the parsing in table Table using more involved (see your complaints above). Also, if we decide to change the padding default, I think the new table defaults should overwrite the old one anyway.

I don't validate the parse frequencies values printed in the data.csv against the one in "new-config.json" (only the overall expected length). I could do this for extra security.

This appears pretty complicated for loading a csv, then checking the number of rows and columns match the existing dimensions.

Could you please point to what you think is unnecessary in the checks / reload?

I think given the current state the settings and checks are necessary, my point is that we have arrived at a very complicated situation, for something that is deep down pretty simple: writing a string of numbers to a file, separated by commas and whitespace, and the ability to pick up where we left off. That we can't do something like deciding to add another excitation onto an existing set of runs, is indicative we've gone astray. This is tangent to this PR, but reviewing this has really made me appreciate it.

phdum-a · 2025-07-18T17:09:54Z

There's also some minor things with the printing to terminal of the total iteration.

Yes, there was a typo the print message. The frequency index was divided by port excitation size which is why you were seeing something wrong:

    Mpi::Print("\nNon-trivial restart: starting at{} frequency step {:d}/{:d}.\n",
               (port_excitations.Size() > 1)
                   ? fmt::format(" excitation step {:d}/{:d} and",
                                 indexing.excitation_n0 + 1, port_excitations.Size())
                   : "",
               indexing.omega_n0 + 1, port_excitations.Size());

should have been

    Mpi::Print("\nRestarting from:{} frequency step {:d}/{:d}.\n",
               (port_excitations.Size() > 1)
                   ? fmt::format(" excitation step {:d}/{:d} and",
                                 indexing.excitation_n0 + 1, port_excitations.Size())
                   : "",
               indexing.omega_n0 + 1, omega_sample.size());
  }

This is fixed in the "Address PR Commit", but see below about integrating with the changes on your branch.

phdum-a · 2025-07-18T17:19:14Z

One big bug right now, which is that if you run a non-restarting config file and there was already a postpro folder in the same location, which previously would have overwritten, it will now error saying the data to restart from is inconsistent.

I see. "No restart" still sets restart=1 and so if there is data in the tables already there, it will try and load existing tables and check that it can start writing at position 1 (which it can't since there is data there or the previous run had a different shape). Since we allow overwriting runs, we can just short circuit restart=1 to ignore existing tables.

Edit below in commit:
Ignore any existing data with default restart

phdum-a · 2025-07-18T21:27:17Z

Rebased this branch (phdum/fix_restart_printing) onto main and the feedback branch hughcars/fix_restart_printing on the corresponding new commit for testing changes/fixes.

phdum-a · 2025-07-21T18:14:47Z

One open issue is still testing of the indexing in the drivensolver.cpp (see also the discussion above). I've added a whole bunch of cases that test PostOperatorCSV with a mini-mesh.

But I've not actually tested anything that does a solve, since I'm not sure the unit-test system is quite set up for that yet.

I guess need to copy all the openmp, gpu, etc options that are being set in main.cpp for solves? This should probably be refactored and mirrored exactly in the main of the unit tests.
We probably want to suppress printing all the default logs to stdout during tests.

There are probably testing refactors one can do to make components more individually testable, but that is a longer term project.

Your commit about the julia option change looks ok, but I do you have something specific in mind for hooking this into the CI?

hughcars · 2025-08-06T19:05:05Z

One open issue is still testing of the indexing in the drivensolver.cpp (see also the discussion above). I've added a whole bunch of cases that test PostOperatorCSV with a mini-mesh.

But I've not actually tested anything that does a solve, since I'm not sure the unit-test system is quite set up for that yet.

I guess need to copy all the openmp, gpu, etc options that are being set in main.cpp for solves? This should probably be refactored and mirrored exactly in the main of the unit tests.

We probably want to suppress printing all the default logs to stdout during tests.

There are probably testing refactors one can do to make components more individually testable, but that is a longer term project.

Your commit about the julia option change looks ok, but I do you have something specific in mind for hooking this into the CI?

I wasn't intending to get this as an explicit CI run, as it's such a niche feature. It was just a capability I was using to perform testing with, which showed things up. I've also opted to just disable Restart on the adaptive runs on the branch i was using, as it doesn't make sense with them being the "cheap" piece, and the expensive piece of assembling the PROM having to be done no matter what.

- Add DrivenSolverIndex data container to group values - Restart acts on total sample (freq and excitation combined) - Add EnumerateView helper for C++17 compatibility - Add tests

- Don't allow empty_cell_val per column for consistent re-parsing - Fix lshift bug in Column - Remove unused drive_source_index in PostOperator CSV - Remove unused AppendHeader, Append Row in TableWithCSVFile - Add comments - Add small tweaks for measurement table parsing Propagate table column block index - Needed for row alignment check during table reload

- Assumes all entries are all double

- Add function to return csv filepath for potential error messages

- Replace pointer to common options with explicit passing, in order to not write custom move

- Must validate table structure against expectation - Must copy some parameter from reference that can't be parsed - Validate cursor location with restart - Properly set nr_expected_measurement_rows for all solvers

- Thread through objects needed for init and print in call

- Move fem_op_t so PostOperatorCSV ctor can use it

- Move expected filling pattern into free function for testing - Add test case csv files to load

In PostOperatorCSV: - m_idx_row -> row_i - m_idx_value -> row_idx_v - excitation_idx_all -> ex_idx_v_all - SingleColBlock() -> HasSingleExIdx() - may_reload_table() -> MayReloadTable Also some test names changed.

…l for testing restarting

- Fixes LTO issue

phdum-a requested a review from hughcars June 27, 2025 02:52

phdum-a force-pushed the phdum/fix_restart_printing branch 2 times, most recently from b20b718 to 0c9f142 Compare June 27, 2025 18:08

hughcars requested changes Jul 2, 2025

View reviewed changes

phdum-a force-pushed the phdum/fix_restart_printing branch from c18b36a to 5beff28 Compare July 18, 2025 21:23

phdum-a force-pushed the phdum/fix_restart_printing branch from 5beff28 to 52613df Compare July 18, 2025 21:58

phdum-a assigned hughcars Jul 21, 2025

hughcars force-pushed the phdum/fix_restart_printing branch from fe75043 to a3b7f45 Compare August 9, 2025 15:30

phdum-a and others added 17 commits August 9, 2025 11:30

Rename restart variable from rst

eacf47f

Clean up restart behaviour in driven solver

3cb52e4

- Add DrivenSolverIndex data container to group values - Restart acts on total sample (freq and excitation combined) - Add EnumerateView helper for C++17 compatibility - Add tests

Add scnlib as dependency for text parsing

4e9ea7d

Add parsing of string table to Table

183fec0

- Assumes all entries are all double

Add file loading to TableWithCSVFile wrapper

deddcab

- Add function to return csv filepath for potential error messages

Change shared default options in Table for Column printing

05e81e7

- Replace pointer to common options with explicit passing, in order to not write custom move

Allow restart data to be loaded into PostOperatorCSV

0c368b1

- Must validate table structure against expectation - Must copy some parameter from reference that can't be parsed - Validate cursor location with restart - Properly set nr_expected_measurement_rows for all solvers

Remove post_op* in PostOperatorCSV for easier testing

92df44f

- Thread through objects needed for init and print in call

Rebalance PostOperatorCSV ctor and InitializeCSVDataCollection

2743760

- Move fem_op_t so PostOperatorCSV ctor can use it

Add unit tests for PostOperatorCSV restart

1fbf04e

- Move expected filling pattern into free function for testing - Add test case csv files to load

Add test for loading data from file for TableWithCSVFile

3e0cc4f

Only rename variables without other changes

9076c66

In PostOperatorCSV: - m_idx_row -> row_i - m_idx_value -> row_idx_v - excitation_idx_all -> ex_idx_v_all - SingleColBlock() -> HasSingleExIdx() - may_reload_table() -> MayReloadTable Also some test names changed.

Add missing scn dependency in ExternalPalace.cmake

06dc4f3

Fix pair reference type

497d5fe

Fix PostOperatorCSV tests MPI awareness

6148180

Remove EnumerateView in favour of modular arithmetic

926fc97

hughcars and others added 3 commits August 9, 2025 11:30

Add option for checking an existing postpro in examples folder. Usefu…

ff8a77f

…l for testing restarting

Ignore any existing data with default restart

c2f74c0

Drop support for Restart in adaptive frequency sweep.

fa7bcc2

hughcars force-pushed the phdum/fix_restart_printing branch from a3b7f45 to fa7bcc2 Compare August 9, 2025 15:30

hughcars approved these changes Aug 9, 2025

View reviewed changes

hughcars enabled auto-merge August 9, 2025 15:32

phdum-a added 4 commits August 11, 2025 12:47

Bug-fix space trim in Table parsing

51238a2

Explicit template instantiation for testing

f2d26e1

- Fixes LTO issue

Add scnlib to spack package

c798ae7

Update changelog

32faeba

hughcars merged commit c3731f3 into main Aug 11, 2025
27 checks passed

hughcars deleted the phdum/fix_restart_printing branch August 11, 2025 17:47

Re-allow restart in driven solver by parsing csv table from file #393

Re-allow restart in driven solver by parsing csv table from file #393

Uh oh!

Conversation

phdum-a commented Jun 27, 2025

Uh oh!

hughcars left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phdum-a Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phdum-a Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phdum-a commented Jul 18, 2025

Uh oh!

phdum-a commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phdum-a commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phdum-a commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughcars commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

hughcars left a comment •

edited

Loading

phdum-a Jul 21, 2025 •

edited

Loading

phdum-a Jul 16, 2025 •

edited

Loading

phdum-a commented Jul 18, 2025 •

edited

Loading

phdum-a commented Jul 18, 2025 •

edited

Loading

phdum-a commented Jul 21, 2025 •

edited

Loading