Skip to content

Conversation

phdum-a
Copy link
Contributor

@phdum-a phdum-a commented Jun 27, 2025

This resolves #376.

Palace had the "Restart" flag in the driven solver. Previously this was simple because printing always appended to files row-by-row. A started run did not need to know what happened or what was written before the restart.

With then new_postpro changes (#302) and multi-excitation support (#309), this is more complicated since all data for the full run is stored by PostOperatorCSV. This is all due to the fact that multi-excitations can't just append to rows below but need to change columns to the side. To re-enable restart, we have to parse back the csv file from disk, which we do using scnlib. When we load from the csv, we have to compare to the expected table, in order to (a) validate that the read in tables are in a good state (b) recover information about the Table state (like internal column names) not in the csv.

Additionally, we clean up various parts of the code for simplicity and to make it more testable. We add several unit tests for the table reload.

The commits have been squashed and cleaned to be understandable one-by-one.

@phdum-a phdum-a requested a review from hughcars June 27, 2025 02:52
@phdum-a phdum-a force-pushed the phdum/fix_restart_printing branch 2 times, most recently from b20b718 to 0c9f142 Compare June 27, 2025 18:08
Copy link
Collaborator

@hughcars hughcars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big bug right now, which is that if you run a non-restarting config file and there was already a postpro folder in the same location, which previously would have overwritten, it will now error saying the data to restart from is inconsistent.

There's also some minor things with the printing to terminal of the total iteration. I tested restarting the uniform cpw case a few times on either excitation and it seems to work on that.

Aside from that looks ok. I have some simplifications on hughcars/fix_restart_printing and have extended the julia testing to allow looking at an already populated postpro folder which makes this simpler. Once this is in we should seriously look at refactoring the csv stuff away, because at this point we're reading and writing so could drastically simplify things by using an existing library. That's beyond the scope of this patch pr however.

When ready, don't forget to add an entry in the CHANGELOG mentioning the bug fix.

Comment on lines 135 to 137
std::size_t total_step_i = indexing.restart - 1;
for (const auto &[exc_step_i, exc_kv] :
EnumerateView(port_excitations.excitations, indexing.excitation_n0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for this abstraction, nor the indexing helper, this is pretty straightforward integer arithmetic. Can then delete the EnumerateView and its file.

  const int restart_excitation_idx = (iodata.solver.driven.restart - 1) / omega_sample.Size() + 1;
  const int freq_restart_idx = (iodata.solver.driven.restart - 1) % int(omega_sample.size());
  for (const auto &[excitation_idx, excitation_spec] : port_excitations)
  {
    if (++excitation_counter < restart_excitation_idx)
    {
      continue;
    }
    ....
        // Frequency loop.
    for (std::size_t omega_i = ((excitation_counter == restart_excitation_idx) ? freq_restart_idx : 0); omega_i < omega_sample.size(); omega_i++)
    {
      auto omega = omega_sample[omega_i];

Given restart is only applied to the online loop of the adaptive solve too, the same integer loops can be used there. See hughcars/fix_restarting_printing for details and also some changes to the julia testing calls to allow checking it.

Copy link
Contributor Author

@phdum-a phdum-a Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnumerateView and DrivenSolverIndexing came about is because I had bugs with indexing while coding these loops and so decided I need some structure that help and are easy to test.

Originally, DrivenSolverIndexing was a more complicated structure that generated nested excitation/frequency iterators — until I thought that this was too specific for the driver solver. Then I coded up EnumerateView because we can use it elsewhere in the code too. Also it is just a a feature of the ranges library that we will get in future C++ standards, so this is just bridge to then. Importantly, this class is easily testable in isolation. DrivenSolverIndexing is now just a collection of POD that I am happy to remove.

There's no need for this abstraction, nor the indexing helper, this is pretty straightforward integer arithmetic. Can then delete the EnumerateView and its file.

Sure it is simple integer arithmetic. And I am not particularly attached to EnumerateView. But there is the unfortunate fact that on your hughcars/fix_restarting_printing branch you introduced two separate indexing errors:

  • In the uniform sweep look (line 125) you wrote: const int restart_excitation_idx = (iodata.solver.driven.restart - 1) / (omega_sample.size() + 1) + 1; I
  • In the adaptive sweep loop (line 373) you wrote const int restart_excitation_idx =(iodata.solver.driven.restart - 1) / port_excitations.Size() + 1;

These should be the same and equal what you wrote in your github comment: const int restart_excitation_idx = (iodata.solver.driven.restart - 1) /omega_sample.Size() + 1; (except Size() should be size()).

The rest looks correct (I think?), but I still had to draw a little 2d table on a pieced of paper and check various combinations. Specifically:

  • The above assumes that restart_excitation_idx is 1-index, which is different from freq_restart_idx which is 0-index. Which is super-confusing. Double so since you initialize int excitation_counter = 0; and immediately increment.
  • restart_excitation_idx and freq_restart_idx should probably be called _counter not _idx since excitation_idx refers to something else (the value of excitation indexing being simulated).
  • ++excitation_counter < restart_excitation_idx is correct but I would argue hard to read.

Not sure how to test this in a quick unit test, without running a bunch of small driven solves.

Like with our discussion on strong-typed indices, I would continue to suggest clean, readable and testable code, even when that requires a few 100s lines of tested boilerplate helpers. But I understand that what counts as "small helpers" vs "unnecessary overhead" is subjective.

Edit: Please make a decision exactly what solution you want here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: For port_excitations we use std::map. We actually use std::map in all sorts of places for the map API but these things are typically tiny amounts of data, so std::map is pretty inefficient. Also we often want both a "counter index" & "semantic index" (excitation, port, etc). So we could do this with map + enumerate. But we could also consider making a class with a vector backed and the two different indices like boost vector property map does. That would also give us random-access.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 7 frequency samples, and four excitations, testing restarting in each of the excitation sweeps for cpw, then

  const int excitation_restart_counter =
      ((iodata.solver.driven.restart - 1) / omega_sample.size()) + 1;
  const int freq_restart_idx = (iodata.solver.driven.restart - 1) % omega_sample.size();

works. excitation_restart_counter is a better name as it's supposed to be reflecting the excitation_counter anyway. It can be changed to 0-based instead, but that means needing if (++excitation_counter < excitation_restart_counter) to become if (excitation_counter++ < excitation_restart_counter) which requires a "mid - left - right" parse, rather than left to right parse.

EnumerateView requires opening another file and then internalizing that, and maintaining that file and testing for it, for one call site, It's not worth it. I don't want to be introducing extra things for developers to have to learn only to loop over a 2 dimensional array.

Comment on lines 21 to 40
// Mini helper class that stored indexing information for printing and restart.
struct DrivenSolverIndexing
{
std::size_t nr_total_samples;

// Restart (1-Based indexing from config file)
std::size_t restart = 1;

// Offsets from restart (0-Based)
std::size_t excitation_n0 = 0;
std::size_t omega_n0 = 0;

DrivenSolverIndexing(std::size_t nr_port_excitations, std::size_t nr_freq_samples,
std::size_t restart_)
: nr_total_samples(nr_port_excitations * nr_freq_samples), restart(restart_),
excitation_n0(std::size_t(restart - 1) / nr_freq_samples),
omega_n0(std::size_t(restart - 1) % nr_freq_samples)
{
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed, see the loops doing away with EnumerateView.

Comment on lines 198 to 201

total_step_i++; // Increment combined counter.
}
indexing.omega_n0 = 0; // No offset in omega from "Restart" in later excitations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, can compute this on the fly in the print message from the excitation and omega indices when using integer loops.

@@ -0,0 +1,214 @@
#include <iterator>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete with EnumerateView.


template <ProblemType solver_t>
void PostOperatorCSV<solver_t>::MoveTableValidateReload(TableWithCSVFile &t_csv_base,
Table &&t_ref)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears pretty complicated for loading a csv, then checking the number of rows and columns match the existing dimensions. Additionally it suggests you couldn't restart a uniform run with existing data, by adding an extra excitation index after the fact, as the two sizes wouldn't match exactly, but the loaded be a subset of the to be filled size.

Copy link
Contributor Author

@phdum-a phdum-a Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me respond to this in reverse:

it suggests you couldn't restart a uniform run with existing data, by adding an extra excitation index after the fact

This is correct — one absolutely cannot do that. Changing (including appending) which frequencies or excitations are being run was not a feature we envisaged and would require bigger changes.

Why? Fundamentally, because in a restart we don't parse the "old-config.json" file. We would need this in order to fully understand what the old data corresponds to and figure out what data is missing that the user asked for the "new-config.json". And even if we had the old-config.json a single restart integer has issues if the shape of frequency / excitations changes (see below).

Importantly, we cannot currently reconstruct everything we need of "old-config.json" from parsing in the "data.csv" files we read in. Instead, we have to assume that the "data.csv" we parse has the same format required by the new-config.json. What MoveTableValidateReload does is (a) checks to validate everything is consistent for the driver solver loop to continue error-free (b) get the information we cannot recover from parsing by looking at the default t_ref table.

Consider things that go wrong with your idea:

  • Let's assume that we have a uniform run with two excitations (2,4) already done. Now we want to add an extra excitation index 1 after the start. Currently, we store and iterator over excitations in a map, so excitation indices are sorted. 1 has to pre-pend columns to the the table. So the "restart" indexing with a single inter no longer works. It gets worse if we want to add both 1,3 since then we have to interleave column blocks. In principle, we could print column blocks as 2,4,1,3 but we would need extra structure in config.json that encodes the run order of excitations.

  • No it get's worse if we start off with a uniform run with a single excitation (e.g. 2). Because of backward compatibility we decided to not print the excitation index if there is only a single excitation (except in the S matrix). So when we load the table we don't know what the original excitation was. Now I pass a new-config.json that asks for excitations 1,2,3 and a restart index. Even setting aside what the restart index means in this case, how does it know to move the existing data to column block 2 (and add the [2] print label to it)?

With more effort there are things I could have done with parsing information, but don't solve the indexing problem above:

  • There are columns options (like float print precision and empty space padding) that I could parse from "data.csv" right now but don't. I just inherit them from t_ref for convenience. These are annoying to parse and would make the parsing in table Table using more involved (see your complaints above). Also, if we decide to change the padding default, I think the new table defaults should overwrite the old one anyway.
  • I don't validate the parse frequencies values printed in the data.csv against the one in "new-config.json" (only the overall expected length). I could do this for extra security.

This appears pretty complicated for loading a csv, then checking the number of rows and columns match the existing dimensions.

Could you please point to what you think is unnecessary in the checks / reload?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the current state the settings and checks are necessary, my point is that we have arrived at a very complicated situation, for something that is deep down pretty simple: writing a string of numbers to a file, separated by commas and whitespace, and the ability to pick up where we left off. That we can't do something like deciding to add another excitation onto an existing set of runs, is indicative we've gone astray. This is tangent to this PR, but reviewing this has really made me appreciate it.

@phdum-a
Copy link
Contributor Author

phdum-a commented Jul 18, 2025

There's also some minor things with the printing to terminal of the total iteration.

Yes, there was a typo the print message. The frequency index was divided by port excitation size which is why you were seeing something wrong:

    Mpi::Print("\nNon-trivial restart: starting at{} frequency step {:d}/{:d}.\n",
               (port_excitations.Size() > 1)
                   ? fmt::format(" excitation step {:d}/{:d} and",
                                 indexing.excitation_n0 + 1, port_excitations.Size())
                   : "",
               indexing.omega_n0 + 1, port_excitations.Size());

should have been

    Mpi::Print("\nRestarting from:{} frequency step {:d}/{:d}.\n",
               (port_excitations.Size() > 1)
                   ? fmt::format(" excitation step {:d}/{:d} and",
                                 indexing.excitation_n0 + 1, port_excitations.Size())
                   : "",
               indexing.omega_n0 + 1, omega_sample.size());
  }

This is fixed in the "Address PR Commit", but see below about integrating with the changes on your branch.

@phdum-a
Copy link
Contributor Author

phdum-a commented Jul 18, 2025

One big bug right now, which is that if you run a non-restarting config file and there was already a postpro folder in the same location, which previously would have overwritten, it will now error saying the data to restart from is inconsistent.

I see. "No restart" still sets restart=1 and so if there is data in the tables already there, it will try and load existing tables and check that it can start writing at position 1 (which it can't since there is data there or the previous run had a different shape). Since we allow overwriting runs, we can just short circuit restart=1 to ignore existing tables.


Edit below in commit:
Ignore any existing data with default restart

@phdum-a phdum-a force-pushed the phdum/fix_restart_printing branch from c18b36a to 5beff28 Compare July 18, 2025 21:23
@phdum-a
Copy link
Contributor Author

phdum-a commented Jul 18, 2025

Rebased this branch (phdum/fix_restart_printing) onto main and the feedback branch hughcars/fix_restart_printing on the corresponding new commit for testing changes/fixes.

@phdum-a phdum-a force-pushed the phdum/fix_restart_printing branch from 5beff28 to 52613df Compare July 18, 2025 21:58
@phdum-a
Copy link
Contributor Author

phdum-a commented Jul 21, 2025

One open issue is still testing of the indexing in the drivensolver.cpp (see also the discussion above). I've added a whole bunch of cases that test PostOperatorCSV with a mini-mesh.

But I've not actually tested anything that does a solve, since I'm not sure the unit-test system is quite set up for that yet.

  • I guess need to copy all the openmp, gpu, etc options that are being set in main.cpp for solves? This should probably be refactored and mirrored exactly in the main of the unit tests.
  • We probably want to suppress printing all the default logs to stdout during tests.

There are probably testing refactors one can do to make components more individually testable, but that is a longer term project.


Your commit about the julia option change looks ok, but I do you have something specific in mind for hooking this into the CI?

@hughcars
Copy link
Collaborator

hughcars commented Aug 6, 2025

One open issue is still testing of the indexing in the drivensolver.cpp (see also the discussion above). I've added a whole bunch of cases that test PostOperatorCSV with a mini-mesh.

But I've not actually tested anything that does a solve, since I'm not sure the unit-test system is quite set up for that yet.

  • I guess need to copy all the openmp, gpu, etc options that are being set in main.cpp for solves? This should probably be refactored and mirrored exactly in the main of the unit tests.
  • We probably want to suppress printing all the default logs to stdout during tests.

There are probably testing refactors one can do to make components more individually testable, but that is a longer term project.

Your commit about the julia option change looks ok, but I do you have something specific in mind for hooking this into the CI?

I wasn't intending to get this as an explicit CI run, as it's such a niche feature. It was just a capability I was using to perform testing with, which showed things up. I've also opted to just disable Restart on the adaptive runs on the branch i was using, as it doesn't make sense with them being the "cheap" piece, and the expensive piece of assembling the PROM having to be done no matter what.

@hughcars hughcars force-pushed the phdum/fix_restart_printing branch from fe75043 to a3b7f45 Compare August 9, 2025 15:30
phdum-a and others added 17 commits August 9, 2025 11:30
- Add DrivenSolverIndex data container to group values
- Restart acts on total sample (freq and excitation combined)
- Add EnumerateView helper for C++17 compatibility
- Add tests
- Don't allow empty_cell_val per column for consistent re-parsing
- Fix lshift bug in Column
- Remove unused drive_source_index in PostOperator CSV
- Remove unused AppendHeader, Append Row in TableWithCSVFile
- Add comments
- Add small tweaks for measurement table parsing

Propagate table column block index

- Needed for row alignment check during table reload
- Assumes all entries are all double
- Add function to return csv filepath for potential error messages
- Replace pointer to common options with explicit passing, in order to not write custom move
- Must validate table structure against expectation
- Must copy some parameter from reference that can't be parsed
- Validate cursor location with restart
- Properly set nr_expected_measurement_rows for all solvers
- Thread through objects needed for init and print in call
- Move fem_op_t so PostOperatorCSV ctor can use it
- Move expected filling pattern into free function for testing
- Add test case csv files to load
In PostOperatorCSV:
- m_idx_row -> row_i
- m_idx_value -> row_idx_v
- excitation_idx_all -> ex_idx_v_all
- SingleColBlock() -> HasSingleExIdx()
- may_reload_table() -> MayReloadTable

Also some test names changed.
@hughcars hughcars force-pushed the phdum/fix_restart_printing branch from a3b7f45 to fa7bcc2 Compare August 9, 2025 15:30
@hughcars hughcars enabled auto-merge August 9, 2025 15:32
@hughcars hughcars merged commit c3731f3 into main Aug 11, 2025
27 checks passed
@hughcars hughcars deleted the phdum/fix_restart_printing branch August 11, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Restart" not working in driven simulation
2 participants