-
Couldn't load subscription status.
- Fork 73
added a QUASR-downloader #425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #425 +/- ##
==========================================
- Coverage 92.01% 91.78% -0.23%
==========================================
Files 82 82
Lines 16013 16043 +30
==========================================
- Hits 14734 14725 -9
- Misses 1279 1318 +39
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/simsopt/configs/zoo.py
Outdated
| return curves, currents, ma | ||
|
|
||
| elif return_style == 'json': | ||
| return coils, ma, surfaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add else: case to raise an exception if return_style is something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is now an exception raised higher up if return_style is something else
src/simsopt/configs/zoo.py
Outdated
| return_style: 'default' or 'json'. If 'default', the function will return the curves, currents and magnetic axis | ||
| like the other configurations in the zoo. | ||
| If 'json' the function will return the full set of coils, magnetic axis and a number of surfaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "a number of surfaces", do you specifically mean a list of SurfaceXYZTensorFourier objects? If so it would be preferable to give this specific information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is clarified in the latest commit
src/simsopt/configs/zoo.py
Outdated
| ID: the ID of the configuration to download. The database is navigatable at https://quasr.flatironinstitute.org/ | ||
| Alternatively, you can download the latest full set of devices from https://zenodo.org/doi/10.5281/zenodo.10050655 | ||
| return_style: 'default' or 'json'. If 'default', the function will return the curves, currents and magnetic axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "json" isn't an intuitive description of what the function returns. Maybe "surfaces"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed--it also looks like "default" is only returning one half-period? Am I reading that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this PR is ready yet. We still need to rerun code coverage to make sure no lines are missing, and address @landreman's comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I'm the author/maintainer of the QUASR navigator website. @andrewgiuliani asked me to take a look at this PR. Really glad for your interest and to see that the site is useful!
I've made a couple suggestions here. (I'm not a part of this project so I don't consider anything I say to be a blocker for this PR--just suggestions.)
Regarding the test, how often would you expect this to be run? While I don't anticipate a big issue, I would want to confirm with our admins before encouraging anyone to run lots of automated downloads from the site as part of a test suite; I would think that it would be better to do a unit test run on every commit & an integration test run on PR merge or whatever. (But I'm not sure whether that works with the existing test setup.)
In your PR, you asked whether there is a list of which integers correspond to a database entry, as you noted they are not contiguous. Unfortunately there isn't a fixed list of good IDs--it's subject to change with each release of new backing data. We could give you a subset, but I don't think a random draw from the complete set of devices is practical at this time. (For context, I was originally enforcing constraints on the valid ID sets in the website routing, but I realized it wouldn't be worthwhile given the likelihood that they would change.)
You also asked about querying against the database for a range of figures of merit. Unfortunately this is also not possible at present--the site actually works by downloading the full set of metadata for all devices, so there isn't a persistent database server to query against. In practice all filtering of the downloaded devices is done client-side, in the browser.
That said, while I appreciate the rigor of selecting random devices, I think using a fixed set of IDs to download should suffice here--we're hopefully not that adversarial an environment 😄 What I might consider would be testing several example IDs of different lengths & a range of device structures, to make sure the full gamut of known-good IDs can be handled by your code & the correct data is downloaded; but honestly even that seems like it might be overkill...
Again, thanks for your interest and please let me know if you'd like to discuss further!
src/simsopt/configs/zoo.py
Outdated
| parametrized in Boozer coordinates. | ||
| """ | ||
|
|
||
| assert return_style in ['default', 'json'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertions can be disabled by caller/system config [e.g. if PYTHONOPTIMIZE is set], so it might be better not to rely on assert for a true runtime (not-debugging) check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know this! thanks for letting me know, we've changed the assert to an if-statement
src/simsopt/configs/zoo.py
Outdated
| id_str = f"{ID:07d}" | ||
| # string to 7 digits | ||
| url = f'https://quasr.flatironinstitute.org/simsopt_serials/{id_str[0:4]}/serial{id_str}.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but note that we reserve the right to change the ID structure in new versions of the database, which would break this code. (Fair warning!)
I might be able to set up an endpoint on QUASR for converting an ID (arbitrary length) to the correct location, but I wouldn't have a chance to do that for at least a few weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted, thanks!
src/simsopt/configs/zoo.py
Outdated
| print(f"Configuration with ID {ID:07} downloaded successfully") | ||
| surfaces, ma, coils = json.loads(r.content, cls=GSONDecoder) | ||
| else: | ||
| raise ValueError(f"Configuration ID {ID:07d} does not exist. Status code: {r.status_code}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't strictly accurate--the download might've failed for some other reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated string here, thanks!
tests/configs/test_zoo.py
Outdated
| curves, currents, ma = get_QUASR_data(952) | ||
| coils, ma, surfaces = get_QUASR_data(952, return_style='json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also be useful to have assertions verifying (from the returned data) that these are the same device record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, checked that the first coil/surface of the downloaded one is as expected
src/simsopt/configs/zoo.py
Outdated
| ID: the ID of the configuration to download. The database is navigatable at https://quasr.flatironinstitute.org/ | ||
| Alternatively, you can download the latest full set of devices from https://zenodo.org/doi/10.5281/zenodo.10050655 | ||
| return_style: 'default' or 'json'. If 'default', the function will return the curves, currents and magnetic axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed--it also looks like "default" is only returning one half-period? Am I reading that right?
|
Sorry to leave this hanging so long, but I think this would be a great addition, and am wanting to use this as I am looking at more QUASRs with pyoculus, and it would be great to directly grab them from a script. @jsoules, has the content of the json changed with newer entries to the database? For example https://quasr.flatironinstitute.org/model/1258083 only returns surfaces and coils @andrewgiuliani, lets divide the work, it should be pretty minimal, hard-code a few numbers into the tests and adapt to the new format of QUASR (no magnetic axis anymore :'( ) |
Hi, The data in the individual-record JSON is an extract of the relevant row for the whole-database JSON; this is what populates the "Device Metadata" block on the right-hand side in the web UI. The set of fields included in the JSON (i.e. the quantities of interest) have changed in different releases, but without knowing the time point you're comparing to, I'm hesitant to give a definitive answer. But any changes would affect all devices, as the same format is used for the entire database. |
|
When we wrote the scripts in June, the database would return a json file containing This would allow a user to more easily swap out a quasr configuration in an existing script. Is it also possible for the user to request the entire row from the database? These quantities could be useful in scripts that further deal with this data. |
|
Hi Chris, you're right! I have those json files too, we will swap them out so the axis is returned as well. As for the row of the data frame, that should be doable as well. Let me discuss with Jeff |
|
Hi, I also think this is a great addition and would further suggest to optionally cache the requests to avoid unnecessary server load. I'd suggest such a minimally invasive implementation: #468 Also related to @jsoules comment #425 (review) since the requests cache for the unit test could be stored as an artifact, avoiding any traffic generated by CI. |
|
@smiet is this PR still in progress or should we close? |
|
@andrew and I were planning to have a look-see this week and wrap it up. Currently stuck a bit with how to 'Mock' a request to a server in the testing... Also looking into the caching of previous results to minimize the clobbering of the database. Let us keep this PR open and add the code from #468 if we choose so (it creates a folder upon import, which I would rather not, we will discuss the solution). |
Let me know if you need a hand with that--it can be a bit idiosyncratic until you get used to it (mocking a context manager, in particular, can be fussy). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mishapadidar @smiet @jsoules I've mocked the requests.get call so that this is a true unit test, please give this a look.
I have not written an integration test - I don't know if we have the infrastructure set up to run a subset of tests only at merging. Can we do this @mbkumar ?
We can definitely do this. It's a bunch of if conditions in the Action file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At @andrewgiuliani's request, I've taken a look at this.
In general I think it's in really good shape; I caught a couple minor things, and made a few broader-scope suggestions that are not germane to this pull request.
The one thing that I think could be improved before merging is how this test will interact with the caching system in the implementation in zoo.py (explained in my in-line comments below).
I think you were also wondering about the integration test. I'll leave it to the local experts to determine how to actually flag a test to be run rarely (& what conditions should trigger running it), but apart from that it would just be as simple as running lines 34-40 of test_QUASR_downloader without mocking request, so that you actually hit the QUASR Navigator web server.
src/simsopt/configs/zoo.py
Outdated
| else: | ||
| raise ValueError #should not be reached as we check before download to avoid clobbering the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd probably leave this in even if it's supposed to be unreachable. Better to have a noisy error than just silently not do anything.
| This unit test checks that the get_QUASR_data functionality works as expected. | ||
| We download the device with ID=0000952 is downloaded correctly. We also check that | ||
| exceptions are raised if an ID is requested, but the associated device | ||
| does not exist, or if the improper return style is passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I tend to write separate tests for these different conditions, since the failures then give you a more specific idea of what's wrong, but of course you should follow whatever the project's standard practice is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh also there's a typo in "We download the device with ID=.. is downloaded correctly."
| mock_response = MagicMock() | ||
| mock_response.status_code = 200 | ||
|
|
||
| with open(THIS_DIR / '../test_files/serial0000952.json', "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you have this string (& the numbers '952' in it) reused a couple times, it might be a good idea to do something like:
serial_number = 952
test_json_filename = f"serial{serial_number:07}.json"
path_to_test_data_file = THIS_DIR / f"../test_files/{test_json_filename}"
with open(path_to_test_data_file, "rb") as f:
...
curves, currents = get_QUASR_data(serial_number, return_style='simsopt-style')
...to make it easier to switch which test file you're using & where it's located. Not a big deal but it ensures consistency across the different usage points.
| if return_style == 'simsopt-style': | ||
| nfp = surfaces[0].nfp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray thought--this is code to convert the downloaded JSON to internal object types, right? Is this repeated in several places in your project? Just feels like the kind of thing that might be, in which case might consider making it a separate function. (I don't want to expand the scope of this PR, just wanted to flag it as a thing to think about.)
| surfaces, coils = get_QUASR_data(952, return_style='quasr-style') | ||
| assert isinstance(coils[0], Coil) | ||
| assert isinstance(surfaces[0], SurfaceXYZTensorFourier) | ||
| np.testing.assert_allclose(surfaces[0].x, true_surfaces[0].x) | ||
| np.testing.assert_allclose(coils[0].x, true_coils[0].x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to think about is the way that running the test might impact the environment where the test is being run. In general, it's not good if the test makes changes that are visible outside the test, because that could impact subsequent tests. I mention it because in this case I think the test will interact with the caching code--the get_QUASR_data function is expected to create a cache file the first time it runs for a particular ID, and then read from the cache file on subsequent runs for the same ID.
For this test, that means the very first time you run it you'll be creating the cache file, but on subsequent runs (and every time, for the code I've highlighted here) you'll be reading from the cache file.
- I would recommend checking for, and cleaning up, the cache file at the end of the test.
- If you care about the caching behavior, you might want to check that it's happening between these two calls.
I think you could handle deleting the cache file by using the FILE_PATH = THIS_DIR / f'QUASR_cache/serial{id_str}.json' logic you have in the actual implementing function (i.e. get_QUASR_data()) and then doing FILE_PATH.is_file() to confirm it exists and doing FILE_PATH.unlink() if it does (you'd still need to clear the QUASR_cache directory as well, I think there's a rmdir() method in pathlib).
Asserting not FILE_PATH.is_file() before the first call to get_QUASR_data(), and FILE_PATH.is_file() after it, would check that the cache file is getting created. If you'd like to check that the cache is actually getting used, you could also do mock_get.assert_not_called() before the first get_QUASR_data(), then mock_get.assert_called_once() after the SECOND get_QUASR_data() (to make sure it actually used the cache). Equivalently, you can also assert against the mock_get.call_count variable.
I could also see mocking builtins.print and inspecting the values it was called with, if you want to check anything about the function's reporting, but I'm not sure whether I'd bother realistically since "does the thing print this" is unlikely to be mission-critical functionality 😄 The only reason I'd do it is if you wanted to inspect what it was printing to confirm that it says it's using the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to avoid having to do the cleanup by using a TemporaryDirectory context in this test; temporary directories are deleted when you exit the context. See https://adamj.eu/tech/2024/12/30/python-temporary-files-directories-unittest/ for an example (except you'd want to use https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory instead of NamedTemporaryFile).
However, using the temporary directory would require changing the definition of THIS_DIR in configs/zoo.py (which defines get_QUASR_data()), which might be complicated, so you're probably best off just deleting the cache file that you expect to have created in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be convenient to let the user configure where the parent directory for the downloads should be located (for the functions in zoo.py)--so that you can use that instead of the THIS_DIR which I think (?) is relative to where the zoo.py file is located. But that's obviously well outside the scope of this PR.
| # mock os.makedir only here so it throws an error | ||
| with patch("simsopt.configs.zoo.os.makedirs") as mock_makedirs: | ||
| mock_makedirs.side_effect = Exception("Failed to create directory") | ||
| _, _ = get_QUASR_data(953, return_style='quasr-style') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here of course you've hard-coded 953, because if you used 952 again you'd hit the cache and makedir wouldn't be called. Subtle! 😄
If you go with the defined serial_number variable I suggested earlier, you could use serial_number + 1 or something.
Alternatively, if this were a separate test, you would theoretically have deleted the cached file at the end of whatever test created it.
src/simsopt/configs/zoo.py
Outdated
| id_str = f"{ID:07d}" # string to 7 digits | ||
| url = f'https://quasr.flatironinstitute.org/simsopt_serials/{id_str[0:4]}/serial{id_str}.json' | ||
|
|
||
| FILE_PATH = THIS_DIR / f'QUASR_cache/serial{id_str}.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if this will still do the right thing on environments where / is not the path separator (e.g. windows). If you care, could change to FILE_PATH = THIS_DIR / 'QUASR_cache' / f"serial{id_str}.json" (and push it onto pathlib).
`get_quasr_data` always attempted to cache in the install directory of simsopt. Now it only does that if the directory is writeable, otherwise in the pwd. And an option to skip the caching alltogether.
|
It is getting there! I changed the caching to test if the directory (relative to zoo.py, in the install directory) is writeable, if not make a cache in the cwd. (in some types of installation this might be the case, i.e. it is installed in base system python and a user imports it). @andrewgiuliani can you address the last few issues by @jsoules? We should not let perfect be the enemy of good enough, and this is a really nice functionality to have, so let's merge it soon! |
| success = True | ||
|
|
||
| if use_cache: | ||
| with open(FILE_PATH, 'wb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if the directory doesn't exist yet.
| with open(FILE_PATH, 'wb') as f: | |
| os.makedirs(FILE_PATH.parent, exist_ok=True) | |
| with open(FILE_PATH, 'wb') as f: |
It would be amazing to have all of the QUASR configurations accessible with just one command.
I implemented a simple interface using the 'requests' library, that downloads any QUASR field.
I tried to make it so that it picks a random configuration, but apparently not all integers between 0 and 300,000 correspond to an entry in the database.
@andrewgiuliani is there a list of which entries exist, so that a robust random picker can be returned?
Or even better, the user could directly query the database for a list of constraints (NFP, A, iota, coil length, etc) and get a random config that fits their needs.
Something like this must be already implemented on the website, one would only have to adapt this to python. I am happy to do this but would need more info on the database itself. Would this be of interest @andrewgiuliani?