-
Notifications
You must be signed in to change notification settings - Fork 229
Fix PSQL OperationalError on archive creation #6993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6993 +/- ##
===========================================
- Coverage 79.03% 24.88% -54.14%
===========================================
Files 566 566
Lines 43675 43718 +43
===========================================
- Hits 34514 10875 -23639
- Misses 9161 32843 +23682 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0fabfb0
to
c1cdb81
Compare
src/aiida/tools/archive/create.py
Outdated
filter_size: int = 10_000, | ||
batch_size: int = 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could also use the QueryParams
dataclass here that combines filter_size
and batch_size
, which is used for archive imports:
aiida-core/src/aiida/tools/archive/imports.py
Lines 51 to 58 in 428a94e
@dataclass | |
class QueryParams: | |
"""Parameters for executing backend queries.""" | |
batch_size: int | |
"""Batch size for streaming database rows.""" | |
filter_size: int | |
"""Maximum number of parameters allowed in a single query filter.""" |
However, I don't see how that really adds any value here, as it just makes argument passing slightly more convoluted:
aiida-core/src/aiida/tools/archive/imports.py
Line 294 in 428a94e
Still, I now moved the definition to ./src/aiida/repository/common.py
and construct the class for now in the top-level create_archive
method but leave the private helper functions, _collect_[all|required]_entities
unchanged, so that they still accept batch_size
and the additional filter_size
:
def _collect_required_entities(
querybuilder: QbType,
entity_ids: dict[EntityTypes, set[int]],
traversal_rules: dict[str, bool],
include_authinfos: bool,
include_comments: bool,
include_logs: bool,
backend: StorageBackend,
batch_size: int,
filter_size: int,
) -> tuple[list[tuple[int, int]], set[LinkQuadruple]]:
import_archive
and create_archive
function signatures:
def import_archive(
path: Union[str, Path],
*,
archive_format: Optional[ArchiveFormatAbstract] = None,
filter_size: int = 999,
batch_size: int = 1000,
import_new_extras: bool = True,
merge_extras: MergeExtrasType = ('k', 'n', 'l'),
merge_comments: MergeCommentsType = 'leave',
include_authinfos: bool = False,
create_group: bool = True,
group: Optional[orm.Group] = None,
test_run: bool = False,
backend: Optional[StorageBackend] = None,
) -> Optional[int]:
def create_archive(
entities: Optional[Iterable[Union[orm.Computer, orm.Node, orm.Group, orm.User]]],
filename: Union[None, str, Path] = None,
*,
archive_format: Optional[ArchiveFormatAbstract] = None,
overwrite: bool = False,
include_comments: bool = True,
include_logs: bool = True,
include_authinfos: bool = False,
allowed_licenses: Optional[Union[list, Callable]] = None,
forbidden_licenses: Optional[Union[list, Callable]] = None,
strip_checkpoints: bool = True,
filter_size: int = 10_000,
batch_size: int = 1000,
compression: int = 6,
test_run: bool = False,
backend: Optional[StorageBackend] = None,
**traversal_rules: bool,
) -> Path:
@click.option( | ||
'-f', | ||
'--filter-size', | ||
default=999, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have expected that one colud set this higher (I went for 10k in my first implementation), but now am using the same default as for archive imports (the original PR #6889 did use this argument to avoid parameter limits for the sqlite backend).
@@ -27,6 +28,16 @@ | |||
EntityTypes.COMMENT: Comment, | |||
} | |||
|
|||
@dataclass | |||
class QueryParams: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved here for usage also in ./src/aiida/repository/create.py
. Can be reverted, but I don't think it's crucial.
|
||
# extract ids/uuid from initial entities | ||
type_check(entities, Iterable, allow_none=True) | ||
if entities is None: | ||
group_nodes, link_data = _collect_all_entities( | ||
querybuilder, entity_ids, include_authinfos, include_comments, include_logs, batch_size | ||
querybuilder, entity_ids, include_authinfos, include_comments, include_logs, query_params.batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No filters used in QB calls, hence only requires batch_size
.
query_params.batch_size, | ||
query_params.filter_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly accessing elements of query_params
here, as I don't see any benefit of using query_params.[batch_size|filter_size]
or passing the whole query_params
inside the helper functions over just using the arguments directly.
) | ||
qres = self._querybuilder.dict() | ||
else: | ||
# PRCOMMENT: Maybe move `batch_iter` elsewhere? Import feels weird here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!
# Batch the query to avoid parameter limits | ||
existing_pks = set() | ||
operational_list = list(operational_set) | ||
batch_size = 10000 # Stay well under 65535 parameter limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant call stack here:
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 247, in create_archive
group_nodes, link_data = _collect_required_entities(
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 544, in _collect_required_entities
traverse_output = get_nodes_export(
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 105, in get_nodes_export
traverse_output = traverse_graph(
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 245, in traverse_graph
traceback.print_stack()
> /home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py(248)traverse_graph()
-> existing_pks = set()
query_nodes.append(orm.Node, project=['id'], filters={'id': {'in': operational_set}}) | ||
existing_pks = set(query_nodes.all(flat=True)) | ||
existing_pks = set() | ||
filter_size = 10_000 # Stay well under 65535 parameter limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant call stack here:
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 247, in create_archive
group_nodes, link_data = _collect_required_entities(
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 544, in _collect_required_entities
traverse_output = get_nodes_export(
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 105, in get_nodes_export
traverse_output = traverse_graph(
File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 245, in traverse_graph
traceback.print_stack()
> /home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py(248)traverse_graph()
-> existing_pks = set()
Again, probably OK to have it hard-coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating and storing 50k / 100k nodes on-the-fly takes a considerable amount of time (I guess because for every node storage, a db connection is opened, the data commited, and the connection closed again). Hence, I added the tests/data
directory, and two pre-created archives for testing large data import / export, as we were having operational errors with both backends (PR that fixed it for sqlite: #6889, and now with psql, so it's something we should regularly test).
Anybody has a better approach, or is that fine?
|
||
@pytest.mark.usefixtures('aiida_profile_clean') | ||
@pytest.mark.timeout(600) # 10 minutes for the full test | ||
def test_large_archive_import_export(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make nightly test?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, we definitely don't want that during normal test suite :D
EDIT: How long does it run normally? How long does it run on current main?
Note to self:
|
This reverts commit 46a7b83.
9f07976
to
10c17aa
Compare
Problem
Archive creation started failing for large datasets with
OperationalError
due to the enforcement of a parameter limit (65535 parameters per query) via a downstream upgrade ofpsycopg
(the database driver used in sqlalchemy for psql) that surfaced when going from aiida v2.6.4 to v2.7.0 (further discussion in issue #6545).Solution
This is fixed in this PR by UUIDs passed to QueryBuilder query filters being batched using the
batch_iter
function1. This is the same approach that was taken in #6907 where a similarOperationalError
occurred for large archive imports with the sqlite backend. The batch size is exposed via thefilter_size
argument (also made available to theverdi archive create
CLI endpoint). The default value of this argument was set to 999, which seems a bit low to me, but follows the approach for archive imports (#6907).The helper functions of
create_archive
to which batching is applied are, among others,_collect_required_entities
,_stream_repo_files
,_check_unsealed_nodes
, and_check_node_licenses
. In addition, batching also had to be applied to some functions for graph traversal operations insrc/aiida/tools/graph/age_rules.py
, andsrc/aiida/tools/graph/graph_traversers.py
. I'm not sure if these incur any further (performance) implications. Maybe archive creation can be benchmarked with a large real-world archive (@mbercx?), using v2.6.4 vs this PR?An integration test for the import of an archive with 100k
Int
nodes, and the export of these 100k nodes is added. On-the-fly creation of 100kInt
nodes takes a considerable amount of time2, so I added an archive file with 100k nodes to the newtests/data
folder (which will also be used in the related PR #6991). Import / export of such a large archive still takes some time, so I had to increase the pytest timeout for that test... happy to learn about any better approaches (create the DB table in the storage backend of the test profile using raw SQL 👀).Finally, I don't think this PR is an ideal fix as there is no guarantee batching avoids reaching the parameter limit, if queries are combined or nested. Instead, it's rather a band aid on a larger underlying problem: the construction of excessively large SQL statements with many
IN
clauses. I also tried to solve it at this level, in the internal QB implementation in PR #6998 (see discussion on different approaches there). However, due to the way the SQL expressions are constructed in the QB implementation, this would require a larger refactor. Given the fundamentality of the QB and the implications of changes to this part of the code, I think for now, it is (hopefully) acceptable to apply the fix in this PR at the level of archive creation, and get it out with a v2.7.2 patch release, as I think this feature being broken already for medium-sized archives for our performance backend is crucial.Footnotes
To avoid large litst in these kinds of queries:
filters={'id': {'in': <large-list>}}
. ↩Maybe I'm doing something wrong here?
↩