Skip to content

Conversation

GeigerJ2
Copy link
Contributor

@GeigerJ2 GeigerJ2 commented Aug 28, 2025

Problem

Archive creation started failing for large datasets with OperationalError due to the enforcement of a parameter limit (65535 parameters per query) via a downstream upgrade of psycopg (the database driver used in sqlalchemy for psql) that surfaced when going from aiida v2.6.4 to v2.7.0 (further discussion in issue #6545).

Solution

This is fixed in this PR by UUIDs passed to QueryBuilder query filters being batched using the batch_iter function1. This is the same approach that was taken in #6907 where a similar OperationalError occurred for large archive imports with the sqlite backend. The batch size is exposed via the filter_size argument (also made available to the verdi archive create CLI endpoint). The default value of this argument was set to 999, which seems a bit low to me, but follows the approach for archive imports (#6907).

The helper functions of create_archive to which batching is applied are, among others, _collect_required_entities, _stream_repo_files, _check_unsealed_nodes, and _check_node_licenses. In addition, batching also had to be applied to some functions for graph traversal operations in src/aiida/tools/graph/age_rules.py, and
src/aiida/tools/graph/graph_traversers.py. I'm not sure if these incur any further (performance) implications. Maybe archive creation can be benchmarked with a large real-world archive (@mbercx?), using v2.6.4 vs this PR?

An integration test for the import of an archive with 100k Int nodes, and the export of these 100k nodes is added. On-the-fly creation of 100k Int nodes takes a considerable amount of time2, so I added an archive file with 100k nodes to the new tests/data folder (which will also be used in the related PR #6991). Import / export of such a large archive still takes some time, so I had to increase the pytest timeout for that test... happy to learn about any better approaches (create the DB table in the storage backend of the test profile using raw SQL 👀).

Finally, I don't think this PR is an ideal fix as there is no guarantee batching avoids reaching the parameter limit, if queries are combined or nested. Instead, it's rather a band aid on a larger underlying problem: the construction of excessively large SQL statements with many IN clauses. I also tried to solve it at this level, in the internal QB implementation in PR #6998 (see discussion on different approaches there). However, due to the way the SQL expressions are constructed in the QB implementation, this would require a larger refactor. Given the fundamentality of the QB and the implications of changes to this part of the code, I think for now, it is (hopefully) acceptable to apply the fix in this PR at the level of archive creation, and get it out with a v2.7.2 patch release, as I think this feature being broken already for medium-sized archives for our performance backend is crucial.

Footnotes

  1. To avoid large litst in these kinds of queries: filters={'id': {'in': <large-list>}}.

  2. Maybe I'm doing something wrong here?

    _ = [orm.Int(i).store() for i in range(100_000)]
    

Copy link

codecov bot commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 1.06383% with 93 lines in your changes missing coverage. Please review.
✅ Project coverage is 24.88%. Comparing base (669f249) to head (10c17aa).

Files with missing lines Patch % Lines
src/aiida/tools/archive/create.py 0.00% 66 Missing ⚠️
src/aiida/tools/graph/age_rules.py 0.00% 13 Missing ⚠️
src/aiida/tools/graph/graph_traversers.py 0.00% 7 Missing ⚠️
src/aiida/tools/archive/common.py 0.00% 6 Missing ⚠️
src/aiida/tools/archive/imports.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (669f249) and HEAD (10c17aa). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (669f249) HEAD (10c17aa)
2 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #6993       +/-   ##
===========================================
- Coverage   79.03%   24.88%   -54.14%     
===========================================
  Files         566      566               
  Lines       43675    43718       +43     
===========================================
- Hits        34514    10875    -23639     
- Misses       9161    32843    +23682     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 57 to 58
filter_size: int = 10_000,
batch_size: int = 1000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could also use the QueryParams dataclass here that combines filter_size and batch_size, which is used for archive imports:

@dataclass
class QueryParams:
"""Parameters for executing backend queries."""
batch_size: int
"""Batch size for streaming database rows."""
filter_size: int
"""Maximum number of parameters allowed in a single query filter."""

However, I don't see how that really adds any value here, as it just makes argument passing slightly more convoluted:

Still, I now moved the definition to ./src/aiida/repository/common.py and construct the class for now in the top-level create_archive method but leave the private helper functions, _collect_[all|required]_entities unchanged, so that they still accept batch_size and the additional filter_size:

def _collect_required_entities(
    querybuilder: QbType,
    entity_ids: dict[EntityTypes, set[int]],
    traversal_rules: dict[str, bool],
    include_authinfos: bool,
    include_comments: bool,
    include_logs: bool,
    backend: StorageBackend,
    batch_size: int,
    filter_size: int,
) -> tuple[list[tuple[int, int]], set[LinkQuadruple]]:

import_archive and create_archive function signatures:

def import_archive(
    path: Union[str, Path],
    *,
    archive_format: Optional[ArchiveFormatAbstract] = None,
    filter_size: int = 999,
    batch_size: int = 1000,
    import_new_extras: bool = True,
    merge_extras: MergeExtrasType = ('k', 'n', 'l'),
    merge_comments: MergeCommentsType = 'leave',
    include_authinfos: bool = False,
    create_group: bool = True,
    group: Optional[orm.Group] = None,
    test_run: bool = False,
    backend: Optional[StorageBackend] = None,
) -> Optional[int]:
def create_archive(
    entities: Optional[Iterable[Union[orm.Computer, orm.Node, orm.Group, orm.User]]],
    filename: Union[None, str, Path] = None,
    *,
    archive_format: Optional[ArchiveFormatAbstract] = None,
    overwrite: bool = False,
    include_comments: bool = True,
    include_logs: bool = True,
    include_authinfos: bool = False,
    allowed_licenses: Optional[Union[list, Callable]] = None,
    forbidden_licenses: Optional[Union[list, Callable]] = None,
    strip_checkpoints: bool = True,
    filter_size: int = 10_000,
    batch_size: int = 1000,
    compression: int = 6,
    test_run: bool = False,
    backend: Optional[StorageBackend] = None,
    **traversal_rules: bool,
) -> Path:

@click.option(
'-f',
'--filter-size',
default=999,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have expected that one colud set this higher (I went for 10k in my first implementation), but now am using the same default as for archive imports (the original PR #6889 did use this argument to avoid parameter limits for the sqlite backend).

@@ -27,6 +28,16 @@
EntityTypes.COMMENT: Comment,
}

@dataclass
class QueryParams:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved here for usage also in ./src/aiida/repository/create.py. Can be reverted, but I don't think it's crucial.


# extract ids/uuid from initial entities
type_check(entities, Iterable, allow_none=True)
if entities is None:
group_nodes, link_data = _collect_all_entities(
querybuilder, entity_ids, include_authinfos, include_comments, include_logs, batch_size
querybuilder, entity_ids, include_authinfos, include_comments, include_logs, query_params.batch_size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No filters used in QB calls, hence only requires batch_size.

Comment on lines +255 to +256
query_params.batch_size,
query_params.filter_size,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly accessing elements of query_params here, as I don't see any benefit of using query_params.[batch_size|filter_size] or passing the whole query_params inside the helper functions over just using the arguments directly.

)
qres = self._querybuilder.dict()
else:
# PRCOMMENT: Maybe move `batch_iter` elsewhere? Import feels weird here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!

# Batch the query to avoid parameter limits
existing_pks = set()
operational_list = list(operational_set)
batch_size = 10000 # Stay well under 65535 parameter limit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant call stack here:

  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 247, in create_archive
    group_nodes, link_data = _collect_required_entities(
  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 544, in _collect_required_entities
    traverse_output = get_nodes_export(
  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 105, in get_nodes_export
    traverse_output = traverse_graph(
  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 245, in traverse_graph
    traceback.print_stack()
> /home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py(248)traverse_graph()
-> existing_pks = set()

query_nodes.append(orm.Node, project=['id'], filters={'id': {'in': operational_set}})
existing_pks = set(query_nodes.all(flat=True))
existing_pks = set()
filter_size = 10_000 # Stay well under 65535 parameter limit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant call stack here:

  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 247, in create_archive
    group_nodes, link_data = _collect_required_entities(
  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/archive/create.py", line 544, in _collect_required_entities
    traverse_output = get_nodes_export(
  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 105, in get_nodes_export
    traverse_output = traverse_graph(
  File "/home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py", line 245, in traverse_graph
    traceback.print_stack()
> /home/geiger_j/aiida_projects/aiida-dev/git-repos/aiida-core/src/aiida/tools/graph/graph_traversers.py(248)traverse_graph()
-> existing_pks = set()

Again, probably OK to have it hard-coded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating and storing 50k / 100k nodes on-the-fly takes a considerable amount of time (I guess because for every node storage, a db connection is opened, the data commited, and the connection closed again). Hence, I added the tests/data directory, and two pre-created archives for testing large data import / export, as we were having operational errors with both backends (PR that fixed it for sqlite: #6889, and now with psql, so it's something we should regularly test).

Anybody has a better approach, or is that fine?

@GeigerJ2 GeigerJ2 marked this pull request as ready for review September 3, 2025 10:46

@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.timeout(600) # 10 minutes for the full test
def test_large_archive_import_export(tmp_path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make nightly test?!

Copy link
Collaborator

@danielhollas danielhollas Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, we definitely don't want that during normal test suite :D

EDIT: How long does it run normally? How long does it run on current main?

@GeigerJ2
Copy link
Contributor Author

GeigerJ2 commented Sep 4, 2025

Note to self:

❯ pip install aiida-core==2.6.4 -q
❯ pip freeze | grep -e aiida-core -ie sqlalchemy -e psycopg
aiida-core==2.6.4
psycopg2-binary==2.9.10
sphinx_sqlalchemy==0.2.0
SQLAlchemy==2.0.43
❯ pip uninstall SQLAlchemy aiida-core psycopg2-binary psycopg-binary psycopg -y
...
❯ pip install aiida-core==2.7.1 -q
❯ pip freeze | grep -e aiida-core -ie sqlalchemy -e psycopg
aiida-core==2.7.1
psycopg==3.2.9
psycopg-binary==3.2.9
sphinx_sqlalchemy==0.2.0
SQLAlchemy==2.0.43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants