Refactor, fix, and optimizer filters/rules #1794

dsblank · 2024-11-22T19:26:31Z

This PR does four things:

Unrolls the recursive call of filter.apply(), and splits out single checks into filter.apply_to_one()
Uses data attributes (person.gender) rather than accessor functions (person.get_gender()) where possible
Adds an optimizer based on rule.selected_handles sets
Adds typing hints to make sure the right objects are passed into methods

Final comparison of finding those related to home person in 40k person Family Tree, between Gramps 5.2 and master + this PR (Gramps 6.0), in seconds (smaller is better):

Version	Prepare Time	Apply Time	Total Time
Gramps 5.2	4.5	27.7	32.2
Gramps 6.0	8.0	0.5	8.5

The above uses the optimizer. Here is a test finding all people with a tag (5 people match):

Version	Prepare Time	Apply Time	Total Time
Gramps 5.2	0.0	5.0	5.0
Gramps 6.0	0.0	1.6	1.6

Recall that converting from JSON to objects is a little slower than converting from array BLOBS to objects, so this is a large improvement.

gramps/gen/filters/rules/repository/_hasrepo.py

gramps/gen/filters/rules/person/_nobirthdate.py

gramps/gen/filters/rules/place/_inlatlonneighborhood.py

dsblank · 2024-11-30T12:39:32Z

Thanks @stevenyoungs for looking this over and the feedback. I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same. Many places in Gramps use the raw data for speed, including tools, views, displayers, and importers. It would be a shame if filters couldn't do the same.

stevenyoungs · 2024-12-01T20:53:07Z

Thanks @stevenyoungs for looking this over and the feedback. I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same. Many places in Gramps use the raw data for speed, including tools, views, displayers, and importers. It would be a shame if filters couldn't do the same.

No problem. I'm also keen to see the benefits of more standard storage and see what opportunities it unlocks. The recent discussion of GQL is an interesting option.
I've learnt a lot from following your changes and hopefully my comments have had a minor benefit to the overall quality of the PR.

Nick-Hall · 2024-12-01T23:22:12Z

hopefully my comments have had a minor benefit to the overall quality of the PR.

Your feedback is very useful. It always helps for another person to review the code. I very much appreciate your contributions.

Nick-Hall · 2024-12-01T23:50:47Z

I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same.

With the new JSON format, I think that we can regard the raw data as providing lightweight objects. The format mirrors the Gramps objects and is therefore much easier to understand and more resilient to updates.

I've been doing some research today to see if we can make the dict format more object-like. For example, data.gender would perhaps be neater than data["gender"]. The dataclasses introduced in python 3.7 look interesting, but I don't think that they are of much help to us. Using a JSON parser written in C may be an option in the future, but the ones I was looking at deserialized to dict structures.

I am open to using JSON data in more core code, including filters. We should probably discuss where the raw format is acceptable. Third-party code should be careful when using the raw format. It may change between feature releases, not just major releases, but his has always been the case.

An optimizer seems like a good idea. I haven't reviewed your code yet though.

dsblank · 2024-12-02T12:28:02Z

I've been doing some research today to see if we can make the dict format more object-like. For example, data.gender would perhaps be neater than data["gender"]

How about just wrapping our raw data with DataDict(data):

class DataDict(dict):
    def __getattr__(self, key):
        value = self[key]
        if isinstance(value, dict):
            return DataDict(value)
        elif isinstance(value, list):
            return DataList(value)
        else:
            return value
        
class DataList(list):
    def __getitem__(self, position):
        value = super().__getitem__(position)
        if isinstance(value, dict):
            return DataDict(value)
        elif isinstance(value, list):
            return DataList(value)
        else:
            return value

It allows exactly what you describe, is low-cost, and doesn't require any more code than that above.

I am open to using JSON data in more core code, including filters. We should probably discuss where the raw format is acceptable. Third-party code should be careful when using the raw format. It may change between feature releases, not just major releases, but his has always been the case.

Agreed. The rule has seemed to be: use where it is needed for speed or easy representation.

stevenyoungs · 2024-12-02T22:43:05Z

The only downside is that we leak the internal representation of the objects into the higher layers of code. But since those layers, in part, already use the raw format, this PR does not make a material difference.

dsblank · 2024-12-03T12:58:40Z

The only downside is that we leak the internal representation of the objects into the higher layers of code.

For many attributes of an object, it isn't so much "internal" now that we are switching to JSON. For example, this is the Person object API to get the parent family list:

person.parent_family_list # object attribute
person.get_parent_family_handle_list() # function call

With the above DataDict (with some additional code) with new JSON, it could look like (where person is really JSON data with a wrapper):

person.parent_family_list  # JSON access
person.get_parent_family_handle_list() # function call, instantiates object

The new version would hide the fact that the function call creates the object (but just once). So because the JSON data mirrors the actual attribute names, it doesn't have to change at all.

dsblank · 2024-12-11T21:08:17Z

Created a PR #1824 to explore attribute access of the JSON dict.

dsblank · 2024-12-14T13:53:53Z

I'll update this PR to use new DataDict attribute access to dict. Which will actually revert most of the data access to what it is in master.

dsblank · 2024-12-19T14:41:56Z

@Nick-Hall, I want to refactor this PR based on the DataDict from #1824. Should I change the base branch to #1824, or wait until it is merged?

Nick-Hall · 2024-12-20T21:44:27Z

I have reviewed and merged PR #1824. Normally we would wait longer to give people longer to comment, but in this case the PR was related to existing work so I made an exception.

dsblank · 2024-12-21T03:35:29Z

Thank you!

gramps/gen/filters/rules/place/_hasnolatorlon.py

gramps/gen/filters/rules/place/_inlatlonneighborhood.py

gramps/gen/filters/_genericfilter.py

gramps/gui/views/treemodels/placemodel.py

gramps/gen/filters/rules/person/_nodeathdate.py

gramps/gen/filters/rules/person/_relationshippathbetweenbookmarks.py

dsblank · 2025-02-02T05:41:10Z

Thanks @cdhorn for the reviews! I'm not going to change the if/else constructs as I believe they are more clear and explicit.

cdhorn · 2025-02-02T12:20:46Z

Thanks @cdhorn for the reviews! I'm not going to change the if/else constructs as I believe they are more clear and explicit.

You're welcome. I always tend to fix these out of habit since pylint and a Sonarcube scan will call them out for refactoring.

Is nice seeing all the great work you, Steven and Nick have been doing.

stevenyoungs · 2025-02-02T12:41:39Z

I always tend to fix these out of habit since pylint and a Sonarcube scan will call them out for refactoring.

Maybe this is something that should incorporated into the pipeline.

dsblank · 2025-02-02T13:10:24Z

I always tend to fix these out of habit since pylint and a Sonarcube scan will call them out for refactoring.

Maybe this is something that should incorporated into the pipeline.

Please no. There are lots of styles, but not this one.

Nick-Hall · 2025-02-02T22:12:50Z

@dsblank I tried to fix a couple of conflicts, but now the unit tests are failing. It looks like the problem is due to the new type checking. Feel free to check or revert my merge.

dsblank · 2025-02-02T22:25:58Z

Ok, I'll take a look

dsblank · 2025-02-03T03:02:44Z

@Nick-Hall, types now adjusted. And a bug found via typing checks (a method was spelled wrong). I had to use a few # type: ignore as I couldn't figure out the correct way to handle conflicts.

This probably going to be difficult to merge in a few places.

Nick-Hall · 2025-02-03T13:29:11Z

@dsblank Is this ready for me to do a final review?

dsblank · 2025-02-03T14:53:44Z

Yes, all ready!

dsblank · 2025-02-03T14:54:18Z

Do you want me to resolve conflicts?

Nick-Hall · 2025-02-03T14:56:44Z

Do you want me to resolve conflicts?

Yes, please. I will review after they have been fixed.

dsblank · 2025-02-03T17:37:44Z

@Nick-Hall, I corrected the diffs in the webUI and now everything passes tests, but now it wants to do a rebase via command line. But git rebase master seems like it wants to reply every commit. Any suggestions?

Nick-Hall

This PR looks good. I clearly have to thank previous reviewers for that.

Just a couple of things I noticed:

The po/POTFILES.skip file needs updating.

Add gramps/gen/filters/optimizer.py
Remove gramps/gen/filters/rules/family/_memberbase.py

The typing modules in gramps/gen/filters/rules are imported using absolute imports when they should be using relative imports.

Nick-Hall · 2025-02-03T18:21:14Z

I corrected the diffs in the webUI and now everything passes tests, but now it wants to do a rebase via command line. But git rebase master seems like it wants to reply every commit. Any suggestions?

The webUI creates branch merges which we don't want in our commit history. The workflow that I recommend is to rebase from the command line on a regular basis and force push back to the branch.

However, your workflow doesn't actually cause a problem. To merge, I would do a git merge --squash followed by a git commit. I would tend to use the PR title and contents of the first comment for the commit message. I don't mind doing this for you.

If you wanted to have several commits in a PR then, you would need to avoid the branch merges. With multiple commits, I still rebase, but then I commit with the --no-ff option to group the commits, but preserve a linear history. A linear history helps if we need to bisect to find a commit that introduced a bug.

We also need to remember that the commit messages are used to create the ChangeLog and the first and last lines (and possibly more) become part of the NEWS file and release notes. I often have to remind developers about this.

Finally, I was going to mention that we normally work in our own forks now rather than creating branches in origin. Again, I doesn't actually cause a problem, so I didn't think it was worth mentioning earlier.

dsblank · 2025-02-03T21:45:57Z

I don't mind doing this for you.

Thanks, I would appreciate that!

I clearly have to thank previous reviewers for that.

Absolutely. And they aren't too numerous to name: @cdhorn, @stevenyoungs, and @fxtmtrsine.

dsblank · 2025-02-03T21:49:35Z

And thanks to @emyoulation and others for emoji encouragement and positive feedback :)

dsblank · 2025-02-03T21:58:14Z

@Nick-Hall, you can take it from here? Let me know if there is still resolving to be done.

This change does four things: 1. Unrolls the recursive call of filter.apply(), and splits out single checks into filter.apply_to_one(). 2. Uses data attributes (`person.gender`) rather than accessor functions (`person.get_gender()`) where possible. 3. Adds an optimizer based on `rule.selected_handles` sets. 4. Adds typing hints to make sure the right objects are passed into methods. Final comparison of finding those related to home person in 40k person Family Tree, between Gramps 5.2 and this change (Gramps 6.0), in seconds (smaller is better): Version | Prepare Time | Apply Time | Total Time --------| -------------:|-----------:|----------: Gramps 5.2 | 4.5 | 27.7 | 32.2 Gramps 6.0 | 8.0 | 0.5 | 8.5 The above uses the optimizer. Here is a test finding all people with a tag (5 people match): Version | Prepare Time | Apply Time | Total Time --------| -------------:|-----------:|----------: Gramps 5.2 | 0.0 | 5.0 | 5.0 Gramps 6.0 | 0.0 | 1.6 | 1.6 Recall that converting from JSON to objects is a little slower than converting from array BLOBS to objects, so this is a large improvement. Co-authored-by: Christopher Horn <[email protected]> Co-authored-by: stevenyoungs <[email protected]>

Nick-Hall · 2025-02-03T22:28:01Z

Squashed and rebased.

QuLogic · 2025-02-04T00:11:56Z

A linear history helps if we need to bisect to find a commit that introduced a bug.

git bisect is quite happy to work without linear histories; that's not directly a reason to require linear history. What makes bisection difficult is non-atomic commits (i.e., ones that are broken and fixed up later), which squash merging can help with but has other trade-offs.

dsblank self-assigned this Nov 22, 2024

dsblank added the enhancement label Nov 22, 2024

stevenyoungs reviewed Nov 30, 2024

View reviewed changes

gramps/gen/filters/rules/repository/_hasrepo.py Outdated Show resolved Hide resolved

stevenyoungs reviewed Nov 30, 2024

View reviewed changes

gramps/gen/filters/rules/person/_nobirthdate.py Outdated Show resolved Hide resolved

stevenyoungs reviewed Nov 30, 2024

View reviewed changes

gramps/gen/filters/rules/place/_inlatlonneighborhood.py Outdated Show resolved Hide resolved

dsblank marked this pull request as ready for review December 3, 2024 11:57

dsblank mentioned this pull request Dec 4, 2024

Switch from pickled blobs to JSON data #1786

Merged

dsblank changed the base branch from dsb/depickle to master December 8, 2024 15:27

dsblank mentioned this pull request Dec 11, 2024

Added a dict wrapper that acts like an object #1824

Merged

dsblank commented Dec 22, 2024

View reviewed changes

gramps/gen/filters/rules/place/_hasnolatorlon.py Show resolved Hide resolved

dsblank commented Dec 22, 2024

View reviewed changes

gramps/gen/filters/rules/place/_inlatlonneighborhood.py Show resolved Hide resolved

dsblank requested review from stevenyoungs and Nick-Hall and removed request for stevenyoungs December 22, 2024 21:21

stevenyoungs reviewed Dec 24, 2024

View reviewed changes

gramps/gen/filters/_genericfilter.py Outdated Show resolved Hide resolved

stevenyoungs reviewed Dec 24, 2024

View reviewed changes

gramps/gui/views/treemodels/placemodel.py Outdated Show resolved Hide resolved

dsblank mentioned this pull request Dec 26, 2024

Add rule 'having note of type' for citation, event, family, person, p… #1756

Merged

cdhorn reviewed Feb 2, 2025

View reviewed changes

gramps/gen/filters/rules/person/_nodeathdate.py Show resolved Hide resolved

cdhorn reviewed Feb 2, 2025

View reviewed changes

gramps/gen/filters/rules/person/_relationshippathbetweenbookmarks.py Show resolved Hide resolved

Nick-Hall self-assigned this Feb 2, 2025

Nick-Hall requested changes Feb 3, 2025

View reviewed changes

Nick-Hall approved these changes Feb 3, 2025

View reviewed changes

dsblank added the performance label Feb 3, 2025

Nick-Hall force-pushed the dsb/refactor-filter-and-optimize branch from 71b3a16 to 1280aa4 Compare February 3, 2025 22:27

Nick-Hall removed the ready-to-merge label Feb 3, 2025

Nick-Hall removed their assignment Feb 3, 2025

Nick-Hall merged commit 1280aa4 into master Feb 3, 2025
3 checks passed

Nick-Hall deleted the dsb/refactor-filter-and-optimize branch February 3, 2025 22:33

Refactor, fix, and optimizer filters/rules #1794

Refactor, fix, and optimizer filters/rules #1794

Uh oh!

Conversation

dsblank commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsblank commented Nov 30, 2024

Uh oh!

stevenyoungs commented Dec 1, 2024

Uh oh!

Nick-Hall commented Dec 1, 2024

Uh oh!

Nick-Hall commented Dec 1, 2024

Uh oh!

dsblank commented Dec 2, 2024

Uh oh!

stevenyoungs commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsblank commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsblank commented Dec 11, 2024

Uh oh!

dsblank commented Dec 14, 2024

Uh oh!

dsblank commented Dec 19, 2024

Uh oh!

Nick-Hall commented Dec 20, 2024

Uh oh!

dsblank commented Dec 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsblank commented Feb 2, 2025

Uh oh!

cdhorn commented Feb 2, 2025

Uh oh!

stevenyoungs commented Feb 2, 2025

Uh oh!

dsblank commented Feb 2, 2025

Uh oh!

Nick-Hall commented Feb 2, 2025

Uh oh!

dsblank commented Feb 2, 2025

Uh oh!

dsblank commented Feb 3, 2025

Uh oh!

Nick-Hall commented Feb 3, 2025

Uh oh!

dsblank commented Feb 3, 2025

Uh oh!

dsblank commented Feb 3, 2025

Uh oh!

Nick-Hall commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsblank commented Feb 3, 2025

Uh oh!

Nick-Hall left a comment

Choose a reason for hiding this comment

Uh oh!

Nick-Hall commented Feb 3, 2025

Uh oh!

dsblank commented Feb 3, 2025

Uh oh!

dsblank commented Feb 3, 2025

Uh oh!

dsblank commented Feb 3, 2025

Uh oh!

Nick-Hall commented Feb 3, 2025

Uh oh!

dsblank commented Nov 22, 2024 •

edited

Loading

stevenyoungs commented Dec 2, 2024 •

edited

Loading

dsblank commented Dec 3, 2024 •

edited

Loading

Nick-Hall commented Feb 3, 2025 •

edited

Loading