Skip to content

Conversation

dsblank
Copy link
Member

@dsblank dsblank commented Nov 22, 2024

This PR does four things:

  1. Unrolls the recursive call of filter.apply(), and splits out single checks into filter.apply_to_one()
  2. Uses data attributes (person.gender) rather than accessor functions (person.get_gender()) where possible
  3. Adds an optimizer based on rule.selected_handles sets
  4. Adds typing hints to make sure the right objects are passed into methods

Final comparison of finding those related to home person in 40k person Family Tree, between Gramps 5.2 and master + this PR (Gramps 6.0), in seconds (smaller is better):

Version Prepare Time Apply Time Total Time
Gramps 5.2 4.5 27.7 32.2
Gramps 6.0 8.0 0.5 8.5

The above uses the optimizer. Here is a test finding all people with a tag (5 people match):

Version Prepare Time Apply Time Total Time
Gramps 5.2 0.0 5.0 5.0
Gramps 6.0 0.0 1.6 1.6

Recall that converting from JSON to objects is a little slower than converting from array BLOBS to objects, so this is a large improvement.

@dsblank dsblank self-assigned this Nov 22, 2024
@dsblank
Copy link
Member Author

dsblank commented Nov 30, 2024

Thanks @stevenyoungs for looking this over and the feedback. I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same. Many places in Gramps use the raw data for speed, including tools, views, displayers, and importers. It would be a shame if filters couldn't do the same.

@stevenyoungs
Copy link
Contributor

Thanks @stevenyoungs for looking this over and the feedback. I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same. Many places in Gramps use the raw data for speed, including tools, views, displayers, and importers. It would be a shame if filters couldn't do the same.

No problem. I'm also keen to see the benefits of more standard storage and see what opportunities it unlocks. The recent discussion of GQL is an interesting option.
I've learnt a lot from following your changes and hopefully my comments have had a minor benefit to the overall quality of the PR.

@Nick-Hall
Copy link
Member

hopefully my comments have had a minor benefit to the overall quality of the PR.

Your feedback is very useful. It always helps for another person to review the code. I very much appreciate your contributions.

@Nick-Hall
Copy link
Member

I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same.

With the new JSON format, I think that we can regard the raw data as providing lightweight objects. The format mirrors the Gramps objects and is therefore much easier to understand and more resilient to updates.

I've been doing some research today to see if we can make the dict format more object-like. For example, data.gender would perhaps be neater than data["gender"]. The dataclasses introduced in python 3.7 look interesting, but I don't think that they are of much help to us. Using a JSON parser written in C may be an option in the future, but the ones I was looking at deserialized to dict structures.

I am open to using JSON data in more core code, including filters. We should probably discuss where the raw format is acceptable. Third-party code should be careful when using the raw format. It may change between feature releases, not just major releases, but his has always been the case.

An optimizer seems like a good idea. I haven't reviewed your code yet though.

@dsblank
Copy link
Member Author

dsblank commented Dec 2, 2024

I've been doing some research today to see if we can make the dict format more object-like. For example, data.gender would perhaps be neater than data["gender"]

How about just wrapping our raw data with DataDict(data):

class DataDict(dict):
    def __getattr__(self, key):
        value = self[key]
        if isinstance(value, dict):
            return DataDict(value)
        elif isinstance(value, list):
            return DataList(value)
        else:
            return value
        
class DataList(list):
    def __getitem__(self, position):
        value = super().__getitem__(position)
        if isinstance(value, dict):
            return DataDict(value)
        elif isinstance(value, list):
            return DataList(value)
        else:
            return value

It allows exactly what you describe, is low-cost, and doesn't require any more code than that above.

I am open to using JSON data in more core code, including filters. We should probably discuss where the raw format is acceptable. Third-party code should be careful when using the raw format. It may change between feature releases, not just major releases, but his has always been the case.

Agreed. The rule has seemed to be: use where it is needed for speed or easy representation.

@stevenyoungs
Copy link
Contributor

stevenyoungs commented Dec 2, 2024

The only downside is that we leak the internal representation of the objects into the higher layers of code. But since those layers, in part, already use the raw format, this PR does not make a material difference.

@dsblank dsblank marked this pull request as ready for review December 3, 2024 11:57
@dsblank
Copy link
Member Author

dsblank commented Dec 3, 2024

The only downside is that we leak the internal representation of the objects into the higher layers of code.

For many attributes of an object, it isn't so much "internal" now that we are switching to JSON. For example, this is the Person object API to get the parent family list:

person.parent_family_list # object attribute
person.get_parent_family_handle_list() # function call

With the above DataDict (with some additional code) with new JSON, it could look like (where person is really JSON data with a wrapper):

person.parent_family_list  # JSON access
person.get_parent_family_handle_list() # function call, instantiates object

The new version would hide the fact that the function call creates the object (but just once). So because the JSON data mirrors the actual attribute names, it doesn't have to change at all.

@dsblank dsblank changed the base branch from dsb/depickle to master December 8, 2024 15:27
@dsblank
Copy link
Member Author

dsblank commented Dec 11, 2024

Created a PR #1824 to explore attribute access of the JSON dict.

@dsblank
Copy link
Member Author

dsblank commented Dec 14, 2024

I'll update this PR to use new DataDict attribute access to dict. Which will actually revert most of the data access to what it is in master.

@dsblank
Copy link
Member Author

dsblank commented Dec 19, 2024

@Nick-Hall, I want to refactor this PR based on the DataDict from #1824. Should I change the base branch to #1824, or wait until it is merged?

@Nick-Hall
Copy link
Member

I have reviewed and merged PR #1824. Normally we would wait longer to give people longer to comment, but in this case the PR was related to existing work so I made an exception.

@dsblank
Copy link
Member Author

dsblank commented Dec 21, 2024

Thank you!

@dsblank dsblank requested review from stevenyoungs and Nick-Hall and removed request for stevenyoungs December 22, 2024 21:21
@dsblank
Copy link
Member Author

dsblank commented Feb 2, 2025

Thanks @cdhorn for the reviews! I'm not going to change the if/else constructs as I believe they are more clear and explicit.

@cdhorn
Copy link
Contributor

cdhorn commented Feb 2, 2025

Thanks @cdhorn for the reviews! I'm not going to change the if/else constructs as I believe they are more clear and explicit.

You're welcome. I always tend to fix these out of habit since pylint and a Sonarcube scan will call them out for refactoring.

Is nice seeing all the great work you, Steven and Nick have been doing.

@stevenyoungs
Copy link
Contributor

I always tend to fix these out of habit since pylint and a Sonarcube scan will call them out for refactoring.

Maybe this is something that should incorporated into the pipeline.

@dsblank
Copy link
Member Author

dsblank commented Feb 2, 2025

I always tend to fix these out of habit since pylint and a Sonarcube scan will call them out for refactoring.

Maybe this is something that should incorporated into the pipeline.

Please no. There are lots of styles, but not this one.

@Nick-Hall Nick-Hall self-assigned this Feb 2, 2025
@Nick-Hall
Copy link
Member

@dsblank I tried to fix a couple of conflicts, but now the unit tests are failing. It looks like the problem is due to the new type checking. Feel free to check or revert my merge.

@dsblank
Copy link
Member Author

dsblank commented Feb 2, 2025

Ok, I'll take a look

@dsblank
Copy link
Member Author

dsblank commented Feb 3, 2025

@Nick-Hall, types now adjusted. And a bug found via typing checks (a method was spelled wrong). I had to use a few # type: ignore as I couldn't figure out the correct way to handle conflicts.

This probably going to be difficult to merge in a few places.

@Nick-Hall
Copy link
Member

@dsblank Is this ready for me to do a final review?

@dsblank
Copy link
Member Author

dsblank commented Feb 3, 2025

Yes, all ready!

@dsblank
Copy link
Member Author

dsblank commented Feb 3, 2025

Do you want me to resolve conflicts?

@Nick-Hall
Copy link
Member

Nick-Hall commented Feb 3, 2025

Do you want me to resolve conflicts?

Yes, please. I will review after they have been fixed.

@dsblank
Copy link
Member Author

dsblank commented Feb 3, 2025

@Nick-Hall, I corrected the diffs in the webUI and now everything passes tests, but now it wants to do a rebase via command line. But git rebase master seems like it wants to reply every commit. Any suggestions?

Copy link
Member

@Nick-Hall Nick-Hall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good. I clearly have to thank previous reviewers for that.

Just a couple of things I noticed:

  1. The po/POTFILES.skip file needs updating.
  • Add gramps/gen/filters/optimizer.py
  • Remove gramps/gen/filters/rules/family/_memberbase.py
  1. The typing modules in gramps/gen/filters/rules are imported using absolute imports when they should be using relative imports.

@Nick-Hall
Copy link
Member

I corrected the diffs in the webUI and now everything passes tests, but now it wants to do a rebase via command line. But git rebase master seems like it wants to reply every commit. Any suggestions?

The webUI creates branch merges which we don't want in our commit history. The workflow that I recommend is to rebase from the command line on a regular basis and force push back to the branch.

However, your workflow doesn't actually cause a problem. To merge, I would do a git merge --squash followed by a git commit. I would tend to use the PR title and contents of the first comment for the commit message. I don't mind doing this for you.

If you wanted to have several commits in a PR then, you would need to avoid the branch merges. With multiple commits, I still rebase, but then I commit with the --no-ff option to group the commits, but preserve a linear history. A linear history helps if we need to bisect to find a commit that introduced a bug.

We also need to remember that the commit messages are used to create the ChangeLog and the first and last lines (and possibly more) become part of the NEWS file and release notes. I often have to remind developers about this.

Finally, I was going to mention that we normally work in our own forks now rather than creating branches in origin. Again, I doesn't actually cause a problem, so I didn't think it was worth mentioning earlier.

@dsblank
Copy link
Member Author

dsblank commented Feb 3, 2025

I don't mind doing this for you.

Thanks, I would appreciate that!

I clearly have to thank previous reviewers for that.

Absolutely. And they aren't too numerous to name: @cdhorn, @stevenyoungs, and @fxtmtrsine.

@dsblank
Copy link
Member Author

dsblank commented Feb 3, 2025

And thanks to @emyoulation and others for emoji encouragement and positive feedback :)

@dsblank
Copy link
Member Author

dsblank commented Feb 3, 2025

@Nick-Hall, you can take it from here? Let me know if there is still resolving to be done.

This change does four things:

1. Unrolls the recursive call of filter.apply(), and splits out single checks
   into filter.apply_to_one().
2. Uses data attributes (`person.gender`) rather than accessor functions
   (`person.get_gender()`) where possible.
3. Adds an optimizer based on `rule.selected_handles` sets.
4. Adds typing hints to make sure the right objects are passed into methods.

Final comparison of finding those related to home person in 40k person Family
Tree, between Gramps 5.2 and this change (Gramps 6.0), in seconds (smaller
is better):

Version | Prepare Time | Apply Time | Total Time
--------| -------------:|-----------:|----------:
Gramps 5.2 | 4.5 | 27.7 | 32.2
Gramps 6.0 | 8.0 | 0.5 | 8.5

The above uses the optimizer. Here is a test finding all people with a tag
(5 people match):

Version | Prepare Time | Apply Time | Total Time
--------| -------------:|-----------:|----------:
Gramps 5.2 | 0.0 | 5.0 | 5.0
Gramps 6.0 | 0.0 | 1.6 | 1.6

Recall that converting from JSON to objects is a little slower than converting
from array BLOBS to objects, so this is a large improvement.

Co-authored-by: Christopher Horn <[email protected]>
Co-authored-by: stevenyoungs <[email protected]>
@Nick-Hall Nick-Hall force-pushed the dsb/refactor-filter-and-optimize branch from 71b3a16 to 1280aa4 Compare February 3, 2025 22:27
@Nick-Hall
Copy link
Member

Squashed and rebased.

@Nick-Hall Nick-Hall removed their assignment Feb 3, 2025
@Nick-Hall Nick-Hall merged commit 1280aa4 into master Feb 3, 2025
3 checks passed
@Nick-Hall Nick-Hall deleted the dsb/refactor-filter-and-optimize branch February 3, 2025 22:33
@QuLogic
Copy link
Contributor

QuLogic commented Feb 4, 2025

A linear history helps if we need to bisect to find a commit that introduced a bug.

git bisect is quite happy to work without linear histories; that's not directly a reason to require linear history. What makes bisection difficult is non-atomic commits (i.e., ones that are broken and fixed up later), which squash merging can help with but has other trade-offs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants