Skip to content

Conversation

dsblank
Copy link
Member

@dsblank dsblank commented Oct 10, 2024

This PR converts the database interface to use JSON data rather than the pickled blobs used since the early days.

  1. Uses a new abstraction in the database: db.serializer
    a. abstracts data column name
    b. contains serialize/unserialize functions
  2. Updates database format to 21
  3. The conversion from 20 to 21 reads pickled blobs, and writes JSON data.
    a. It does this by switching between serializers
  4. New databases do not contain pickled blobs
  5. Converted databases contain both fields

@dsblank dsblank requested a review from Nick-Hall October 10, 2024 21:56
@Nick-Hall
Copy link
Member

If we are moving from BLOBs to JSON then we should really use the new format. See PR #800.

The new format uses the to_json and from_json methods in the serialize module to build the json from the underlying classes. It comes with get_schema class methods which provide a JSON Schema that allow the validation that we already use in our unit tests.

The main benefit of the new format is that it is easier maintain and debug. Instead of lists we use dictionaries. So, for example, we refer to the field "parent_family_list" instead of field number 9.

Upgrades are no problem. We just read and write the raw data.

When I have more time I'll update you on discussion whilst you have been away.

@dsblank
Copy link
Member Author

dsblank commented Oct 11, 2024

Oh, that sounds like a great idea! I'll take a look at the JSON format and switch to that. Should work even better with the SQL JSON_EXTRACT().

@Nick-Hall
Copy link
Member

There are a few places where the new format is used, so we will get some bonus performance improvements.

Feel free to make changes to my existing code if you see a benefit.

You may also want to have a quick look at how we serialize GrampsType. Enough information is stored so that we can recreate the object, but I don't think that I chose to store all fields.

@dsblank
Copy link
Member Author

dsblank commented Oct 12, 2024

Making some progress. Turns out, the serialized format had leaked into many other places, probably for speed. Probably good candidates for business logic.

@dsblank
Copy link
Member Author

dsblank commented Oct 13, 2024

I added a to_dict() and from_dict() based on the to_json() and from_json(). I didn't know about the object hooks. Brilliant! That saves so much code.

@dsblank
Copy link
Member Author

dsblank commented Oct 13, 2024

@Nick-Hall , I will probably need your assistance regarding the complete save/load of the to_json and from_json functions. I looked at your PR but as it touches 590 files, there is a lot there.

In this PR, I can now upgrade a database, and load the people views (except for name functions which I have to figure out).

image

@Nick-Hall
Copy link
Member

@dsblank I have rebased PR #800 on the gramps51 branch. Only 25 files were actually changed.

You can also see the changes suggested by @prculley resulting from his testing and performance benchmarks.

@dsblank
Copy link
Member Author

dsblank commented Oct 13, 2024

Thanks @Nick-Hall, that was very useful. I think that I will cherry pick some of the changes (like attribute name changes, elimination of private attributes).

You'll see that I did many of the same changes you made. But, one thing I found is that if we want to allow upgrades from previous versions, then we need to be able to read in blob_data, and write out json_data. I think my version has that covered.

I'll continue to make progress.

@Nick-Hall
Copy link
Member

@dsblank Why are you removing the properties? The validation in the setters will no longer be called.

@dsblank
Copy link
Member Author

dsblank commented Oct 14, 2024

@Nick-Hall , I thought that was what @prculley did for optimization, and I thought was needed. I can put those back :)

@Nick-Hall
Copy link
Member

Perhaps we could consider a solution similar to that provided by the pickle __getstate__ and __setstate__ methods.

A get_state method in a base class could return a dictionary of public attributes by default. This could be overridden to add properties if required.

Aset_state method could write the values back. In the case of properties we could just set the corresponding private variable rather than calling the setter. The list to tuple conversion could also be done in this method.

I expect that only a handful of classes would need to override the default methods.

@dsblank
Copy link
Member Author

dsblank commented Nov 29, 2024

CC: @Nick-Hall

@dsblank
Copy link
Member Author

dsblank commented Dec 4, 2024

@Nick-Hall, you have any estimate on possible review on this PR (and the #1794 filter fixes)?

I have some available time coming up, and would like to start work on checking the addons for this next version (6.0 or 5.3).

@Nick-Hall
Copy link
Member

@dsblank I'll make time this weekend, but may be able to start sooner.

@dsblank
Copy link
Member Author

dsblank commented Dec 7, 2024

@Nick-Hall, if you'd like to meet over Google Meet or Zoom so that I can walk you (and others) through proposed changes, I'd be glad to.

@Nick-Hall
Copy link
Member

This PR is looking good now.

I agree with you that the remaining serialize/unserialize code in the upgrade path should be left to another PR. Changes to the upgrades always require extra testing.

Would it be useful to log database upgrades? Perhaps a version table that could store the dates of database creation and any upgrades. I'm not suggesting adding to this PR though.

I'll convert my changes to add a create timestamp field to the primary objects so that they work with the new raw JSON format. Upgrades to the schema should be easier after this PR is merged. I'm not sure if we'll want to include it in the next release. We can discuss this later.

@dsblank
Copy link
Member Author

dsblank commented Dec 8, 2024

Sounds good!

@dsblank
Copy link
Member Author

dsblank commented Dec 8, 2024

Shall we merge this PR then?

@Nick-Hall
Copy link
Member

I'm just about to do some final checks, followed by a rebase and merge now. It was getting late last night.

Copy link
Contributor

@stevenyoungs stevenyoungs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good from my point of view

@dsblank
Copy link
Member Author

dsblank commented Dec 8, 2024

Thank you all for the reviews and comments!

@dsblank dsblank merged commit 81d1e01 into master Dec 8, 2024
3 checks passed
@Nick-Hall
Copy link
Member

@dsblank Please don't merge PRs until I have done a final review. I was about to merge this, but noticed that the new gen.db.conversion_tools package was not listed in the setup.py and the two files it contains are not in the POTFILES.skip file.

In your merge a "Co-authored-by: stevenyoungs [email protected]" credit seems to have been lost. Was this intentional?

Otherwise, I appreciate that you squashed the commits and rebased to maintain a linear history according to our committing policies.

@dsblank
Copy link
Member Author

dsblank commented Dec 8, 2024

Oh, sorry... how do you want to fix? I did not mean to lose Steve's credit.

Nick-Hall added a commit to Nick-Hall/gramps that referenced this pull request Dec 8, 2024
@Nick-Hall
Copy link
Member

I've created PR #1823 with the changes I was going to include.

Unfortunately, we can't go back and add the credit to the commit now. We can add a copyright line in a file if it hasn't already been done. I'll make sure that I mention Steve in the release announcement.

dsblank pushed a commit that referenced this pull request Dec 22, 2024
SNoiraud pushed a commit to SNoiraud/gramps that referenced this pull request Jan 26, 2025
This PR made the following changes:

* Database format 21: add JSON, remove pickle
* Rename new column to json_data
* Added to_dict, from_dict
* Refactor for upgrade uses
* Refactor serializers to classes
* Updated libgedcom
* Apply suggestions from code review
* Fixed broken test: couldn't replicate, so went with new results
* Migrated metadata to JSON
* Refine BSDDB
* Regular bug fix: citation date error
* Added logging to serialize
* A manual test script for validating conversion
SNoiraud pushed a commit to SNoiraud/gramps that referenced this pull request Jan 26, 2025
DavidMStraub added a commit to DavidMStraub/addons-source that referenced this pull request Feb 5, 2025
@Nick-Hall Nick-Hall deleted the dsb/depickle branch February 13, 2025 15:32
DavidMStraub added a commit to DavidMStraub/addons-source that referenced this pull request Mar 16, 2025
GaryGriffin pushed a commit to gramps-project/addons-source that referenced this pull request Mar 19, 2025
ForeverFloating pushed a commit to ForeverFloating/gramps that referenced this pull request Mar 21, 2025
This PR made the following changes:

* Database format 21: add JSON, remove pickle
* Rename new column to json_data
* Added to_dict, from_dict
* Refactor for upgrade uses
* Refactor serializers to classes
* Updated libgedcom
* Apply suggestions from code review
* Fixed broken test: couldn't replicate, so went with new results
* Migrated metadata to JSON
* Refine BSDDB
* Regular bug fix: citation date error
* Added logging to serialize
* A manual test script for validating conversion
ForeverFloating pushed a commit to ForeverFloating/gramps that referenced this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants