-
-
Notifications
You must be signed in to change notification settings - Fork 139
Filter metadata to avoid storing excess text in the database table #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
These functions aren't being used yet, they will be tested against my database before that happens.
The software doesn't need an extra space per key.
The `automatic_captions` has a layer for language codes that I didn't account for. The type checking was copied and I didn't adjust for the arguments in this function.
Well, this has been successful so far.
My logs show each media item is storing around ~10k in just URLs that we never touch. |
Dropping the extra format keys has resulted in more savings.
I think this is all I want for this pull request. More savings could be obtained by compressing this text or changing to another format other than text for storage. I'm marking this for review, even though it doesn't actually change the metadata at this point. Hooking this up should probably happen with a setting to enable it to allow users to opt into this change. |
Also, name the tuple values when using the results.
Fixing the removal of URLs from formats made a huge difference.
Also, I learned a cute trick to sort by metadata length.
|
This was misleading because the data dict becomes a JSON string.
First, only check that changes did happen.
This looks pretty nice, thanks. Providing this keeps the core metadata, streams available, languages for subtitles and audio it should be a nice addition and not affect anything else. The only thing I'd comment here is historically I've noticed the JSON can be quite fluid and change a bit over time, I assume as YouTube add, delete or refactor things so this may introduce code that requires more maintenance than anything else that currently calls yt-dlp. |
I've been intentionally conservative about removing things. It should only be removing metadata that we don't or can't use. Even with this approach, I'm seeing huge reductions in the size of the database copies that I've tried this code on.
I actually don't think the maintenance will be bad at all, the URL is unlikely to change. In the case that it did, failing to remove the unwanted data won't be harmful. We'd just end up in our current state. |
I've glanced over the code but not tried this locally as I'm a bit short on time this week. You happy for this to be merged? |
I added a commit to hide the extra logging, that would otherwise be misleading for people not testing. So, I'm happy with merging this now. It's not expected to change anything without the settings. |
Sounds good, given it's disabled by default I'll just merge this for now. Thanks! |
These functions aren't being used yet, they will be tested against my database before that happens.My testing went well.
I've added two new settings to let people easily try this out.
SHRINK_NEW_MEDIA_METADATA
must be set toTrue
for newly retrieved metadata to be filtered.SHRINK_OLD_MEDIA_METADATA
must be set toTrue
for loaded metadata to be updated with filtered metadata.Please do try out whichever way is more useful for you.