Skip to content

Conversation

tcely
Copy link
Contributor

@tcely tcely commented Jan 7, 2025

These functions aren't being used yet, they will be tested against my database before that happens.

My testing went well.

I've added two new settings to let people easily try this out.

  • SHRINK_NEW_MEDIA_METADATA must be set to True for newly retrieved metadata to be filtered.

  • SHRINK_OLD_MEDIA_METADATA must be set to True for loaded metadata to be updated with filtered metadata.

Please do try out whichever way is more useful for you.

tcely added 7 commits January 7, 2025 00:05
These functions aren't being used yet, they will be tested against my database before that happens.
The software doesn't need an extra space per key.
The `automatic_captions` has a layer for language codes that I didn't account for.

The type checking was copied and I didn't adjust for the arguments in this function.
@tcely
Copy link
Contributor Author

tcely commented Jan 7, 2025

Well, this has been successful so far.

: metadata reduced by 10,291 characters (516,747 -> 506,456)

My logs show each media item is storing around ~10k in just URLs that we never touch.

@tcely
Copy link
Contributor Author

tcely commented Jan 7, 2025

Dropping the extra format keys has resulted in more savings.

: metadata reduced by 17,428 characters (540,141 -> 522,713)

I think this is all I want for this pull request. More savings could be obtained by compressing this text or changing to another format other than text for storage.

I'm marking this for review, even though it doesn't actually change the metadata at this point.

Hooking this up should probably happen with a setting to enable it to allow users to opt into this change.

@tcely tcely marked this pull request as ready for review January 7, 2025 11:08
@tcely
Copy link
Contributor Author

tcely commented Jan 7, 2025

Fixing the removal of URLs from formats made a huge difference.

: metadata reduced by 584,845 characters (652,362 -> 67,517)

Also, I learned a cute trick to sort by metadata length.

from sync.models import Source, Media
my_media = Media.objects.all()
from django.db.models.functions import Length
ordered = my_media.annotate(metadata_length=Length('metadata')).order_by('-metadata_length')
media = ordered[0]

tcely added 4 commits January 8, 2025 11:31
This was misleading because the data dict becomes a JSON string.
First, only check that changes did happen.
@meeb
Copy link
Owner

meeb commented Jan 9, 2025

This looks pretty nice, thanks. Providing this keeps the core metadata, streams available, languages for subtitles and audio it should be a nice addition and not affect anything else. The only thing I'd comment here is historically I've noticed the JSON can be quite fluid and change a bit over time, I assume as YouTube add, delete or refactor things so this may introduce code that requires more maintenance than anything else that currently calls yt-dlp.

@tcely
Copy link
Contributor Author

tcely commented Jan 9, 2025

This looks pretty nice, thanks. Providing this keeps the core metadata, streams available, languages for subtitles and audio it should be a nice addition and not affect anything else.

I've been intentionally conservative about removing things. It should only be removing metadata that we don't or can't use.

Even with this approach, I'm seeing huge reductions in the size of the database copies that I've tried this code on.

The only thing I'd comment here is historically I've noticed the JSON can be quite fluid and change a bit over time, I assume as YouTube add, delete or refactor things so this may introduce code that requires more maintenance than anything else that currently calls yt-dlp.

I actually don't think the maintenance will be bad at all, the URL is unlikely to change. In the case that it did, failing to remove the unwanted data won't be harmful. We'd just end up in our current state.

@meeb
Copy link
Owner

meeb commented Jan 14, 2025

I've glanced over the code but not tried this locally as I'm a bit short on time this week. You happy for this to be merged?

@tcely
Copy link
Contributor Author

tcely commented Jan 14, 2025

I added a commit to hide the extra logging, that would otherwise be misleading for people not testing.

So, I'm happy with merging this now. It's not expected to change anything without the settings.

@meeb
Copy link
Owner

meeb commented Jan 14, 2025

Sounds good, given it's disabled by default I'll just merge this for now. Thanks!

@meeb meeb merged commit 51153f0 into meeb:main Jan 14, 2025
@tcely tcely deleted the filter-metadata-response branch January 14, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants