Filter metadata to avoid storing excess text in the database table #612

tcely · 2025-01-07T05:17:34Z

~~These functions aren't being used yet, they will be tested against my database before that happens.~~

My testing went well.

I've added two new settings to let people easily try this out.

SHRINK_NEW_MEDIA_METADATA must be set to True for newly retrieved metadata to be filtered.
SHRINK_OLD_MEDIA_METADATA must be set to True for loaded metadata to be updated with filtered metadata.

Please do try out whichever way is more useful for you.

These functions aren't being used yet, they will be tested against my database before that happens.

The software doesn't need an extra space per key.

The `automatic_captions` has a layer for language codes that I didn't account for. The type checking was copied and I didn't adjust for the arguments in this function.

tcely · 2025-01-07T08:28:48Z

Well, this has been successful so far.

: metadata reduced by 10,291 characters (516,747 -> 506,456)

My logs show each media item is storing around ~10k in just URLs that we never touch.

tcely · 2025-01-07T11:02:50Z

Dropping the extra format keys has resulted in more savings.

: metadata reduced by 17,428 characters (540,141 -> 522,713)

I think this is all I want for this pull request. More savings could be obtained by compressing this text or changing to another format other than text for storage.

I'm marking this for review, even though it doesn't actually change the metadata at this point.

Hooking this up should probably happen with a setting to enable it to allow users to opt into this change.

Also, name the tuple values when using the results.

tcely · 2025-01-07T18:56:51Z

Fixing the removal of URLs from formats made a huge difference.

: metadata reduced by 584,845 characters (652,362 -> 67,517)

Also, I learned a cute trick to sort by metadata length.

from sync.models import Source, Media
my_media = Media.objects.all()
from django.db.models.functions import Length
ordered = my_media.annotate(metadata_length=Length('metadata')).order_by('-metadata_length')
media = ordered[0]

This was misleading because the data dict becomes a JSON string.

First, only check that changes did happen.

meeb · 2025-01-09T15:24:21Z

This looks pretty nice, thanks. Providing this keeps the core metadata, streams available, languages for subtitles and audio it should be a nice addition and not affect anything else. The only thing I'd comment here is historically I've noticed the JSON can be quite fluid and change a bit over time, I assume as YouTube add, delete or refactor things so this may introduce code that requires more maintenance than anything else that currently calls yt-dlp.

tcely · 2025-01-09T17:26:34Z

This looks pretty nice, thanks. Providing this keeps the core metadata, streams available, languages for subtitles and audio it should be a nice addition and not affect anything else.

I've been intentionally conservative about removing things. It should only be removing metadata that we don't or can't use.

Even with this approach, I'm seeing huge reductions in the size of the database copies that I've tried this code on.

The only thing I'd comment here is historically I've noticed the JSON can be quite fluid and change a bit over time, I assume as YouTube add, delete or refactor things so this may introduce code that requires more maintenance than anything else that currently calls yt-dlp.

I actually don't think the maintenance will be bad at all, the URL is unlikely to change. In the case that it did, failing to remove the unwanted data won't be harmful. We'd just end up in our current state.

meeb · 2025-01-14T05:22:27Z

I've glanced over the code but not tried this locally as I'm a bit short on time this week. You happy for this to be merged?

tcely · 2025-01-14T10:39:58Z

I added a commit to hide the extra logging, that would otherwise be misleading for people not testing.

So, I'm happy with merging this now. It's not expected to change anything without the settings.

meeb · 2025-01-14T14:29:05Z

Sounds good, given it's disabled by default I'll just merge this for now. Thanks!

tcely added 7 commits January 7, 2025 00:05

Add response filtering

8c22b6c

These functions aren't being used yet, they will be tested against my database before that happens.

More compact JSON

63fa97c

The software doesn't need an extra space per key.

Log the reduction of metadata length

8c31720

Merge branch 'meeb:main' into filter-metadata-response

ca75398

Don't reduce the actual data yet

25d2ff6

Fixes from testing

2f34fff

The `automatic_captions` has a layer for language codes that I didn't account for. The type checking was copied and I didn't adjust for the arguments in this function.

Fix formatting

9a4101a

tcely added 4 commits January 7, 2025 03:35

Adjusted comment

db25fa8

Loop over a set of keys for each URL type

431de2e

Drop keys from formats that cannot be useful

7b8d117

Check that the drop_key exists

c7457e9

tcely marked this pull request as ready for review January 7, 2025 11:08

tcely added 4 commits January 7, 2025 06:20

Use a distinct try to log errors

2d85bcb

Use the exception function for traceback

8ac5b36

Simplify results from _url_keys

7793701

Also, name the tuple values when using the results.

Some formats are using a different URL

1c432cc

tcely added 4 commits January 8, 2025 11:31

Drop /expire/ URLs from automatic_captions too

d35f52f

Log both compacted and reduced sizes

ad10bcf

Rename compact_data to compact_json

100382f

This was misleading because the data dict becomes a JSON string.

Add a filter_response test

682a53d

First, only check that changes did happen.

tcely added 2 commits January 9, 2025 11:47

More filter_response asserts

4c9fa40

More filter_response asserts

3e3f80d

tcely added 3 commits January 9, 2025 13:20

Add SHRINK_NEW_MEDIA_METADATA setting

29c39aa

Have filter_response return a copy, if requested

0f98694

Use the new copy argument to filter_response

274f19f

tcely added 2 commits January 9, 2025 13:53

Use the new copy argument to filter_response

1ff8dfd

Add SHRINK_OLD_MEDIA_METADATA setting

6292a9a

Only log the extra messages with the new setting

45d7039

meeb merged commit 51153f0 into meeb:main Jan 14, 2025

tcely deleted the filter-metadata-response branch January 14, 2025 14:32

tcely mentioned this pull request Jan 15, 2025

Add environment variables for container #622

Merged

tcely mentioned this pull request Mar 16, 2025

Best way to adjust or lightweight metadata in 'media_sync' table #470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Filter metadata to avoid storing excess text in the database table #612

Filter metadata to avoid storing excess text in the database table #612

Uh oh!

tcely commented Jan 7, 2025 •

edited

Loading

Uh oh!

tcely commented Jan 7, 2025

Uh oh!

tcely commented Jan 7, 2025 •

edited

Loading

Uh oh!

tcely commented Jan 7, 2025 •

edited

Loading

Uh oh!

meeb commented Jan 9, 2025

Uh oh!

tcely commented Jan 9, 2025

Uh oh!

meeb commented Jan 14, 2025

Uh oh!

tcely commented Jan 14, 2025

Uh oh!

meeb commented Jan 14, 2025

Uh oh!

Uh oh!

Uh oh!

Filter metadata to avoid storing excess text in the database table #612

Filter metadata to avoid storing excess text in the database table #612

Uh oh!

Conversation

tcely commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcely commented Jan 7, 2025

Uh oh!

tcely commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcely commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meeb commented Jan 9, 2025

Uh oh!

tcely commented Jan 9, 2025

Uh oh!

meeb commented Jan 14, 2025

Uh oh!

tcely commented Jan 14, 2025

Uh oh!

meeb commented Jan 14, 2025

Uh oh!

Uh oh!

tcely commented Jan 7, 2025 •

edited

Loading

tcely commented Jan 7, 2025 •

edited

Loading

tcely commented Jan 7, 2025 •

edited

Loading