Skip to content

Conversation

Sovietaced
Copy link
Member

@Sovietaced Sovietaced commented May 13, 2025

Tracking issue

Closes #3305

Why are the changes needed?

Flyte data may be stored in S3 buckets that have retention policies. If those retention policies delete the underlying data in S3 the entries in Flyte data catalog are stale and point to nonexistent buckets. This results in errors at task runtime in flytekit.

What changes were proposed in this pull request?

This pull request modifies Flyte data catalog such that artifacts in the catalog can expire and no longer be valid. The general approach is to allow for multiple rows with the same artifact key but to only retain one non-expired row ideally. I chose to allow for multiple rows to exist so that cleanup of expired artifacts could be done asynchronously. Also, there are multiple tables that hold references to the artifact table and ensuring these dependents were also expired seemed challenging.

Data Model

  1. A new column is added to the artifacts table, expires_at. This dictates when the artifact is no longer valid.
  2. The composite primary key for artifacts was broadened to include created_at such that there can be multiple rows with the same artifact key (ideally with one and only one non-expired entry if insertions are done carefully and atomically).
  3. The compose primary kettle for tag was breaded to include artifact_id such that multiple tags can exist with the same artifact key but can only point to one artifact.

Data Layer

  1. The data layer was updated to filter expired artifacts in get/list operations, this propagates to tags as well.
  2. The data layer was updated such that non-expired artifacts are searched before creating new ones.

Manager Layer

  1. The TTL is used with a pluggable clock to generated the expires_at value for records.

Propeller

Updated propeller to plumb the cache TTL from the task metadata to Flyte data catalog API calls.

Future Updates

At a high level I'd like to follow up on this work with a background job of some sort that queries the DB for expired cache entries and then thoroughly cleans them up by deleting the contents if possible and then later removing the rows from Postgres. Similar to what is done in #4655

I realize flytekit changes will need to follow up here as well but I think those will be relatively trivial and this change is entirely backwards compatible.

How was this patch tested?

Unit tests, for now.

Labels

Please add one or more of the following labels to categorize your PR:

  • added: For new features.
  • changed: For changes in existing functionality.
  • deprecated: For soon-to-be-removed features.
  • removed: For features being removed.
  • fixed: For any bug fixed.
  • security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This pull request enhances the Flyte data catalog by introducing support for cache TTL, allowing artifacts to expire and preventing stale entries that could lead to runtime errors. Key modifications include the addition of an 'expires_at' column in the artifacts table and updates to the data layer for filtering expired artifacts. The implementation is designed to be backward compatible and includes unit tests to validate the new functionality.

@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Automatic Review Skipped - Draft PR

    Bito didn't auto-review because this pull request is in draft status.
    To trigger review, mark the PR as ready or type /review in the comment and save.
    You can change draft PR review settings here, or contact the agent instance creator at [email protected].

Copy link

codecov bot commented May 13, 2025

Codecov Report

Attention: Patch coverage is 87.64045% with 11 lines in your changes missing coverage. Please review.

Project coverage is 58.53%. Comparing base (a75cea0) to head (e44b246).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
datacatalog/pkg/manager/impl/artifact_manager.go 84.21% 2 Missing and 1 partial ⚠️
datacatalog/pkg/repositories/gormimpl/artifact.go 94.73% 2 Missing and 1 partial ⚠️
flyteplugins/go/tasks/plugins/array/catalog.go 33.33% 2 Missing ⚠️
datacatalog/pkg/repositories/postgres_repo.go 0.00% 1 Missing ⚠️
datacatalog/pkg/rpc/datacatalogservice/service.go 0.00% 1 Missing ⚠️
flytepropeller/pkg/controller/nodes/task/cache.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6451      +/-   ##
==========================================
+ Coverage   58.48%   58.53%   +0.04%     
==========================================
  Files         940      940              
  Lines       71584    71608      +24     
==========================================
+ Hits        41867    41915      +48     
+ Misses      26534    26516      -18     
+ Partials     3183     3177       -6     
Flag Coverage Δ
unittests-datacatalog 59.69% <90.36%> (+0.65%) ⬆️
unittests-flyteadmin 56.26% <ø> (ø)
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 64.72% <ø> (ø)
unittests-flyteidl 76.12% <ø> (ø)
unittests-flyteplugins 61.08% <33.33%> (+0.13%) ⬆️
unittests-flytepropeller 54.78% <66.66%> (+<0.01%) ⬆️
unittests-flytestdlib 64.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Sovietaced Sovietaced force-pushed the issue-6450 branch 2 times, most recently from bae020b to 822ed4a Compare May 13, 2025 23:03
@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Automatic Review Skipped - Draft PR

    Bito didn't auto-review because this pull request is in draft status.
    To trigger review, mark the PR as ready or type /review in the comment and save.
    You can change draft PR review settings here, or contact the agent instance creator at [email protected].

3 similar comments
@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Automatic Review Skipped - Draft PR

    Bito didn't auto-review because this pull request is in draft status.
    To trigger review, mark the PR as ready or type /review in the comment and save.
    You can change draft PR review settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Automatic Review Skipped - Draft PR

    Bito didn't auto-review because this pull request is in draft status.
    To trigger review, mark the PR as ready or type /review in the comment and save.
    You can change draft PR review settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Automatic Review Skipped - Draft PR

    Bito didn't auto-review because this pull request is in draft status.
    To trigger review, mark the PR as ready or type /review in the comment and save.
    You can change draft PR review settings here, or contact the agent instance creator at [email protected].

@Sovietaced Sovietaced changed the title WIP: Add support for cache ttl Add support for cache ttl May 21, 2025
@Sovietaced Sovietaced changed the title Add support for cache ttl Add support for cache TTL May 21, 2025
@Sovietaced Sovietaced added the added Merged changes that add new functionality label May 21, 2025
@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Automatic Review Skipped - Draft PR

    Bito didn't auto-review because this pull request is in draft status.
    To trigger review, mark the PR as ready or type /review in the comment and save.
    You can change draft PR review settings here, or contact the agent instance creator at [email protected].

1 similar comment
@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Automatic Review Skipped - Draft PR

    Bito didn't auto-review because this pull request is in draft status.
    To trigger review, mark the PR as ready or type /review in the comment and save.
    You can change draft PR review settings here, or contact the agent instance creator at [email protected].

@Sovietaced Sovietaced marked this pull request as ready for review May 30, 2025 02:59
defer func() {
if r := recover(); r != nil {
tx.Rollback()
err := h.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly refactoring to use the nicer transaction syntax


func (p *postgresErrorTransformer) ToDataCatalogError(err error) error {
var dce catalogErrors.DataCatalogError
if errors.As(err, &dce) {
Copy link
Member Author

@Sovietaced Sovietaced May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little improvement so we can liberally call ToDatacatalogError on things especially in nested situations.

Signed-off-by: Jason Parraga <[email protected]>
@kumare3
Copy link
Contributor

kumare3 commented May 30, 2025

From reading you description it sees expired at is a better primary key member

@Sovietaced
Copy link
Member Author

From reading you description it sees expired at is a better primary key member

Yeah that would work as well. Just need to widen that composite key with something.

artifactModel.ArtifactData = artifactDataModels

// Reset TTL on the artifact since all the data is fresh.
if request.GetTtl() != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be problematic depending on how the object storage engine implements the lifecycle rule for the object. It isn't required that the TTL be reset if a /PUT request is issued for a given key. For example, S3 triggers off of the creationTimestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
added Merged changes that add new functionality review-needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

is it possible to add cache expiration feature?
5 participants