Skip to content

URL and path in DMS #26

@rufuspollock

Description

@rufuspollock

How should url vs path work in a DMS? Should we only have path?

resource:
  path:
  data:
    
  # conventions
  sample:

Observations:

  • What do i want to represent
    • "local" path (with storage done somewhere)
    • "remote" file (storage is not managed by our system)
    • "inline" data
  • path and data have to be separate because one can't disambiguate a string that is a path from a string that is data

Concepts:

graph LR

treedb[Tree DB]
blobkv[Local Blob KV store]
wc[Working Copy]
Loading

Problem

Let's say i have a file on disk and want to represent it in frictionless

path: 'abc.csv'

On disk

/abc.csv: bytes ...

Then we add fact that we want to separate storage from working copy ...

# working copy

{
  path: 'abc.csv'
  sha256: '...'
}

# in my local content addressed blob kv store
sha256: bytes ...

Then i can build my working copy from storage.

Then i can even move that storage to the cloud ...

sha256: url into LFS in the cloud ...

If i were starting from scratch for a DMS

  • Datasets have resources
  • Resources have
    • working copy path (unique within this dataset resources)
    • data streams:
      • Local: identified by sha256 hash of the stream
      • Remote: identified by url
      • (Inline)
resource:
  path: ...
  data: 'https:// ...' | 'oid:sha256' | 'data://'

0000

resource:
  path: local | 'url'
  

In a perfect world ... what should CKAN return on GET resource X

{
  path: 
  
}

What does a user want to do with that ...

I want to get the byte stream ... (and know where to put it on disk ...)

Imagine i'm a data scientist or data engineer

  • Pull a data project and work on it and push back and this project includes data
  • I want to depend on "external" dataset (package) and have it part of my project
    • But because most people don't do this yet AND data is big/changing (and i don't want to have to store it) i want to "package" external data

Catalogers ...

  • i want to catalog data i do store in my system
  • I want to catalog (similar to bundle) data i don't store in my system

Job stories

I want to pull and checkout and have my local copy in correct state

I want to depend on external dependencies and have them locally in a structured way.

  • A way to collaborate on a data project
    • => a way to version a data project
    • => efficient (centralized/cloud) file storage for large files
  • A way to manage data dependencies [follow the go / pnpm approach]
    • A central content addressed cache symlinked to projects in a flat structure

Qu: why would i ever have external urls in a Resource?

Ans: because i want to create datasets in my catalog with data hosted elsewhere ... That's fine but

  • Do we expected to download these files in the UI
  • Do we expect to cache that data ...

Intuition: people SHOULD just package everything. In code you would not package an external codebase. You'd pull it locally and wrap it. Even in creating say "deb" packages you have the code within that. ...

However, data packaging is early stage (so lot's of stuff not packaged) AND data is very large so painful to host oneself ... => desire to external package

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions