-
Notifications
You must be signed in to change notification settings - Fork 5
Description
How should url vs path work in a DMS? Should we only have path?
resource:
path:
data:
# conventions
sample:
Observations:
- What do i want to represent
- "local" path (with storage done somewhere)
- "remote" file (storage is not managed by our system)
- "inline" data
pathanddatahave to be separate because one can't disambiguate a string that is a path from a string that is data
Concepts:
graph LR
treedb[Tree DB]
blobkv[Local Blob KV store]
wc[Working Copy]
Problem
Let's say i have a file on disk and want to represent it in frictionless
path: 'abc.csv'
On disk
/abc.csv: bytes ...
Then we add fact that we want to separate storage from working copy ...
# working copy
{
path: 'abc.csv'
sha256: '...'
}
# in my local content addressed blob kv store
sha256: bytes ...
Then i can build my working copy from storage.
Then i can even move that storage to the cloud ...
sha256: url into LFS in the cloud ...
If i were starting from scratch for a DMS
- Datasets have resources
- Resources have
- working copy path (unique within this dataset resources)
- data streams:
- Local: identified by sha256 hash of the stream
- Remote: identified by url
- (Inline)
resource:
path: ...
data: 'https:// ...' | 'oid:sha256' | 'data://'
0000
resource:
path: local | 'url'
In a perfect world ... what should CKAN return on GET resource X
{
path:
}
What does a user want to do with that ...
I want to get the byte stream ... (and know where to put it on disk ...)
Imagine i'm a data scientist or data engineer
- Pull a data project and work on it and push back and this project includes data
- I want to depend on "external" dataset (package) and have it part of my project
- But because most people don't do this yet AND data is big/changing (and i don't want to have to store it) i want to "package" external data
Catalogers ...
- i want to catalog data i do store in my system
- I want to catalog (similar to bundle) data i don't store in my system
Job stories
I want to pull and checkout and have my local copy in correct state
I want to depend on external dependencies and have them locally in a structured way.
- A way to collaborate on a data project
- => a way to version a data project
- => efficient (centralized/cloud) file storage for large files
- A way to manage data dependencies [follow the go / pnpm approach]
- A central content addressed cache symlinked to projects in a flat structure
Qu: why would i ever have external urls in a Resource?
Ans: because i want to create datasets in my catalog with data hosted elsewhere ... That's fine but
- Do we expected to download these files in the UI
- Do we expect to cache that data ...
Intuition: people SHOULD just package everything. In code you would not package an external codebase. You'd pull it locally and wrap it. Even in creating say "deb" packages you have the code within that. ...
However, data packaging is early stage (so lot's of stuff not packaged) AND data is very large so painful to host oneself ... => desire to external package