URL and path in DMS

How should `url` vs `path` work in a DMS? Should we only have path?

```yaml=
resource:
  path:
  data:
    
  # conventions
  sample:
```

Observations:

* What do i want to represent
  * "local" path (with storage done somewhere)
  * "remote" file (storage is not managed by our system)
  * "inline" data
* `path` and `data` have to be separate because one can't disambiguate a string that is a path from a string that is data

Concepts:

```mermaid
graph LR

treedb[Tree DB]
blobkv[Local Blob KV store]
wc[Working Copy]
```

## Problem

Let's say i have a file on disk and want to represent it in frictionless

```
path: 'abc.csv'
```

On disk

```
/abc.csv: bytes ...
```

Then we add fact that we want to separate storage from working copy ...


```
# working copy

{
  path: 'abc.csv'
  sha256: '...'
}

# in my local content addressed blob kv store
sha256: bytes ...
```

Then i can build my working copy from storage.

Then i can even move that storage to the cloud ...

```
sha256: url into LFS in the cloud ...
```

If i were starting from scratch for a DMS

* Datasets have resources
* Resources have
  * working copy path (unique within this dataset resources)
  * data streams:
    * Local: identified by sha256 hash of the stream
    * Remote: identified by url
    * (Inline)

```yaml=
resource:
  path: ...
  data: 'https:// ...' | 'oid:sha256' | 'data://'
```

0000

```yaml=
resource:
  path: local | 'url'
  
```

In a perfect world ... what should CKAN return on GET resource X

```
{
  path: 
  
}
```

What does a user want to do with that ...

I want to get the byte stream ... (and know where to put it on disk ...)

Imagine i'm a data scientist or data engineer

* Pull a data project and work on it and push back and this project *includes* data
* I want to depend on "external" dataset (package) and have it part of my project
  * But because most people don't do this yet AND data is big/changing (and i don't want to have to store it) i want to "package" external data

Catalogers ...

* i want to catalog data i do store in my system
* I want to catalog  (similar to bundle) data i don't store in my system 

## Job stories

I want to pull and checkout and have my local copy in correct state

I want to depend on external dependencies and have them locally in a structured way.

- A way to collaborate on a data project
  - => a way to version a data project
  - => efficient (centralized/cloud) file storage for large files
- A way to manage data dependencies [follow the go / pnpm approach]
  - A central content addressed cache symlinked to projects in a flat structure


Qu: why would i ever have external urls in a Resource?

Ans: because i want to create datasets in my catalog with data hosted elsewhere ... That's fine but

* Do we expected to download these files in the UI
* Do we expect to cache that data ...

Intuition: people SHOULD just package everything. In code you would not package an external codebase. You'd pull it locally and wrap it. Even in creating say "deb" packages you have the code within that. ...

However, data packaging is early stage (so lot's of stuff not packaged) AND data is very large so painful to host oneself ... => desire to external package  



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

URL and path in DMS #26

Problem

Job stories

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

URL and path in DMS #26

Description

Problem

Job stories

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions