Skip to content

RFE: output_format md5_bucket_files #448

@pbrownrobo

Description

@pbrownrobo

Right now, there exists "output_format files" and also the calculate hash optional feature.

What would be really nice to see, would be "output_format md5_bucket_files".

eg: file has md5 hash a37f352376.....
So it gets saved to a3/a37f35....jpg

I personally have a post-download conversion routine to organized the downloaded files liie this, but I just realized.... what if img2dataset did it out of the box?

There are multiple advantages to this new feature:

  1. the "100,000 files in a single directory" problem mostly goes away
  2. collaboration with other people on a dataset becomes a lot easier. For example, if you are both working from a raw image dataset somewhere, one of you can run captioning on it, zip up the directory with just the .txt files, send it over, then the other person can extract it, and they will automatically match up to the right images.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions