RFC: Structured metadata

I have a proposal for a spec for metadata, laying out goals and a formal spec.

I'm happy to implement this if there's buy-in.

Thoughts?

---

# RFC: Structured metadata

Currently when generating images from the CLI (but not the web), metadata for that is stored as a string kind-of corresponding to the prompt. That metadata is enough to reproduce the original image... sometimes.

I'd like to:
- be more precise about the metadata which gets stored
- allow reproducing _any_ output just from the metadata and necessary input files
  - necessary input files meaning the model weights, the image for img2img, and the embeddings if using embeddings
  - metadata should allow you to confirm that you have the right inputs, by storing hashes of all of those files
  - "any" output includes outputs from [seed fuzzing](https://github.com/lstein/stable-diffusion/pull/184) and interpolations (which I haven't written yet, in part because I wanted to work out the metadata format first)
- store it in a _structured_ format, namely JSON
- expand the metadata so it works with grids
- expand the metadata so it works with stuff like variations and interpolations

To that end, I'd like to propose the following spec for metadata.

In this doc, "hash" means "the first 8 characters of the hex-encoded sha-256".

## Data location

Metadata is a JSON string following the "top-level data" schema, stored in an uncompressed PNG `tEXt` or `iTXt` chunk named "sd-metadata". (This corresponds to what PIL does already when adding text data - it will choose `tEXt` or `iTXt` depending on whether it contains non-latin-1 characters. I just figure it's worth writing this down.)

## Top-level data

The top-level metadata should have the following fields:

- `model`: "stable diffusion"
- `model_id`: string identifying the model. must by the `model_id` field of a [Model card](https://arxiv.org/abs/1810.03993). **Optional**; there is no default value, but consuming applications may infer a value from `model_hash` if they recognize that value.
- `model_url`: a string giving a URL where the model can be downloaded (if public) or read about (if not). **Optional**, does not have a default.
- `model_hash`: hash of the weights [precise format TBD depending on implementation feasibility]; see the "model information" section below
- `app_id`: a string identifying the application consuming the model. It is recommended, but not required, that applications hosted on GitHub use the username/repo_name of the repository in this field; for example, the fork we're on would use lstein/stable-diffusion.
- `app_version`: a string giving the version of the app from `app_id`. It is recommended, but not required, that projects with numbered versions use a string of the form `v1.0`, and that projects built from git repos use the short-form git hash of the commit. **Optional**, defaults to "unknown".
- `app_url`: a string giving the canonical location of the application on the web. **Optional**, does not have a default.
- `embeddings_hashes`: an an array of the hashes of any textual-inversion embeddings in use. **Optional**, defaults to an empty array.
- `arch`: "cuda", "MPS", or another helpful value indicating the GPU architecture. **Optional**, defaults to "unknown". 
- `grid`: a boolean, whether this was a grid. **Optional**, defaults to `false`.
- `metadata_version`: the string "1.0". **Optional**, defaults to "1.0". Breaking changes to this metadata format should update this field. 

and then also one of the following two fields, depending on whether this is a `grid`:

- `image`: an object in one of the formats specified below
- `images`: an array of such objects

## Image data

Every image has the following fields:

- `type`: either "txt2img" or "img2img"
- `postprocessing`: either `null`, indicating no postprocessing was done, or an arbitrary object representing the postprocessing performed. Spec for this will depend on individual postprocessors, but I'll write something up for the ones we support. **Optional**, defaults to `null`. 
- `sampler`: one of [these samplers](https://github.com/lstein/stable-diffusion/blob/ed513397b255868a9c0afe6dd7e580005b5d32bb/scripts/dream.py#L302-L311)
- `prompt`: a nonempty array of `{ prompt: string, weight: number }` pairs. The single-prompt case is `[{ prompt: prompt, wieght: 1 }]`
- `seed`: a seed
- `variations`: an array of `{ seed: number, weight: number }` pairs used to [generate variations](https://github.com/lstein/stable-diffusion/pull/277). **Optional**, defaults to an empty array.
- `steps`: the number of steps configured to be taken
- `cfg_scale`: the unconditional guidance scale
- `step_number`: the number of steps actually taken. Normally this will be the full number of steps, but for intermediate images it may be less. **Optional**, defaults to `steps` (or `strength_steps` in the case of img2img).
- `width`: the specified width (as a number of pixels). **Optional** only when this metadata is embedded in am image whose width is the same as this value would be, in which case it defaults to that image's width.
- `height`: the specified height (as a number of pixels). **Optional** only when this metadata is embedded in am image whose height is the same as this value would be, in which case it defaults to that image's height.
- `extra`: an object containing any necessary additional information to generate this image. Not to be used for other data, like contact information. **Optional**, defaults to the empty object.

Images of type `img2img` also have the following fields:

- `orig_hash`: hash of the input image
- `strength_steps`: the configured strength for running `img2img` (as an integer; as discussed [here](https://github.com/lstein/stable-diffusion/issues/252#issuecomment-1233050856), that's what it actually is).

Height/width are not stored since you can infer those from the file.

## Thoughts on storing the model information

I am proposing to store a hash of the [loaded model](https://github.com/lstein/stable-diffusion/blob/9ad79207c21384954524d4b56ffd9cd8b529a75b/ldm/simplet2i.py#L581), which is a lot faster than reading the file from disk a second  time, but the hash correspond to the file on disk. Better than nothing, though.

Is it worth also storing a hash of the model _config_? I don't think so, since you're always going to need the original config for a given model weights file.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Structured metadata #266

RFC: Structured metadata

Data location

Top-level data

Image data

Thoughts on storing the model information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Structured metadata #266

Description

RFC: Structured metadata

Data location

Top-level data

Image data

Thoughts on storing the model information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions