-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
I have a proposal for a spec for metadata, laying out goals and a formal spec.
I'm happy to implement this if there's buy-in.
Thoughts?
RFC: Structured metadata
Currently when generating images from the CLI (but not the web), metadata for that is stored as a string kind-of corresponding to the prompt. That metadata is enough to reproduce the original image... sometimes.
I'd like to:
- be more precise about the metadata which gets stored
- allow reproducing any output just from the metadata and necessary input files
- necessary input files meaning the model weights, the image for img2img, and the embeddings if using embeddings
- metadata should allow you to confirm that you have the right inputs, by storing hashes of all of those files
- "any" output includes outputs from seed fuzzing and interpolations (which I haven't written yet, in part because I wanted to work out the metadata format first)
- store it in a structured format, namely JSON
- expand the metadata so it works with grids
- expand the metadata so it works with stuff like variations and interpolations
To that end, I'd like to propose the following spec for metadata.
In this doc, "hash" means "the first 8 characters of the hex-encoded sha-256".
Data location
Metadata is a JSON string following the "top-level data" schema, stored in an uncompressed PNG tEXt
or iTXt
chunk named "sd-metadata". (This corresponds to what PIL does already when adding text data - it will choose tEXt
or iTXt
depending on whether it contains non-latin-1 characters. I just figure it's worth writing this down.)
Top-level data
The top-level metadata should have the following fields:
model
: "stable diffusion"model_id
: string identifying the model. must by themodel_id
field of a Model card. Optional; there is no default value, but consuming applications may infer a value frommodel_hash
if they recognize that value.model_url
: a string giving a URL where the model can be downloaded (if public) or read about (if not). Optional, does not have a default.model_hash
: hash of the weights [precise format TBD depending on implementation feasibility]; see the "model information" section belowapp_id
: a string identifying the application consuming the model. It is recommended, but not required, that applications hosted on GitHub use the username/repo_name of the repository in this field; for example, the fork we're on would use lstein/stable-diffusion.app_version
: a string giving the version of the app fromapp_id
. It is recommended, but not required, that projects with numbered versions use a string of the formv1.0
, and that projects built from git repos use the short-form git hash of the commit. Optional, defaults to "unknown".app_url
: a string giving the canonical location of the application on the web. Optional, does not have a default.embeddings_hashes
: an an array of the hashes of any textual-inversion embeddings in use. Optional, defaults to an empty array.arch
: "cuda", "MPS", or another helpful value indicating the GPU architecture. Optional, defaults to "unknown".grid
: a boolean, whether this was a grid. Optional, defaults tofalse
.metadata_version
: the string "1.0". Optional, defaults to "1.0". Breaking changes to this metadata format should update this field.
and then also one of the following two fields, depending on whether this is a grid
:
image
: an object in one of the formats specified belowimages
: an array of such objects
Image data
Every image has the following fields:
type
: either "txt2img" or "img2img"postprocessing
: eithernull
, indicating no postprocessing was done, or an arbitrary object representing the postprocessing performed. Spec for this will depend on individual postprocessors, but I'll write something up for the ones we support. Optional, defaults tonull
.sampler
: one of these samplersprompt
: a nonempty array of{ prompt: string, weight: number }
pairs. The single-prompt case is[{ prompt: prompt, wieght: 1 }]
seed
: a seedvariations
: an array of{ seed: number, weight: number }
pairs used to generate variations. Optional, defaults to an empty array.steps
: the number of steps configured to be takencfg_scale
: the unconditional guidance scalestep_number
: the number of steps actually taken. Normally this will be the full number of steps, but for intermediate images it may be less. Optional, defaults tosteps
(orstrength_steps
in the case of img2img).width
: the specified width (as a number of pixels). Optional only when this metadata is embedded in am image whose width is the same as this value would be, in which case it defaults to that image's width.height
: the specified height (as a number of pixels). Optional only when this metadata is embedded in am image whose height is the same as this value would be, in which case it defaults to that image's height.extra
: an object containing any necessary additional information to generate this image. Not to be used for other data, like contact information. Optional, defaults to the empty object.
Images of type img2img
also have the following fields:
orig_hash
: hash of the input imagestrength_steps
: the configured strength for runningimg2img
(as an integer; as discussed here, that's what it actually is).
Height/width are not stored since you can infer those from the file.
Thoughts on storing the model information
I am proposing to store a hash of the loaded model, which is a lot faster than reading the file from disk a second time, but the hash correspond to the file on disk. Better than nothing, though.
Is it worth also storing a hash of the model config? I don't think so, since you're always going to need the original config for a given model weights file.