-
Notifications
You must be signed in to change notification settings - Fork 834
Description
In TF TF.summary is the primary ways people export metrics tracking the performance of their models during training. This data can be visualized using tensorboard (see here).
A lot of the key signals; e.g. accuracy metrics are just time series.
Should we export summaries in prometheus format so that they can be visualized and collected using tools in the prometheus tool chain?
Potential Use Cases
-
For hyperparameter tuning(Katib) we'd like a generic way for the HP parameter infrastructure to get model metrics; prometheus could provide a standard interface that abstracts the details of metrics in a particular framework (PyTorch vs. TF).
-
We could use this data to support features like "Run until converged"
Implementation
I think implementation would be pretty straightforward we would just need a Python server to read TF.Events files and export metrics to prometheus.