log learning rate #937

lkhphuc · 2025-03-06T06:12:49Z

Simply log all the learning rates for all parameter groups of all schedulers.

torchtitan/train.py

Co-authored-by: Chien-Chin Huang <[email protected]>

tianyu-l

sounds good to me.

Could you please include a screenshot of the results? Either TB or WandB is fine.

torchtitan/train.py

tianyu-l

It seems similar things (with bigger change) are being done in #938

How about we collaborate over there?

Co-authored-by: tianyu-l <[email protected]>

lkhphuc · 2025-03-07T06:21:07Z

Here's a screenshot on WandB (before the log's name change).

It seems similar things (with bigger change) are being done in #938

Currently that PR does not includes changes to the Exp Tracker, only to the logger. So it's orthogonal to this PR I think.
I'm not sure how best to proceed but but feel free to include the code in that PR directly before merging. .

tianyu-l · 2025-03-07T08:30:55Z

sorry what is "Exp Tracker"?
It looks to me this line effectively achieves the same thing, as all optimizer groups across all lr schedulers would have the same LR in torchtitan setting.
https://github.com/pytorch/torchtitan/pull/938/files#diff-ea620cebba782ef8545fcfc700627348c15bb4cbb8ef5c5b4f417ddff955668bR396

lkhphuc · 2025-03-07T08:53:04Z

Ah yes sorry I was confused and missed that part. Those are the same thing. I will close this PR in favor of that PR.

This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.

This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](pytorch#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](pytorch#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.

log learning rate

0b28feb

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 6, 2025

fegin reviewed Mar 6, 2025

View reviewed changes

torchtitan/train.py Outdated Show resolved Hide resolved

Update torchtitan/train.py

138df9a

Co-authored-by: Chien-Chin Huang <[email protected]>

tianyu-l reviewed Mar 6, 2025

View reviewed changes

torchtitan/train.py Outdated Show resolved Hide resolved

tianyu-l reviewed Mar 6, 2025

View reviewed changes

Update torchtitan/train.py

7385203

Co-authored-by: tianyu-l <[email protected]>

lkhphuc closed this Mar 7, 2025

lkhphuc mentioned this pull request Mar 7, 2025

[Scheduler] Add support for cosine and wsd scheduler #938

Merged

idoh mentioned this pull request Jul 18, 2025

Add logging for learning rates in MetricsProcessor #1413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

log learning rate #937

log learning rate #937

Uh oh!

lkhphuc commented Mar 6, 2025

Uh oh!

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

tianyu-l left a comment

Uh oh!

lkhphuc commented Mar 7, 2025

Uh oh!

tianyu-l commented Mar 7, 2025

Uh oh!

lkhphuc commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

log learning rate #937

log learning rate #937

Uh oh!

Conversation

lkhphuc commented Mar 6, 2025

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

lkhphuc commented Mar 7, 2025

Uh oh!

tianyu-l commented Mar 7, 2025

Uh oh!

lkhphuc commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants