Skip to content

Model Training Domain Knowledge

Sherlock edited this page Mar 12, 2021 · 2 revisions

ML Knowledge

  • Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
  • Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
  • Understand the difference between optimizers: SGD, Adam and LAMB
  • Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Know-hows

  • Familiar with running/monitoring AML experiments
  • Familiarize with setting up tensorboard
  • Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics

Convergence Investigation

  • Remove all randomness in the program

    • Set Seeds
    • Set Dropout Ratio to 0
    • Set use_deterministice_compute=True
    • Disable dataloader shuffling
  • Shrink the reproducible condition to the very minimal, as long as it can still repro

    • Use 1 layer model
    • Use smaller hidden_size
    • Use single GPU
    • ...
  • Common Tricks

    • Set the learning rate to 0 to disable model change
  • Advance: how to do hyper-parameter tuning to get the model to converge better?

Action: Train a model E2E to get hands-on experience