Understanding Knowledge Distillation in Neural Sequence Generation

Published on 17 Jan 2020, 18:47
Sequence-level knowledge distillation (KD) -- learning a student model with targets decoded from a pre-trained teacher model -- has been widely used in sequence generation applications (e.g. model compression, non-autoregressive translation (NAT), low-resource translation, etc). However, the underlying reasons behind this success have, as of yet, been unclear. In this talk, we will try to tackle the understanding of KD particularly in two scenarios: (1) Learning a weak student from a strong teacher model while keeping the same parallel data used for training the teacher; (2) Learning a student from a teacher model of equal size while the targets are generated from additional monolingual data.

Talk slides: microsoft.com/en-us/research/uploads/pro...

See more on this and other talks at Microsoft Research: microsoft.com/en-us/research/video/under...