Evaluating distillation methods for data-efficient syntax learning

2025.findings-emnlp.801@ACL

Total: 1

#1 Evaluating distillation methods for data-efficient syntax learning [PDF] [Copy] [Kimi] [REL]

Authors: Takateru Yamakoshi, Thomas L. Griffiths, R. Thomas McCoy, Robert D. Hawkins

Data-efficient training requires strong inductive biases. To the extent that transformer attention matrices encode syntactic relationships, we would predict that knowledge distillation (KD) targeting attention should selectively accelerate syntax acquisition relative to conventional logit-based KD. To test this hypothesis, we train GPT-2 student models on datasets ranging from 10K to 5M sentences using both distillation methods, evaluating them on both syntactic benchmarks and perplexity. Surprisingly, while logit-based KD dramatically improves data-efficiency, attention-based KD provides minimal benefit even for syntactic tasks. This suggests that output distributions provide sufficient supervisory signal for syntax acquisition, indicating that syntactic knowledge may be distributed throughout the network rather than localized in attention patterns.

Subject: EMNLP.2025 - Findings