Total: 1
Large language models (LLMs) have achievedremarkable success across various natural lan-guage processing tasks. However, most LLMmodels use traditional tokenizers like BPE andSentencePiece, which fail to capture the finernuances of a morphologically rich languagelike Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla.Built upon a small variant of Google’s ByT5architecture, BanglaByT5 is pre-trained on a14GB curated corpus combining high-qualityliterary and newspaper articles. Through zero-shot and supervised evaluations across gen-erative and classification tasks, BanglaByT5demonstrates competitive performance, surpassing several multilingual and larger models.Our findings highlight BanglaByT5’s potentialas a lightweight yet powerful tool for BanglaNLP, particularly in resource-constrained orscalable environments. BanglaByT5 is pub-licly available for download from https://huggingface.co/Vacaspati/BanglaByT5.