Total: 1
Large language models (LLMs) are widely used in various domains for their ability to perform tasks that requirehuman-like skills. However, LLM inference is expensive today. Furthermore, optimizing LLM inference ischallenging, as its performance depends on many configuration options such as model parallelization strategy, thebatching algorithm, scheduling policy, maximum batch size allowed, etc. Identifying the optimal configuration fora large-scale cluster by experimentally running hundreds of configuration combinations is impractical due to theexorbitant time and monetary cost involved. To tackle this challenge, we present VIDUR and VIDUR-BENCH,the first large-scale, high-fidelity, collaborative, and easily extensible simulation framework for LLM inferencealongside a benchmark suite. VIDUR carefully models the performance of various operators involved in LLMinference using a combination of experimental profiling and predictive modeling, and evaluates the end-to-endmodel inference performance for different workloads by estimating several key performance metrics such aslatency, throughput, and time-to-first-byte. We experimentally validate our simulator on several LLMs and showthat it can estimate metrics such as inference latency and throughput with less than 5% error rate. VIDUR alsohelps answer large-scale deployment related what-if questions such as what is the best tensor-parallel dimension tomaximize serving throughput of the LlaMa-7B model across 32 A100 GPUs? We will open-source the simulatorcode, along with the workload benchmark suite, so that researchers and practitioners can collaboratively exploremodel and systems optimizations for efficient deployment of LLMs.