LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

#1 LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings [PDF] [Copy] [Kimi] [REL]

Authors: Yi Zhao, Zhen Yang, Mengpan Chen, Mingde Xu, Shanghui Gong, Xijun Liu, Jibing Gong, Jie Tang

Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. It employs two complementary protocols: a multi-dimensional VLM-based metric for assessing long-range structural coherence, and a DOM-augmented agent-based pipeline for end-to-end functional verification. We further examine the automatic evaluation protocols through human agreement analysis. Experiments with state-of-the-art open-source and proprietary VLMs under single-image and multi-image settings reveal that structural fidelity degrades as webpage length increases, while visually plausible generations often fail to support executable multi-step interactions. These results highlight the need to evaluate long webpage generation beyond visual similarity, with executable interaction as a core criterion. Our code and data are available at https://github.com/zheny2751-dotcom/LongWebBench.

Subject: Artificial Intelligence

Publish: 2026-06-16 09:43:12 UTC

2606.17727

#1 LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings [PDF] [Copy] [Kimi] [REL]