An Index-based Approach for Efficient and Effective Web Content Extraction

#1 An Index-based Approach for Efficient and Effective Web Content Extraction [PDF¹] [Copy] [Kimi] [REL]

Authors: Yihan Chen, Benfeng Xu, Xiaorui Wang, Zhendong Mao

As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.

Subjects: Information Retrieval , Computation and Language

Publish: 2025-12-07 03:18:19 UTC

2512.06641

#1 An Index-based Approach for Efficient and Effective Web Content Extraction [PDF1] [Copy] [Kimi] [REL]

#1 An Index-based Approach for Efficient and Effective Web Content Extraction [PDF¹] [Copy] [Kimi] [REL]