Databases

#1 Dynamic Data Layout Optimization with Worst-case Guarantees [PDF] [Copy] [Kimi¹]

Authors: Kexin Rong ; Paul Liu ; Sarah Ashok Sonje ; Moses Charikar

Many data analytics systems store and process large datasets in partitions containing millions of rows. By mapping rows to partitions in an optimized way, it is possible to improve query performance by skipping over large numbers of irrelevant partitions during query processing. This mapping is referred to as a data layout. Recent works have shown that customizing the data layout to the anticipated query workload greatly improves query performance, but the performance benefits may disappear if the workload changes. Reorganizing data layouts to accommodate workload drift can resolve this issue, but reorganization costs could exceed query savings if not done carefully. In this paper, we present an algorithmic framework OReO that makes online reorganization decisions to balance the benefits of improved query performance with the costs of reorganization. Our framework extends results from Metrical Task Systems to provide a tight bound on the worst-case performance guarantee for online reorganization, without prior knowledge of the query workload. Through evaluation on real-world datasets and query workloads, our experiments demonstrate that online reorganization with OReO can lead to an up to 32% improvement in combined query and reorganization time compared to using a single, optimized data layout for the entire workload.

#2 Towards Accurate and Efficient Document Analytics with Large Language Models [PDF] [Copy] [Kimi]

Authors: Yiming Lin ; Madelon Hulsebos ; Ruiying Ma ; Shreya Shankar ; Sepanta Zeigham ; Aditya G. Parameswaran ; Eugene Wu

Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost.

#3 A Novel Technique for Query Plan Representation Based on Graph Neural Networks [PDF] [Copy] [Kimi]

Authors: Baoming Chang ; Amin Kamali ; Verena Kantere

Learning representations for query plans play a pivotal role in machine learning-based query optimizers of database management systems. To this end, particular model architectures are proposed in the literature to convert the tree-structured query plans into representations with formats learnable by downstream machine learning models. However, existing research rarely compares and analyzes the query plan representation capabilities of these tree models and their direct impact on the performance of the overall optimizer. To address this problem, we perform a comparative study to explore the effect of using different state-of-the-art tree models on the optimizer's cost estimation and plan selection performance in relatively complex workloads. Additionally, we explore the possibility of using graph neural networks (GNN) in the query plan representation task. We propose a novel tree model combining directed GNN with Gated Recurrent Units (GRU) and demonstrate experimentally that the new tree model provides significant improvements to cost estimation tasks and relatively excellent plan selection performance compared to the state-of-the-art tree models.

#4 SPSW: Database Watermarking Based on Fake Tuples and Sparse Priority Strategy [PDF] [Copy] [Kimi]

Authors: Zhiwen Ren ; Zehua Ma ; Weiming Zhang ; Nenghai Yu

Databases play a crucial role in storing and managing vast amounts of data in various organizations and industries. Yet the risk of database leakage poses a significant threat to data privacy and security. To trace the source of database leakage, researchers have proposed many database watermarking schemes. Among them, fake-tuples-based database watermarking shows great potential as it does not modify the original data of the database, ensuring the seamless usability of the watermarked database. However, the existing fake-tuple-based database watermarking schemes need to insert a large number of fake tuples for the embedding of each watermark bit, resulting in low watermark transparency. Therefore, we propose a novel database watermarking scheme based on fake tuples and sparse priority strategy, named SPSW, which achieves the same watermark capacity with a lower number of inserted fake tuples compared to the existing embedding strategy. Specifically, for a database about to be watermarked, we prioritize embedding the sparsest watermark sequence, i.e., the sequence containing the most `0' bits among the currently available watermark sequences. For each bit in the sparse watermark sequence, when it is set to `1', SPSW will embed the corresponding set of fake tuples into the database. Otherwise, no modifications will be made to the database. Through theoretical analysis, the proposed sparse priority strategy not only improves transparency but also enhances the robustness of the watermark. The comparative experimental results with other database watermarking schemes further validate the superior performance of the proposed SPSW, aligning with the theoretical analysis.

#1 Dynamic Data Layout Optimization with Worst-case Guarantees [PDF] [Copy] [Kimi1]

#2 Towards Accurate and Efficient Document Analytics with Large Language Models [PDF] [Copy] [Kimi]

#3 A Novel Technique for Query Plan Representation Based on Graph Neural Networks [PDF] [Copy] [Kimi]

#4 SPSW: Database Watermarking Based on Fake Tuples and Sparse Priority Strategy [PDF] [Copy] [Kimi]

#1 Dynamic Data Layout Optimization with Worst-case Guarantees [PDF] [Copy] [Kimi¹]