NAACL.2024 - Industry Track

| Total: 43

#1 HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices [PDF] [Copy] [Kimi1] [REL]

Authors: Ruilong Ma ; Xiang Yang ; Jingyu Wang ; Qi Qi ; Haifeng Sun ; Jing Wang ; Zirui Zhuang ; Jianxin Liao

Micro-enterprises and individual developers emerge analysis demands for long sequence with powerful Large Language Models (LLMs). They try to deploy the LLMs at local, but only possess various commodity devices and the unreliable interconnection between devices. Existing parallel techniques do not lead to the same effectiveness in limited environment. The heterogeneity of devices, coupled with their limited capacity and expensive communication, brings challenges to private deployment for maximized utilization of available devices while masking latency. Hence, we introduce HPipe, a pipeline inference framework that successfully mitigates LLMs from high-performance clusters to heterogeneous commodity devices. By ensuring a balanced distribution of workloads, HPipe facilitates the parallel execution of LLMs through pipelining the sequences on the token dimension. The evaluation conducted on LLaMA-7B and GPT3-2B demonstrates that HPipe holds the potential for context analysis on LLM with heterogeneity devices, achieving an impressive speedup in latency and throughput up to 2.28 times.

#2 Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding [PDF] [Copy] [Kimi] [REL]

Authors: Jie Ou ; Yueming Chen ; Prof. Tian

While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM’s original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.

#3 SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling [PDF] [Copy] [Kimi] [REL]

Authors: Sanghoon Kim ; Dahyun Kim ; Chanjun Park ; Wonsung Lee ; Wonho Song ; Yunsu Kim ; Hyeonwoo Kim ; Yungi Kim ; Hyeonju Lee ; Jihoo Kim ; Changbae Ahn ; Seonghoon Yang ; Sukyung Lee ; Hyunbyung Park ; Gyoungjin Gim ; Mikyoung Cha ; Hwalsuk Lee ; Sunghun Kim

We introduce SOLAR 10.7B, a large language model (LLM) with 10.7 billion parameters, demonstrating superior performance in various natural language processing (NLP) tasks. Inspired by recent efforts to efficiently up-scale LLMs, we present a method for scaling LLMs called depth up-scaling (DUS), which encompasses depthwise scaling and continued pretraining. In contrast to other LLM up-scaling methods that use mixture-of-experts, DUS does not require complex changes to train and inference efficiently. We show experimentally that DUS is simple yet effective in scaling up high-performance LLMs from small ones. Building on the DUS model, we additionally present SOLAR 10.7B-Instruct, a variant fine-tuned for instruction-following capabilities, surpassing Mixtral-8x7B-Instruct. SOLAR 10.7B is publicly available under the Apache 2.0 license, promoting broad access and application in the LLM field.

#4 UINav: A Practical Approach to Train On-Device Automation Agents [PDF] [Copy] [Kimi] [REL]

Authors: Wei Li ; Fu-Lin Hsu ; William Bishop ; Folawiyo Campbell-Ajala ; Max Lin ; Oriana Riva

Automation systems that can autonomously drive application user interfaces to complete user tasks are of great benefit, especially when users are situationally or permanently impaired. Prior automation systems do not produce generalizable models while AI-based automation agents work reliably only in simple, hand-crafted applications or incur high computation costs. We propose UINav, a demonstration-based approach to train automation agents that fit mobile devices, yet achieving high success rates with modest numbers of demonstrations. To reduce the demonstration overhead, UINav uses a referee model that provides users with immediate feedback on tasks where the agent fails, and automatically augments human demonstrations to increase diversity in training data. Our evaluation shows that with only 10 demonstrations can achieve 70% accuracy, and that with enough demonstrations it can surpass 90% accuracy.

#5 Efficiently Distilling LLMs for Edge Applications [PDF] [Copy] [Kimi1] [REL]

Authors: Achintya Kundu ; Yu Chin Fabian Lim ; Aaron Chew ; Laura Wynter ; Penny Chong ; Rhui Lee

Supernet training of LLMs is of great interest in industrial applications as it confers the ability to produce a palette of smaller models at constant cost, regardless of the number of models (of different size / latency) produced. We propose a new method called Multistage Low-rank Fine-tuning of Super-transformers (MLFS) for parameter-efficient supernet training. We show that it is possible to obtain high-quality encoder models that are suitable for commercial edge applications, and that while decoder-only models are resistant to a comparable degree of compression, decoders can be effectively sliced for a significant reduction in training time.

#6 Modeling and Detecting Company Risks from News [PDF] [Copy] [Kimi1] [REL]

Authors: Jiaxin Pei ; Soumya Vadlamannati ; Liang-Kang Huang ; Daniel Preotiuc-Pietro ; Xinyu Hua

Identifying risks associated with a company is important to investors and the wellbeing of the overall financial markets. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competition. We annotate 666 news articles and benchmark various machine learning models. While large language mod- els have achieved remarkable progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of- the-art LLMs (e.g., Llama-2) can only achieve moderate to low performances in identifying risk factors. In contrast, fine-tuning pre-trained language models yields better results on most risk factors. Using this model, we analyze over 277K Bloomberg News articles and demonstrate that identifying risk factors from news could provide extensive insights into the operations of companies and industries.

#7 Multiple-Question Multiple-Answer Text-VQA [PDF] [Copy] [Kimi] [REL]

Authors: Peng Tang ; Srikar Appalaraju ; R. Manmatha ; Yusheng Xie ; Vijay Mahadevan

We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. However, in industry applications, users may come up with multiple questions about a single image. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers. MQMA pre-trained model achieves state-of-the-art results on multiple text-VQA datasets, each with strong baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%) absolute improvements over the previous state-of-the-art approaches.

#8 An NLP-Focused Pilot Training Agent for Safe and Efficient Aviation Communication [PDF] [Copy] [Kimi] [REL]

Authors: Xiaochen Liu ; Bowei Zou ; AiTi Aw

Aviation communication significantly influences the success of flight operations, ensuring safety of lives and efficient air transportation. In day-to-day flight operations, air traffic controllers (ATCos) would timely communicate instructions to pilots using specific phraseology for aircraft manipulation . However, pilots, originating from diverse backgrounds and understanding of English language, have struggled with conforming to strict phraseology for readback and communication in the live operation, this problem had not been effectively addressed over the past decades. Traditionally, aviation communication training involved expensive setups and resources, often relying on human-in-the-loop (HIL) air traffic simulations that demand allocating a specific environment, domain experts for participation, and substantial amount of annotated data for simulation. Therefore, we would like to propose an NLP-oriented training agent and address these challenges. Our approach involves leveraging only natural language capabilities and fine-tuning on communication data to generate instructions based on input scenarios (keywords). Given the absence of prior references for this business problem, we investigated the feasibility of our proposed solution by 1) generating all instructions at once and 2) generating one instruction while incorporating conversational history in each input. Our findings affirm the feasibility of this approach, highlighting the effectiveness of fine-tuning pre-trained models and large language models in advancing aviation communication training.

#9 Visual Grounding for User Interfaces [PDF] [Copy] [Kimi] [REL]

Authors: Yijun Qian ; Yujie Lu ; Alexander Hauptmann ; Oriana Riva

Enabling autonomous language agents to drive application user interfaces (UIs) as humans do can significantly expand the capability of today’s API-based agents. Essential to this vision is the ability of agents to ground natural language commands to on-screen UI elements. Prior UI grounding approaches work by relaying on developer-provided UI metadata (UI trees, such as web DOM, and accessibility labels) to detect on-screen elements. However, such metadata is often unavailable or incomplete. Object detection techniques applied to UI screens remove this dependency, by inferring location and types of UI elements directly from the UI’s visual appearance. The extracted semantics, however, are too limited to directly enable grounding. We overcome the limitations of both approaches by introducing the task of visual UI grounding, which unifies detection and grounding. A model takes as input a UI screenshot and a free-form language expression, and must identify the referenced UI element. We propose a solution to this problem, LVG, which learns UI element detection and grounding using a new technique called layout-guided contrastive learning, where the semantics of individual UI objects are learned also from their visual organization. Due to the scarcity of UI datasets, LVG integrates synthetic data in its training using multi-context learning. LVG outperforms baselines pre-trained on much larger datasets by over 4.9 points in top-1 accuracy, thus demonstrating its effectiveness.

#10 Prompt Tuned Embedding Classification for Industry Sector Allocation [PDF1] [Copy] [Kimi1] [REL]

Authors: Valentin Buchner ; Lele Cao ; Jan-Christoph Kalo ; Vilhelm Von Ehrenheim

We introduce Prompt Tuned Embedding Classification (PTEC) for classifying companies within an investment firm’s proprietary industry taxonomy, supporting their thematic investment strategy. PTEC assigns companies to the sectors they primarily operate in, conceptualizing this process as a multi-label text classification task. Prompt Tuning, usually deployed as a text-to-text (T2T) classification approach, ensures low computational cost while maintaining high task performance. However, T2T classification has limitations on multi-label tasks due to the generation of non-existing labels, permutation invariance of the label sequence, and a lack of confidence scores. PTEC addresses these limitations by utilizing a classification head in place of the Large Language Models (LLMs) language head. PTEC surpasses both baselines and human performance while lowering computational demands. This indicates the continuing need to adapt state-of-the-art methods to domain-specific tasks, even in the era of LLMs with strong generalization abilities.

#11 REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking [PDF] [Copy] [Kimi] [REL]

Authors: Nacime Bouziani ; Shubhi Tyagi ; Joseph Fisher ; Jens Lehmann ; Andrea Pierleoni

Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual sub-task and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which will be available at https://github.com/amazon-science/e2e-docie.

#12 Conformer-Based Speech Recognition On Extreme Edge-Computing Devices [PDF] [Copy] [Kimi] [REL]

Authors: Mingbin Xu ; Alex Jin ; Sicheng Wang ; Mu Su ; Tim Ng ; Henry Mason ; Shiyi Han ; Zhihong Lei ; Yaqiao Deng ; Zhen Huang ; Mahesh Krishnamoorthy

With increasingly more powerful compute capabilities and resources in today’s devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it is still challenging to implement on-device ASR on resource-constrained devices, such as smartphones, smart wearables, and other small home automation devices. In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve over 5.26 times faster than realtime (0.19 RTF) speech recognition on small wearables while minimizing energy consumption and achieving state-of-the-art accuracy. The proposed methods are widely applicable to other transformer-based server-free AI applications. In addition, we provide a complete theory on optimal pre-normalizers that numerically stabilize layer normalization in any Lp-norm using any floating point precision.

#13 Generating Signed Language Instructions in Large-Scale Dialogue Systems [PDF] [Copy] [Kimi] [REL]

Authors: Mert Inan ; Katherine Atwell ; Anthony Sicilia ; Lorna Quandt ; Malihe Alikhani

We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at https://github.com/Merterm/signed-dialogue, and a demo of our signed instruction video retrieval system is available at https://huggingface.co/spaces/merterm/signed-instructions.

#14 Leveraging Natural Language Processing and Large Language Models for Assisting Due Diligence in the Legal Domain [PDF] [Copy] [Kimi] [REL]

Authors: Myeongjun Jang ; Gábor Stikkel

Due diligence is a crucial legal process that mitigates potential risks of mergers and acquisitions (M&A). However, despite its prominent importance, there has been a lack of research regarding leveraging NLP techniques for due diligence. In this study, our aim is to explore the most efficient deep-learning model architecture for due diligence in terms of performance and latency, and evaluate the potential of large language models (LLMs) as an efficient due diligence assistant. To our knowledge, this is the first study that employs pre-trained language models (PLMs) and LLMs for the due diligence problem. Our experimental results suggest that methodologies that have demonstrated promising performance in the general domain encounter challenges when applied in due diligence due to the inherent lengthy nature of legal documents. We also ascertain that LLMs can be a useful tool for helping lawyers who perform due diligence.

#15 AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [PDF] [Copy] [Kimi] [REL]

Authors: Xingwei He ; Zhenghao Lin ; Yeyun Gong ; A-Long Jin ; Hang Zhang ; Chen Lin ; Jian Jiao ; Siu Ming Yiu ; Nan Duan ; Weizhu Chen

Many natural language processing (NLP) tasks rely on labeled data to train machine learning models with high performance. However, data annotation is time-consuming and expensive, especially when the task involves a large amount of data or requires specialized domains. Recently, GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. In this paper, we first claim that large language models (LLMs), such as GPT-3.5, can serve as an excellent crowdsourced annotator when provided with sufficient guidance and demonstrated examples. Accordingly, we propose AnnoLLM, an annotation system powered by LLMs, which adopts a two-step approach, explain-then-annotate. Concretely, we first prompt LLMs to provide explanations for why the specific ground truth answer/label was assigned for a given example. Then, we construct the few-shot chain-of-thought prompt with the self-generated explanation and employ it to annotate the unlabeled data with LLMs. Our experiment results on three tasks, including user input and keyword relevance assessment, BoolQ, and WiC, demonstrate that AnnoLLM surpasses or performs on par with crowdsourced annotators. Furthermore, we build the first conversation-based information retrieval dataset employing AnnoLLM. This dataset is designed to facilitate the development of retrieval models capable of retrieving pertinent documents for conversational text. Human evaluation has validated the dataset’s high quality.

#16 An Automatic Prompt Generation System for Tabular Data Tasks [PDF] [Copy] [Kimi] [REL]

Authors: Ashlesha Akella ; Abhijit Manatkar ; Brijkumar Chavda ; Hima Patel

Efficient processing of tabular data is important in various industries, especially when working with datasets containing a large number of columns. Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. However, creating effective prompts for tabular datasets is challenging due to the structured nature of the data and the need to manage numerous columns. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training. It proposes two novel methods; 1) A Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns 2) cell-level similarity-based approach for enhancing few-shot example selection. Our approach has been extensively tested across 66 datasets, demonstrating improved performance in three downstream tasks: data imputation, error detection, and entity matching using two distinct LLMs; Google/flant-t5xxl and Mixtral 8x7B.

#17 Fighting crime with Transformers: Empirical analysis of address parsing methods in payment data [PDF] [Copy] [Kimi] [REL]

Authors: Haitham Hammami ; Louis Baligand ; Bojan Petrovski

In the financial industry, identifying the location of parties involved in payments is a major challenge in the context of Anti-Money Laundering transaction monitoring. For this purpose address parsing entails extracting fields such as street, postal code, or country from free text message attributes. While payment processing platforms are updating their standards with more structured formats such as SWIFT with ISO 20022, address parsing remains essential for a considerable volume of messages. With the emergence of Transformers and Generative Large Language Models (LLM), we explore the performance of state-of-the-art solutions given the constraint of processing a vast amount of daily data. This paper also aims to show the need for training robust models capable of dealing with real-world noisy transactional data. Our results suggest that a well fine-tuned Transformer model using early-stopping significantly outperforms other approaches. Nevertheless, generative LLMs demonstrate strong zero_shot performance and warrant further investigations.

#18 Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain [PDF1] [Copy] [Kimi] [REL]

Authors: Brian Hu ; Bill Ray ; Alice Leung ; Amy Summerville ; David Joy ; Christopher Funk ; Arslan Basharat

In difficult decision-making scenarios, it is common to have conflicting opinions among expert human decision-makers as there may not be a single right answer. Such decisions may be guided by different attributes that can be used to characterize an individual’s decision. We introduce a novel dataset for medical triage decision-making, labeled with a set of decision-maker attributes (DMAs). This dataset consists of 62 scenarios, covering six different DMAs, including ethical principles such as fairness and moral desert. We present a novel software framework for human-aligned decision-making by utilizing these DMAs, paving the way for trustworthy AI with better guardrails. Specifically, we demonstrate how large language models (LLMs) can serve as ethical decision-makers, and how their decisions can be aligned to different DMAs using zero-shot prompting. Our experiments focus on different open-source models with varying sizes and training techniques, such as Falcon, Mistral, and Llama 2. Finally, we also introduce a new form of weighted self-consistency that improves the overall quantified performance. Our results provide new research directions in the use of LLMs as alignable decision-makers. The dataset and open-source software are publicly available at: https://github.com/ITM-Kitware/llm-alignable-dm.

#19 Reducing hallucination in structured outputs via Retrieval-Augmented Generation [PDF] [Copy] [Kimi1] [REL]

Authors: Orlando Ayala ; Patrice Bechard

A current limitation of Generative AI (GenAI) is its propensity to hallucinate. While Large Language Models (LLM) have taken the world by storm, without eliminating or at least reducing hallucination, real-world GenAI systems will likely continue to face challenges in user adoption. In the process of deploying an enterprise application that produces workflows from natural language requirements, we devised a system leveraging Retrieval-Augmented Generation (RAG) to improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucination and allows the generalization of our LLM to out-of-domain settings. In addition, we show that using a small, well-trained retriever can reduce the size of the accompanying LLM at no loss in performance, thereby making deployments of LLM-based systems less resource-intensive.

#20 Towards Translating Objective Product Attributes Into Customer Language [PDF] [Copy] [Kimi] [REL]

Authors: Ram Yazdi ; Oren Kalinsky ; Alexander Libov ; Dafna Shahaf

When customers search online for a product they are not familiar with, their needs are often expressed through subjective product attributes, such as ”picture quality” for a TV or ”easy to clean” for a sofa. In contrast, the product catalog in online stores includes objective attributes such as ”screen resolution” or ”material”. In this work, we aim to find a link between the objective product catalog and the subjective needs of the customers, to help customers better understand the product space using their own words. We apply correlation-based methods to the store’s product catalog and product reviews in order to find the best potential links between objective and subjective attributes; next, Large Language Models (LLMs) reduce spurious correlations by incorporating common sense and world knowledge (e.g., picture quality is indeed affected by screen resolution, and 8k is the best one). We curate a dataset for this task and show that our combined approach outperforms correlation-only and causation-only approaches.

#21 Automating the Generation of a Functional Semantic Types Ontology with Foundational Models [PDF] [Copy] [Kimi] [REL]

Authors: Sachin Konan ; Larry Rudolph ; Scott Affens

The rise of data science, the inherent dirtiness of data, and the proliferation of vast data providers have increased the value proposition of Semantic Types. Semantic Types are a way of encoding contextual information onto a data schema that informs the user about the definitional meaning of data, its broader context, and relationships to other types. We increasingly see a world where providing structure to this information, attached directly to data, will enable both people and systems to better understand the content of a dataset and the ability to efficiently automate data tasks such as validation, mapping/joins, and eventually machine learning. While ontological systems exist, they have not had widespread adoption due to challenges in mapping to operational datasets and lack of specificity of entity-types. Additionally, the validation checks associated with data are stored in code bases separate from the datasets that are distributed. In this paper, we address both challenges holistically by proposing a system that efficiently maps and encodes functional meaning on Semantic Types.

#22 Leveraging Customer Feedback for Multi-modal Insight Extraction [PDF] [Copy] [Kimi] [REL]

Authors: Sandeep Mukku ; Abinesh Kanagarajan ; Pushpendu Ghosh ; Chetan Aggarwal

Businesses can benefit from customer feedback in different modalities, such as text and images, to enhance their products and services. However, it is difficult to extract actionable and relevant pairs of text segments and images from customer feedback in a single pass. In this paper, we propose a novel multi-modal method that fuses image and text information in a latent space and decodes it to extract the relevant feedback segments using an image-text grounded text decoder. We also introduce a weakly-supervised data generation technique that produces training data for this task. We evaluate our model on unseen data and demonstrate that it can effectively mine actionable insights from multi-modal customer feedback, outperforming the existing baselines by 14 points in F1 score.

#23 Optimizing LLM Based Retrieval Augmented Generation Pipelines in the Financial Domain [PDF] [Copy] [Kimi] [REL]

Authors: Yiyun Zhao ; Prateek Singh ; Hanoz Bhathena ; Bernardo Ramos ; Aviral Joshi ; Swaroop Gadiyaram ; Saket Sharma

Retrieval Augmented Generation (RAG) is a prominent approach in real-word applications for grounding large language model (LLM) generations in up to date and domain-specific knowledge. However, there is a lack of systematic investigations of the impact of each component (retrieval quality, prompts, generation models) on the generation quality of a RAG pipeline in real world scenarios. In this study, we benchmark 6 LLMs in 15 retrieval scenarios exploring 9 prompts over 2 real world financial domain datasets. We thoroughly discuss the impact of each component in RAG pipeline on answer generation quality and formulate specific recommendations for the design of RAG systems.

#24 Scaling Up Authorship Attribution [PDF] [Copy] [Kimi] [REL]

Authors: Jacob Striebel ; Abishek Edikala ; Ethan Irby ; Alex Rosenfeld ; J. Gage ; Daniel Dakota ; Sandra Kübler

We describe our system for authorship attribution in the IARPA HIATUS program. We describe the model and compute infrastructure developed to satisfy the set of technical constraints imposed by IARPA, including runtime limits as well as other constraints related to the ultimate use case. One use-case constraint concerns the explainability of the features used in the system. For this reason, we integrate features from frame semantic parsing, as they are both interpretable and difficult for adversaries to evade. One trade-off with using such features, however, is that more sophisticated feature representations require more complicated architectures, which limit usefulness in time-sensitive and constrained compute environments. We propose an approach to increase the efficiency of frame semantic parsing through an analysis of parallelization and beam search sizes. Our approach results in a system that is approximately 8.37x faster than the base system with a minimal effect on accuracy.

#25 Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models [PDF] [Copy] [Kimi] [REL]

Authors: Md Messal Monem Miah ; Ulie Schnaithmann ; Arushi Raghuvanshi ; Youngseo Son

Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows.In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.