| Total: 121
This paper tackles the challenges presented by Automatic Speech Recognition (ASR) errors in voice-based dialog systems, specifically, their adverse impact on Entity Resolution (ER) as a downstream task. Navigating the equilibrium between accuracy and online retrieval’s speed requirement proves challenging, particularly when limited data links the failed mentions to resolved entities. In this paper, we propose a entity reference expansion system, injecting pairs of failed mentions and resolved entity names into the knowledge graph, enhancing its awareness of unresolved mentions. To address data scarcity, we introduce a synthetic data generation approach aligned with noise patterns. This, combined with an ASR-Error-Aware Loss function, facilitates the training of a RoBERTa model, which filters failed mentions and extracts entity pairs for knowledge graph expansion. These designs confront obstacles related to ASR noise, data limitations, and online entity retrieval.
Detecting hallucinations, where Large Language Models (LLMs) are not factually consistent with a Knowledge Base (KB), is a challenge for Retrieval-Augmented Generation (RAG) systems. Current solutions rely on public datasets to develop prompts or fine-tune a Natural Language Inference (NLI) model. However, these approaches are not focused on developing an enterprise RAG system; they do not consider latency, train or evaluate on production data, nor do they handle non-verifiable statements such as small talk or questions. To address this, we leverage the customer service conversation data of four large brands to evaluate existing solutions and propose a set of small encoder models trained on a new dataset. We find the proposed models to outperform existing methods and highlight the value of combining a small amount of in-domain data with public datasets.
Recent development of large language models (LLMs) for code like CodeX and CodeT5+ shows promise in achieving code intelligence. Their ability of synthesizing program targeting a pre-defined algorithmic coding task has been intensively tested and verified on datasets including HumanEval and MBPP. Yet, evaluation of these LLMs from more perspectives (than just program synthesis) is also anticipated, considering their broad scope of applications. In this paper, we explore their ability of automatic test cases generation. We show intriguing observations and reveal how the quality of their generated test cases can be improved. Following recent work which uses generated test cases to enhance program synthesis, we further leverage our findings in improving the quality of the synthesized programs and show +11.77% and +4.22% higher code pass rates on HumanEval+ comparing with the GPT-3.5-turbo baseline and the recent state-of-the-art, respectively. Our code is publicly available at https://github.com/asdasxzxcq/TestCaseGen.
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the effort required for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on summary faithfulness is not universally positive across LLMs. To enable this approach, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
A frequent challenge in applications that use entities extracted from text documents is selecting the most salient entities when only a small number can be used by the application (e.g., displayed to a user). Solving this challenge is particularly difficult in the setting of extremely short documents, such as the response from a digital assistant, where traditional signals of salience such as position and frequency are less likely to be useful. In this paper, we propose a lightweight and data-efficient approach for entity salience detection on short text documents. Our experiments show that our approach achieves competitive performance with respect to complex state-of-the-art models, such as GPT-4, at a significant advantage in latency and cost. In limited data settings, we show that a semi-supervised fine-tuning process can improve performance further. Furthermore, we introduce a novel human-labeled dataset for evaluating entity salience on short question-answer pair documents.
Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model’s ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.
The rapid introduction of new brand names into everyday language poses a unique challenge for e-commerce spelling correction services, which must distinguish genuine misspellings from novel brand names that use unconventional spelling. We seek to address this challenge via Retrieval Augmented Generation (RAG). On this approach, product names are retrieved from a catalog and incorporated into the context used by a large language model (LLM) that has been fine-tuned to do contextual spelling correction. Through quantitative evaluation and qualitative error analyses, we find improvements in spelling correction utilizing the RAG framework beyond a stand-alone LLM. We also demonstrate the value of additional finetuning of the LLM to incorporate retrieved context.
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization.In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation – effective training tokens – which we posit to be a critical determinant of performance for parameter-constrained language models.Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text:(i) text diversity and (ii) syntheticity as measured by a teacher model.We pretrained over 200 models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores.We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyze it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, Climate-Change NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain- specific (SciBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.
Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.
Large language models (LLMs) enhanced with retrieval augmentation has shown great performance in many applications. However, the computational demands for these models pose a challenge when applying them to real-time tasks, such as composition assistance. To address this, we propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA), a novel system for real-time text prediction that efficiently combines a cloud-based LLM with a smaller client-side model through retrieval augmented memory. This integration enables the client model to generate better responses, benefiting from the LLM’s capabilities and cloud-based data. Meanwhile, via a novel asynchronous memory update mechanism, the client model can deliver real-time completions to user inputs without the need to wait for responses from the cloud. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.
Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements limit the widespread adoption. Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating LLMs, albeit with potential risks to accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, their quantization configurations vary from each other and cannot be fairly compared. In this paper, we present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. LLMC integrates dozens of algorithms, models, and hardware, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats, providing novel insights and detailed analyses for further research and practical guidance for users. Our toolkit is available at https://github.com/ModelTC/llmc.
Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user’s mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.
Machine learning applications for text classification are increasingly used in domains such as toxicity and misinformation detection in online settings. However, obtaining precisely labeled data for training remains challenging, particularly because not all problematic instances are reported. Positive-Unlabeled (PU) learning, which uses only labeled positive and unlabeled samples, offers a solution for these scenarios. A significant concern in PU learning, especially in online settings, is fairness: specific groups may be disproportionately classified as problematic. Despite its importance, this issue has not been explicitly addressed in research. This paper aims to bridge this gap by investigating the fairness of PU learning in both offline and online settings. We propose a novel approach to achieve more equitable results by extending PU learning methods to online learning for both linear and non-linear classifiers and analyzing the impact of the online setting on fairness. Our approach incorporates a convex fairness constraint during training, applicable to both offline and online PU learning. Our solution is theoretically robust, and experimental results demonstrate its efficacy in improving fairness in PU learning in text classification.
This study presents a novel learning approach designed to enhance both mathematical reasoning and problem-solving abilities of Large Language Models (LLMs). We focus on integrating the Chain-of-Thought (CoT) and the Program-of-Thought (PoT) learning, hypothesizing that prioritizing the learning of mathematical reasoning ability is helpful for the amplification of problem-solving ability. Thus, the initial learning with CoT is essential for solving challenging mathematical problems. To this end, we propose a sequential learning approach, named SAAS (Solving Ability Amplification Strategy), which strategically transitions from CoT learning to PoT learning. Our empirical study, involving an extensive performance comparison using several benchmarks, demonstrates that our SAAS achieves state-of-the-art (SOTA) performance. The results underscore the effectiveness of our sequential learning approach, marking a significant advancement in the field of mathematical reasoning in LLMs.
Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
This paper addresses the challenge of improving user experience on e-commerce platforms by enhancing product ranking relevant to user’s search queries. Ambiguity and complexity of user queries often lead to a mismatch between user’s intent and retrieved product titles or documents. Recent approaches have proposed the use of Transformer-based models which need millions of annotated query-title pairs during the pre-training stage, and this data often does not take user intent into account. To tackle this, we curate samples from existing datasets at eBay, manually annotated with buyer-centric relevance scores, and centrality scores which reflect how well the product title matches the user’s intent. We introduce a User-intent Centrality Optimization (UCO) approach for existing models, which optimizes for the user intent in semantic product search. To that end, we propose a dual-loss based optimization to handle hard negatives, i.e., product titles that are semantically relevant but do not reflect the user’s intent. Our contributions include curating challenging evaluation sets and implementing UCO, resulting in significant improvements in product ranking efficiency, observed for different evaluation metrics. Our work aims to ensure that the most buyer-centric titles for a query are ranked higher, thereby, enhancing the user experience on e-commerce platforms.
Evaluating natural language generation (NLG) systems automatically poses significant challenges.Recent studies have employed large language models (LLMs) as reference-free metrics for NLG evaluation, enhancing adaptability to new tasks tasks. However, these methods still show lower correspondence with human judgments compared to specialized neural evaluators.In this paper, we introduce “Fusion-Eval”, an innovative approach that leverages LLMs to integrate insights from various assistant evaluators. The LLM is given the example to evaluate along with scores from the assistant evaluators. Each of these evaluators specializes in assessing distinct aspects of responses.Fusion-Eval achieves a 0.962 system-level Kendall-Tau correlation with humans on SummEval and a 0.744 turn-level Spearman correlation on TopicalChat, which is significantly higher than baseline methods. These results highlight Fusion-Eval’s significant potential in the realm of natural language system evaluation.
This study explores the consistency of personality traits in quantized large language models (LLMs) for edge device role-playing scenarios. Using the Big Five personality traits model, we evaluate how stable assigned personalities are for Quantized Role-Playing Dialog Agents (QRPDA) during multi-turn interactions. We evaluate multiple LLMs with various quantization levels, combining binary indexing of personality traits, explicit self-assessments, and linguistic analysis of narratives. To address personality inconsistency, we propose a non-parametric method called Think2. Our multi-faceted evaluation framework demonstrates Think2’s effectiveness in maintaining consistent personality traits for QRPDA. Moreover, we offer insights to help select the optimal model for QRPDA, improving its stability and reliability in real-world applications.
Error correction (EC) based on large language models is an emerging technology to enhance the performance of automatic speech recognition (ASR) systems.Generally, training data for EC are collected by automatically pairing a large set of ASR hypotheses (as sources) and their gold references (as targets).However, the quality of such pairs is not guaranteed, and we observed various types of noise which can make the EC models brittle, e.g. inducing overcorrection in out-of-domain (OOD) settings.In this work, we propose two fundamental criteria that EC training data should satisfy: namely, EC targets should (1) improve linguistic acceptability over sources and (2) be inferable from the available context (e.g. source phonemes).Through these criteria, we identify low-quality EC pairs and train the models not to make any correction in such cases, the process we refer to as conservative data filtering.In our experiments, we focus on Japanese ASR using a strong Conformer-CTC as the baseline and finetune Japanese LLMs for EC.Through our evaluation on a suite of 21 internal benchmarks, we demonstrate that our approach can significantly reduce overcorrection and improve both the accuracy and quality of ASR results in the challenging OOD settings.
Language models for natural language processing have been grafted onto programming language modeling for advancing code intelligence. Although it can be represented in the text format, code is syntactically more rigorous, as it is designed to be properly compiled or interpreted to perform a set of behaviors given any inputs. In this case, existing works benefit from syntactic representations to learn from code less ambiguously in forms of abstract syntax tree, control-flow graph, etc. However, programs with the same purpose can be implemented in various ways showing different syntactic representations, while the ones with similar implementations can have distinct behaviors. Though trivially demonstrated during executions, such semantics about functionality are challenging to be learned directly from code, especially in an unsupervised manner. Hence, in this paper, we propose FuzzPretrain to explore the dynamic information of programs revealed by their test cases and embed it into the feature representations of code as complements. The test cases are obtained with the assistance of a customized fuzzer and are only required during pre-training. FuzzPretrain yielded more than 6%/19% mAP improvements on code search over its masked language modeling counterparts trained with only source code and source code coupled with abstract syntax trees (ASTs), respectively. Our experiments show the benefits of learning discriminative code representations from FuzzPretrain.
Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that reveal that with 64 concurrent requests on Mixtral 8x7B, ScaleLLM achieves a 4.3× speed up over vLLM and outperforms state-of-the-arts with 1.5× higher throughput.
This paper introduces a novel framework that combines graph-driven context retrieval in conjunction to knowledge graphs based enhancement, honing the proficiency of LLMs, especially in domain specific community question answering platforms like AskUbuntu, Unix, and ServerFault. We conduct experiments on various LLMs with different parameter sizes to evaluate their ability to ground knowledge and determine factual accuracy in answers to open-ended questions. Our methodology GraphContextGen consistently outperforms dominant text-based retrieval systems, demonstrating its robustness and adaptability to a larger number of use cases. This advancement highlights the importance of pairing context rich data retrieval with LLMs, offering a renewed approach to knowledge sourcing and generation in AI systems. We also show that, due to rich contextual data retrieval, the crucial entities, along with the generated answer, remain factually coherent with the gold answer. We shall release the source code and datasets upon acceptance.
The electric vehicle (EV) battery supply chain’s vulnerability to disruptions necessitates advanced predictive analytics. We present SHIELD (Schema-based Hierarchical Induction for EV supply chain Disruption), a system integrating Large Language Models (LLMs) with domain expertise for EV battery supply chain risk assessment. SHIELD combines: (1) LLM-driven schema learning to construct a comprehensive knowledge library, (2) a disruption analysis system utilizing fine-tuned language models for event extraction, multi-dimensional similarity matching for schema matching, and Graph Convolutional Networks (GCNs) with logical constraints for prediction, and (3) an interactive interface for visualizing results and incorporating expert feedback to enhance decision-making. Evaluated on 12,070 paragraphs from 365 sources (2022-2023), SHIELD outperforms baseline GCNs and LLM+prompt methods (e.g. GPT-4o) in disruption prediction. These results demonstrate SHIELD’s effectiveness in combining LLM capabilities with domain expertise for enhanced supply chain risk assessment.
Evaluating the quality and consistency of text generated by Large Language Models (LLMs) poses a significant, yet unresolved challenge for industry research. We propose , an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators operating at the paragraph level, our method employs a divide-and-conquer evaluator () that breaks down the paragraph-to-paragraph comparison into sentence-to-paragraph comparisons. To facilitate this approach, we also introduce an automatic metric converter () that translates the output from into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver () that mitigates inconsistencies by leveraging the analytical reasons identified by . Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +16.8% and +32.5% on the SummEval dataset) in consistency evaluation across multiple benchmarks. Our approach also substantially reduces nearly 90% output inconsistencies in one iteration, showing promise for effective hallucination mitigation in real-world industrial applications.