| Total: 1000
Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.
In this article, we describe the architecture of the LIMA (Libre Multilingual Analyzer) framework and its recent evolution with the addition of new text analysis modules based on deep neural networks. We extended the functionality of LIMA in terms of the number of supported languages while preserving existing configurable architecture and the availability of previously developed rule-based and statistical analysis components. Models were trained for more than 60 languages on the Universal Dependencies 2.5 corpora, WikiNer corpora, and CoNLL-03 dataset. Universal Dependencies allowed us to increase the number of supported languages and to generate models that could be integrated into other platforms. This integration of ubiquitous Deep Learning Natural Language Processing models and the use of standard annotated collections using Universal Dependencies can be viewed as a new path of interoperability, through the normalization of models and data, that are complementary to a more standard technical interoperability, implemented in LIMA through services available in Docker containers on Docker Hub.
This paper tries to determine the origin of springs on the Costa Verde beach, located in the district of Barranco, Miraflores and Magdalena, province of Lima, Peru. These springs emerge near the shoreline, from the lower layers of a 80 meter high cliff. They have survived the process of urbanization of agricultural land, started in the early 70, which decreased the water table aquifer of Lima, and wiped the water leaks from the cliffs. To identify the source of the springs, isotopic, physical, chemical and bacteriological analysis was carried out for samples from five springs. The isotopic concentrations in waters from Costa Verde springs are depleted compared to those obtained for Lima aquifer waters, which is recharged by infiltration of the Rimac River. The measured values of those concentrations suggest that water from the Costa Verde springs should come from a direct recharge in the upper and middle basin, due to infiltration of rainfall or the river at an altitude of about 3600 m. Conductivity and temperature, measured in situ, are similar to those obtained on Lima aquifers. The laboratory analysis showed no significant levels of total or fecal coliform, discarding possible leakage from Lima sewerage.
In this paper, we present a novel and general network structure towards accelerating the inference process of convolutional neural networks, which is more complicated in network structure yet with less inference complexity. The core idea is to equip each original convolutional layer with another low-cost collaborative layer (LCCL), and the element-wise multiplication of the ReLU outputs of these two parallel layers produces the layer-wise output. The combined layer is potentially more discriminative than the original convolutional layer, and its inference is faster for two reasons: 1) the zero cells of the LCCL feature maps will remain zero after element-wise multiplication, and thus it is safe to skip the calculation of the corresponding high-cost convolution in the original convolutional layer, 2) LCCL is very fast if it is implemented as a 1*1 convolution or only a single filter shared by all channels. Extensive experiments on the CIFAR-10, CIFAR-100 and ILSCRC-2012 benchmarks show that our proposed network structure can accelerate the inference process by 32\% on average with negligible performance drop.
The 8 million inhabitants of the coast Lima City are supplied with water from Rimac and Chillons rivers and water wells in the Lima aquifer. Historics of Rimac River flow and static level of water level in wells are correlated in order to calculate residence time of water since the aquifer is recharged by Rimac River until water reaches a well located 12 km farther, in Miraflores district near sea. Relative abundances of 2H and 18O are used to identify origins of waters from those wells. 3H and 14C contents, respectively, are used to estimate ages of waters.
We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.
A remarkable feat of active matter physics is that systems as diverse as collections of self-propelled particles, nematics mixed with molecular motors, and interacting robots can all be described by symmetry-based continuum theories. These descriptions rely on reducing complex effects of individual motors to a few key active parameters, which increase with activity. Here we discover a striking anomaly in the continuum description of non-reciprocal active solids, a ubiquitous class of active materials. We find that as microscopic activity increases, macroscale active response can vanish: more is less. In this highly active regime, non-affine and localized modes prevail and destroy the large-scale signature of microscopic activity. These modes exist in any dilute periodic structure and emerge in random lattices below a percolation transition. Our results unveil a counterintuitive facet of active matter, offering new principles for engineering materials far from equilibrium.
We study the breaking of integrability by a finite density of dilute impurities, specifically the emerging diffusive transport. Provided the distance between impurities (localized perturbations) is large, one would expect that the scattering rates are additive, and therefore, the resistivity is proportional to the number of impurities (the so-called Matthiessen's rule). We show that this is, in general, not the case. If transport is anomalous in the original integrable system without impurities, the diffusion constant in the non-integrable system at low impurity density gets a nontrivial power-law dependence on the impurity density, with the power being determined by the dynamical scaling exponent of anomalous transport. We also find a regime at high impurity density in which, counterintuitively, adding more impurities to an already diffusive system increases transport rather than decreases it.
Radar offers the advantage of providing additional physical properties related to observed objects. In this study, we design a physical-enhanced radar-inertial odometry system that capitalizes on the Doppler velocities and radar cross-section information. The filter for static radar points, correspondence estimation, and residual functions are all strengthened by integrating the physical properties. We conduct experiments on both public datasets and our self-collected data, with different mobile platforms and sensor types. Our quantitative results demonstrate that the proposed radar-inertial odometry system outperforms alternative methods using the physical-enhanced components. Our findings also reveal that using the physical properties results in fewer radar points for odometry estimation, but the performance is still guaranteed and even improved, thus aligning with the ``less is more'' principle.
Ensemble techniques for classification and clustering have long proven effective, yet anomaly ensembles have been barely studied. In this work, we tap into this gap and propose a new ensemble approach for anomaly mining, with application to event detection in temporal graphs. Our method aims to combine results from heterogeneous detectors with varying outputs, and leverage the evidence from multiple sources to yield better performance. However, trusting all the results may deteriorate the overall ensemble accuracy, as some detectors may fall short and provide inaccurate results depending on the nature of the data in hand. This suggests that being selective in which results to combine is vital in building effective ensembles---hence "less is more". In this paper we propose SELECT; an ensemble approach for anomaly mining that employs novel techniques to automatically and systematically select the results to assemble in a fully unsupervised fashion. We apply our method to event detection in temporal graphs, where SELECT successfully utilizes five base detectors and seven consensus methods under a unified ensemble framework. We provide extensive quantitative evaluation of our approach on five real-world datasets (four with ground truth), including Enron email communications, New York Times news corpus, and World Cup 2014 Twitter news feed. Thanks to its selection mechanism, SELECT yields superior performance compared to individual detectors alone, the full ensemble (naively combining all results), and an existing diversity-based ensemble.
Synthetic training data generation with Large Language Models (LLMs) like Google's Gemma and OpenAI's GPT offer a promising solution to the challenge of obtaining large, labeled datasets for training classifiers. When rapid model deployment is critical, such as in classifying emerging social media trends or combating new forms of online abuse tied to current events, the ability to generate training data is invaluable. While prior research has examined the comparability of synthetic data to human-labeled data, this study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset. This "less is more" approach not only improves accuracy but also reduces the volume of data required, leading to potentially more efficient model fine-tuning.
Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that the early self-attention layers in Transformers still focus on local patterns and bring minor benefits in recent hierarchical vision Transformers. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. Code is available at: https://github.com/zhuang-group/LIT
We exhibit the intriguing phenomena of "Less is More" using a set of multipartite entangled states. We consider the quantum communication protocols for the {\em exact} teleportation, superdense coding, and quantum key distribution. We find that sometimes {\em less} entanglement is {\em more} useful. To understand this phenomena we obtain a condition that a resource state must satisfy to communicate a n-qubit pure state with m terms. We find that the an appropriate partition of the resource state should have a von-Neumann entropy of log2m. Furthermore, it is shown that some states may be suitable for exact superdense coding, but not for exact teleportation.
Surgical phase recognition is a fundamental task in computer-assisted surgery systems. Most existing works are under the supervision of expensive and time-consuming full annotations, which require the surgeons to repeat watching videos to find the precise start and end time for a surgical phase. In this paper, we introduce timestamp supervision for surgical phase recognition to train the models with timestamp annotations, where the surgeons are asked to identify only a single timestamp within the temporal boundary of a phase. This annotation can significantly reduce the manual annotation cost compared to the full annotations. To make full use of such timestamp supervisions, we propose a novel method called uncertainty-aware temporal diffusion (UATD) to generate trustworthy pseudo labels for training. Our proposed UATD is motivated by the property of surgical videos, i.e., the phases are long events consisting of consecutive frames. To be specific, UATD diffuses the single labelled timestamp to its corresponding high confident ( i.e., low uncertainty) neighbour frames in an iterative way. Our study uncovers unique insights of surgical phase recognition with timestamp supervisions: 1) timestamp annotation can reduce 74% annotation time compared with the full annotation, and surgeons tend to annotate those timestamps near the middle of phases; 2) extensive experiments demonstrate that our method can achieve competitive results compared with full supervision methods, while reducing manual annotation cost; 3) less is more in surgical phase recognition, i.e., less but discriminative pseudo labels outperform full but containing ambiguous frames; 4) the proposed UATD can be used as a plug and play method to clean ambiguous labels near boundaries between phases, and improve the performance of the current surgical phase recognition methods.
With the remarkable success achieved by Multimodal Large Language Models (MLLMs), numerous benchmarks have been designed to assess MLLMs' ability to guide their development in image perception tasks (e.g., image captioning and visual question answering). However, the existence of numerous benchmarks results in a substantial computational burden when evaluating model performance across all of them. Moreover, these benchmarks contain many overly simple problems or challenging samples, which do not effectively differentiate the capabilities among various MLLMs. To address these challenges, we propose a pipeline to process the existing benchmarks, which consists of two modules: (1) Semi-Automated Screening Process and (2) Eliminating Answer Leakage. The Semi-Automated Screening Process filters out samples that cannot distinguish the model's capabilities by synthesizing various MLLMs and manually evaluating them. The Eliminate Answer Leakage module filters samples whose answers can be inferred without images. Finally, we curate the LIME-M: Less Is More for Evaluation of Multimodal LLMs, a lightweight Multimodal benchmark that can more effectively evaluate the performance of different models. Our experiments demonstrate that: LIME-M can better distinguish the performance of different MLLMs with fewer samples (24% of the original) and reduced time (23% of the original); LIME-M eliminates answer leakage, focusing mainly on the information within images; The current automatic metric (i.e., CIDEr) is insufficient for evaluating MLLMs' capabilities in captioning. Moreover, removing the caption task score when calculating the overall score provides a more accurate reflection of model performance differences. All our codes and data are released at https://github.com/kangreen0210/LIME-M.
The XLLM@ACL2025 Shared Task-III formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the XLLM@ACL2025 Shared Task-III, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at https://github.com/Jiahao-Yuan/Less-is-More.
Multiparty session types (MPST) provide a type discipline where a programmer or architect specifies a whole view of communications as a global protocol, and each distributed program is locally type-checked against its end-point projection. After 10 years from the birth of MPST, Scalas and Yoshida discovered that the proofs of type safety in the literature which use the end-point projection with mergeability are flawed. After this paper, researchers wrongly believed that the end-point projection (with mergeability) was unsound. We correct this misunderstanding, proposing a new general proof technique for type soundness of multiparty session π-calculus, which uses an association relation between a global type and its end-point projection.
The rapid growth of encryption has significantly enhanced privacy and security while posing challenges for network traffic classification. Recent approaches address these challenges by transforming network traffic into text or image formats to leverage deep-learning models originally designed for natural language processing, and computer vision. However, these transformations often contradict network protocol specifications, introduce noisy features, and result in resource-intensive processes. To overcome these limitations, we propose NetMatrix, a minimalistic tabular representation of network traffic that eliminates noisy attributes and focuses on meaningful features leveraging RFCs (Request for Comments) definitions. By combining NetMatrix with a vanilla XGBoost classifier, we implement a lightweight approach, LiM ("Less is More") that achieves classification performance on par with state-of-the-art methods such as ET-BERT and YaTC. Compared to selected baselines, experimental evaluations demonstrate that LiM improves resource consumption by orders of magnitude. Overall, this study underscores the effectiveness of simplicity in traffic representation and machine learning model selection, paving the way towards resource-efficient network traffic classification.
Preferences play an important role in our everyday lives. CP-networks, or CP-nets in short, are graphical models for representing conditional qualitative preferences under ceteris paribus ("all else being equal") assumptions. Despite their intuitive nature and rich representation, dominance testing with CP-nets is computationally complex, even when the CP-nets are restricted to binary-valued preferences. Tractable algorithms exist for binary CP-nets, but these algorithms are incomplete for multi-valued CPnets. In this paper, we identify a class of multivalued CP-nets, which we call more-or-less CPnets, that have the same computational complexity as binary CP-nets. More-or-less CP-nets exploit the monotonicity of the attribute values and use intervals to aggregate values that induce similar preferences. We then present a search control rule for dominance testing that effectively prunes the search space while preserving completeness.
Driven by advances in self-supervised learning for speech, state-of-the-art synthetic speech detectors have achieved low error rates on popular benchmarks such as ASVspoof. However, prior benchmarks do not address the wide range of real-world variability in speech. Are reported error rates realistic in real-world conditions? To assess detector failure modes and robustness under controlled distribution shifts, we introduce ShiftySpeech, a benchmark with more than 3000 hours of synthetic speech from 7 domains, 6 TTS systems, 12 vocoders, and 3 languages. We found that all distribution shifts degraded model performance, and contrary to prior findings, training on more vocoders, speakers, or with data augmentation did not guarantee better generalization. In fact, we found that training on less diverse data resulted in better generalization, and that a detector fit using samples from a single carefully selected vocoder and a single speaker achieved state-of-the-art results on the challenging In-the-Wild benchmark.
The spread of disinformation poses a significant threat to societal well-being. We analyze this phenomenon using an evolutionary game theory model of the sender-receiver game, where senders aim to mislead receivers and receivers aim to discern the truth. Using a combination of replicator equations, finite-size scaling analysis, and extensive Monte Carlo simulations, we investigate the long-term evolutionary dynamics of this game. Our central finding is a counterintuitive threshold phenomenon: the role (sender or receiver) with the larger difference in payoffs between successful and unsuccessful interactions is surprisingly more likely to lose in the long run. We show that this effect is robust across different parameter values and arises from the interplay between the relative speeds of evolution of the two roles and the ability of the slower evolving role to exploit the fixed strategy of the faster evolving role. Moreover, for finite populations we find that the initially less frequent strategy of the slower role is more likely to fixate in the population. The initially rarer strategy in the less-rewarded role is, paradoxically, more likely to prevail.
Compressed mass spectra are generally more difficult to identify than spectra with large splittings. In particular, gluino pair production with four high energy top or bottom quarks leaves a striking signature in a detector. However, if any of the mass splittings are compressed, the power of traditional techniques may deteriorate. Searches for direct stop/sbottom pair production can fill in the gaps. As a demonstration, we show that for ˜g→t˜t1 and m˜t1∼m˜χ01, limits on the stop mass at 8 TeV can be extended by least 300 GeV for a 1.1 TeV gluino using a pp→˜t1˜t1 search. At 13 TeV, the effective cross section for the gluino mediated process is twice the direct stop/sbottom pair production cross section, suggesting that direct stop/sbottom searches could be sensitive to discover new physics earlier than expected.
Thyroid nodule classification aims at determining whether the nodule is benign or malignant based on a given ultrasound image. However, the label obtained by the cytological biopsy which is the golden standard in clinical medicine is not always consistent with the ultrasound imaging TI-RADS criteria. The information difference between the two causes the existing deep learning-based classification methods to be indecisive. To solve the Inconsistent Label problem, we propose an Adaptive Curriculum Learning (ACL) framework, which adaptively discovers and discards the samples with inconsistent labels. Specifically, ACL takes both hard sample and model certainty into account, and could accurately determine the threshold to distinguish the samples with Inconsistent Label. Moreover, we contribute TNCD: a Thyroid Nodule Classification Dataset to facilitate future related research on the thyroid nodules. Extensive experimental results on TNCD based on three different backbone networks not only demonstrate the superiority of our method but also prove that the less-is-more principle which strategically discards the samples with Inconsistent Label could yield performance gains. Source code and data are available at https://github.com/chenghui-666/ACL/.
Assessing the factual consistency of automatically generated texts in relation to source context is crucial for developing reliable natural language generation applications. Recent literature proposes AlignScore which uses a unified alignment model to evaluate factual consistency and substantially outperforms previous methods across many benchmark tasks. In this paper, we take a closer look of datasets used in AlignScore and uncover an unexpected finding: utilizing a smaller number of data points can actually improve performance. We process the original AlignScore training dataset to remove noise, augment with robustness-enhanced samples, and utilize a subset comprising 10\% of the data to train an improved factual consistency evaluation model, we call LIM-RA (Less Is More for Robust AlignScore). LIM-RA demonstrates superior performance, consistently outperforming AlignScore and other strong baselines like ChatGPT across four benchmarks (two utilizing traditional natural language generation datasets and two focused on large language model outputs). Our experiments show that LIM-RA achieves the highest score on 24 of the 33 test datasets, while staying competitive on the rest, establishing the new state-of-the-art benchmarks.