2024-10-29 | | Total: 21
Large language models (LLM) have sparked significant impact with regard to both intelligence and productivity. In recent years, a great surge has been witnessed in the introduction of both commercial and open-source LLMs. Many businesses have adopted the LLMs into their applications to solve their own domain-specific tasks. However, integrating LLMs into specific business scenarios requires more than just utilizing the models themselves. Instead, it is a systematic process that involves substantial components, which are collectively referred to as the LLM supply chain. The LLM supply chain inherently carries risks. Therefore, it is essential to understand the types of components that may be introduced into the supply chain and the associated risks, enabling different stakeholders to implement effective mitigation measures. While some literature touches on risks associated with the LLM supply chain, there is currently no paper that explicitly defines its scope, identifies inherent risks, and examines potential mitigation strategies. As LLMs have become essential infrastructure in the new era, we believe that a thorough review of the LLM supply chain, along with its inherent risks and mitigation strategies, would be valuable for industry practitioners to avoid potential damages and losses, and enlightening for academic researchers to rethink existing approaches and explore new avenues of research. Our paper provides a comprehensive overview of the LLM supply chain, detailing the stakeholders, composing artifacts, and the supplying types. We developed taxonomies of risk types, risky actions, and mitigations related to various supply chain stakeholders and components. In summary, our work explores the technical and operational aspects of the LLM supply chain, offering valuable insights for researchers and engineers in the evolving LLM landscape.
Software testing is an essential part of the software development cycle to improve the code quality. Typically, a unit test consists of a test prefix and a test oracle which captures the developer's intended behaviour. A known limitation of traditional test generation techniques (e.g. Randoop and Evosuite) is that they produce test oracles that capture the actual program behaviour rather than the expected one. Recent approaches leverage Large Language Models (LLMs), trained on an enormous amount of data, to generate developer-like code and test cases. We investigate whether the LLM-generated test oracles capture the actual or expected software behaviour. We thus, conduct a controlled experiment to answer this question, by studying LLMs performance on two tasks, namely, test oracle classification and generation. The study includes developer-written and automatically generated test cases and oracles for 24 open-source Java repositories, and different well tested prompts. Our findings show that LLM-based test generation approaches are also prone on generating oracles that capture the actual program behaviour rather than the expected one. Moreover, LLMs are better at generating test oracles rather than classifying the correct ones, and can generate better test oracles when the code contains meaningful test or variable names. Finally, LLM-generated test oracles have higher fault detection potential than the Evosuite ones.
LLMs can be used in a variety of code related tasks such as translating from one programming language to another, implementing natural language requirements and code summarization. Artifacts generated by state of the art LLM technology are expected to be useful in the sense that a user will be able to use the LLM generated artifact after a small number of easy modifications. Quantifying this vague notion is challenging and it is thus hard to determine the quality of code related LLM solutions. We refer to evaluation of LLM solutions using LLM judgment as "LLM as a Judge", or LaaJ for short. In this work we introduce a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark. The purpose of the benchmark is two fold, namely, it is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs. To that end, we developed an automated benchmark generation engine, which generates code in multiple programming languages for multiple code related tasks and which serves as the input for LaaJ evaluation. We utilize a graph representation, G, of the potential code related generations. The graph vertices are generated artifacts and edges represent possible generations, e.g., the generation of a Java program from its natural language requirements. Utilizing a chain of LLM agents and G we generate code related artifacts. Using cycles in G we formulate expectations on the generated artifacts. Taking advantage of these formulated expectations enables the development and testing of reliable LLM judgement for usefulness of the artifacts generated by the solution. Our approach enables the creation of high quality code task solutions.
The landscape of computing technologies is changing rapidly, straining existing software engineering practices and tools. The growing need to produce and maintain increasingly complex multi-architecture applications makes it crucial to effectively accelerate and automate software engineering processes. At the same time, artificial intelligence (AI) tools are expected to work hand-in-hand with human developers. Therefore, it becomes critical to model the software accurately, so that the AI and humans can share a common understanding of the problem. In this contribution, firstly, an in-depth overview of these interconnected challenges faced by modern software engineering is presented. Secondly, to tackle them, a novel architecture based on the emerging WebAssembly technology and the latest advancements in neuro-symbolic AI, autonomy, and knowledge graphs is proposed. The presented system architecture is based on the concept of dynamic, knowledge graph-based WebAssembly Twins, which model the software throughout all stages of its lifecycle. The resulting systems are to possess advanced autonomous capabilities, with full transparency and controllability by the end user. The concept takes a leap beyond the current software engineering approaches, addressing some of the most urgent issues in the field. Finally, the efforts towards realizing the proposed approach as well as future research directions are summarized.
The rise of spatiotemporal data and the need for efficient geospatial modeling have spurred interest in automating these tasks with large language models (LLMs). However, general LLMs often generate errors in geospatial code due to a lack of domain-specific knowledge on functions and operators. To address this, a retrieval-augmented generation (RAG) approach, utilizing an external knowledge base of geospatial functions and operators, is proposed. This study introduces a framework to construct such a knowledge base, leveraging geospatial script semantics. The framework includes: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Semantic Mapping (Geo-FuM). Techniques like Chain-of-Thought, TF-IDF, and the APRIORI algorithm are utilized to derive and align geospatial functions. An example knowledge base, Geo-FuB, built from 154,075 Google Earth Engine scripts, is available on GitHub. Evaluation metrics show a high accuracy, reaching 88.89% overall, with structural and semantic accuracies of 92.03% and 86.79% respectively. Geo-FuB's potential to optimize geospatial code generation through the RAG and fine-tuning paradigms is highlighted.
Logic programs are a powerful approach for solving NP-Hard problems. However, due to their declarative nature, debugging logic programs poses significant challenges. Unlike procedural paradigms, which allow for step-by-step inspection of program state, logic programs require reasoning about logical statements for fault localization. This complexity is amplified in learning environments due to students' inexperience. We introduce FormHe, a novel tool that combines logic-based techniques and Large Language Models to identify and correct issues in Answer Set Programming submissions. FormHe consists of two components: a fault localization module and a program repair module. First, the fault localizer identifies a set of faulty program statements requiring modification. Subsequently, FormHe employs program mutation techniques and Large Language Models to repair the flawed ASP program. These repairs can then serve as guidance for students to correct their programs. Our experiments with real buggy programs submitted by students show that FormHe accurately detects faults in 94% of cases and successfully repairs 58% of incorrect submissions.
Understanding collaboration patterns in introductory programming courses is essential, as teamwork is a critical skill in computer science. In professional environments, software development relies on effective teamwork, navigating diverse perspectives, and contributing to shared goals. This paper offers a comprehensive analysis of the factors influencing team efficiency and project success, providing actionable insights to enhance the effectiveness of collaborative programming education. By analyzing version control data, survey responses, and performance metrics, the study highlights the collaboration trends that emerge as first-semester students develop a 2D game project. Results indicate that students often slightly overestimate their contributions, with more engaged individuals more likely to acknowledge mistakes. Team performance shows no significant variation based on nationality or gender composition, though teams that disbanded frequently consisted of lone wolves, highlighting collaboration challenges and the need for strengthened teamwork skills. Presentations closely reflected individual project contributions, with active students excelling in evaluative questioning and performing better on the final exam. Additionally, the complete absence of plagiarism underscores the effectiveness of proactive academic integrity measures, reinforcing honest collaboration in educational settings.
Producing code of good quality is an essential skill in software development. Code quality is an aspect of software quality that concerns the directly observable properties of code, such as decomposition, modularization, and code flow. Code quality can often be improved by means of code refactoring -- an internal change made to code that does not alter its observable behavior. According to the ACM/IEEE-CS/AAAI Computer Science Curricula 2023, code refactoring and code quality are core topics in software engineering education. However, studies show that students often produce code with persistent quality issues. Therefore, it is important to understand what problems students experience when trying to identify and fix code quality issues. In a prior study, we identified a number of student misconceptions in method-level code refactoring. In this paper, we present the findings from a think-aloud study conducted to investigate what students think when working on method-level refactoring exercises. We use grounded theory to identify and classify student reasoning. As a result of the analysis, we identify a set of eight reasons given by students to refactor code, which either concerns the presence of code quality issues, the improvement of software quality attributes, or code semantics. We also analyze which quality issues are identified by students, and to which reasonings these quality issues are related. We found that experienced students reason more often about code quality attributes rather than pointing at a problem they see in the code. Students were able to remove code quality issues in most cases. However, they often overlooked particular issues, such as the presence of a method with multiple responsibilities or the use of a less suitable loop structure.
In this paper, we present a remote verification environment for Mizar and its integration with a web platform. Although a VSCode extension for Mizar is already available, it requires installing the Mizar verification tools locally. Our newly developed system implements these verification environments on a server, eliminating this requirement. First, we explain the implementation of the remote verification environment for Mizar and the VSCode for the Web extension. Second, we discuss the integration with the web platform emwiki, which allows browsing the existing Mizar Mathematical Library (MML).
The rapid expansion of foundation models (FMs), such as large language models (LLMs), has given rise to FMware--software systems that integrate FMs as core components. While building demonstration-level FMware is relatively straightforward, transitioning to production-ready systems presents numerous challenges, including reliability, high implementation costs, scalability, and compliance with privacy regulations. This paper provides a thematic analysis of the key obstacles in productionizing FMware, synthesized from industry experience and diverse data sources, including hands-on involvement in the Open Platform for Enterprise AI (OPEA) and FMware lifecycle engineering. We identify critical issues in FM selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment, alongside cross-cutting concerns such as memory management, observability, and feedback integration. We discuss needed technologies and strategies to address these challenges and offer guidance on how to enable the transition from demonstration systems to scalable, production-ready FMware solutions. Our findings underscore the importance of continued research and multi-industry collaboration to advance the development of production-ready FMware.
To identify security vulnerabilities in Android applications, numerous static application security testing (SAST) tools have been proposed. However, it poses significant challenges to assess their overall performance on diverse vulnerability types. The task is non-trivial and poses considerable challenges. {Firstly, the absence of a unified evaluation platform for defining and describing tools' supported vulnerability types, coupled with the lack of normalization for the intricate and varied reports generated by different tools, significantly adds to the complexity.} Secondly, there is a scarcity of adequate benchmarks, particularly those derived from real-world scenarios. To address these problems, we are the first to propose a unified platform named VulsTotal, supporting various vulnerability types, enabling comprehensive and versatile analysis across diverse SAST tools. Specifically, we begin by meticulously selecting 11 free and open-sourced SAST tools from a pool of 97 existing options, adhering to clearly defined criteria. After that, we invest significant efforts in comprehending the detection rules of each tool, subsequently unifying 67 general/common vulnerability types for {Android} SAST tools. We also redefine and implement a standardized reporting format, ensuring uniformity in presenting results across all tools. Additionally, to mitigate the problem of benchmarks, we conducted a manual analysis of huge amounts of CVEs to construct a new CVE-based benchmark based on our comprehension of Android app vulnerabilities. Leveraging the evaluation platform, which integrates both existing synthetic benchmarks and newly constructed CVE-based benchmarks from this study, we conducted a comprehensive analysis to evaluate and compare these selected tools from various perspectives, such as general vulnerability type coverage, type consistency, tool effectiveness, and time performance.
Case studies have shown that software disasters snowball from technical issues to catastrophes through humans covering up problems rather than addressing them and empirical research has found the psychological safety of software engineers to discuss and address problems to be foundational to improving project success. However, the failure to do so can be attributed to psychological factors like loss aversion. We conduct a large-scale study of the experiences of 600 software engineers in the UK and USA on project success experiences. Empirical evaluation finds that approaches like ensuring clear requirements before the start of development, when loss aversion is at its lowest, correlated to 97\% higher project success. The freedom of software engineers to discuss and address problems correlates with 87\% higher success rates. The findings support the development of software development methodologies with a greater focus on human factors in preventing failure.
Application Programming Interfaces (APIs) are essential tools for social work researchers aiming to harness advanced technologies like Large Language Models (LLMs) and other AI services. This paper demystifies APIs and illustrates how they can enhance research methodologies. It provides an overview of API functionality and integration into research workflows, addressing common barriers for those without programming experience. The paper offers a technical breakdown of code and procedures for using APIs, focusing on connecting to LLMs and leveraging them to facilitate API connections. Practical code examples demonstrate how LLMs can generate API code for accessing specialized services, such as extracting data from unstructured text. Emphasizing data security, privacy considerations, and ethical concerns, the paper highlights the importance of careful data handling when using APIs. By equipping researchers with these tools and knowledge, the paper aims to expand the impact of social work research through the effective incorporation of AI technologies.
Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources (e.g., different Apps in the iPhone), especially for complex user instructions. In this paper, we introduce \texttt{AppBench}, the first benchmark to evaluate LLMs' ability to plan and execute multiple APIs from various sources in order to complete the user's task. Specifically, we consider two significant challenges in multiple APIs: \textit{1) graph structures:} some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and \textit{2) permission constraints:} which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0\% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at https://github.com/ruleGreen/AppBench.
In the past few years, Large Language Models (LLMs) have exploded in usefulness and popularity for code generation tasks. However, LLMs still struggle with accuracy and are unsuitable for high-risk applications without additional oversight and verification. In particular, they perform poorly at generating code for highly complex systems, especially with unusual or out-of-sample logic. For such systems, verifying the code generated by the LLM may take longer than writing it by hand. We introduce a solution that divides the code generation into two parts; one to be handled by an LLM and one to be handled by formal methods-based program synthesis. We develop a benchmark to test our solution and show that our method allows the pipeline to solve problems previously intractable for LLM code generation.
Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.
Cloud computing is essential for modern enterprises, requiring robust tools to monitor and manage Large-Scale Cloud Systems (LCS). Traditional monitoring tools often miss critical insights due to the complexity and volume of LCS telemetry data. This paper presents CloudHeatMap, a novel heatmap-based visualization tool for near-real-time monitoring of LCS health. It offers intuitive visualizations of key metrics such as call volumes, response times, and HTTP response codes, enabling operators to quickly identify performance issues. A case study on the IBM Cloud Console demonstrates the tool's effectiveness in enhancing operational monitoring and decision-making. A demonstration is available at https://www.youtube.com/watch?v=3u5K1qp51EA .
The JavaScript programming language, which began as a simple scripting language for the Web, has become ubiquitous, spanning desktop, mobile, and server applications. This increase in usage has made JavaScript an attractive target for nefarious actors, resulting in the proliferation of malicious browser extensions that steal user information and supply chain attacks that target the official Node.js package registry. To combat these threats, researchers have developed specialized tools and frameworks for analyzing the behavior of JavaScript programs to detect malicious patterns. Static analysis tools typically struggle with the highly dynamic nature of the language and fail to process obfuscated sources, while dynamic analysis pipelines take several minutes to run and require more resources per program, making them unfeasible for large-scale analyses. In this paper, we present Fakeium, a novel, open source, and lightweight execution environment designed for efficient, large-scale dynamic analysis of JavaScript programs. Built on top of the popular V8 engine, Fakeium complements traditional static analysis by providing additional API calls and string literals that would otherwise go unnoticed without the need for resource-intensive instrumented browsers or synthetic user input. Besides its negligible execution overhead, our tool is highly customizable and supports hooks for advanced analysis scenarios such as network traffic emulation. Fakeium's flexibility and ability to detect hidden API calls, especially in obfuscated sources, highlights its potential as a valuable tool for security analysts to detect malicious behavior.
Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to CUDA and Fortran to C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU, with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for this complex task, improving CodeBLEU by at least 4.63 points compared to closed-source and open-code LLMs.
Deep Neural Networks (DNNs) are increasingly deployed across applications. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between models, especially with limited test datasets, making it difficult to select or combine models effectively. Differential testing addresses this by generating test inputs that expose discrepancies in DNN model behavior. However, existing approaches face significant limitations: many rely on model internals or are constrained by available seed inputs. To address these challenges, we propose DiffGAN, a black-box test image generation approach for differential testing of DNN models. DiffGAN leverages a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to generate diverse and valid triggering inputs that reveal behavioral discrepancies between models. DiffGAN employs two custom fitness functions, focusing on diversity and divergence, to guide the exploration of the GAN input space and identify discrepancies between models' outputs. By strategically searching this space, DiffGAN generates inputs with specific features that trigger differences in model behavior. DiffGAN is black-box, making it applicable in more situations. We evaluate DiffGAN on eight DNN model pairs trained on widely used image datasets. Our results show DiffGAN significantly outperforms a SOTA baseline, generating four times more triggering inputs, with greater diversity and validity, within the same budget. Additionally, the generated inputs improve the accuracy of a machine learning-based model selection mechanism, which selects the best-performing model based on input characteristics and can serve as a smart output voting mechanism when using alternative models.
Quantum Software (QSW) uses the principles of quantum mechanics, specifically programming quantum bits (qubits) that manipulate quantum gates, to implement quantum computing systems. QSW has become a specialized field of software development, requiring specific notations, languages, patterns, and tools for mapping the behavior of qubits and the structure of quantum gates to components and connectors of QSW architectures. To support declarative modeling of QSW, we aim to enable architecture-driven development, where software engineers can design, program, and evaluate quantum software systems by abstracting complex details through high-level components and connectors. We introduce QADL (Quantum Architecture Description Language), which provides a specification language, design space, and execution environment for architecting QSW. Inspired by classical ADLs, QADL offers (1) a graphical interface to specify and design QSW components, (2) a parser for syntactical correctness, and (3) an execution environment by integrating QADL with IBM Qiskit. The initial evaluation of QADL is based on usability assessments by a team of quantum physicists and software engineers, using quantum algorithms such as Quantum Teleportation and Grover's Search. QADL offers a pioneering specification language and environment for QSW architecture. A demo is available at https://youtu.be/xaplHH_3NtQ.