2025-12-10 | | Total: 19
With this paper, we introduce RESTifAI, an LLM-driven approach for generating reusable, CI/CD ready REST API tests, following the happy-path approach. Unlike existing tools that often focus primarily on internal server errors, RESTifAI systematically constructs valid test scenarios (happy paths) and derives negative cases to verify both intended functionality (2xx responses) and robustness against invalid inputs or business-rule violations (4xx responses). The results indicate that RESTifAI performs on par with the latest LLM tools, i.e., AutoRestTest and LogiAgent, while addressing limitations related to reusability, oracle complexity, and integration. To support this, we provide common comparative results and demonstrate the tool's applicability in industrial services. For tool demonstration, please refer to https://www.youtube.com/watch?v=2vtQo0T0Lo4. RESTifAI is publicly available at https://github.com/casablancahotelsoftware/RESTifAI.
ML-Enabled Systems (MLES) are inherently complex since they require multiple components to achieve their business goal. This experience report showcases the software architecture reusability techniques applied while building Ocean Guard, an MLES for anomaly detection in the maritime domain. In particular, it highlights the challenges and lessons learned to reuse the Ports and Adapters pattern to support building multiple microservices from a single codebase. This experience report hopes to inspire software engineers, machine learning engineers, and data scientists to apply the Hexagonal Architecture pattern to build their MLES.
This study investigates learners' preferences for game design elements (GDEs) in educational contexts to inform the development of purpose-driven gamification strategies. It emphasizes a learner-centered approach that aligns gamification design with pedagogical goals, while mitigating risks such as the erosion of intrinsic motivation. A systematic literature review was conducted to identify ten widely discussed GDEs. Visual prototypes representing each element were developed, and a best-worst scaling (BWS) survey with 125 participants was administered to elicit preference rankings. Qualitative feedback was also collected to uncover motivational drivers. Learners consistently preferred GDEs that support learning processes directly-most notably progress bars, concept maps, immediate feedback, and achievements. Qualitative analysis revealed six recurring motivational themes, including visible progress, content relevance, and constructive feedback. The findings suggest that learners value gamification elements that are meaningfully integrated with educational content and support intrinsic motivation. Purpose-aligned gamification should prioritize tools that visualize learning progress and provide actionable feedback, rather than relying solely on extrinsic incentives.
This study offers new insights into students' interest in computer science (CS) education by disentangling the distinct effects of age and gender across a diverse adolescent sample. Grounded in the person-object theory of interest (POI), we conceptualize enthusiasm as a short-term, activating expression of interest that combines positive affect, perceived relevance, and intention to re-engage. Experiencing such enthusiasm can temporarily shift CS attitudes and strengthen future engagement intentions, making it a valuable lens for evaluating brief outreach activities. To capture these dynamics, we developed a theoretically grounded questionnaire for pre-post assessment of the enthusiasm potential of CS interventions. Using data from more than 400 students participating in online CS courses, we examined age- and gender-related patterns in enthusiasm. The findings challenge the prevailing belief that early exposure is the primary pathway to sustained interest in CS. Instead, we identify a marked decline in enthusiasm during early adolescence, particularly among girls, alongside substantial variability in interest trajectories across age groups. Crucially, our analyses reveal that age is a more decisive factor than gender in shaping interest development and uncover key developmental breakpoints. Despite starting with lower baseline attitudes, older students showed the largest positive changes following the intervention, suggesting that well-designed short activities can effectively re-activate interest even at later ages. Overall, the study highlights the need for a dynamic, age-sensitive framework for CS education in which instructional strategies are aligned with developmental trajectories.
While the importance of human factors in agile software development is widely acknowledged, the measurement of an individual's "agile agreement" remains an ill-defined and challenging area. A key limitation in existing research is the failure to distinguish between agreement with the abstract, high-level values of the Agile Manifesto and agreement with the concrete, day-to-day practices derived from the 12 Principles. This paper addresses this methodological gap by presenting the design and validation of two distinct instruments: the novel Manifesto Agreement Scale (MAS), and the Principle Agreement Scale (PAS), which is a systematic adaptation and refinement of a prior instrument. We detail the systematic process of item creation and selection, survey design, and validation. The results demonstrate that both scales possess important internal consistency and construct validity. A convergence and divergence analysis, including Proportional Odds Logistic Regression, a Bland-Altman plot, and an Intraclass Correlation Coefficient (ICC), reveals that while the two scales are moderately correlated, they are not interchangeable and capture distinct dimensions of agile agreement. The primary contribution of this work is a pair of publicly available instruments, validated within a specific demographic of Belgian IT professionals. These scales represent a critical initial step toward facilitating a more nuanced measurement of agile agreement, distinguishing agile agreement across various levels of perception and aiding in a more refined interpretation of person-agile fit.
The integration of Large Language Models (LLMs) into mobile and software development workflows faces a persistent tension among three demands: semantic awareness, developer productivity, and data privacy. Traditional cloud-based tools offer strong reasoning but risk data exposure and latency, while on-device solutions lack full-context understanding across codebase and developer tooling. We introduce SolidGPT, an open-source, edge-cloud hybrid developer assistant built on GitHub, designed to enhance code and workspace semantic search. SolidGPT enables developers to: talk to your codebase: interactively query code and project structure, discovering the right methods and modules without manual searching. Automate software project workflows: generate PRDs, task breakdowns, Kanban boards, and even scaffold web app beginnings, with deep integration via VSCode and Notion. Configure private, extensible agents: onboard private code folders (up to approximately 500 files), connect Notion, customize AI agent personas via embedding and in-context training, and deploy via Docker, CLI, or VSCode extension. In practice, SolidGPT empowers developer productivity through: Semantic-rich code navigation: no more hunting through files or wondering where a feature lives. Integrated documentation and task management: seamlessly sync generated PRD content and task boards into developer workflows. Privacy-first design: running locally via Docker or VSCode, with full control over code and data, while optionally reaching out to LLM APIs as needed. By combining interactive code querying, automated project scaffolding, and human-AI collaboration, SolidGPT provides a practical, privacy-respecting edge assistant that accelerates real-world development workflows, ideal for intelligent mobile and software engineering contexts.
Log-based anomaly detection (LAD) is critical for ensuring the reliability of large-scale distributed systems. However, most existing LAD approaches assume centralized training, which is often impractical due to privacy constraints and the decentralized nature of system logs. While federated learning (FL) offers a promising alternative, there is a lack of dedicated testbeds tailored to the needs of LAD in federated settings. To address this, we present FedLAD, a unified platform for training and evaluating LAD models under FL constraints. FedLAD supports plug-and-play integration of diverse LAD models, benchmark datasets, and aggregation strategies, while offering runtime support for validation logging (self-monitoring), parameter tuning (self-configuration), and adaptive strategy control (self-adaptation). By enabling reproducible and scalable experimentation, FedLAD bridges the gap between FL frameworks and LAD requirements, providing a solid foundation for future research. Project code is publicly available at: https://github.com/AA-cityu/FedLAD.
Large language models (LLMs) have shown exceptional performance in code generation and understanding tasks, yet their high computational costs hinder broader adoption. One important factor is the inherent verbosity of programming languages, such as unnecessary formatting elements and lengthy boilerplate code. This leads to inflated token counts in both input and generated outputs, which increases inference costs and slows down the generation process. Prior work improves this through simplifying programming language grammar, reducing token usage across both code understanding and generation tasks. However, it is confined to syntactic transformations, leaving significant opportunities for token reduction unrealized at the semantic level. In this work, we propose Token Sugar, a concept that replaces frequent and verbose code patterns with reversible, token-efficient shorthand in the source code. To realize this concept in practice, we designed a systematic solution that mines high-frequency, token-heavy patterns from a code corpus, maps each to a unique shorthand, and integrates them into LLM pretraining via code transformation. With this solution, we obtain 799 (code pattern, shorthand) pairs, which can reduce up to 15.1% token count in the source code and is complementary to existing syntax-focused methods. We further trained three widely used LLMs on Token Sugar-augmented data. Experimental results show that these models not only achieve significant token savings (up to 11.2% reduction) during generation but also maintain near-identical Pass@1 scores compared to baselines trained on unprocessed code.
Migrating quantum algorithms across evolving frameworks introduces subtle behavioral changes that affect accuracy and reproducibility. This paper reports our experience converting the Quantum Approximate Optimization Algorithm (QAOA) from Qiskit Algorithms with Qiskit 1.x (v1 primitives) to a custom implementation using Qiskit 2.x (v2 primitives). Despite identical circuits, optimizers, and Hamiltonians, the new version produced drastically different results. A systematic analysis revealed the root cause: the sampling budget -- the number of circuit executions (shots) per iteration. The library's implicit use of unlimited shots yielded dense probability distributions, whereas the v2 default of 10 000 shots captured only 23% of the state space. Increasing shots to 250 000 restored library-level accuracy. This study highlights how hidden parameters at the quantum--classical interaction level can dominate hybrid algorithm performance and provides actionable recommendations for developers and framework designers to ensure reproducible results in quantum software migration.
Large Language Models for code (LLMs4Code) are increasingly used to generate software artifacts, including library and package recommendations in languages such as Go. However, recent evidence shows that LLMs frequently hallucinate package names or generate dependencies containing known security vulnerabilities, posing significant risks to developers and downstream software supply chains. At the same time, quantization has become a widely adopted technique to reduce inference cost and enable deployment of LLMs on resource-constrained environments. Despite its popularity, little is known about how quantization affects the correctness and security of LLM-generated software dependencies while generating shell commands for package installation. In this work, we conduct the first systematic empirical study of the impact of quantization on package hallucination and vulnerability risks in LLM-generated Go packages. We evaluate five Qwen model sizes under full-precision, 8-bit, and 4-bit quantization across three datasets (SO, MBPP, and paraphrase). Our results show that quantization substantially increases the package hallucination rate (PHR), with 4-bit models exhibiting the most severe degradation. We further find that even among the correctly generated packages, the vulnerability presence rate (VPR) rises as precision decreases, indicating elevated security risk in lower-precision models. Finally, our analysis of hallucinated outputs reveals that most fabricated packages resemble realistic URL-based Go module paths, such as most commonly malformed or non-existent GitHub and golang.org repositories, highlighting a systematic pattern in how LLMs hallucinate dependencies. Overall, our findings provide actionable insights into the reliability and security implications of deploying quantized LLMs for code generation and dependency recommendation.
The usability of open-source software (OSS) is important but frequently overlooked in favor of technical and functional complexity. Argumentation can be a pivotal device for diverse stakeholders in OSS usability discussions to express opinions and persuade others. However, the characteristics of argument discourse in those discussions remain unknown, resulting in difficulties in providing effective support for discussion participants. We address this through a comprehensive analysis of argument discourse and quality in five OSS projects. Our results indicated that usability discussions are predominantly argument-driven, although their qualities vary. Issue comments exhibit lower-quality arguments than the issue posts, suggesting a shortage of collective intelligence about usability in OSS communities. Moreover, argument discourse and quality have various impacts on the subsequent behavior of participants. Overall, this research offers insights to help OSS stakeholders build more effective arguments and eventually improve OSS usability. These insights can also inform studies about other distributed collaborative communities.
Today, with the growing obsession with applying Artificial Intelligence (AI), particularly Machine Learning (ML), to software across various contexts, much of the focus has been on the effectiveness of AI models, often measured through common metrics such as F1- score, while fairness receives relatively little attention. This paper presents a review of existing gray literature, examining fairness requirements in AI context, with a focus on how they are defined across various application domains, managed throughout the Software Development Life Cycle (SDLC), and the causes, as well as the corresponding consequences of their violation by AI models. Our gray literature investigation shows various definitions of fairness requirements in AI systems, commonly emphasizing non-discrimination and equal treatment across different demographic and social attributes. Fairness requirement management practices vary across the SDLC, particularly in model training and bias mitigation, fairness monitoring and evaluation, and data handling practices. Fairness requirement violations are frequently linked, but not limited, to data representation bias, algorithmic and model design bias, human judgment, and evaluation and transparency gaps. The corresponding consequences include harm in a broad sense, encompassing specific professional and societal impacts as key examples, stereotype reinforcement, data and privacy risks, and loss of trust and legitimacy in AI-supported decisions. These findings emphasize the need for consistent frameworks and practices to integrate fairness into AI software, paying as much attention to fairness as to effectiveness.
As machine learning (ML) becomes an integral part of high-autonomy systems, it is critical to ensure the trustworthiness of learning-enabled software systems (LESS). Yet, the nondeterministic and run-time-defined semantics of ML complicate traditional software refactoring. We define semantic preservation in LESS as the property that optimizations of intelligent components do not alter the system's overall functional behavior. This paper introduces an empirical framework to evaluate semantic preservation in LESS by mining model evolution data from HuggingFace. We extract commit histories, $\textit{Model Cards}$, and performance metrics from a large number of models. To establish baselines, we conducted case studies in three domains, tracing performance changes across versions. Our analysis demonstrates how $\textit{semantic drift}$ can be detected via evaluation metrics across commits and reveals common refactoring patterns based on commit message analysis. Although API constraints limited the possibility of estimating a full-scale threshold, our pipeline offers a foundation for defining community-accepted boundaries for semantic preservation. Our contributions include: (1) a large-scale dataset of ML model evolution, curated from 1.7 million Hugging Face entries via a reproducible pipeline using the native HF hub API, (2) a practical pipeline for the evaluation of semantic preservation for a subset of 536 models and 4000+ metrics and (3) empirical case studies illustrating semantic drift in practice. Together, these contributions advance the foundations for more maintainable and trustworthy ML systems.
Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis--such as scientific papers to code--primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task-relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowledge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commercial agents such as Cursor and Claude Code, and crucially, surpassing PhD-level human experts from top institutes on key reproduction metrics. By systematically transforming paper specifications into production-grade implementations comparable to human expert quality, this work establishes new foundations for autonomous scientific reproduction that can accelerate research evaluation and discovery.
Configuring computational fluid dynamics (CFD) simulations requires significant expertise in physics modeling and numerical methods, posing a barrier to non-specialists. Although automating scientific tasks with large language models (LLMs) has attracted attention, applying them to the complete, end-to-end CFD workflow remains a challenge due to its stringent domain-specific requirements. We introduce CFD-copilot, a domain-specialized LLM framework designed to facilitate natural language-driven CFD simulation from setup to post-processing. The framework employs a fine-tuned LLM to directly translate user descriptions into executable CFD setups. A multi-agent system integrates the LLM with simulation execution, automatic error correction, and result analysis. For post-processing, the framework utilizes the model context protocol (MCP), an open standard that decouples LLM reasoning from external tool execution. This modular design allows the LLM to interact with numerous specialized post-processing functions through a unified and scalable interface, improving the automation of data extraction and analysis. The framework was evaluated on benchmarks including the NACA~0012 airfoil and the three-element 30P-30N airfoil. The results indicate that domain-specific adaptation and the incorporation of the MCP jointly enhance the reliability and efficiency of LLM-driven engineering workflows.
Efficient edge caching reduces latency and alleviates backhaul congestion in modern networks. Traditional caching policies, such as Least Recently Used (LRU) and Least Frequently Used (LFU), perform well under specific request patterns. LRU excels in workloads with strong temporal locality, while LFU is effective when content popularity remains static. However, real-world client requests often exhibit correlations due to shared contexts and coordinated activities. This is particularly evident in Virtual Reality (VR) environments, where groups of clients navigate shared virtual spaces, leading to correlated content requests. In this paper, we introduce the \textit{grouped client request model}, a generalization of the Independent Reference Model that explicitly captures different types of request correlations. Our theoretical analysis of LRU under this model reveals that the optimal causal caching policy depends on cache size: LFU is optimal for small to moderate caches, while LRU outperforms it for larger caches. To address the limitations of existing policies, we propose Least Following and Recently Used (LFRU), a novel online caching policy that dynamically infers and adapts to causal relationships in client requests to optimize evictions. LFRU prioritizes objects likely to be requested based on inferred dependencies, achieving near-optimal performance compared to the offline optimal Belady policy in structured correlation settings. We develop VR based datasets to evaluate caching policies under realistic correlated requests. Our results show that LFRU consistently performs at least as well as LRU and LFU, outperforming LRU by up to 2.9x and LFU by up to1.9x in certain settings.
Neural models for vulnerability prediction (VP) have achieved impressive performance by learning from large-scale code repositories. However, their susceptibility to Membership Inference Attacks (MIAs), where adversaries aim to infer whether a particular code sample was used during training, poses serious privacy concerns. While MIA has been widely investigated in NLP and vision domains, its effects on security-critical code analysis tasks remain underexplored. In this work, we conduct the first comprehensive analysis of MIA on VP models, evaluating the attack success across various architectures (LSTM, BiGRU, and CodeBERT) and feature combinations, including embeddings, logits, loss, and confidence. Our threat model aligns with black-box and gray-box settings where prediction outputs are observable, allowing adversaries to infer membership by analyzing output discrepancies between training and non-training samples. The empirical findings reveal that logits and loss are the most informative and vulnerable outputs for membership leakage. Motivated by these observations, we propose a Noise-based Membership Inference Defense (NMID), which is a lightweight defense module that applies output masking and Gaussian noise injection to disrupt adversarial inference. Extensive experiments demonstrate that NMID significantly reduces MIA effectiveness, lowering the attack AUC from nearly 1.0 to below 0.65, while preserving the predictive utility of VP models. Our study highlights critical privacy risks in code analysis and offers actionable defense strategies for securing AI-powered software systems.
The rising global prevalence of diabetes necessitates early detection to prevent severe complications. While AI-powered prediction applications offer a promising solution, they require a responsive and scalable back-end architecture to serve a large user base effectively. This paper details the development and evaluation of a scalable back-end system designed for a mobile diabetes prediction application. The primary objective was to maintain a failure rate below 5% and an average latency of under 1000 ms. The architecture leverages horizontal scaling, database sharding, and asynchronous communication via a message queue. Performance evaluation showed that 83% of the system's features (20 out of 24) met the specified performance targets. Key functionalities such as user profile management, activity tracking, and read-intensive prediction operations successfully achieved the desired performance. The system demonstrated the ability to handle up to 10,000 concurrent users without issues, validating its scalability. The implementation of asynchronous communication using RabbitMQ proved crucial in minimizing the error rate for computationally intensive prediction requests, ensuring system reliability by queuing requests and preventing data loss under heavy load.
Time series forecasting has applications across domains and industries, especially in healthcare, but the technical expertise required to analyze data, build models, and interpret results can be a barrier to using these techniques. This article presents a web platform that makes the process of analyzing and plotting data, training forecasting models, and interpreting and viewing results accessible to researchers and clinicians. Users can upload data and generate plots to showcase their variables and the relationships between them. The platform supports multiple forecasting models and training techniques which are highly customizable according to the user's needs. Additionally, recommendations and explanations can be generated from a large language model that can help the user choose appropriate parameters for their data and understand the results for each model. The goal is to integrate this platform into learning health systems for continuous data collection and inference from clinical pipelines.