| Total: 74
Delivering judicial decisions requires interpreting complex legal texts, analyzing evidence, and reasoning over jurisprudence and legal principles. Recent advances in Generative Artificial Intelligence, particularly Large Language Models (LLMs), have shown potential to automate parts of this process; however, practical and measurable benefits in real-world judicial settings remain limited. This paper introduces SARA, an LLM-powered legal reasoning platform deployed in a regional Brazilian court, which demonstrates significant efficiency and quality gains through the integration of LLM agents with a Jurisprudential Knowledge Graph (Jur-KG). SARA automatically extracts and structures key elements from legal documents, including claims, requests, and evidence, and generates legal reasoning grounded in retrieved jurisprudential precedents. The Jur-KG is modeled through an ontology encompassing core legal concepts such as parties, facts, and legal claims, enabling semantic matching and retrieval of relevant case law. By representing cases according to the Legal Case Ontology for the Brazilian Judicial System, SARA supports traceable reasoning and addresses competence questions to assess the coverage, coherence, and justification of AI-generated outputs. Deployment results indicate measurable improvements in processing time, consistency, and explainability, while ensuring compliance with ethical and legal guidelines established by Brazil’s National Council of Justice. This work demonstrates that combining LLM-based agents with domain-specific knowledge graphs can deliver both innovative capabilities and proven impact in judicial decision-making.
In retail lending, offering preferential interest rates is a core marketing instrument for balancing customer acquisition with portfolio profitability. Accurately predicting the effect of interest-rate discounts for each customer is pivotal for optimizing the discount strategy: offering overly generous discounts erodes margins, while insufficient discounts drive price-sensitive customers to defect. Off‑the‑shelf machine learning uplift models rarely respect the complex operational constraints of financial business, such as tiered rate grids, regulatory guard‑rails, and marketing budget ceilings. We propose an integrated system that fuses causal inference and domain adaptation to produce constraint‑aware, customer‑specific discount recommendations. To further enhance practitioner adoption, a large language model layer translates model outputs into actionable narratives. Developed in Hyundai Capital Services, the system boosted transaction volume by 13%, demonstrating both technical soundness and material business impact.
Evaluating the quality of search systems traditionally requires a significant number of human relevance annotations. In recent times, several systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.
Configuring the parameters of additive manufacturing processes for metal alloys is a challenging problem due to complex relationships between input parameters (e.g., laser power, scan speed) and quality of printed outputs. The standard trial-and-error approach to find feasible parameter configurations is highly inefficient because validating each configuration is expensive in terms of resources (physical and human labor) and the configuration space is very large. This paper combines the general principles of AI-driven adaptive experimental design with domain knowledge to address the challenging problem of discovering feasible configurations. The key idea is to build a surrogate model from past experiments to intelligently select a small batch of input configurations for validation in each iteration. To demonstrate the effectiveness of this methodology, we deploy it for Directed Energy Deposition process to print GRCop-42, a high-performance copper–chromium–niobium alloy developed by NASA for aerospace applications. Within three months, our approach yielded multiple defect-free outputs across a range of laser powers—dramatically reducing time-to-result and resource expenditure compared to several months of manual experimentation by domain scientists with no success. By enabling high-quality GRCop-42 fabrication on readily available infrared laser platforms for the first time, we democratize access to this critical alloy, paving the way for cost-effective, decentralized production for aerospace applications.
Modern ETL (Extract, Transform, Load) tools offer graphical, no-code interfaces for workflow creation but still require users to manually identify transformation functions and configure their properties, which is time-consuming and demands prior expertise. We present the research and engineering foundations of the IBM DataStage Assistant, a deployed capability that generates complete multi-stage ETL flows directly from natural language (NL) descriptions. Our framework infers transformation functions, their properties, and transformer expressions, enabling novices to discover relevant functions and allowing experts to bypass manual configuration. The proposed framework achieves a prediction accuracy of 96.4% for flow predictions, 87.0% for properties, and 83.6% for transformer expressions. We also show a document exploration module that uses retrieval-augmented generation (RAG) over product documentation to answer tool-specific questions in NL. Implemented in IBM DataStage, this approach supports iterative, in-environment workflow design and reduces context switching. In initial studies, it achieves up to 90% time savings for novices and 50% for experts.
IT environments typically have logging mechanisms to monitor system health and detect issues. However, the huge volume of generated logs makes manual inspection impractical, highlighting the importance of automated log analysis in IT Software Support. In this paper, we propose a log analytics tool that leverages Large Language Models (LLMs) for log data processing and issue diagnosis, enabling the generation of automated insights and summaries. We further present a novel approach for efficiently running LLMs on CPUs to process massive log volumes in minimal time without compromising output quality. We share the insights and lessons learned from deployment of the tool - in production since March 2024 - scaled across 70 software products, processing over 2000 tickets for issue diagnosis, achieving a time savings of 300+ man hours and an estimated $15,444 per month in manpower costs compared to the traditional practices.
We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA's effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.
Accurate monitoring of eider duck populations in Arctic Canada is essential for understanding ecosystem health and supporting conservation efforts in a rapidly changing climate. Traditional manual counting from aerial imagery is time-consuming, labor-intensive, and prone to observer bias. In this work, we present a human-in-the-loop wildlife counting system that integrates an open-vocabulary multi-species object detector to streamline and enhance the accuracy of eider duck surveys. The system leverages a pre-trained open-vocabulary model, enabling the identification of both target and incidental species without retraining, and employs human validation to correct and refine automated detections. This collaborative workflow combines the scalability of machine learning with expert ecological knowledge, reducing annotation effort while maintaining high accuracy. Field validation using aerial imagery from Arctic Canada demonstrates that our approach can significantly accelerate population assessments, improve consistency across surveys, and facilitate adaptive monitoring in remote environments.
Over recent decades, the tourism industry has demonstrated progressive expansion, driven by advancements in aviation technologies and shifting consumer interests. In this context, online flight itinerary ranking has become a pivotal business for Online Travel Platforms (OTPs), which aim to rank flight itineraries by synthesizing real-time flight data provided by airlines with users' individual travel preferences. Currently, most OTPs rely on rule-based methodologies or rudimentary user preference-driven models to address this task. However, these methods are inherently limited by their insufficient consideration of delayed booking behaviors and their neglect of dynamic contextual attributes associated with flight itineraries, thereby undermining their ability to effectively handle the intricacies of flight ranking. To address these shortcomings, this paper introduces the Delayed Conversion Modeling based Personalized Flight Itinerary Ranking Network (DCRNet), designed to improve ranking accuracy by integrating delayed booking patterns and contextual dependencies into the modeling framework. Specifically, DCRNet explores the dynamic associations between users' current contextual information and their historical travel records, and models users' delayed booking behaviors via a masked attention mechanism. Moreover, an enhanced multi-task learning framework is employed to effectively integrate traditional behavioral modeling with delay-aware modeling, thereby improving the overall prediction accuracy and enhancing the system's personalized recommendation capabilities. Extensive offline experiments conducted on real-world datasets from Amadeus and Fliggy demonstrate the superior performance of DCRNet. Furthermore, its successful deployment on Fliggy's online itinerary search system has yielded significant improvements, underscoring its practical effectiveness and scalability.
User retention is a critical objective for online platforms like Pinterest, as it strengthens user loyalty and drives growth through repeated engagement. A key indicator of retention is revisitation, i.e., when users return to view previously saved content, a behavior often sparked by personalized recommendations and user satisfaction. However, modeling and optimizing revisitation poses significant challenges. One core difficulty is accurate attribution: it is often unclear which specific user actions or content exposures trigger a revisit, since many confounding factors (e.g., content quality, user interface, notifications, or even changing user intent) can influence return behavior. Additionally, the scale and timing of revisitations introduce further complexity; users may revisit content days or even weeks after their initial interaction, requiring the system to maintain and associate extensive historical records across millions of users and sessions. These complexities render existing methods insufficient for robustly capturing and optimizing long-term revisitation. To address these gaps, we introduce a novel, lightweight, and interpretable framework for modeling revisitation behavior and optimizing long-term user retention in Pinterest’s search-based recommendation context. By defining a surrogate attribution process that links saves to subsequent revisitations, we reduce noise in the causal relationship between user actions and return visits. Our scalable event aggregation pipeline enables large-scale analysis of user revisitation patterns and enhances the ranking system’s ability to surface items with high retention value. Deployed on Pinterest’s Related Pins surface to serve 500+ million users, the framework led to a significant lift of 0.1% in active users without additional computational costs. Our data analysis reveals novel insights, such as the impact of content topics on revisitation rates; for example, users are more likely to revisit aesthetically pleasing topics.
Public health experts need scalable methods to monitor large volumes of health data (e.g., human-reported cases, hospitalizations, deaths). These methods must identify individual data points that may indicate significant events, such as outbreaks, or reveal data quality issues. Identifying, triaging, and analyzing these data points in real-time is critical for preventing downstream errors in forecasting or policy. Traditional alert-based data monitoring systems, used for decades in practice, fail to identify relevant data events for several reasons. For example, these systems may not output real-time results from large data volumes, or they may return tens of thousands of unhelpful alerts. We introduce a human-in-the-loop AI system for public health data monitoring that uses a ranking-based AI anomaly detection method. This system was developed through a multi-year interdisciplinary collaboration with participatory design from researchers, engineers, and public health data experts. From this process, we identified system goals, such as user control and efficiency and designed a system that balances these goals. This system has since been deployed at a national public health organization and analyzes up to 5 million data points daily. A three-month longitudinal deployment evaluation revealed a significant improvement in system goals, including a 54x increase in data reviewer efficiency and increased engagement compared to traditional alert-based methods.
Online human trafficking investigations generate vast amounts of noisy, heterogeneous, and deliberately obfuscated data, making traditional search and analytics tools ineffective for supporting law enforcement. This paper discusses the deployment of the Domain-Specific Insight Graphs (DIG) system, an AI-powered investigative search engine that was operationally used by over 200 U.S. law enforcement agencies for more than five years in the pre-COVID period. The system integrates advanced research conducted over the years in information extraction, knowledge graph construction, and entity-centric search to enable investigators to formulate queries without technical background, aggregate evidence, and uncover latent relationships among entities such as phone numbers, emails, and locations. Beyond technical innovation, the deployment required sustained attention to usability, explainability, and policy compliance, ensuring trust in high-stakes legal contexts. We report measurable benefits in investigative efficiency, case initiation, and prosecutorial support, as well as lessons learned from long-term maintenance and adaptation to evolving online platforms. Since 2020, work conducted in this domain has also had significant policy and advocacy ramifications. The system's generalized design has also allowed it to be prototyped for adjacent illicit domains, including securities fraud and illegal firearm sales, demonstrating the broader applicability of AI-driven investigative tools. We contribute a rare case study of an AI system that has transitioned from research to sustained real-world impact in a socially critical domain.
The quality of experience (QoE) delivered by video conferencing systems is significantly influenced by accurately estimating the time-varying available bandwidth between the sender and receiver. Bandwidth estimation for real-time communications remains an open challenge due to rapidly evolving network architectures, increasingly complex protocol stacks, and the difficulty of defining QoE metrics that reliably improve user experience. In this work, we propose a deployed, human-in-the-loop, data-driven framework for bandwidth estimation to address these challenges. Our approach begins with training objective QoE reward models derived from subjective user evaluations to measure audio and video quality in real-time video conferencing systems. Subsequently, we collect roughly 1M network traces with objective QoE rewards from real-world Microsoft Teams calls to curate a bandwidth estimation training dataset. We then introduce a novel distributional offline reinforcement learning (RL) algorithm to train a neural-network-based bandwidth estimator aimed at improving QoE for users. Our real-world A/B test demonstrates that the proposed approach reduces the subjective poor call ratio by 11.41% compared to the baseline bandwidth estimator. Furthermore, the proposed offline RL algorithm is benchmarked on D4RL tasks to demonstrate its generalization beyond bandwidth estimation.
Many organizations are increasingly relying on unstructured documents such as PDFs and scanned forms to support downstream large language model (LLM) services, including search, summarization, and recommendation. However, traditional OCR systems struggle with diverse layouts of documents, leading to frequent errors and high costs of labor. So, this study developed DATALUX - a robust document layout system that trans-forms unstructured documents into structured, machine-readable data suitable for automation. Built on a trans-former-based detector, DATALUX incorporates several modules for layout refinement, text-visual fusion, and layer-wise optimization to improve coherence and generalization across diverse layouts. Around January 2025, we successfully deployed DATALUX into one of the largest academic content service firms (Nurimedia) in South Korea. This firm faced the challenge of extracting metadata and references from thousands of academic pa-pers submitted in various formats. Also, the existing LLM-based tools provided unreliable results. So, they needed to process them manually, creating bottlenecks in both labor and time. However, DATALUX enabled the automatic structuring of over 100,000 research papers a year, improving extraction accuracy to over 97%, reducing costs by more than USD 185K annually, and accelerating processing speed by 8.7 times. These deployment results suggest that DATALUX enables scalable and efficient document automation in complex and high-volume environments successfully. We thus believe that our DATALUX has a significant impact on both academia and industry practices.
The rapid proliferation of smart-city ecosystems has significantly amplified the demand for Li-ion batteries, which now serve as the primary energy source for sustainable transportation systems such as e-bikes. Ensuring battery safety and optimal performance is crucial, yet challenging due to complex intrinsic dynamics and extrinsic operating conditions. This paper presents LiBrain, an innovative LLM-powered, time-series-aware retrieval-augmented framework designed to simultaneously address both safety and performance challenges through three synergistic components: (1) a distributed IoT-enabled edge network for continuous real-time battery monitoring and data acquisition, (2) a pretrained deep multi-task diagnostic engine capable of comprehensive battery performance forecasting, and (3) a knowledge-base augmentation module that transforms technical diagnostics into clear, actionable guidance tailored for e-bike users. Functioning as an intelligent battery management assistant, LiBrain effectively bridges the gap between expert-level real-time analytics and practical, user-friendly instructions. Extensive validation across a real-world operational e-bike battery-swap network demonstrates LiBrain's exceptional capabilities, achieving a 95% adoption rate in hazardous alarm detection and 92% in battery-status prediction. In real application, Li-Brain has processed over 500 million battery events, managed almost 10 million inquiries and 1 million alarms annually, and identified 10% of on-site batteries daily for proactive replacement, thereby maintaining operational safety and reliability.
Electric bicycles (e-bikes) have become the dominant mode of transportation in China’s urban instant delivery industry. However, many riders lack the experience to navigate complex traffic networks and diverse road conditions, leading to reduced delivery efficiency. To address this issue, we present Talking Trails, an e-bike delivery route planning system built upon an LLM-enhanced spatiotemporal trajectory model. Trained on millions of real-world delivery trajectories, fused with spatiotemporal and semantic data information, the model achieves a top-5 rider displacement prediction accuracy of 95% and a route optimization rate of 82.1%. In practice, we augment the core planner with an LLM-driven semantic layer that translates high-level user intent into executable tasks, then pair it with a battery-swap module that continuously validates route feasibility so the vehicle never runs out of charge mid-mission. Currently serving tens of thousands of riders, the system is projected to reduce average delivery mileage by 17% and lower annual carbon emissions by 3978 tons. Overall, Talking Trails significantly improves delivery efficiency, offering a scalable and sustainable solution for instant delivery operations.
LLM-based autonomous agents have recently shown strong capabilities in solving complex industrial design tasks. However, in domains aiming for carbon neutrality and high-performance renewable energy systems, current AI-assisted design automation methods face critical challenges in explainability, scalability, and practical usability. To address these limitations, we introduce PHIA (Physics-Informed Autonomous Agent), an LLM-driven system that automates modulation design for power converters in Power Electronics Systems with minimal human intervention. In contrast to traditional pipeline-based methods, PHIA incorporates an LLM-based planning module that interactively acquires and verifies design requirements via a user-friendly chat interface. This planner collaborates with physics-informed simulation and optimization components to autonomously generate and iteratively refine modulation designs. The interactive interface also supports interpretability by providing textual explanations and visual outputs throughout the design process. Experimental results show that PHIA reduces standard mean absolute error by 63.2% compared to the second-best benchmark and accelerates the overall design process by over 33 times. A user study involving 20 domain experts further confirms PHIA’s superior design efficiency and usability, highlighting its potential to transform industrial design workflows in power electronics.
Accurate multi-turn intent classification is critical for advancing conversational AI systems but remains challenging due to limited datasets and complex contextual dependencies across dialogue turns. This paper presents two novel approaches leveraging Large Language Models (LLMs) to enhance scalability and reduce latency in production dialogue systems. First, we introduce Symbol Tuning, which simplifies intent labels to reduce task complexity and improve performance in multi-turn dialogues. Second, we propose Consistency-aware, Linguistics Adaptive Retrieval Augmentation (CLARA), a framework that employs LLMs for data augmentation and pseudo-labeling to generate synthetic multi-turn dialogues. These enriched datasets are used to fine-tune a small, efficient model suitable for deployment. Experiments on multilingual dialogue datasets show that our methods result in notable gains in both accuracy and resource efficiency, with improvements of 5.09% in classification accuracy, a 40% reduction in annotation costs, and effective deployment in low-resource multilingual industrial settings.
Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.
Food rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer feedback allows food rescue organizations to identify issues early and ensure volunteer satisfaction. However, food rescue organizations monitor feedback manually, which can be cumbersome and labor-intensive, making it difficult to prioritize which issues are most important. In this work, we investigate how large language models (LLMs) assist food rescue organizers in understanding and taking action based on volunteer experiences. We work with 412 Food Rescue, a large food rescue organization based in Pittsburgh, Pennsylvania, to design RescueLens, an LLM-powered tool that automatically categorizes volunteer feedback, suggests donors and recipients to follow up with, and updates volunteer directions based on feedback. We evaluate the performance of RescueLens on an annotated dataset, and show that it can recover 96% of volunteer issues at 71% precision. Moreover, by ranking donors and recipients according to their rates of volunteer issues, RescueLens allows organizers to focus on 0.5% of donors responsible for more than 30% of volunteer issues. RescueLens is now deployed at 412 Food Rescue and through semi-structured interviews with organizers, we find that RescueLens streamlines the feedback process so organizers better allocate their time.
In large-scale recommendation systems like LinkedIn’s, the retrieval stage is critical for narrowing billions of potential candidates to a manageable subset for ranking. LinkedIn's feed now serves suggested content based on the topical interests of members, where 2000 candidates are retrieved from several million candidates with a latency budget of a few milliseconds and inbound QPS of several thousand per second. This paper presents a novel retrieval approach that fine tunes a large causal language model (Meta’s LLaMA 3) as a dual encoder to generate high quality embeddings for both users (members) and content (items), using only textual input. We describe the end to end pipeline, including prompt design for embedding generation, techniques for fine tuning at LinkedIn scale, and infrastructure for low latency, cost effective online serving. We share our findings on how quantizing numerical features in the prompt enables the information getting encoded in the embedding facilitating greater alignment between the retrieval and ranking layer. The system was evaluated using offline metrics and an online A/B test, which showed substantial improvements in member engagement. We observed significant gains among newer members, who often lack strong network connections, indicating that high-quality suggested content aids retention. This work demonstrates how generative language models can be effectively adapted for real time, high throughput retrieval in industrial applications.
Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11—India’s largest fantasy sports platform with over 250 million users—that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google’s Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.
In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren’t capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multimodal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. In this benchmark, our competitor discovery agent (Bioptic Agent) achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to ~3 hours (~20x) for the competitive analysis.
In the user growth scenario, Internet companies invest heavily in paid acquisition channels to acquire new users. But sustainable growth depends on acquired users' generating lifetime value (LTV) exceeding customer acquisition cost (CAC). In order to maximize LTV/CAC ratio, it is crucial to predict channel-level LTV in an early stage for further optimization of budget allocation. The LTV forecasting problem is significantly different from traditional time series forecasting problems, and there are three main challenges. Firstly, it is an unaligned multi-time series forecasting problem that each channel has a number of LTV series of different activation dates. Secondly, to predict in the early stage, it faces the imbalanced short-input long-output (SILO) challenge. Moreover, compared with the commonly used time series datasets, the real LTV series are volatile and non-stationary, with more frequent fluctuations and higher variance. In this work, we propose a novel framework called Trapezoidal Temporal Fusion (TTF) to address the above challenges. We introduce a trapezoidal multi-time series module to deal with data unalignment and SILO challenges, and output accurate predictions with a multi-tower structure called MT-FusionNet. The framework has been deployed to the online system for Douyin. Compared to the previously deployed online model, MAPE_p decreased by 4.3%, and MAPE_a decreased by 3.2%, where MAPE_p denotes the point-wise MAPE of the LTV curve and MAPE_a denotes the MAPE of the aggregated LTV.
There is growing interest in applying artificial intelligence (AI) to automate and support complex decision-making tasks. However, it remains unclear how algorithms compare to human judgment in contexts requiring semantic understanding and domain expertise. We examine this in the context of the judge assignment problem, matching submissions to suitably qualified judges. Specifically, we tackled this problem at the Harvard President’s Innovation Challenge, the university’s premier venture competition awarding over $500,000 to student and alumni startups. This setting represents a real-world environment where high-quality judge assignment is essential. We developed an AI-based judge assignment algorithm, the Hybrid Lexical-Semantic Similarity Ensemble (HLSE), and deployed it at the competition. We then evaluated its performance against human expert assignments using blinded match-quality scores from judges on 309 judge-venture pairs. Using a Mann-Whitney U statistic-based test, we found no statistically significant difference in assignment quality between the two approaches (AUC=0.48, p=0.40); on average, algorithmic matches were rated 3.90 and manual matches 3.94 on a 5-point scale, where 5 indicates an excellent match. Furthermore, manual assignments that previously required a full week could be automated in several hours by the algorithm during deployment. These results demonstrate that HLSE achieves human-expert-level matching quality while offering greater scalability and efficiency, underscoring the potential of AI-driven solutions to support and enhance human decision-making for judge assignment in high-stakes settings.