| Total: 40
My doctoral research develops a unified framework for offline imitation learning (IL) that tackles three central challenges: achieving sample efficiency in strictly batch settings, ensuring robustness and generalization under dynamics shifts, and learning from demonstrations of varying quality. At the core of this work is a new paradigm for strictly offline IL based on enforcing the Markov Balance Equation (MBE), a fundamental structural property of trajectory data. Using advanced conditional density estimation, I developed two algorithms, CKIL and MBIL, which achieve state-of-the-art performance in high-dimensional continuous-control tasks. Building upon this foundation, I developed the first Distributionally Robust Offline IL framework under a stationarity constraint, enabling robustness to transition-model mismatch without requiring any additional interaction. I am now extending this direction through Robust Behavior Foundation Models (RBFMs), which aim to generalize across dynamics shifts for a wide range of tasks. Finally, I propose a variational approach for learning from crowdsourced demonstrations by inferring and accounting for demonstrator expertise. Together, these contributions yield principled and practical IL algorithms with strong performance and robustness, broadening the applicability of IL to real-world domains such as robotics, healthcare, and autonomous systems.
Explainability has emerged as a pillar of Trustworthy AI for ensuring safety in high-risk application domains. However, the incorporation of explainability to boost the transparency of black-box AI systems can inadvertently introduce unforeseen vulnerabilities. Previous research has drawn attention to privacy leakage, malicious or otherwise, from explainable interfaces leading to identification of individuals and exposure of sensitive personal information. Privacy preservation methods used in response to this leakage are found to adversely affect the utility of the system, including the degradation of model accuracy and explanation quality. The proposed thesis will examine the advancement of Privacy Enhancing Technologies (PETs) in Explainable AI (XAI) while ensuring that users remain at the core of the design process. The main objectives of this research are: (1) determining defenses for privacy attacks in XAI (2) building interpretable algorithms for private models and (3) examining user requirements for privacy preserving XAI. This research is expected to yield characteristics of privacy preserving XAI, guidelines and recommendations for effectively building privacy compliant XAI while considering the diverse needs of end users. The research outcomes will enable developers and researchers in designing XAI that is safe for deployment and considers the balance between privacy, explainability and utility.
Theory of Mind (ToM) enables agents to model others' mental states, but in mixed-motive games, this capacity can lead to deceptive behaviour and alignment risks. My research investigates how ToM affects strategic behaviour in partially observed games, contributing: (1) a formal model of ToM-driven manipulation in a preference elicitation task, (2) evidence that excessive ToM leads to paranoid-like overmentalisation, and (3) the Aleph-IPOMDP model, a framework for multi-agent systems that balances ToM reasoning with game-theoretic principles to prevent manipulation, deterring capable agents from deceiving. My work contributes to the understanding of deceptive AI, overcoming deception in multi-agent systems and applications to computational model of human cognition.
While Reinforcement Learning (RL) has demonstrated remarkable success in solving complex sequential decision-making problems, its application in real-world, safety-critical systems is hindered by its reliance on carefully engineered reward functions. Designing effective rewards is notoriously challenging and can lead to unintended or unsafe behaviors, a phenomenon known as reward hacking. Specification-guided RL has emerged as a principled alternative, leveraging formal methods to directly encode high-level objectives, safety requirements, and behavioral constraints. However, the practical utility of this approach is often limited by coarse or under-specified logical formulas and the computational challenge of enforcing safety at scale. This thesis addresses these limitations by developing a unified framework for the automated refinement, scalable enforcement, and flexible adaptation of formal specifications in RL.
The multi-agent path finding (MAPF) problem is a combinatorial search problem that aims at finding paths for multiple agents in an environment without collisions, subject to constraints on the lengths of paths. The real-world applications of MAPF require flexible, lifelong, robust and explainable solutions. In this study, these challenges are being addressed.
Causal discovery is the task of learning a causal model from a source of information. Traditionally, the community has focused on algorithms that infer causal models from observational and/or interventional data, while alternative approaches have been only marginally explored. The proposed work aims to contribute to the theoretical foundations connecting agent-based systems with causal modeling, and to identify conditions under which newly developed causal discovery algorithms can be applied to elicit causal knowledge from agents.
Autonomous driving has shown significant progress in recent years. The combination of advanced sensors, ample data, and machine learning algorithms has led to the deployment of autonomous vehicles (AVs) in cities like Los Angeles, San Francisco, and Phoenix. While not all humans can drive perfectly, AVs should be able to plan, adapt, and react to environmental disturbances, including irrational human drivers. My research focuses on applying reinforcement learning (RL) techniques to validate AV-related cyber-physical systems (CPS) in realistic environments. I develop a custom RL environment that simulates highway driving scenarios with multiple vehicles. This environment includes a CPS model of adaptive cruise control (ACC), a lane-changing model (MOBIL), and an adversarial agent that learns to drive irrationally. My work extends interpretable RL techniques to continuous control tasks like autonomous driving.
Large language models (LLMs) have achieved remarkable success in natural language processing tasks but still struggle with complex causal and logical reasoning. Previous neuro-symbolic methods can be summarized into a two-stage framework: first translating natural language (NL) problems into symbolic language (SL) representation, and then performing the symbolic reasoning process. To facilitate this direction, we provide a comprehensive survey, summarizing two main challenges including complex logical question-answering (QA) and cross-question logical consistency, and further propose a new taxonomy. To achieve precise symbolic representation and enhance the accuracy of LLMs’ logical reasoning, we propose several effective and efficient approaches, including adaptively selecting the most suitable SL for each QA problem, a data-driven approach to determine the fine-tuning samples order, and an efficient multi-agent debate framework with sparse communication. Our future research will focus on theoretical analysis for optimal SL selection, translation refinement and robust neuro-symbolic approach to improve LLMs' reasoning.
The safe deployment of artificial intelligence systems hinges on their ability to recognize and appropriately handle inputs they have not been trained for. Out-of-Distribution (OOD) detection aims to provide this capability, yet most existing methods are developed under idealized assumptions that do not hold in the real world. This thesis challenges these assumptions by systematically addressing four key practical challenges: the semantic ambiguity of unlabeled data, the presence of domain shifts and class imbalances, the scarcity of labeled training data, and the need to operate on dynamic video streams instead of static images. The core of this research is a suite of four novel deep learning frameworks, each designed to overcome one of these specific limitations. My contributions push the field of OOD detection from a laboratory problem towards a robust and practical technology, essential for building trustworthy AI.
The PhD studies of the author focus on a human-centered and computationally-challenging interdisciplinary problem of the Stable Roommates problem and its variations. Motivated by real-world applications, and by the fact that the Stable Roommates problem does not always admit a stable solution, the goal is to develop novel computational methods to solve these problems, that are not only computationally efficient but also yield solutions that are fair, personalized, and applicable in real-world to benefit humans.
Model development in AI is shaped by developer decisions. While there is significant research on the opportunities and risks of multiplicity, little attention has been paid to how developer decisions impact multiplicity. My thesis focuses on (a) introducing broader frameworks to better situate and analyze developer decisions in AI, (b) identifying theoretical connections to characterize the influence of these decisions on multiplicity, and (c) operationalizing these insights across various applications, thus building responsible AI models with multiplicity.
Multi-agent reinforcement learning enables sophisticated collaborative behaviors in autonomous systems, yet fundamental scalability barriers persist: existing methods struggle to coordinate large agent populations and face challenges with extended decision-making horizons. This research develops hierarchical approaches to scale up multi-agent learning systems through two complementary directions: structural scaling for coordinating increasing numbers of agents and temporal scaling for extending decision-making horizons. This paper presents four integrated contributions: a taxonomic survey establishing hierarchical architectures as the theoretical foundation for scalable multi-agent learning systems, a benchmark for long-horizon multi-objective multi-agent reinforcement learning, a framework integrating self-organizing neural networks with multiple reinforcement learning agents for hierarchical tri-level control, and a framework leveraging large language models for zero-shot multi-agent planning. Through comprehensive validation, this work demonstrates that hierarchical, heterogeneous, modular architectures provide unified, interpretable solutions to multi-agent scalability, bridging theoretical multi-agent reinforcement learning research with real-world deployment requirements.
The rise of generative AI presents a profound duality. On one hand, it offers a powerful solution to data scarcity and privacy challenges in biometrics. On the other, it is weaponized to create deepfakes that threaten digital integrity. Existing detectors for these deepfakes are brittle, failing against real-world transformations and novel generative models. This dissertation confronts this duality head-on. First, I establish the viability of synthetic data for building fair and private biometric systems. Second, to counter the malicious use of this technology, this dissertation develops deepfake detectors designed to be robust, generalizable, and efficient by construction. My work introduces novel, lightweight feature sets on different cues (e.g., colour cue-based Relative Chrominance Difference, Gradient features, Depth cues, etc.) that are inherently resilient to OSN transformations and improve generalisation to unseen forgeries. Whereas, accomplished results confirm state-of-the-art performance, achieving high accuracy in challenging real-world scenarios with a significant reduction in model complexity, my current and future work focuses on achieving superior generalisation while being OSN manipulation resistant.
Autonomous systems operating in uncertain environments without human intervention must consider several factors, including safety, reliability, and task success. State-of-the-art methods have made progress in addressing these factors individually, but often fail to unify them for deployment in real-world systems. My dissertation aims to combine methods in planning under uncertainty, failure recovery, and explainability, providing a holistic framework for comprehensive safe autonomy in real-world deployment.
Time-series data, which represent the evolution of one or more variables over time, are ubiquitous across domains such as finance, medicine, industry, and security. Time-Series Anomaly Detection (TSAD) is essential for identifying irregular events such as equipment failures, fraudulent activities, and neurological disorders. Despite significant progress, TSAD remains challenging due to the complexity of time-series signals, the diversity of anomaly types, and the scarcity of high-quality labeled data. This thesis contributes: (i) the first comprehensive surveys of Graph-based TSAD (G-TSAD) and Self-Supervised Learning for Anomaly Detection (SSL-AD), showing how graph modeling and SSL proxy tasks yield robust representations for TSAD while mapping limits and future directions; (ii) EEG-CGS, a contrastive–generative SSL framework that encodes fine-grained subgraph structure without anomaly labels, improving multivariate TSAD and localizing anomalous sensors and regions; (iii) TSAD-C, which integrates graph representations with diffusion models to capture long-range temporal and spatial dependencies while explicitly handling contaminated training data; and (iv) extending TSAD beyond benchmark datasets into other impactful domains, and developing foundation models specialized for biosignals to detect novel anomalies in drug-resistant epilepsy patients.
Recent advances in deep neural networks have highlighted the importance of geometric shape in various image analysis and computer vision tasks. However, most current approaches rely on coarse or simplified shape representations, such as binary masks, meshes, or point clouds, that are primarily designed to capture global structures of objects presented in images. While effective for general image and visual understanding, these methods often fail to learn fine-grained geometric information that is critical for accurately modeling complex shapes and subtle anatomical variations. This limitation is particularly consequential in healthcare applications, where understanding fine-grained anatomical shapes and their changes is crucial for accurate disease detection and diagnosis. My research focuses on developing a set of advanced deep learning frameworks that learn robust and complex shape representations from dense image data and integrate them into the current paradigm of image appearance and texture learning.
Tabular data is a fundamental form of information in real-world applications, ranging from finance and healthcare to scientific research. Unlike traditional views that treat tables as isolated structured data, tables are often inherently multimodal—appearing as images, embedded in documents, or coexisting with text and other modalities. My research explores multimodal tabular data learning, aiming to bridge structured tabular knowledge with diverse input forms and tasks. To this end, our work investigates leveraging tabular data as expert knowledge to provide guidance for visual modalities and enable cross-modal transfer learning. We also study more common scenarios where tables appear as images, conducting comprehensive investigations from evaluation to method development for table-based question answering and reasoning. Beyond these works, we extend tabular learning to more general scenarios, developing unified models capable of handling diverse table tasks within a single framework, and further expanding from tables to broader document-level parsing and understanding.
My research investigates how to evaluate and enhance large language models’ (LLMs) alignment with human values in collective decision-making scenarios. I focus on three inter-related aspects of this challenge: (i) normative alignment, (ii) procedural competence, and (iii) personalization.
Autonomous driving must handle motion blur, low light, and fast-changing scenes, where RGB frames and event cameras provide complementary strengths. This thesis explores how to fuse them across the perception–reasoning–planning pipeline. It introduces FlexEvent, a frequency-robust detector with adaptive fusion and label-efficient training; Talk2Event, the first benchmark for event–language grounding with attribute-aware modeling; and the EventDrive, an event–frame VLM covering the full driving loop. Together, these contributions advance robust perception, interpretable reasoning, and reliable planning for safety-critical driving through event–frame fusion.
Learning from human feedback enables AI systems and robots to learn policies that align with human intent. While existing work has primarily examined learning from demonstrations, corrections, and preferences in single-agent settings, these ideas have yet to be fully extended to multi-agent domains—where cooperation, decentralization, and non-stationary dynamics demand new methods. In this thesis summary, I highlight my current work and outline future directions for multi-robot learning from human feedback, offering deployment strategies that align supervisor intent with robot teams in the real world.
Transformers have reshaped modern artificial intelligence, yet their theoretical foundations remain incomplete. This thesis investigates the approximation power and memory limitations of transformers. I combine tools from approximation theory and statistical learning theory to provide provable guarantees on expressivity, memorization capacity, and inherent architectural constraints. My contributions include the first rigorous proof of memory bottlenecks in prompt tuning and new results on the expressivity of transformers. The long-term goal of my doctoral research is to develop a principled theoretical framework that grounds the empirical behavior of large-scale transformer models in formal approximation-theoretic results.
AI systems often fail on challenging or out-of-distribution inputs—a critical limitation in domains such as healthcare, finance, and autonomous driving. Learning to Defer (L2D) addresses this by training models not only to predict but also to decide when to defer to external experts. This thesis develops a unified and robust framework for L2D that advances its theoretical foundations, reliability, and applicability. It characterizes Bayes-optimal routing policies, establishes surrogate-consistency guarantees, and introduces a unified adversarial framework for attacking and defending L2D with Bayes-optimal robustness. It further proposes the first top-k deferral methods in both two-stage and one-stage settings. Empirical studies validate these ideas in multi-task learning and extractive question answering with large language models. Ongoing work explores token-level routing in LLMs, online adaptation with dynamic experts, and partial deferral.
Global biodiversity is declining at unprecedented rates, yet traditional monitoring at the necessary scales remains costly and biased toward what can be seen. Sound offers a complementary lens: many species are detected more reliably by their vocalizations, microphones are inexpensive and unobtrusive, and they can cover greater spatial and temporal scales. These advantages have made passive acoustic monitoring a fast-growing paradigm, yet robust, generalizable sound distinction in complex soundscapes remain a central obstacle. My thesis addresses this by combining data-driven human-inspired representation learning with knowledge-guided unsupervised learning to prioritize hierarchical organization and structure discovery prior to labelling. Human-in-the-loop oversight is incorporated as targeted verification under uncertainty, drawing on active learning and weak supervision to direct effort where it has the highest value.
The rapid advancement of generative models has created new opportunities for addressing core challenges in computer vision, including data scarcity, image quality, and efficient personalization. My research develops principled, resource- aware methods that enable models to generalize effectively from limited supervision, adapt efficiently to new concepts, and generate high-fidelity visual content. I first address few-shot learning through augmentation-driven uncertainty- guided mixup, improving robustness in data-constrained regimes. Building on this, I propose caption-guided multi-modal augmentation techniques that enrich visual diversity while mitigating real-to-synthetic domain gaps. To enhance the quality and realism of generated images, I introduce diffusion models grounded in natural image statistics, yielding perceptually aligned outputs suitable for downstream tasks. To advance personalization, I develop parameter-efficient mechanisms for combining low-rank adapters, enabling fine-grained control over content and style without retraining. I further extend personalization to a zero-shot setting through a training-free textual-inversion-based method that customizes arbitrary objects directly within the diffusion process. Finally, I present a frequency-guided multi-LoRA fusion framework that leverages wavelet-domain cues and timestep-aware weighting for accurate, training-free concept composition. Collectively, these contributions move toward a unified vision of generative models that are efficient, adaptive, and capable of high-quality, customizable image synthesis.
Nature is inherently structured! The entities in the real world are naturally organized in rich relationships. For example, dolphins and sharks, despite their striking visual resemblance in body shape and fins, are actually from entirely different branches of the animal hierarchy, i.e., mammals and fishes, respectively. This remarkable similarity is a prime example of ‘convergent evolution’, where unrelated species develop similar features because they face similar environmental challenges. This illustrates how nature’s underlying organization often transcends superficial visual resemblances. Although humans intuitively grasp and utilize these profound natural constraints, they are typically underutilized in most AI systems. As a result, trained AI models tend to align with statistical patterns in the data, such as sampling biases or class imbalance, rather than adhering to the underlying relational consistency. This thesis argues that AI systems must evolve beyond learning “flat” feature representations, which are domain-agnostic and derived purely from data correlations, to “explicitly model the domain-specific structural relationships”. A key benefit of encoding relational priors in the learning process is that it can inject domain knowledge as an inductive bias, leading to more robust and reliable models. My research investigates incorporating domain knowledge by leveraging “graph-based structural priors” that explicitly model relational constraints in various visual recognition tasks. This work spans three distinct dimensions of visual recognition, progressing from coarse-level (image-level) to fine-grained (scene-level) understanding. My research highlights a crucial limitation in existing AI models: they often fail to incorporate real-world constraints, leading to significant errors. I show that even powerful, pre-trained neural networks can make severe mistakes due to a lack of domain knowledge. I argue that standard metrics like top-1 accuracy, precision, and recall are insufficient for evaluating model robustness, and propose a new metric based on rank order of the predictions as a better indicator of reliability. The benchmark on various large-scale datasets confirms that existing solutions do not sufficiently capture the domain knowledge, which is often available as a taxonomy tree, motivating our design of better learning frameworks. I also examine complex visual re-identification (Re-ID) tasks, such as monitoring animals in the wild. I find that existing foundational models struggle with new species and environments. This challenge is compounded by the high cost of manual annotation for adapting these systems to new settings. While existing unsupervised learning methods can help reduce the need for extensive labeling, they often suffer from under- and over-segmentation errors, which led me to develop more effective active learning strategies. Finally, I address the limitations of the classic Kalman filter, a widely used tool for dynamic systems. I point out that this filter makes a flawed assumption that the movement of each individual object is independent of its dynamic surroundings. In the real world, this is rarely the case. I demonstrate the need for a new filtering mechanism that not only considers an object’s past movements but also its spatial relationship with other dynamic entities in its environment. In my analysis, I observed the vision foundation models for all recognition tasks, i.e., classification, detection, and segmentation, lack the domain knowledge. I believe that our learning framework, which was designed specifically for classification, can be adapted for other recognition tasks. I speculate that a unified learning framework can be designed that can be leveraged for making vision foundation models aware of the available taxonomy.