Total: 1
Machine learning models have become ubiquitous in the last decade, and with their increasing use in critical applications (e.g., healthcare, financial systems, and crime forecasting), it is vital to ensure that ML developers and practitioners understand and trust their decisions. This problem has become paramount in the era of frontier models, which are developed by training billion-parameter models on broad, uncurated datasets and extensive computing. In this talk, we will first explore the (un)reliability of existing multimodal explainability techniques in large language and multimodal models and understand the robustness and safety implications of Mechanistic Interpretability tools. Next, we will delve into two complementary threads: i) domain-specific safety and related trustworthy evaluation that surfaces risks missed by generic red-teaming, focusing on multilingual and distribution-shifted settings; and ii) methods that explicitly train and assess reasoning in medical LLMs.