| Total: 856
Convolutional neural networks have primarily led 3D medical image segmentation but may be limited by small receptive fields. Transformer models excel in capturing global relationships through self-attention but are challenged by high computational costs at high resolutions. Recently, Mamba, a state space model, has emerged as an effective approach for sequential modeling. Inspired by its success, we introduce a novel Mamba-based 3D medical image segmentation model called EM-Net. It not only efficiently captures attentive interaction between regions by integrating and selecting channels, but also effectively utilizes frequency domain to harmonize the learning of features across varying scales, while accelerating training speed. Comprehensive experiments on two challenging multi-organ datasets with other state-of-the-art (SOTA) algorithms show that our method exhibits better segmentation accuracy while requiring nearly half the parameter size of SOTA models and 2x faster training speed. Our code is publicly available at https://github.com/zang0902/EM-Net.
Segmentation models for thyroid ultrasound images are challenged by domain gaps across multi-center data. Some methods have been proposed to address this issue by enforcing consistency across multi-domains or by simulating domain gaps using augmented single-domain. Among them, single-domain generalization methods offer a more universal solution, but their heavy reliance on the data augmentation causes two issues for ultrasound image segmentation. Firstly, the corruption in data augmentation may affect the distribution of grayscale values with diagnostic significant, leading to a decline in model’s segmentation ability. The second is the real domain gap between ultrasound images is difficult to be simulated, resulting in features still correlate with domain, which in turn prevents the construction of the domain-independent latent space. To address these, given that the shape distribution of nodules is task-relevant but domain-independent, the SHape-prior Affine Network (SHAN) is proposed. SHAN serves shape prior as a stable latent mapping space, learning aspect ratio, size, and location of nodules through affine transformation of prior. Thus, our method enhances the segmentation capability and cross-domain generalization of model without any data augmentation methods. Additionally, SHAN is designed to be a plug-and-play method that can improve the performance of segmentation models with an encoder-decoder structure. Our experiments are performed on the public dataset TN3K and a private dataset TUI with 6 domains. By combining SHAN with several segmentation methods and comparing them with other single-domain generalization methods, it can be proved that SHAN performs optimally on both source and target domain data.
Automatic surgical video analysis is pivotal in enhancing the effectiveness and safety of robot-assisted minimally invasive surgery. This study introduces a novel procedure planning task aimed at predicting target-conditioned actions in surgical videos to achieve desired visual goals, thereby addressing the question of ``What to do to achieve a desired visual goal?”. Leveraging recent advancements in deep learning, particularly diffusion models, our work proposes the Multi-Scale Phase-Condition Diffusion (MS-PCD) framework. This innovative approach incorporates multi-scale visual features into the diffusion process, conditioned by phase class, to generate goal-conditioned plans. By cascading multiple diffusion models with inputs at different scales, MS-PCD adaptively extracts fine-grained visual features, significantly enhancing procedure planning performance in unstructured robotic surgical videos. We establish a new benchmark for procedure planning in robotic surgical videos using the publicly available PSI-AVA dataset, demonstrating that our method notably outperforms existing baselines on several metrics. Our research not only presents an innovative approach to surgical video analysis but also opens new avenues for automation in surgical procedures, contributing to both patient safety and surgical training.
The rising interest in pooling neuroimaging data from various sources presents challenges regarding scanner variability, known as scanner effects. While numerous harmonization methods aim to tackle these effects, they face issues with model robustness, brain structural modifications, and over-correction. To combat these issues, we propose a novel harmonization approach centered on simulating scanner effects through augmentation methods. This strategy enhances model robustness by providing extensive simulated matched data, comprising sets of images with similar brain but varying scanner effects. Our proposed method, ESPA, is an unsupervised harmonization framework via Enhanced Structure Preserving Augmentation. Additionally, we introduce two domain-adaptation augmentation: tissue-type contrast augmentation and GAN-based residual augmentation, both focusing on appearancebased changes to address structural modifications. While the former adapts images to the tissue-type contrast distribution of a target scanner, the latter generates residuals added to the original image for more complex scanner adaptation. These augmentations assist ESPA in mitigating over correction through data stratification or population matching strategies during augmentation configuration. Notably, we leverage our unique in-house matched dataset as a benchmark to compare ESPA against supervised and unsupervised state-of-the art (SOTA) harmonization methods. Our study marks the first attempt, to the best of our knowledge, to address harmonization by simulating scanner effects. Our results demonstrate the successful simulation of scanner effects, with ESPA outperforming SOTA methods using this harmonization approach.
Improving the fairness of federated learning (FL) benefits healthy and sustainable collaboration, especially for medical applications. However, existing fair FL methods ignore the specific characteristics of medical FL applications, i.e., domain shift among the datasets from different hospitals. In this work, we propose Fed-LWR to improve performance fairness from the perspective of feature shift, a key issue influencing the performance of medical FL systems caused by domain shift. Specifically, we dynamically perceive the bias of the global model across all hospitals by estimating the layer-wise difference in feature representations between local and global models. To minimize global divergence, we assign higher weights to hospitals with larger differences. The estimated client weights help us to re-aggregate the local models per layer to obtain a fairer global model. We evaluate our method on two widely used federated medical image segmentation benchmarks. The results demonstrate that our method achieves better and fairer performance compared with several state-of-the-art fair FL methods.
Semi-supervised medical image segmentation (SSMIS) has been demonstrated the potential to mitigate the issue of limited medical labeled data. However, confirmation and cognitive biases may affect the prevalent teacher-student based SSMIS methods due to erroneous pseudo-labels. To tackle this challenge, we improve the mean teacher approach and propose the Students Discrepancy-Informed Correction Learning (SDCL) framework that includes two students and one nontrainable teacher, which utilizes the segmentation difference between the two students to guide the self-correcting learning. The essence of SDCL is to identify the areas of segmentation discrepancy as the potential bias areas, and then encourage the model to review the correct cognition and rectify their own biases in these areas. To facilitate the bias correction learning with continuous review and rectification, two correction loss functions are employed to minimize the correct segmentation voxel distance and maximize the erroneous segmentation voxel entropy. We conducted experiments on three public medical image datasets: two 3D datasets (CT and MRI) and one 2D dataset (MRI). The results show that our SDCL surpasses the current State-of-the-Art (SOTA) methods by 2.57%, 3.04%, and 2.34% in the Dice score on the Pancreas, LA, and ACDC datasets, respectively. In addition, the accuracy of our method is very close to the fully supervised method on the ACDC dataset, and even exceeds the fully supervised method on the Pancreas and LA dataset.(Code available at https://github.com/pascalcpp/SDCL).
We propose a self-supervised model producing 3D anatomical positional embeddings (APE) of individual medical image voxels. APE encodes voxels’ anatomical closeness, i.e., voxels of the same organ or nearby organs always have closer positional embeddings than the voxels of more distant body parts. In contrast to the existing models of anatomical positional embeddings, our method is able to efficiently produce a map of voxel-wise embeddings for a whole volumetric input image, which makes it an optimal choice for different downstream applications. We train our APE model on 8400 publicly available CT images of abdomen and chest regions. We demonstrate its superior performance compared with the existing models on anatomical landmark retrieval and weakly-supervised few-shot localization of 13 abdominal organs. As a practical application, we show how to cheaply train APE to crop raw CT images to different anatomical regions of interest with 0.99 recall, while reducing the image volume by 10-100 times. The code and the pre-trained APE model are available at https://github.com/mishgon/ape.
High myopia significantly increases the risk of irreversible vision loss. Traditional perimetry-based visual field (VF) assessment provides systematic quantification of visual loss but it is subjective and time-consuming. Consequently, machine learning models utilizing fundus photographs to estimate VF have emerged as promising alternatives. However, due to the high variability and the limited availability of VF data, existing VF estimation models fail to generalize well, particularly when facing out-of-distribution data across diverse centers and populations. To tackle this challenge, we propose a novel, parameter-efficient framework to enhance the generalized robustness of VF estimation on both in- and out-of-distribution data. Specifically, we design a Refinement-by-Denoising (RED) module for feature refinement and adaptation from pretrained vision models, aiming to learn high-entropy feature representations and to mitigate the domain gap effectively and efficiently. Through independent validation on two distinct real-world datasets from separate centers, our method significantly outperforms existing approaches in RMSE, MAE and correlation coefficient for both internal and external validation. Our proposed framework benefits both in- and out-of-distribution VF estimation, offering significant clinical implications and potential utility in real-world ophthalmic practices.
Generating realistic images to accurately predict changes in the structure of brain MRI can be a crucial tool for clinicians. Such applications can help assess patients’ outcomes and analyze how diseases progress at the individual level. However, existing methods developed for this task present some limitations. Some approaches attempt to model the distribution of MRI scans directly by conditioning the model on patients’ ages, but they fail to explicitly capture the relationship between structural changes in the brain and time intervals, especially on age-unbalanced datasets. Other approaches simply rely on interpolation between scans, which limits their clinical application as they do not predict future MRIs. To address these challenges, we propose a Temporally-Aware Diffusion Model (TADM), which introduces a novel approach to accurately infer progression in brain MRIs. TADM learns the distribution of structural changes in terms of intensity differences between scans and combines the prediction of these changes with the initial baseline scans to generate future MRIs. Furthermore, during training, we propose to leverage a pre-trained Brain-Age Estimator (BAE) to refine the model’s training process, enhancing its ability to produce accurate MRIs that match the expected age gap between baseline and generated scans. Our assessment, conducted on 634 subjects from the OASIS-3 dataset, uses similarity metrics and region sizes computed by comparing predicted and real follow-up scans on 3 relevant brain regions. TADM achieves large improvements over existing approaches, with an average decrease of 24% in region size error and an improvement of 4% in similarity metrics. These evaluations demonstrate the improvement of our model in mimicking temporal brain neurodegenerative progression compared to existing methods. We believe that our approach will significantly benefit clinical applications, such as predicting patient outcomes or improving treatments for patients.
Single domain generalization (single-DG) for medical image segmentation aims to learn a style-invariant representation, which can be generalized to a variety unseen target domains, with the data from a single source. However, due to the limitation of sample diversity in the single source domain, the robustness of generalized features yielded by existing single-DG methods is still unsatisfactory. In this paper, we propose a novel single-DG framework, namely Hallucinated Style Distillation (HSD), to generate the robust style-invariant feature representation. Particularly, our HSD firstly expands the style diversity of the single source domain via hallucinating the samples with random styles. Then, a hallucinated cross-domain distillation paradigm is proposed to distillate the style-invariant knowledge between the original and style-hallucinated medical images. Since the hallucinated styles close to the source domain may over-fit our distillation paradigm, we further propose a learning objective to diversify style-invariant representation, which alleviates the over-fitting issue and smooths the learning process of generalized features. Extensive experiments on two standard domain generalized medical image segmentation datasets show the state-of-the-art performance of our HSD. Source code will be publicly available.
3D scene reconstruction from stereo endoscopic video data is crucial for advancing surgical interventions. In this work, we present an online framework for real-time, dense 3D scene reconstruction and tracking, aimed at enhancing surgical scene understanding and assisting interventions. Our method dynamically extends a canonical scene representation using Gaussian splatting, while modeling tissue deformations through a sparse set of control points. We introduce an efficient online fitting algorithm that optimizes the scene parameters, enabling consistent tracking and accurate reconstruction. Through experiments on the StereoMIS dataset, we demonstrate the effectiveness of our approach, outperforming state-of-the-art tracking methods and achieving comparable performance to offline reconstruction techniques. Our work enables various downstream applications thus contributing to advancing the capabilities of surgical assistance systems.
Accurate segmentation of the pulp cavity, root canals, and inferior alveolar nerve (IAN) in dental imaging is essential for effective orthodontic interventions. Despite the availability of numerous Cone Beam Computed Tomography (CBCT) scans annotated for individual dental-anatomical structures, there is a lack of a comprehensive dataset covering all necessary parts. As a result, existing deep learning models have encountered challenges due to the scarcity of comprehensive datasets encompassing all relevant anatomical structures. We present our novel Pulpy3D dataset, specifically curated to address dental-anatomical structures’ segmentation and identification needs. Additionally, we noticed that many current deep learning methods in dental imaging prefer 2D segmentation, missing out on the benefits of 3D segmentation. Our study suggests a UNet-based approach capable of segmenting dental structures using 3D volume segmentation, providing a better understanding of spatial relationships and more precise dental anatomy representation. Pulpy3D contributed in creating the seeding model from 150 scans, which helped complete the remainder of the dataset. Other modifications in the architecture, such as using separate networks, one semantic network, and a multi-task network, were highlighted in the model description to show how versatile the Pulpy3D dataset is and how different models, architectures, and tasks can run on the dataset. Additionally, we stress the lack of attention to pulp segmentation tasks in existing studies, underlining the need for specialized methods in this area. The code and Pulpy3D links can be found at https://github.com/mahmoudgamal0/Pulpy3D
Modern neuroimaging technologies set the stage for studying structural connectivity (SC) and functional connectivity (FC) \textit{in-vivo}. Due to distinct biological wiring underpinnings in SC and FC, however, it is challenging to understand their coupling mechanism using statistical association approaches. We seek to answer this challenging neuroscience question through the lens of a novel perspective rooted in network topology. Specifically, our assumption is that each FC instance is either locally supported by the direct link of SC or collaboratively sustained by a group of alternative SC pathways which form a topological notion of \textit{detour}. In this regard, we propose a new connectomic representation, coined detour connectivity (DC), to characterize the complex relationship between SC and FC by presenting direct FC with the weighted connectivity strength along in-directed SC routes. Furthermore, we present SC-FC Detour Network (SFDN), a graph neural network that integrates DC embedding through a self-attention mechanism, to optimize detour to the extent that the coupling of SC and FC is closely aligned with the evolution of cognitive states. We have applied the concept of DC in network community detection while the clinical value of our SFDN is evaluated in cognitive task recognition and early diagnosis of Alzheimer’s disease. After benchmarking on three public datasets under various brain parcellations, our detour-based computational approach shows significant improvement over current state-of-the-art counterpart methods.
Cell nuclei segmentation is crucial in digital pathology for various diagnoses and treatments which are prominently performed using semantic segmentation that focus on scalable receptive field and multi-scale information. In such segmentation tasks, U-Net based task-specific encoders excel in capturing fine-grained information but fall short integrating high-level global context. Conversely, foundation models inherently grasp coarse-level features but are not as proficient as task-specific models to provide fine-grained details. To this end, we propose utilizing the foundation model to guide the task-specific supervised learning by dynamically combining their global and local latent representations, via our proposed X-Gated Fusion Block, which uses Gated squeeze and excitation block followed by Cross-attention to dynamically fuse latent representations. Through our experiments across datasets and visualization analysis, we demonstrate that the integration of task-specific knowledge with general insights from foundational models can drastically increase performance, even outperforming domain-specific semantic segmentation models to achieve state-of-the-art results by increasing the Dice score and mIoU by approximately 12% and 17.22% on CryoNuSeg, 15.55% and 16.77% on NuInsSeg, and 9% on both metrics for the CoNIC dataset. Our code will be released at https://cvpr-kit.github.io/SAM-Guided-Enhanced-Nuclei-Segmentation/.
Automated diabetic retinopathy (DR) lesion segmentation aids in improving the efficiency of DR detection. However, obtaining lesion annotations for model training heavily relies on domain expertise and is a labor-intensive process. In addition to classical methods for alleviating label scarcity issues, such as self-supervised and semi-supervised learning, with the rapid development of generative models, several studies have indicated that utilizing synthetic image-mask pairs as data augmentation is promising. Due to the insufficient labeled data available to train powerful generative models, however, the synthetic fundus data suffers from two drawbacks: 1) unrealistic anatomical structures, 2) limited lesion diversity. In this paper, we propose a novel framework to synthesize fundus with DR lesion masks under limited labels. To increase lesion variation, we designed a learnable module to generate anatomically plausible masks as the condition, rather than directly using lesion masks from the limited dataset. To reduce the difficulty of learning intricate structures, we avoid directly generating images solely from lesion mask conditions. Instead, we developed an inpainting strategy that enables the model to generate lesions only within the mask area based on easily accessible healthy fundus images. Subjective evaluations indicate that our approach can generate more realistic fundus images with lesions compared to other generative methods. The downstream lesion segmentation experiments demonstrate that our synthetic data resulted in the most improvement across multiple network architectures, surpassing state-of-the-art methods.
The Magnetic Resonance Fingerprinting (MRF) approach aims to estimate multiple MR or physiological parameters simultaneously with a single fast acquisition sequence. Most of the MRF studies proposed so far have used simple MR sequence types to measure relaxation times (T1, T2). In that case, deep learning algorithms have been successfully used to speed up the reconstruction process. In theory, the MRF concept could be used with a variety of other MR sequence types and should be able to provide more information about the tissue microstructures. Yet, increasing the complexity of the numerical models often leads to prohibited simulation times, and estimating multiple parameters from one sequence implies new dictionary dimensions whose sizes become too large for standard computers and DL architectures. In this paper, we propose to analyze the MRF signal coming from a complex balanced Steady-State Free Precession (bSSFP) type sequence to simultaneously estimate relaxometry maps (T1, T2), Field maps (B1, B0) as well as microvascular properties such as the local Cerebral Blood Volume (CBV) or the averaged vessel Radius (R). To bypass the curse of dimensionality, we propose an efficient way to simulate the MR signal coming from numerical voxels containing realistic microvascular networks as well as a Bidirectional Long Short-Term Memory network that replaces the matching process. On top of standard MRF maps, our results on 3 human volunteers suggest that our approach can quickly produce high-quality quantitative maps of microvascular parameters that are otherwise obtained using longer dedicated sequences and intravenous injection of a contrast agent. This approach could be used for the management of multiple pathologies and could be tuned to provide other types of microstructural information.
Accurate segmentation of ovarian tumors from medical images is crucial for early diagnosis, treatment planning, and patient management. However, the diverse morphological characteristics and heterogeneous appearances of ovarian tumors pose significant challenges to automated segmentation methods. In this paper, we propose MBA-Net, a novel architecture that integrates the powerful segmentation capabilities of the Segment Anything Model (SAM) with domain-specific knowledge for accurate and robust ovarian tumor segmentation. MBA-Net employs a hybrid encoder architecture, where the encoder consists of a prior branch, which inherits the SAM encoder to capture robust segmentation priors, and a domain branch, specifically designed to extract domain-specific features. The bidirectional flow of information between the two branches is facilitated by the robust feature injection network (RFIN) and the domain knowledge integration network (DKIN), enabling MBA-Net to leverage the complementary strengths of both branches. We extensively evaluate MBA-Net on the public multi-modality ovarian tumor ultrasound dataset and the in-house multi-site ovarian tumor MRI dataset. Our proposed method consistently outperforms state-of-the-art segmentation approaches. Moreover, MBA-Net demonstrates superior generalization capability across different imaging modalities and clinical sites.
The dynamic 3D shape of a cell acts as a signal of its physiological state, reflecting the interplay of environmental stimuli and intra- and extra-cellular processes. However, there is little quantitative understanding of cell shape determination in 3D, largely due to the lack of data-driven methods that analyse 3D cell shape dynamics. To address this, we have developed MorphoSense, an interpretable, variable-length multivariate time series classification (TSC) pipeline based on multiple instance learning (MIL). We use this pipeline to classify 3D cell shape dynamics of perturbed cancer cells and learn hallmark 3D shape changes associated with clinically relevant and shape-modulating small molecule treatments. To show the generalisability across datasets, we apply our pipeline to classify migrating T-cells in collagen matrices and assess interpretability on a synthetic dataset. Across datasets, our pipeline offers increased predictive performance and higher-quality interpretations. To our knowledge, our work is the first to utilise MIL for multivariate, variable-length TSC, focusing on interpretable 3D morphodynamic profiling of biological cells.
Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important factors that potentially affect its performance: (i) irrelative feature learned caused by asymmetric supervision; (ii) feature redundancy in the feature map. To this end, we propose to balance the supervision between encoder and decoder and reduce the redundant information in the UNet. Specifically, we use the feature map that contains the most semantic information (i.e., the last layer of the decoder) to provide additional supervision to other blocks to provide additional supervision and reduce feature redundancy by leveraging feature distillation. The proposed method can be easily integrated into existing UNet architecture in a plug-and-play fashion with negligible computational cost. The experimental results suggest that the proposed method consistently improves the performance of standard UNets on four medical image segmentation datasets.
The Prostate Imaging Reporting and Data System (PI-RADS) is pivotal in the diagnosis of clinically significant prostate cancer through MRI imaging. Current deep learning-based PI-RADS scoring methods often lack the incorporation of common PI-RADS clinical guideline (PICG) utilized by radiologists, potentially compromising scoring accuracy. This paper introduces a novel approach that adapts a multi-modal large language model (MLLM) to incorporate PICG into PI-RADS scoring model without additional annotations and network parameters. We present a designed two-stage fine-tuning process aiming at adapting a MLLM originally trained on natural images to the MRI images while effectively integrating the PICG. Specifically, in the first stage, we develop a domain adapter layer tailored for processing 3D MRI inputs and instruct the MLLM to differentiate MRI sequences. In the second stage, we translate PICG for guiding instructions from the model to generate PICG-guided image features. Through such a feature distillation step, we align the scoring network’s features with the PICG-guided image features, which enables the model to effectively incorporate the PICG information. We develop our model on a public dataset and evaluate it on an in-house dataset. Experimental results demonstrate that our approach effectively improves the performance of current scoring networks. Code is available at: https://github.com/med-air/PICG2scoring
Catheter ablation is a prevalent procedure for treating atrial fibrillation, primarily utilizing catheters equipped with electrodes to gather electrophysiological signals. However, the localization of catheters in fluoroscopy images presents a challenge for clinicians due to the complexity of the intervention processes. In this paper, we propose SIX-Net, a novel algorithm intending to localize landmarks of electrodes in fluoroscopy images precisely, by mixing up spatial-context information from three aspects: First, we propose a new network architecture specially designed for global-local spatial feature aggregation; Then, we mix up spatial correlations between segmentation and landmark detection, by sequential connections between the two tasks with the help of the Segment Anything Model; Finally, a weighted loss function is carefully designed considering the relative spatial-arrangement information among electrodes in the same image. Experiment results on the test set and two clinical-challenging subsets reveal that our method outperforms several state-of-the-art landmark detection methods (~50% improvement for RF and ~25% improvement for CS).
Scoliosis poses significant diagnostic challenges, particularly in adolescents, where early detection is crucial for effective treatment. Traditional diagnostic and follow-up methods, which rely on physical examinations and radiography, face limitations due to the need for clinical expertise and the risk of radiation exposure, thus restricting their use for widespread early screening. In response, we introduce a novel, video-based, non-invasive method for scoliosis classification using gait analysis, which circumvents these limitations. This study presents Scoliosis1K, the first large-scale dataset tailored for video-based scoliosis classification, encompassing over one thousand adolescents. Leveraging this dataset, we developed ScoNet, an initial model that encountered challenges in dealing with the complexities of real-world data. This led to the creation of ScoNet-MT, an enhanced model incorporating multi-task learning, which exhibits promising diagnostic accuracy for application purposes. Our findings demonstrate that gait can be a non-invasive biomarker for scoliosis, revolutionizing screening practices with deep learning and setting a precedent for non-invasive diagnostic methodologies. The dataset and code are publicly available at \url{https://zhouzi180.github.io/Scoliosis1K/}.
We focus on the problem of Unsupervised Domain Adaptation (\uda) for breast cancer detection from mammograms (BCDM) problem. Recent advancements have shown that masked image modeling serves as a robust pretext task for UDA. However, when applied to cross-domain BCDM, these techniques struggle with breast abnormalities such as masses, asymmetries, and micro-calcifications, in part due to the typically much smaller size of region of interest in comparison to natural images. This often results in more false positives per image (FPI) and significant noise in pseudo-labels typically used to bootstrap such techniques. Recognizing these challenges, we introduce a transformer-based Domain-invariant Mask Annealed Student Teacher autoencoder (D-MASTER) framework. D-MASTER adaptively masks and reconstructs multi-scale feature maps, enhancing the model’s ability to capture reliable target domain features. D-MASTER also includes adaptive confidence refinement to filter pseudo-labels, ensuring only high-quality detections are considered. We also provide a bounding box annotated subset of 1000 mammograms from the RSNA Breast Screening Dataset (referred to as RSNA-BSD1K) to support further research in BCDM. We evaluate D-MASTER on multiple BCDM datasets acquired from diverse domains. Experimental results show a significant improvement of 9% and 13% in sensitivity at 0.3 FPI over state-of-the-art UDA techniques on publicly available benchmark INBreast and DDSM datasets respectively. We also report an improvement of 11% and 17% on In-house and RSNA-BSD1K datasets respectively. The source code, pre-trained D-MASTER model, along with RSNA-BSD1K dataset annotations is available at https://dmaster-iitd.github.io/webpage.
Personalized medicine based on medical images, including predicting future individualized clinical disease progression and treatment response, would have an enormous impact on healthcare and drug development, particularly for diseases (e.g. multiple sclerosis (MS)) with long term, complex, heterogeneous evolutions and no cure. In this work, we present the first stochastic causal temporal framework to model the continuous temporal evolution of disease progression via Neural Stochastic Differential Equations (NSDE). The proposed causal inference model takes as input the patient’s high dimensional images (MRI) and tabular data, and predicts both factual and counterfactual progression trajectories on different treatments in latent space. The NSDE permits the estimation of high-confidence personalized trajectories and treatment effects. Extensive experiments were performed on a large, multi-centre, proprietary dataset of patient 3D MRI and clinical data acquired during several randomized clinical trials for MS treatments. Our results present the first successful uncertainty-based causal Deep Learning (DL) model to: (a) accurately predict future patient MS disability evolution (e.g. EDSS) and treatment effects leveraging baseline MRI, and (b) permit the discovery of subgroups of patients for which the model has high confidence in their response to treatment even in clinical trials which did not reach their clinical endpoints.
Transthoracic Echocardiography (TTE) is the most widely-used screening method for the detection of pulmonary hypertension (PH), a life-threatening cardiopulmonary disorder that requires accurate and timely detection for effective management. Automated PH risk detection from TTE can flag subtle indicators of PH that might be easily missed, thereby decreasing variability between operators and enhancing the positive predictive value of the screening test. Previous algorithms for assessing PH risk still rely on pre-identified, single TTE views which might ignore useful information contained in other recordings. Additionally, these methods focus on discerning PH from healthy controls, limiting their utility as a tool to differentiate PH from conditions that mimic its cardiovascular or respiratory presentation. To address these issues, we propose EchoFM, an architecture that combines self-supervised learning (SSL) and a transformer model for view-independent detection of PH from TTE. EchoFM 1) incorporates a powerful encoder for feature extraction from frames, 2) overcomes the need for explicit TTE view classification by merging features from all available views, 3) uses a transformer to attend to frames of interest without discarding others, and 4) is trained on a realistic clinical dataset which includes mimicking conditions as controls. Extensive experimentation demonstrates that EchoFM significantly improves PH risk detection over state-of-the-art Convolutional Neural Networks (CNNs).