FedUMM: A General Framework for Federated Learning with Unified Multimodal Models

#1 FedUMM: A General Framework for Federated Learning with Unified Multimodal Models [PDF] [Copy] [Kimi] [REL]

Authors: Zhaolong Su, Leheng Zhao, Xiaoying Wu, Ziyue Xu, Jindong Wang

Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low communication cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients. Results show slight degradation as client count and heterogeneity increase, while remaining competitive with centralized training. We further analyze computation--communication trade-offs and demonstrate that adapter-only federation reduces per-round communication by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training. This work provides empirical experience for future research on privacy-preserving federated unified multimodal models.

Subject: Machine Learning

Publish: 2026-01-21 19:02:52 UTC

2601.15390

#1 FedUMM: A General Framework for Federated Learning with Unified Multimodal Models [PDF] [Copy] [Kimi] [REL]