Total: 1
Deep learning models are increasingly employed for perception, prediction, and control in robotic systems. For for achieving realistic and consistent outputs, it is crucial to embed physical knowledge into their learned representations. However, doing so is difficult due to high-dimensional observation data, such as images, particularly under conditions of incomplete system knowledge and imprecise state sensing. To address this, we propose Physically Interpretable World Models, a novel architecture that aligns learned latent representations with real-world physical quantities. To this end, our architecture combines three key elements: (1) a vector-quantized image autoencoder, (2) a transformer-based physically interpretable autoencoder, and (3) a partially known dynamical model. The training incorporates weak interval-based supervision to eliminate the impractical reliance on ground-truth physical knowledge. Three case studies demonstrate that our approach achieves physical interpretability and accurate state predictions, thus advancing representation learning for robotics.