Total: 1
Natural Language Inference (NLI) determines whether a hypothesis entails, contradicts, or is neutral with respect to a premise. While text-based NLI is well-studied, its multimodal and multilingual extension remains underexplored. This paper introduces a multilingual, multimodal NLI framework classifying entailment, contradiction, and neutrality across text-text, text-speech, speech-text, and speech-speech pairs in same- and cross-lingual settings. A key motivation is improving translation assessment, where similarity-based approaches may miss contradictions. The framework complements evaluation methods and helps identify inconsistencies by detecting entailment and contradiction alongside semantic similarity. It also extends text-based datasets with speech-text and speech-speech pairs for multilingual multimodal inference. Experiments show the model outperforms BLASER in distinguishing entailment from non-entailment, achieving F1 gains of 0.19 in speech-speech and 0.13 in speech-text.