| Total: 50
The task of one-shot face video re-enactment aims at generating target video of faces with the same identity of one source frame and facial deformation of the driving video. To achieve high quality generation, it is essential to precisely disentangle identity-related and identity-independent characteristics, meanwhile build expressive features keeping high-frequency facial details, which still remain unaddressed for existing approaches. To deal with these two challenges, we propose a two-stage generation model based on StyleGAN, whose key novel techniques lie in better disentangling identity and deformation codes in the latent space through an identity-based modeling and manipulating intermediate StyleGAN features at the second stage for augmenting facial details of the generating targets. To further improve identity consistency, a data augmentation method is introduced during training for enhancing the key features affecting identity such as hair and wrinkles. Extensive experimental results demonstrate the superiority of our approach compared to state-of-the-art methods.
The advancement in multimodal research has increased focus on Emotion Recognition in Conversations (ERC), targeting accurately identifying emotional changes. Methods based on graph convolution can better capture the dynamic changes of emotions and improve the accuracy and robustness of emotion recognition. However, existing methods do not distinguish the interaction patterns of a conversation, which results in limiting their ability to model contextual emotional relationships. In this paper, we propose a Dynamic Interactive Bimodal HyperGraph Convolutional Networks (DIB-HGCN), which creatively constructs two types of sub-hypergraphs, i.e., the monologic sub-hypergraph and the dialogic sub-hypergraph, for modeling emotion relationships of different interaction patterns. The monologic sub-hypergraph is used to explore the contextual consistent emotions during the speaker's monologue interactions, while the dialogic sub-hypergraph focuses on capturing the emotional transfers in the dialogic interactions. Meanwhile, the single window partitioning mechanism fails to accommodate the distinct emotional velocity variations across the two interaction patterns. Therefore, we set up dynamic windows in the monologic interactions to fully utilize the information of sentence nodes with consistent emotions, and we add fragment windows to the dialogic interactions to prevent information interference caused by frequent emotional transfers. The experimental results show that our proposed method outperforms existing methods on two benchmark multimodal ERC datasets.
The rapid evolution of multimedia technology has revolutionized human perception, paving the way for multi-view learning. However, traditional multi-view learning approaches are tailored for scenarios with fixed data views, falling short of emulating the intricate cognitive procedures of the human brain processing signals sequentially. Our cerebral architecture seamlessly integrates sequential data through intricate feed-forward and feedback mechanisms. In stark contrast, traditional methods struggle to generalize effectively when confronted with data spanning diverse domains, highlighting the need for innovative strategies that can mimic the brain's adaptability and dynamic integration capabilities. In this paper, we propose a bio-neurologically inspired multi-view incremental framework named MVIL aimed at emulating the brain's fine-grained fusion of sequentially arriving views. MVIL lies two fundamental modules: structured Hebbian plasticity and synaptic partition learning. The structured Hebbian plasticity reshapes the structure of weights to express the high correlation between view representations, facilitating a fine-grained fusion of view representations. Moreover, synaptic partition learning is efficient in alleviating drastic changes in weights and also retaining old knowledge by inhibiting partial synapses. These modules bionically play a central role in reinforcing crucial associations between newly acquired information and existing knowledge repositories, thereby enhancing the network's capacity for generalization. Experimental results on six benchmark datasets show MVIL's effectiveness over state-of-the-art methods.
Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.
Functional decomposition is the process of breaking down a function f into a composition f=g(f_1,...,f_k) of simpler functions f_1,...,f_k belonging to some class F. This fundamental notion can be used to model applications arising in a wide variety of contexts, ranging from machine learning to formal language theory. In this work, we study functional decomposition by leveraging on the notion of functional reconfiguration. In this setting, constraints are imposed not only on the factor functions f_1,...,f_k but also on the intermediate functions arising during the composition process. We introduce a symbolic framework to address functional reconfiguration and decomposition problems. In our framework, functions arising during the reconfiguration process are represented symbolically, using ordered binary decision diagrams (OBDDs). The function g used to specify the reconfiguration process is represented by a Boolean circuit C. Finally, the function class F is represented by a second-order finite automaton A. Our main result states that functional reconfiguration, and hence functional decomposition, can be solved in fixed-parameter linear time when parameterized by the width of the input OBDD, by structural parameters associated with the reconfiguration circuit C, and by the size of the second-order finite automaton A.
Dynamic Vision Sensors (DVS) capture event data with high temporal resolution and low power consumption, presenting a more efficient solution for visual processing in dynamic and real-time scenarios compared to conventional video capture methods. Event data augmentation serves as an essential method for overcoming the limitation of scale and diversity in event datasets. Our comparative experiments demonstrate that the two factors, spatial integrity and temporal continuity, can significantly affect the capacity of event data augmentation, which guarantee the maintenance of the sparsity and high dynamic range characteristics unique to event data. However, existing augmentation methods often neglect the preservation of spatial integrity and temporal continuity. To address this, we developed a novel event data augmentation strategy EventZoom, which employs a temporal progressive strategy, embedding transformed samples into the original samples through progressive scaling and shifting. The scaling process avoids the spatial information loss associated with cropping, while the progressive strategy prevents interruptions or abrupt changes in temporal information. We validated EventZoom across various supervised learning frameworks. The experimental results show that EventZoom consistently outperforms existing event data augmentation methods with SOTA performance. For the first time, we have concurrently employed Semi-supervised and Unsupervised learning to verify feasibility on event augmentation algorithms, demonstrating the applicability and effectiveness of EventZoom as a powerful event-based data augmentation tool in handling real-world scenes with high dynamics and variability environments.
Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines. Social Stories™ are traditionally crafted by psychology experts under strict constraints to address these challenges but are costly and limited in diversity. As Large Language Models (LLMs) advance, there's an opportunity to develop more automated, affordable, and accessible methods to generate Social Stories in real-time with broad coverage. However, adapting LLMs to meet the unique and strict constraints of Social Stories is a challenging issue. To this end, we propose SS-GEN, a Social Story GENeration framework with LLMs. Firstly, we develop a constraint-driven sophisticated strategy named StarSow to hierarchically prompt LLMs to generate Social Stories at scale, followed by rigorous human filtering to build a high-quality dataset. Additionally, we introduce quality assessment criteria to evaluate the effectiveness of these generated stories. Considering that powerful closed-source large models require very complex instructions and expensive API fees, we finally fine-tune smaller language models with our curated high-quality dataset, achieving comparable results at lower costs and with simpler instruction and deployment. This work marks a significant step in leveraging AI to personalize Social Stories cost-effectively for autistic children at scale, which we hope can encourage future research on special groups.
Multimodal sentiment analysis, which learns a model to process multiple modalities simultaneously and predict a sentiment value, is an important area of affective computing. Modeling sequential intra-modal information and enhancing cross-modal interactions are crucial to multimodal sentiment analysis. In this paper, we propose MSAmba, a novel hybrid Mamba-based architecture for multimodal sentiment analysis, consisting of two core blocks: Intra-Modal Sequential Mamba (ISM) block and Cross-Modal Hybrid Mamba (CHM) block, to comprehensively address the above-mentioned challenges with hybrid state space models. Firstly, the ISM block models the sequential information within each modality in a bi-directional manner with the assistance of global information. Subsequently, the CHM blocks explicitly model centralized cross-modal interaction with a hybrid combination of Mamba and attention mechanism to facilitate information fusion across modalities. Finally, joint learning of the intra-modal tokens and cross-modal tokens is utilized to predict the sentiment values. This paper serves as one of the pioneering works to unravel the outstanding performances and great research potential of Mamba-based methods in the task of multimodal sentiment analysis. Experiments on CMU-MOSI, CMU-MOSEI and CH-SIMS demonstrate the superior performance of the proposed MSAmba over prior Transformer-based and CNN-based methods.
Humans excel at understanding and reasoning about novel, compositionally structured knowledge, largely due to their capacity for compositional generalization—a cognitive skill that has recently been validated in structured neural networks. However, most existing research has focused primarily on semantic translation within canonical language environments, often neglecting the explicit connection to compositional generalization behavior. In contrast, humans typically demonstrate this ability through interaction with their environments rather than solely through internal reasoning. To address this gap, we propose CraftFactory, a benchmark designed for evaluating compositional generalization in an interactive control environment. This benchmark introduces a new challenge for testing compositional generalization in a more realistic and comprehensive manner. CraftFactory stands out due to three key features: (1) it offers an open-ended interactive control environment with thousands of items and flexible actions; (2) it requires advanced compositional inference through various combinations and complex permutations of instructions; and (3) it evaluates compositional generalization intuitively through interactive behavior. By leveraging CraftFactory, we aim to promote the development of more advanced compositional generalization methods, thereby contributing to the broader field of general AI.
The field of music generation has seen a surge of interest from both academia and industry, with innovative platforms such as Suno, Udio, and SkyMusic earning widespread recognition. However, the challenge of music infilling—modifying specific music segments without reconstructing the entire piece—remains a significant hurdle for both audio-based and symbolic-based models, limiting their adaptability and practicality. In this paper, we address symbolic music infilling by introducing the Collaborative Music Inpainter (CMI), an advanced human-in-the-loop (HITL) model for music infilling. The CMI features the Joint Embedding Predictive Autoregressive Generative Architecture (JEP-AGA), which learns the high-level predictive representations of the masked part that needs to be infilled during the autoregressive generative process, akin to how humans perceive and interpret music. The newly developed Dynamic Interaction Learner (DIL) achieves HITL by iteratively refining the infilled output based on user interactions alone, significantly reducing the interaction cost without requiring further input. Experimental results confirm CMI’s superior performance in music infilling, demonstrating its efficiency in producing high-quality music.
The scarcity of high-quality large-scale labeled datasets poses a huge challenge for employing deep learning models in video deception detection. To address this issue, inspired by the psychological theory on the relation between deception and expressions, we propose a novel method called AFFAKT in this paper, which enhances the classification performance by transferring useful and correlated knowledge from a large facial expression dataset. Two key challenges in knowledge transfer arise: 1) how much knowledge of facial expression data should be transferred and 2) how to effectively leverage transferred knowledge for the deception classification model during inference. Specifically, the optimal relation mapping between facial expression classes and deception samples is firstly quantified using proposed H-OTKT module and then transfers knowledge from the facial expression dataset to deception samples. Moreover, a correlation prototype within another proposed module SRKB is well designed to retain the invariant correlations between facial expression classes and deception classes through momentum updating. During inference, the transferred knowledge is fine-tuned with the correlation prototype using a sample-specific re-weighting strategy. Experimental results on two deception detection datasets demonstrate the superior performance of our proposed method. The interpretability study reveals high associations between deception and negative affections, which coincides with the theory in psychology.
The ability to estimate temporal relationships is critical for both animals and artificial agents. Cognitive science and neuroscience provide remarkable insights into behavioral and neural aspects of temporal credit assignment. In particular, scale invariance of learning dynamics, observed in behavior and supported by neural data, is one of the key principles that governs animal perception: proportional rescaling of temporal relationships does not alter the overall learning efficiency. Here we integrate a computational neuroscience model of scale invariant memory into deep reinforcement learning (RL) agents. We first provide a theoretical analysis and then demonstrate through experiments that such agents can learn robustly across a wide range of temporal scales, unlike agents built with commonly used recurrent memory architectures such as LSTM. This result illustrates that incorporating computational principles from neuroscience and cognitive science into deep neural networks can enhance adaptability to complex temporal dynamics, mirroring some of the core properties of human learning.
The emergence of unveiling human-like behaviors in Large Language Models (LLMs) has led to a closer connection between NLP and human psychology. However, research on the personalities exhibited by LLMs has largely been confined to limited investigations using individual psychological tests, primarily focusing on a small number of commercially licensed LLMs. This approach overlooks the extensive use and significant advancements observed in open-source LLMs. This work aims to address both the above limitations by conducting an in-depth investigation of a significant body of 12 LLM Agents based on the most representative Open models, through the two most well-known psychological assessment tests, namely Myers-Briggs Type Indicator (MBTI) and Big Five Inventory (BFI). Our approach involves evaluating the intrinsic personality traits of LLM agents and determining the extent to which these agents can mimic human personalities when conditioned by specific personalities and roles. Our findings unveil that (i) each LLM agent showcases distinct human personalities; (ii) personality-conditioned prompting produces varying effects on the agents, with only few successfully mirroring the imposed personality, while most of them being ``closed-minded'' (i.e., they retain their intrinsic traits); and (iii) combining role and personality conditioning can enhance the agents' ability to mimic human personalities. Our work represents a step up in understanding the dense relationship between NLP and human psychology through the lens of LLMs.
Spiking Neural Networks (SNNs) have a low-power advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non-convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state-of-the-art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0x efficiency on ADE20K, +14.3% mIoU and 5.2x efficiency on VOC2012, and +9.1% mIoU and 6.6x efficiency on CityScapes.
Abstract visual reasoning (AVR) is a critical ability of humans, and it has been widely studied, but arithmetic visual reasoning, a unique task in AVR to reason over number sense, is less studied in the literature. To facilitate this research, we construct a Machine Number Reasoning (MNR) dataset to assess the model's ability in arithmetic visual reasoning over number sense and spatial layouts. To solve the MNR tasks, we propose a Dual-branch Arithmetic Regression Reasoning (DARR) framework, which includes an Intra-Image Arithmetic Regression Reasoning (IIARR) module and a Cross-Image Arithmetic Regression Reasoning (CIARR) module. The IIARR includes a set of Intra-Image Regression Blocks to identify the correct number orders and the underlying arithmetic rules within individual images, and an Order Gate to determine the correct number order. The CIARR establishes the arithmetic relations across different images through a `3-to-1' regressor and a set of `2-to-1' regressors, with a Selection Gate to select the most suitable `2-to-1' regressor and a gated fusion to combine the two kinds of regressors. Experiments on the MNR dataset show that the DARR outperforms state-of-the-art models for arithmetic visual reasoning.
Completely automated public Turing test to tell humans apart (CAPTCHA) is an effective mechanism to protect websites and online applications from malicious bots programs. Image-based CAPTCHA is one of the most widely used schemes. However, deep learning techniques have significantly weakened the security of some image-based CAPTCHA schemes. Mooney images (MIs) are important research materials in the field of cognitive science. Compared to natural images, MI exhibits fewer visual cues, fragmented content, and greater ambiguity, leading to the perception of MI relying more on the iterative process between feedforward and feedback mechanisms. In this paper, we raise an intriguing question: can MIs be used to enhance the security of CAPTCHA? Before this study, we first propose a novel framework HiMI that generates the high-quality MIs from natural images and also allows flexible adjustment of the perceived difficulty. Based on MI, we design two MI-CAPTCHA schemes related to object detection and instance segmentation tasks, respectively. We experimentally demonstrate that HiMI performs better than other baseline methods in terms of both image quality and application potential in two MI-CAPTCHA schemes. Additionally, we conduct experiments to explore the solving performance of humans and CAPTCHA solvers under different parameter settings of schemes, providing valuable reference for the practical application.
Hashing has been widely applied in large-scale multimodal retrieval by mapping heterogeneous modalities data into binary codes. However, most cross-modal hashing methods cannot make the most of semantic information to construct the association relations of sample pairs, resulting in unsatisfactory retrieval accuracy. Concept lattice is a powerful tool for data mining and information retrieval, and for all we know, this is the first time to combine formal concept analysis and hash learning to improve cross-modal hashing retrieval performance. In this paper, we propose a novel framework for Asymmetric Cross-modal Hashing based on Formal Concept Analysis, denoted as ACHFCA. Initially, a flash-projection three-layer semantic enhancement descriptor is designed to extract latent representations from heterogeneous modalities. Subsequently, an asymmetric hash learning framework is established to enhance the semantics of different layers based on the fine-grained similarity values reconstructed from concept lattice to reinforce the discriminative competence of the model. Finally, an effective discrete optimization algorithm is proposed, which can directly learn compact hash codes. Comprehensive experiments on MIRFlickr, NUS-WIDE and IAPR-TC12 datasets demonstrate the superior performance of ACHFCA to state-of-the-art hashing approaches.
Binary Spiking Neural Networks (BSNNs) inherit the event-driven paradigm of SNNs, while also adopting the reduced storage burden of binarization techniques. These distinct advantages grant BSNNs lightweight and energy-efficient characteristics, rendering them ideal for deployment on resource-constrained edge devices. However, due to the binary synaptic weights and non-differentiable spike function, effectively training BSNNs remains an open question. In this paper, we conduct an in-depth analysis of the challenge for BSNN learning, namely the frequent weight sign flipping problem. To mitigate this issue, we propose an Adaptive Gradient Modulation Mechanism (AGMM), which is designed to reduce the frequency of weight sign flipping by adaptively adjusting the gradients during the learning process. The proposed AGMM can enable BSNNs to achieve faster convergence speed and higher accuracy, effectively narrowing the gap between BSNNs and their full-precision equivalents. We validate AGMM on both static and neuromorphic datasets, and results indicate that it achieves state-of-the-art results among BSNNs. This work substantially reduces storage demands and enhances SNNs' inherent energy efficiency, making them highly feasible for resource-constrained environments.
Despite multimodal sentiment analysis being a fertile research ground that merits further investigation, current approaches take up high annotation cost and suffer from label ambiguity, non-amicable to high-quality labeled data acquisition. Furthermore, choosing the right interactions is essential because the significance of intra- or inter-modal interactions can differ among various samples. To this end, we propose Semi-IIN, a Semi-supervised Intra-inter modal Interaction learning Network for multimodal sentiment analysis. Semi-IIN integrates masked attention and gating mechanisms, enabling effective dynamic selection after independently capturing intra- and inter-modal interactive information. Combined with the self-training approach, Semi-IIN fully utilizes the knowledge learned from unlabeled data. Experimental results on two public datasets, MOSI and MOSEI, demonstrate the effectiveness of Semi-IIN, establishing a new state-of-the-art on several metrics.
Spiking Neural Networks (SNNs) are biologically inspired models that process visual inputs over multiple time steps. However, they often struggle with limited feature discrimination along the temporal dimension due to inherent spatiotemporal invariance. This limitation arises from the redundant activation of certain regions and shared supervision for multiple time steps, constraining the network’s ability to adapt and learn diverse features. To address this challenge, we propose a novel Temporal-Self-Erasing (TSE) supervision method that dynamically adapts the learning regions of interest for different time steps. The TSE method operates by identifying highly activated regions from predictions across multiple time steps and adaptively suppressing them during model training, thereby encouraging the network to focus on less activated yet potentially informative regions. This approach not only enhances the feature discrimination capability of SNNs but also facilitates more effective multi-time-step inference by exploiting more semantic information. Experimental results on benchmark datasets demonstrate that our TSE method significantly improves the classification accuracy and robustness of SNNs.
Imitating how humans move their gaze in a visual scene is a vital research problem for both visual understanding and psychology, kindling crucial applications such as building alive virtual characters. Previous studies aim to predict gaze trajectories when humans are free-viewing an image, searching for required targets, or looking for clues to answer questions in an image. While these tasks focus on visual-centric scenarios, humans move their gaze also along with audio signal inputs in more common scenarios. To fill this gap, we introduce a new task that predicts human gaze trajectories in a visual scene with synchronized audio inputs and provide a new dataset containing 20k gaze points from 8 subjects. To effectively integrate audio information and simulate the dynamic process of human gaze motion, we propose a novel learning framework called EyEar (Eye moving while Ear listening) based on physics-informed dynamics, which considers three key factors to predict gazes: eye inherent motion tendency, vision salient attraction, and audio semantic attraction. We also propose a probability density score to overcome the high individual variability of gaze trajectories, thereby improving the stabilization of optimization and the reliability of the evaluation. Experimental results show that EyEar outperforms all the baselines in the context of all evaluation metrics, thanks to the proposed components in the learning model.
Multimodal emotion recognition is a crucial research area in the field of affective brain-computer interfaces. However, in practical applications, it is often challenging to obtain all modalities simultaneously. To deal with this problem, researchers focus on using cross-modal methods to learn multimodal representations with fewer modalities. However, due to the significant differences in the distribution of different modalities, it is challenging to enable any modality to fully learn multimodal features. To address this limitation, we propose a Multi-to-Single (M2S) emotion recognition model, leveraging contrastive learning and incorporating two innovative modules: 1) a spatial and temporal-sparse (STS) attention mechanism that enhances the encoders' ability to extract features from data; 2) a novel Multi-to-Multi Contrastive Predictive Coding (M2M CPC) that learns and fuses features across different modalities. In the final testing, we only use a single modality for emotion recognition, reducing the dependence on multimodal data. Extensive experiments on five public multimodal emotion datasets demonstrate that our model achieves the state-of-the-art performance in the cross-modal tasks and maintains multimodal performance using only a single modality.
Emotion recognition based on body movements is vital in human-computer interaction. However, existing emotion recognition methods predominantly focus on enhancing classification accuracy, often neglecting the provision of textual explanations to justify their classifications. In this paper, we propose an Emotion-Action Interpreter powered by LargeLanguage Model (EAI-LLM), which not only recognizes emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within large language models (LLMs). Specifically, we propose a multi-granularity skeleton tokenizer designed for LLMs, which separately extracts spatio-temporal tokens and semantic tokens from the skeleton data. This approach allows LLMs to generate more nuanced classification descriptions while maintaining robust classification performance. Furthermore, we treat the skeleton sequence as a specific language and propose a unified skeleton token module. This module leverages the extensive background knowledge and language processing capabilities of LLMs to address the challenges of joint training on heterogeneous datasets, thereby significantly enhancing recognition accuracy on individual datasets. Experimental results demonstrate that our model achieves recognition accuracy comparable to existing methods. More importantly, with the support of background knowledge from LLMs, our model can generate detailed emotion descriptions based on classification results, even when trained on a limited amount of labeled skeleton data.
This paper presents a novel method for automatically recognizing people's apparent personality traits as perceived by others. In previous studies, apparent personality trait recognition from multimodal human behavior is often modeled to directly estimate personality trait scores, i.e., the ``Big Five'' scores. In the model training phase, ground-truth personality trait scores were often determined from personality test results scored by many other people using fine-grained questionnaires, however, rich information in the personality test results have not been leveraged for anything other than determining the ground-truth Big Five scores. The scores assigned to each questionnaire item are thought to include more meta-level differences in personality characteristics. Therefore, we propose joint modeling methods that can estimate not only the Big Five scores but also questionnaire item-level scores. This enables us to improve awareness of multimodal human behavior. In addition, we present a newly created self-introduction video dataset with 50-item Big Five questionnaire results since previous apparent personality trait recognition datasets do not provide such personality test results. Experiments using the created dataset demonstrate that our proposed joint modeling methods with a multimodal transformer backbone can improve to estimate Big Five scores and effectively estimate questionnaire item-level scores. We also verify that the estimation performance reached human evaluation performance.
In recent years, with the advancements in brain science, spiking neural networks (SNNs) have garnered significant attention. SNNs can generate spikes that mimic the function of neurons transmission in humans brain, thereby significantly reducing computational costs by the event-driven nature during training. While deep SNNs have shown impressive performance on classification tasks, they still face challenges in more complex tasks such as object detection. In this paper, we propose SpikingYOLOX, extending the structure of the original YOLOX by introducing signed spiking neurons and fast Fourier convolution (FFC). The designed ternary signed spiking neurons could generate three kinds of spikes to obtain more robust features in the deep layer of the backbone. Meanwhile, we integrate FFC with SNN modules to enhance object detection performance, because its global receptive field is beneficial to the object detection task. Extensive experiments demonstrate that the proposed SpikingYOLOX achieves state-of-the-art performance among other SNN-based object detection methods.