Total: 1
Neuron segmentation from electron microscopy (EM) volumes is crucial for understanding brain circuits, yet the complex neuronal structures in high-resolution EM images present significant challenges. EM data exhibits unique characteristics including high noise levels, anisotropic voxel dimensions, and ultra-long spatial dependencies that make traditional vision models inadequate. Inspired by autoregressive pretraining in language models, we propose TokenUnify, a hierarchical predictive coding framework that captures multi-scale dependencies through three complementary learning objectives. TokenUnify integrates random token prediction, next-token prediction, and next-all token prediction to create a comprehensive representational space with emergent properties. From an information-theoretic perspective, these three tasks are complementary and provide optimal coverage of visual data structure, with our approach reducing autoregressive error accumulation from O(K) to O(\sqrt K ) for sequences of length K. We also introduce a large-scale EM dataset with 1.2 billion annotated voxels, offering ideal long-sequence visual data with spatial continuity. Leveraging the Mamba architecture's linear-time sequence modeling capabilities, TokenUnify achieves a 44% performance improvement on downstream neuron segmentation and outperforms MAE by 25%. Our approach demonstrates superior scaling properties as model size increases, effectively bridging the gap between pretraining strategies for language and vision models. Code is available at https://github.com/ydchen0806/TokenUnify.