Applied Sciences, Vol. 15, Pages 3230: Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition
Applied Sciences doi: 10.3390/app15063230
Authors: Jianjun Bao Ke Luo Qiqi Kou Liang He Guo Zhao
Multi-view image classification tasks require the effective extraction of both spatial and temporal features to fully leverage the complementary information across views. In this study, we propose a lightweight yet powerful model, Multi-head Sparse Structural Attention-based Vision Transformer (MSSAViT), which integrates Structural Self-Attention mechanisms into a compact framework optimized for multi-view inputs. The model employs a fixed MobileNetV3 as a Feature Extraction Module (FEM) to ensure consistent feature patterns across views, followed by Spatial Sparse Self-Attention (SSSA) and Temporal Sparse Self-Attention (TSSA) modules that capture long-range spatial dependencies and inter-view temporal dynamics, respectively. By leveraging these structural attention mechanisms, the model achieves the effective fusion of spatial and temporal information. Importantly, the total model size is reduced to 6.1 M with only 1.5 M trainable parameters, making it highly efficient. Comprehensive experiments demonstrate the proposed model’s superior performance and robustness in multi-view classification tasks, outperforming baseline methods while maintaining a lightweight design. These results highlight the potential of MSSAViT as a practical solution for real-world applications under resource constraints.