Notice
Attention-guided Dynamic inference for model compression
- document 1 document 2 document 3
- niveau 1 niveau 2 niveau 3
Descriptif
Attention models have recently gained popularity in Machine Learning with Transformer architectures, dominating Natural Language Processing (NLP) and challenging CNN-based neural network architectures in computer vision tasks. This is due to the self-attention mechanism, i.e., the building block of Transformers, which assigns importance weights to different regions in the input sequence, enabling the model to focus on relevant information for each prediction. Recent work leverages the inherent attention mechanism in Transformers for complexity reduction of models in image classification. Precisely, it uses attention to focus on the most important information in input images, which allows us to allocate computation to this salient spatial locations only. The motivation for compressing neural networks stems from their computational complexity, which is quadratic (O(N2)) in the case of Transformers, where N is the number of input tokens, and memory requirements, which hinder their efficiency in terms of energy consumption. In our case, we explore a novel approach named dynamic compression, which aims to reduce complexity during inference by dynamically allocating resources based on each input sample. Through a preliminary study, we observed that Transformers exhibit a suboptimal image partitioning into tokens, which shows that small models (less tokens) classify a set of images better than bigger models (more tokens) for image classification task.