paper.txt

Attention mechanisms, as exemplified in the Transformer model have significantly advanced the field of sequence modeling, particularly in Natural Language Processing (NLP) and various branches of Signal Processing such as Speech Signal Processing and Digital Image Signal Processing. These mechanisms are adept at capturing dependencies within the context length, although their effectiveness can vary based on the relative placement of tokens and the inherent limitations in handling long-range dependencies due to quadratic complexity. Ongoing research continues to address these challenges, seeking more efficient ways to model long sequences and capture global context dependencies.
In recent years, the self-attention mechanism (or variations of it that are also dot-product based) integral to Transformer layers has become central to the encoders of Pre-Trained Models (PTMs) trained via Self-Supervised Learning (SSL) methods. Examples include WavLM  and HuBERT  for Speech Encoding, Llama 2  for Text Encoding, and BEiT  for Digital Image Encoding. These PTMs excel in generating contextualized embeddings, surpassing traditional feature engineering methods and are versatile for various downstream tasks tailored to their respective training modalities.
The impetus for this work stems from the broad spectrum of downstream use cases that can benefit from an enhanced attention mechanism due to the inherent limitations of the self-attention mechanism (or other dot-product attention mechanisms variations of it) in Transformer models, which relies on normalized dot-product, and the potential advantages of adopting a more robust and explainable approach.
Self-attention’s fixed-length context window can lead to sub-optimal performance especially for long sequences where distant elements may not be relevant. Without inductive biases like locality self-attention layers might require more data to learn patterns that are more easily captured by other methods. Despite its theoretical capability, self-attention can struggle with capturing long-term dependencies in practice, particularly as sequence length increases. The interpretability of self-attention mechanisms is challenging (since we can only derive correlation-based activations and hence adopting all of correlation’s drawbacks such as primarily focusing on pairwise similarities), making it difficult to understand why certain parts of the input are prioritized and while self-attention allows each token to attend to all others, it may not always effectively capture the most relevant context .
In our work, we introduce a significant enhancement to the Transformer model’s Attention mechanism: the (Multi-Head) Gaussian Adaptive Attention Mechanism (GAAM). GAAM is designed to improve upon the standard self-attention mechanism in Transformers. Unlike conventional attention in the Transformer, which calculates weights based on dot-product between different weight matrices, GAAM employs a Gaussian-based modulation of input features instead. This approach enables the model to concentrate on the most pertinent features in a context-sensitive manner, thereby improving its capability to interpret and process sequential and spatial data.
GAAM’s attention mechanism, applied in various domains like multimedia recommendation (as in ), image classification (aligning with Patrick et al.’s  robustness strategies), and text classification (enhancing accuracy in contexts like e-commerce as shown by ), can significantly enhance model performance. Its ability to dynamically recalibrate feature significance based on Gaussian parameters proves particularly beneficial, offering improved accuracy, robustness, and user experience across diverse and challenging real-world applications. Furthermore, GAAM’s Gaussian-based modulation offers a more interpretable framework for Artificial Intelligence (AI), addressing the critical need for transparency and trustworthiness in real-world AI systems .
Our proposed GAAM mechanism learns both the mean and variance of input features in a Multi-Headed setting. This mechanism operates independently across different heads, each focusing on distinct, non-overlapping subspaces of the input features. By employing Gaussian modulation, GAAM assigns varying levels of importance to each feature, effectively generating local attention outputs from each head. These outputs are then combined to construct a comprehensive Global Attention map. Each head independently adjusts its mean and variance, allowing for a focused approach to different skewness aspects in data subsets capturing a broader range of data characteristics, including asymmetries, and collectively, non-Gaussian traits. Unlike previous approaches in the literature wherein no parameters of the Gaussian distribution are learned, and are thus hard-coded making them non-specific to the data they are used on only multiplicative parameters like the scaled variance are learned a pre-defined Amplitude that is updated during training  or approaches that are limited because their attention framework can only explicitly model Gaussian traits behavior .