ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection (2025)

Author NameAffiliationemail@example.com  Cunhang Fan  Xiaoke Yang  Hongyu Zhang  Ying Chen  Lu Li  Jian Zhou&Zhao Lv
Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University
{cunhang.fan, jzhou, kjlz}@ahu.edu.cn,
{e22201014, e22201103, e23201035, e12314059}@stu.ahu.edu.cn
Corresponding Author

Abstract

Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spatio-Temporal Enhancement Nested Network (ListenNet) for AAD. The ListenNet has three key components: Spatio-temporal Dependency Encoder (STDE), Multi-scale Temporal Enhancement (MSTE), and Cross-Nested Attention (CNA). The STDE reconstructs dependencies between consecutive time windows across channels, improving the robustness of dynamic pattern extraction. The MSTE captures temporal features at multiple scales to represent both fine-grained and long-range temporal patterns. In addition, the CNA integrates hierarchical features more effectively through novel dynamic attention mechanisms to capture deep spatio-temporal correlations. Experimental results on three public datasets demonstrate the superiority of ListenNet over state-of-the-art methods in both subject-dependent and challenging subject-independent settings, while reducing the trainable parameter count by approximately 7 times. Code is available at: https://github.com/fchest/ListenNet.

1 Introduction

In multi-speaker environments, humans with normal hearing have the ability to focus on a specific speaker while ignoring interference from other sound sources, the phenomenon known as the cocktail party effectCherry (1953). The mechanism behind it is commonly referred to as selective auditory attention. This inherent ability plays a crucial role in human communication and has attracted growing interest in auditory attention detection (AAD), which aims to localize the attended speaker using brain signals Dai et al. (2018). AAD could potentially enhance the design of human-centered intelligent interaction systems, such as hearing aids.

Neuroscientific studies have demonstrated a nonlinear relationship between auditory attention and brain activity Choi et al. (2013); Mesgarani and Chang (2012), which involves higher cognitive processing in the cerebral cortex. Electroencephalography (EEG) signalsare widely used due to their non-invasive nature, ease of acquisition, and high temporal resolution DeTaillez et al. (2020); Fan et al. (2024a). Spatio-temporal patterns of EEG reveal attentional regulation during selective listening Tune et al. (2021). Findings of the inter-subject correlation (ISC) suggest that EEG signals are synchronized across subjects during perception of the same naturalistic visual and narrative speech stimuli Dmochowski et al. (2014); Shen et al. (2022). Taking this perspective, EEG signals exhibit temporal correlations, spatial correlations across channels, and spatio-temporal dependencies, which could provide valuable information for discriminating different attention states and advancing robust AAD methods.

Despite significant progress made by existing EEG-based AAD methods, three major challenges still limit their performance and practical application. Firstly, many existing methods have made substantial strides in spatio-temporal modeling, effectively capturing dynamic spatial patterns, leading to improved detection performance. These methods typically treat space and time separately, as shown in Figure 1(a) and (b). Spatial dependencies are captured independently, and temporal dependencies are subsequently extracted. However, these methods overlook the temporal context under dynamic time conditions, as well as the spatio-temporal dependencies across different channels during auditory stimulus processing. Secondly, the individual differences and the non-stationary characteristics of EEG signals lead to significant performance degradation when applying AAD methods across subjects. Cai et al. (2024); Fan et al. (2024b) effectively leverage individual-specific features to demonstrate strong performance in the subject-dependent setting, but they lack good generalization ability, which makes it difficult to develop subject-independent robust methods. Lastly, the pursuit of accuracy in current methods Jiang et al. (2022); Ni et al. (2024) leads to large model sizes and high computational complexity, which are often attributed to complex feature extraction methods and transformer attention mechanisms, making them impractical for low-power devices.

ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection (1)

To address these issues, this paper proposes a Lightweight Spatio-Temporal Enhancement Nested Network (ListenNet) with low parameter count and computational complexity. As shown in Figure 1(c), it captures multi-channel spatio-temporal dependenciesand multi-scale dynamic temporal patterns, ensuring high accuracy and strong generalization. Specifically, the proposed ListenNet consists of three components: (1) Spatio-temporal Dependency Encoder (STDE) captures consecutive time steps and multi-channel features, differing from previous studies that first focus on channel features. It expands the input EEG signals within each channel to capture temporal dependencies and extracts spatial features both within and across channels, enhancing spatio-temporal representation capacity. (2) Multi-scale Temporal Enhancement (MSTE) captures temporal dependencies at multiple time scales, adding dynamic temporal context to build robust temporal embeddings. (3) Cross-Nested Attention (CNA) groups spatio-temporal features in parallel, extracts sub-feature context, and recalibrates weights by encoding global information, enhancing deep spatio-temporal correlations. Finally, the effective features are passed to a classifier to predict the subject’s attended speaker. The major contributions of this work are summarized as follows:

  • \bullet

    The proposed ListenNet overcomes the performance and efficiency limitations of existing methods for AAD by efficiently capturing spatio-temporal dependencies in both subject-dependent and subject-independent settings.

  • \bullet

    A novel MSTE module is designed to efficiently extract multi-channel dependencies across multiple scales and time steps to integrate multi-level features, enhancing and complementing robust temporal representations.

  • \bullet

    Experimental results show that ListenNet achieves outstanding accuracy while reducing the trainable parameter count by approximately 7 times. Specifically, it surpasses the best baseline by 6.1% on the DTU dataset under the subject-dependent setting and by 8.2% on the KUL dataset under the subject-independent setting, all within a 1-second decision window.

2 Related Works

For spatial dependency modeling, existing methods are dividedinto physical and dynamic dependencies. Cai et al. (2021); Jiang et al. (2022) project differential entropy (DE) features in the frequency domain onto 2D topological maps using the known electrode positions to calculate spatial dependency based on physical distance and achieve good performance. Although physical dependency conforms to priorphysiological paradigms, the electrode positions relations between channels cannot be directly equated to their functional connections Liu et al. (2024). Currently, some researchers autonomously learn spatial dependency relationships during training. Fan et al. (2024b) extracts DE features as nodes to construct graph neural networks (GNN) and utilize an updated parameter matrix to represent spatial dependency. Su et al. (2021); Cai et al. (2023, 2024) design channel-wise attention mechanisms that learn to assign distinct weights to capture spatial patterns. Ni et al. (2024) utilizes a dual-branch approach to extract features from the temporal and frequency domains in parallel. For the frequency branch, it projects DE onto 2D maps and uses their topological patterns. For the temporal branch, the transformer encoder embeds a single cross-channel time step as an input token to autonomously learn features. The current state-of-the-art (SOTA) study Yan et al. (2024) employs spatial convolution operations across all channels to effectively capture spatial dependencies, resulting in competitive AAD performance.

For temporal dependency modeling, existing methods typically capture temporal dependencies using convolutional neural networks (CNN) and attention mechanisms. Monesi et al. (2020) independently uses long short-term memory (LSTM) networks to capture dependencies within EEG signals and achieve decent decoding performance. Vandecappelle et al. (2021) applies a simple one layer CNN model to directly process EEG data, where the time series are reduced to a single value. Su et al. (2022) sequentially processes temporal information after spatial attention, multiplying attention maps with EEG signals for adaptive feature refinement. Wang et al. (2023) utilizes a temporal attention mechanism after GNN that assigns varying weights to a sequence of EEG signals, enabling the capture of the complex temporal dynamics and enhancing the detection of even subtle changes in attentional states over time. Recently, EskandariNasab et al. (2024) employs gated recurrent units (GRU) and CNN to consider both historical and new temporal information when calculating the current state value, thereby inferring the temporal dependencies between time steps.

The methods mentioned above often focus separately on spatial and temporal features, or adopt a two-step processing strategy in which spatial dependencies are captured, followed by the modeling of temporal dependencies. However, these approaches tend to overlook the rich temporal contextual information under dynamic time conditions, as well as the spatio-temporal distribution characteristics of different brain regions during the reception, processing, and response to auditory stimuli. As a result, the failure to capture critical spatio-temporal dependencies significantly limits model performance.

ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection (2)

3 The Proposed ListenNet Method

The proposed ListenNet is designed to comprehensively integratespatio-temporal dependencies in EEG signals, addressing the limitations of existing methods by modeling dependencies across both multiple channels and time scales. Figure 2 illustrates the overall structure of ListenNet. The method will be specified in the following subsections.

Given the EEG data split by a moving window, a series of decision windows is obtained, each containing a short time segment of EEG signals. Consider the original EEG data of a decision window represented by X=[x1,,xi,,xT]C×T𝑋subscript𝑥1subscript𝑥𝑖subscript𝑥𝑇superscript𝐶𝑇X=[x_{1},...,x_{i},...,x_{T}]\in\mathbb{R}^{C\times T}italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_T end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the number of EEG channels and T𝑇Titalic_T is the length of the decision window. Here, xiC×1subscript𝑥𝑖superscript𝐶1x_{i}\in\mathbb{R}^{C\times 1}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 end_POSTSUPERSCRIPT is the EEG data at the i𝑖iitalic_i-th window of X𝑋Xitalic_X. We aim to learn a representation F()𝐹F(\cdot)italic_F ( ⋅ ), which maps x𝑥xitalic_x to the corresponding label y=F(x)𝑦𝐹𝑥y=F(x)italic_y = italic_F ( italic_x ). Here, y𝑦yitalic_y denotes the locus (i.e., left or right) of auditory attention. Before inputting the EEG data into ListenNet, a Euclidean alignment (EA) method Miao et al. (2022) is employed, which standardizes the EEG data by calculating the average covariance matrix to extract shared features from the data across different brain states.X~C×T~𝑋superscript𝐶𝑇\tilde{X}\in\mathbb{R}^{C\times T}over~ start_ARG italic_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_T end_POSTSUPERSCRIPT is obtained by normalizing and aligning X𝑋Xitalic_X.

3.1 Spatio-temporal Dependency Encoder (STDE)

EEG signals are derived from different brain regions and exhibit dynamic changes in connectivity patterns between brain regions over time. Previous studies neglect the spatio-temporal characteristics of EEG signals. Meanwhile, as networks become increasingly complex Zhang et al. (2023); Chen et al. (2023); Niu et al. (2024), the limited size of EEG data makes these networks prone to overfitting. CNN-based networks have demonstrated sufficient feature extraction capabilities in brain-computer interface (BCI) tasks Lawhern et al. (2018); Miao et al. (2023). Considering these characteristics, we design a spatio-temporal dependency encoder to extract robust dynamic patterns using depthwise separable convolutions, which consists of the temporal feature component (STDE-T) and the spatial feature component (STDE-S), as shown in Figure 2(a).

Firstly, STDE-T extracts dynamic features from EEG signals through temporal convolution layers, capturing temporal dependencies and constructing the temporal patterns Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This can be expressed as:

Et=GELU(DepthwiseConv(Conv(X~)))subscript𝐸𝑡𝐺𝐸𝐿𝑈𝐷𝑒𝑝𝑡𝑤𝑖𝑠𝑒𝐶𝑜𝑛𝑣𝐶𝑜𝑛𝑣~𝑋E_{t}=GELU(DepthwiseConv(Conv(\tilde{X})))italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_G italic_E italic_L italic_U ( italic_D italic_e italic_p italic_t italic_h italic_w italic_i italic_s italic_e italic_C italic_o italic_n italic_v ( italic_C italic_o italic_n italic_v ( over~ start_ARG italic_X end_ARG ) ) )(1)

where Etddepth×C×Tsubscript𝐸𝑡superscriptsubscript𝑑depth𝐶superscript𝑇E_{t}\in\mathbb{R}^{d_{\text{depth}}\times C\times T^{\prime}}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × italic_C × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, Conv()𝐶𝑜𝑛𝑣Conv(\cdot)italic_C italic_o italic_n italic_v ( ⋅ ) represents convolutional filters with a 1×1111\times 11 × 1 kernel size to perform spatio-temporal reshaping on the input signals. DepthwiseConv()𝐷𝑒𝑝𝑡𝑤𝑖𝑠𝑒𝐶𝑜𝑛𝑣DepthwiseConv(\cdot)italic_D italic_e italic_p italic_t italic_h italic_w italic_i italic_s italic_e italic_C italic_o italic_n italic_v ( ⋅ ) performs convolution independently on each input channel along the time dimension with a kernel size 1×k01subscript𝑘01\times k_{\text{0}}1 × italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a group size ddepthsubscript𝑑depthd_{\text{depth}}italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT, followed by the GELU()𝐺𝐸𝐿𝑈GELU(\cdot)italic_G italic_E italic_L italic_U ( ⋅ ) activation function.

Subsequently, STDE-S encodes the spatial distribution information across all channels through spatial convolution layers, capturing the spatial distribution features Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT from EEG signals, which facilitate a comprehensive understanding of the brain’s activity patterns in response to various auditory stimuli. This can be expressed as:

Es=GELU(DepthwiseConv(Conv(Et)))subscript𝐸𝑠𝐺𝐸𝐿𝑈𝐷𝑒𝑝𝑡𝑤𝑖𝑠𝑒𝐶𝑜𝑛𝑣𝐶𝑜𝑛𝑣subscript𝐸𝑡E_{s}=GELU(DepthwiseConv(Conv({E_{t}})))italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_G italic_E italic_L italic_U ( italic_D italic_e italic_p italic_t italic_h italic_w italic_i italic_s italic_e italic_C italic_o italic_n italic_v ( italic_C italic_o italic_n italic_v ( italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) )(2)

where Esddepth×1×Tsubscript𝐸𝑠superscriptsubscript𝑑depth1superscript𝑇E_{s}\in\mathbb{R}^{d_{\text{depth}}\times 1\times T^{\prime}}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × 1 × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, Conv()𝐶𝑜𝑛𝑣Conv(\cdot)italic_C italic_o italic_n italic_v ( ⋅ ) represents convolutional filters with a 1×1111\times 11 × 1 kernel size for initial channel mapping and achieving channel-wise feature fusion. DepthwiseConv()𝐷𝑒𝑝𝑡𝑤𝑖𝑠𝑒𝐶𝑜𝑛𝑣DepthwiseConv(\cdot)italic_D italic_e italic_p italic_t italic_h italic_w italic_i italic_s italic_e italic_C italic_o italic_n italic_v ( ⋅ ) performs convolution to capture inter-channel dependencies with a C×1𝐶1C\times 1italic_C × 1 and a group size ddepthsubscript𝑑depthd_{\text{depth}}italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT, with the GELU()𝐺𝐸𝐿𝑈GELU(\cdot)italic_G italic_E italic_L italic_U ( ⋅ ) activation function. We integrate the spatial distribution features with the temporal patterns to form a comprehensive spatio-temporal embedding Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

3.2 Multi-scale Temporal Enhancement (MSTE)

The auditory system is sensitive to the temporal patterns Puffay et al. (2022). Inspired by the concept of multi-scale modeling Wu et al. (2020); Fan et al. (2024c), we propose a novel MSTE module. As shown in Figure 2(b), the module captures dynamic brain activity across multiple time scales, offering a comprehensive representation of temporal patterns.

MSTE integrates dilated convolutions with the inception strategy to capture temporal features across multiple scales, thereby enabling a more comprehensive representation of multi-level temporal dependencies and enhancing the modeling of complex temporal patterns. The dilated convolution filters use four different kernel sizes to capture patterns at different time scales, with same dilation factor progressively expanding the effective receptive field. This enables the module to more efficiently capture both fine-grained and long-term temporal dependencies without increasing the number of parameters. Formally, the Inception strategy is combined with dilated convolutions to capture multi-scale temporal features. Given the input from the temporal convolution layers, the module applies four convolutional filters, each with a fixed dilation factor, to extract multi-scale temporal features. The outputs are truncated to match the size of the largest kernel, concatenated along the channel dimension, and normalized using batch normalization, ultimately generating the multi-scale feature map. The above process can be formulated as:

U=[DilatedConv1×k(Et)k{k1,k2,k3,k4}]𝑈delimited-[]conditionalsubscriptDilatedConv1𝑘subscript𝐸𝑡𝑘subscript𝑘1subscript𝑘2subscript𝑘3subscript𝑘4U=[\text{DilatedConv}_{1\times k}(E_{t})\mid k\in\{k_{1},k_{2},k_{3},k_{4}\}]italic_U = [ DilatedConv start_POSTSUBSCRIPT 1 × italic_k end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∣ italic_k ∈ { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } ](3)

where Uddepth×C×Tmink𝑈superscriptsubscript𝑑depth𝐶subscriptsuperscript𝑇𝑘U\in\mathbb{R}^{d_{\text{depth}}\times C\times T^{k}_{\min}}italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × italic_C × italic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and Tminksubscriptsuperscript𝑇𝑘T^{k}_{\min}italic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT represents the minimum time dimension among the outputs.[]delimited-[][\cdot][ ⋅ ] represents concatenation operation, and DilatedConv1×k()subscriptDilatedConv1𝑘\textit{DilatedConv}_{1\times k}(\cdot)DilatedConv start_POSTSUBSCRIPT 1 × italic_k end_POSTSUBSCRIPT ( ⋅ ) is implemented as a set of dilated convolutions with k{k1,k2,k3,k4}𝑘subscript𝑘1subscript𝑘2subscript𝑘3subscript𝑘4k\in\{k_{1},k_{2},k_{3},k_{4}\}italic_k ∈ { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }.For each kernel size k𝑘kitalic_k, a convolution is applied along the temporal dimension with a fixed dilation factor d𝑑ditalic_d.

The skip connection is implemented using a depthwise convolution with a kernel size of C×1𝐶1C\times 1italic_C × 1 and a group size ddepthsubscript𝑑depthd_{\text{depth}}italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT. These transform spatial information while preserving channel structure and standardize the sequence length for consistent transmission to the output module. The features are resized via bilinear interpolation to match the dimensions required by the subsequent layer, resulting in Sddepth×1×T𝑆superscriptsubscript𝑑depth1superscript𝑇S\in\mathbb{R}^{d_{\text{depth}}\times 1\times T^{\prime}}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × 1 × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, which is then added to Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, producing a robust representation of spatio-temporal dynamics Esddepth×1×Tsuperscriptsubscript𝐸𝑠superscriptsubscript𝑑depth1superscript𝑇E_{s}^{\prime}\in\mathbb{R}^{d_{\text{depth}}\times 1\times T^{\prime}}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × 1 × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

3.3 Cross-Nested Attention (CNA)

The multi-head attention mechanism in transformer models achieves significant results but incurs high computational cost. Inspired by the parallel strategy for cross-dimensional spatial information aggregation Wang et al. (2020); Ouyang et al. (2023), we propose a novel cross-nested attention module that efficiently integrates hierarchical spatio-temporal features and reduces computational cost.

CNA employs dual-branch decomposition and interactive enhancement, extracting deep spatio-temporal features through attention weighting. Prior to processing, the input temporal feature Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is depth-aligned with Essuperscriptsubscript𝐸𝑠E_{s}^{\prime}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to produce Etddepth×ddepth×Tsuperscriptsubscript𝐸𝑡superscriptsubscript𝑑depthsubscript𝑑depthsuperscript𝑇E_{t}^{\prime}\in\mathbb{R}^{d_{\text{depth}}\times d_{\text{depth}}\times T^{%\prime}}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. As shown in Figure2(c), both Etsuperscriptsubscript𝐸𝑡E_{t}^{\prime}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Essuperscriptsubscript𝐸𝑠E_{s}^{\prime}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are divided into G𝐺Gitalic_G groups along the depth dimension, where G=ddepth/2𝐺subscript𝑑depth2G=\left\lfloor d_{\text{depth}}/2\right\rflooritalic_G = ⌊ italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT / 2 ⌋, and \left\lfloor\cdot\right\rfloor⌊ ⋅ ⌋ denotes the floor operation. The dimension-adjusted features are denoted as Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, respectively. Then, a dual-branch spatio-temporal module is applied to decompose and capture global information in both directions, producing two enhanced features, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as formulated below:

F1subscript𝐹1\displaystyle F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=GN(Ftσ(AAPS(Ft))σ(AAPT(Ft)))absent𝐺𝑁direct-productdirect-productsubscript𝐹𝑡𝜎𝐴𝐴subscript𝑃𝑆subscript𝐹𝑡𝜎𝐴𝐴subscript𝑃𝑇subscript𝐹𝑡\displaystyle=GN\left(F_{t}\odot\sigma\left(AAP_{S}(F_{t})\right)\odot\sigma%\left(AAP_{T}(F_{t})\right)\right)= italic_G italic_N ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_σ ( italic_A italic_A italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ⊙ italic_σ ( italic_A italic_A italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) )(4)
F2subscript𝐹2\displaystyle F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=GN(Fsσ(AAPS(Fs))σ(AAPT(Fs)))absent𝐺𝑁direct-productdirect-productsubscript𝐹𝑠𝜎𝐴𝐴subscript𝑃𝑆subscript𝐹𝑠𝜎𝐴𝐴subscript𝑃𝑇subscript𝐹𝑠\displaystyle=GN\left(F_{s}\odot\sigma\left(AAP_{S}(F_{s})\right)\odot\sigma%\left(AAP_{T}(F_{s})\right)\right)= italic_G italic_N ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⊙ italic_σ ( italic_A italic_A italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ⊙ italic_σ ( italic_A italic_A italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) )

where, GN()𝐺𝑁GN(\cdot)italic_G italic_N ( ⋅ ) denotes the group normalization operation, AAPS()𝐴𝐴subscript𝑃𝑆AAP_{S}(\cdot)italic_A italic_A italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( ⋅ ) denotes the spatial adaptive average pooling operation, AAPT()𝐴𝐴subscript𝑃𝑇AAP_{T}(\cdot)italic_A italic_A italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) denotes the temporal adaptive average pooling operation, σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) denotes the sigmoid activation function, and direct-product\odot denotes the element-wise multiplication operation.

To capture long-range dependencies and global context, global average pooling and softmax are applied to each input branch to produce attention vectors. These are reshaped and used to compute cross-attention maps with features from the opposite branch via matrix multiplication. The resulting maps are concatenated and passed through a shared 1×1111\times 11 × 1 convolution for feature fusion and dimensionality reduction, yielding the final attention weights W(B×G)×1×1×T𝑊superscript𝐵𝐺11superscript𝑇W\in\mathbb{R}^{(B\times G)\times 1\times 1\times T^{\prime}}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_B × italic_G ) × 1 × 1 × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, with B𝐵Bitalic_B denoting the batch size. Finally, the output deep spatio-temporal features Eddepth×1×T𝐸superscriptsubscript𝑑depth1superscript𝑇E\in\mathbb{R}^{d_{\text{depth}}\times 1\times T^{\prime}}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT × 1 × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are obtained by applying element-wise multiplication between Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the sigmoid-activated W𝑊Witalic_W.

3.4 Classifier

The classifier is designed to provide the final auditory attention results. Global average pooling is applied to reduce the dimensions of the features output by the CNA module. Then, the normalized feature maps are flattened into a 1D vector and fed into a fully connected layer to produce the final result. In the training stage, we apply the binary cross-entropy function to update the parameters.

=1Ni=1N[yilogQi+(1yi)log(1Qi)]1𝑁superscriptsubscript𝑖1𝑁delimited-[]subscript𝑦𝑖subscript𝑄𝑖1subscript𝑦𝑖1subscript𝑄𝑖\mathcal{L}=-\frac{1}{N}\sum_{i=1}^{N}[y_{i}\cdot\log Q_{i}+(1-y_{i})\cdot\log%(1-Q_{i})]caligraphic_L = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_log italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ roman_log ( 1 - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ](5)

where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means the ground-truth label of i𝑖iitalic_i-th decision window, N𝑁Nitalic_N means the number of samples, and Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding possibility of predicted direction label with softmax function processing.

4 Experiment

4.1 Datasets

We evaluate ListenNet on three publicly available datasets: KUL Das et al. (2019, 2016), DTU Fuglsang et al. (2018, 2017), and AVED ZHANG et al. (2024). We summarize the details of the above datasets in Table 1.

1) KUL: This dataset consists of 16 normal-hearing subjects, with 64-channel EEG data recorded. Each subject was instructed to attend to one of two competing voices from either the 90 left or right. Each subject completed 8 trials, each lasting 6 minutes.

2) DTU: This dataset consists of 18 normal-hearing subjects, with 64-channel EEG data recorded. Each subject was instructed to perform a target speaker tracking task in an environment with reverberation and dynamic background noise interference, attending to one of two competing voices from speakers positioned at a 60 relative to the subject. Each subject completed 60 trials, each lasting 50 seconds.

3) AVED: This dataset consists of 20 normal-hearing subjects, with 32-channel EEG data recorded. Subjects were evenly divided into two experimental conditions: audio-only and audio-visual, with 10 subjects in each condition. Each subject was instructed to attend to one of two competing voices from either the 90 left or right. In the audio-visual condition, subjects not only listened to the stories but also watched the video of the narrator they were instructed to focus on. Each subject completed 16 trials, each lasting 152 seconds.

DatasetSceneSubjectsChannelsStimulusDuration
Direction(minutes)
KULaudio-only1664±9048
DTUaudio-only1864±6050
AVEDaudio-only1032±9040
audio-visual1032±9040

4.2 Data Processing

To eliminate artifact noise and obtain cleaner EEG signals, specific preprocessing steps are applied to the three datasets to ensure consistency and comparability across experiments. For the KUL dataset, EEG signals are band-pass filtered (0.1–50 Hz) to remove irrelevant frequencies and downsampled to 128 Hz. For the DTU dataset, 50 Hz line noise and power line interference are filtered out, followed by downsampling to 128 Hz and high-pass filtering at 0.1 Hz. Eye artifacts are removed using joint decorrelation, and data are re-referenced to the average EEG channel response. For the AVED dataset, 50 Hz power line interference is removed, and the signals are band-pass filtered (0.1–50 Hz) and downsampled to 128 Hz. Subsequently, ocular and muscle artifacts are eliminated using independent component analysis (ICA). Finally, all EEG channels were re-referenced.

To evaluate ListenNet, we compare it with other SOTA AAD methods under both subject-dependent and the more challenging subject-independent settings. Specifically, four open-source models are selected as baselines: SSF-CNN Cai et al. (2021), MBSSFCC Jiang et al. (2022), DBPNet Ni et al. (2024), and DARNet Yan et al. (2024).

4.3 Implementation Details

We evaluate the performance of ListenNet on KUL, DTU, and AVED datasets under both subject-dependent and subject-independent settings. For the subject-dependent condition, each subject’s data is split into training, validation, and test sets in an 8:1:1 ratio. The batch size is set to 32, the maximum number of epochs to 100, and an early stopping strategy is employed. Moreover, the model is trained using an Adam optimizer with a learning rate of 5e-4 and weight decay of 3e-4. For the subject-independent condition, the leave-one-subject-out (LOSO) cross-validation strategy is used. Namely, one subject’s EEG data constituted the testing data, and the remaining subjects’ EEG data constituted the training data. Here, the batch size is set to 128, with a maximum of 100 epochs. An Adam optimizer is also used with a learning rate of 1e-3 and a weight decay of 3e-4.

The following describes the implementation details, including the training settings and network configuration. The hyperparameters of ListenNet are consistently fixed across the three datasets to ensure a fair comparison of its generalizability. For STDE, the kernel size k0subscript𝑘0k_{\text{0}}italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is set to 8, and the group size ddepthsubscript𝑑depthd_{\text{depth}}italic_d start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT is set to 16. For MSTE, the kernel sizes used in the 2D dilated convolutional filter are k{1,2,3,5}𝑘1235k\in\{1,2,3,5\}italic_k ∈ { 1 , 2 , 3 , 5 }, and the dilation factor d𝑑ditalic_d is set to 1. Consequently, the number of groups G𝐺Gitalic_G in CNA is configured as 8. All experiments are conducted using PyTorch on an RTX 4090 GPU.

5 Results

5.1 Comparison with Prior Art

DatasetSceneModelSubject-DependentSubject-Independent
0.1-second1-second2-second1-second2-second
KULaudio-onlyCNN Vandecappelle et al. (2021)74.384.185.756.8 ± 5.5859.5 ± 8.21
SSF-CNN Cai et al. (2021)76.3 ± 8.4784.4 ± 8.6787.8 ± 7.8759.3 ± 6.6960.8 ± 8.40
MBSSFCC Jiang et al. (2022)79.0 ± 7.3486.5 ± 7.1689.5 ± 6.7462.7 ± 8.0864.7 ± 8.62
EEGraph Cai et al. (2023)88.7 ± 6.5996.1 ± 3.2296.5 ± 3.34--
DGSD Fan et al. (2024b)-90.3 ± 7.2993.3 ± 6.5363.6 ± 8.00-
DBPNet Ni et al. (2024)85.3 ± 6.2294.4 ± 4.6295.3 ± 4.6361.1 ± 8.2662.3 ± 7.37
DARNet Yan et al. (2024)89.2 ± 5.5094.8 ± 4.5395.5 ± 4.8969.9 ± 11.8271.9 ± 13.01
ListenNet (ours)92.5 ± 5.2496.9 ± 3.0197.3 ± 2.6278.1 ± 13.5079.6 ± 14.60
DTUaudio-onlyCNN Vandecappelle et al. (2021)56.763.365.251.8 ± 3.0352.9 ± 3.42
SSF-CNN Cai et al. (2021)62.5 ± 3.4069.8 ± 5.1273.3 ± 6.2152.3 ± 3.5053.4 ± 4.16
MBSSFCC Jiang et al. (2022)66.9 ± 5.0075.6 ± 6.5578.7 ± 6.7552.5 ± 4.3553.9 ± 5.80
EEGraph Cai et al. (2023)72.5 ± 7.4178.7 ± 6.4779.4 ± 7.16--
DGSD Fan et al. (2024b)-79.6 ± 6.7682.4 ± 6.8655.2 ± 4.07-
DBPNet Ni et al. (2024)74.0 ± 5.2079.8 ± 6.9180.2 ± 6.7955.5 ± 6.3355.8 ± 6.11
DARNet Yan et al. (2024)74.6 ± 6.0980.1 ± 6.8581.2 ± 6.3455.6 ± 4.1355.6 ± 4.04
ListenNet (ours)79.4 ± 7.0086.2 ± 5.5586.6 ± 4.8256.8 ± 7.3257.2 ± 5.83
AVEDaudio-onlySSF-CNN Cai et al. (2021)53.4 ± 1.4758.4 ± 3.7958.9 ± 5.3551.7 ± 0.8552.5 ± 1.55
MBSSFCC Jiang et al. (2022)55.9 ± 1.8070.2 ± 4.1074.2 ± 7.2452.2 ± 1.5252.7 ± 1.87
DBPNet Ni et al. (2024)53.6 ± 2.6558.9 ± 3.6562.8 ± 5.9352.1 ± 1.1953.3 ± 1.88
DARNet Yan et al. (2024)49.7 ± 1.0580.2 ± 14.6783.6 ± 12.1051.3 ± 0.2152.1 ± 1.54
ListenNet (ours)57.7 ± 1.7174.6 ± 3.3677.1 ± 5.3152.8 ± 1.3053.8 ± 1.98
audio-visualSSF-CNN Cai et al. (2021)54.5 ± 1.7959.2 ± 5.4463.1 ± 6.5552.4 ± 2.2953.8 ± 2.27
MBSSFCC Jiang et al. (2022)57.5 ± 2.7569.6 ± 5.5775.5 ± 4.3452.8 ± 1.5754.1 ± 1.86
DBPNet Ni et al. (2024)56.1 ± 2.6861.5 ± 4.3364.1 ± 6.0953.3 ± 2.3954.0 ± 1.61
DARNet Yan et al. (2024)50.3 ± 0.7783.6 ± 12.1088.7 ± 13.1551.4 ± 0.3252.6 ± 0.29
ListenNet (ours)57.9 ± 2.1674.9 ± 4.6376.5 ± 5.0753.7 ± 1.6054.1 ± 1.83

In this work, we maintain the same subject-dependent setup as most existing models and evaluate our model in a more challenging subject-independent setup to better align with real-world applications, as detailed in Table 2.

5.1.1 Performance of Subject-Dependent

The comparison of subject dependence AAD performance between the ListenNet model and other baselines on the KUL, DTU and AVED datasets is presented in Tables 2. Our method significantly outperforms the current SOTA on both the KUL and DTU datasets. Specifically, on the KUL dataset, ListenNet demonstrates higher accuracies by 3.3%, 2.1%, and 1.8% for the 0.1-second, 1-second, and 2-second decision windows, respectively. Similarly, on the DTU dataset, it achieves improvements of 4.8%, 6.1%, and 5.4% in the same decision windows.On the AVED dataset, ListenNet performs slightly worse than DARNet in the 1-second and 2-second decision windows, but still achieves optimal performance in the very short 0.1-second window. One possible explanation is that DARNet’s transformer attention outperforming by capturing long-range cross-modal dependencies in the AVED dataset.

We observe that ListenNet’s decoding accuracy increases with the enlargement of decision windows, due to longer decision windows providing more information. The proposed ListenNet exhibits satisfactory performance at a temporal resolution of 1-second, which is approximately close to the time lag necessary for humans to switch attention. Moreover, our advantages are further enhanced under the highly challenging short 0.1-second decision window length, thereby contributing to the subsequent realization of real-time decoding of auditory attention.

5.1.2 Performance of Subject-Independent

Apart from excellent results in the subject-dependent setup, the proposed ListenNet also demonstrates comprehensive leading classification performance in the more challenging subject-independent setup across three datasets for the commonly used two detection window sizes. ListenNet benefits from better results by more comprehensively and effectively integrating dynamic temporal patterns and spatio-temporal dependencies, enabling the model to flexibly utilize subject-invariant representations. The results further confirm this capability. Especially on the KUL dataset, ListenNet achieves notable performance, demonstrating accuracy increases of 8.2% and 7.7% over the current SOTA model for the 1-second and 2-second decision windows, respectively. Furthermore, ListenNet outperforms baselines for DTU and AVED as well.

Compared to the widely-used KUL dataset, the DTU and AVED datasets pose a more challenging AAD task. Specifically, DTU presents speech at a narrower angle, and its recording environment includes reverberation and background noise, whereas AVED introduces complex multi-modal stimulus materials. The results show that ListenNet outperforms the baseline methods across diverse datasets, with lower variability in its results, further highlighting the stability and reliability of our approach across different decision windows. It learns the common pattern of feature distribution from subjects, thereby more effectively simulating real-world scenarios. These results highlight the robustness and generalization capabilities of the proposed model, emphasizing its potential superiority in EEG-based applications.

5.2 Ablation Analysis

DatasetModelSubject-DependentSubject-Independent
KULw/o STDE-T91.1 ± 6.0562.6 ± 11.10
w/o STDE-S94.6 ± 6.1876.0 ± 15.05
w/o MSTE96.7 ± 3.4677.8 ± 13.39
w/o CNA96.3 ± 2.7677.7 ± 14.74
ListenNet96.9 ± 3.0178.1 ± 13.50
DTUw/o STDE-T72.5 ± 5.5352.3 ± 2.01
w/o STDE-S84.3 ± 5.8954.3 ± 8.36
w/o MSTE84.9 ± 6.5956.7 ± 7.91
w/o CNA85.8 ± 5.7556.5 ± 5.83
ListenNet86.2 ± 5.5556.8 ± 7.32
AVED(audio-only) w/o STDE-T64.2 ± 6.6251.1 ± 1.43
w/o STDE-S66.2 ± 4.5052.6 ± 1.71
w/o MSTE71.8 ± 3.0052.5 ± 1.48
w/o CNA74.3 ± 3.3652.5 ± 1.32
ListenNet74.6 ± 3.3652.8 ± 1.30
AVED(audio-visual) w/o STDE-T64.9 ± 5.3053.3 ± 3.03
w/o STDE-S66.2 ± 5.2753.2 ± 2.14
w/o MSTE72.8 ± 3.4053.6 ± 1.80
w/o CNA74.6 ± 3.0853.2 ± 2.53
ListenNet74.9 ± 4.6353.7 ± 1.60

Ablation studies are conducted on three datasets using a 1-second window setting, which most closely aligns with human attention switching Jiang et al. (2022); Fan et al. (2025). ListenNet constructs robust spatio-temporal representations. This enables the model to capture the full spatiotemporal information in EEG signals, thereby improving the interpretation of brain activity. Table 3 presents a comparison between the full ListenNet model and these four variants across the three datasets.

STDE-T and STDE-S are each removed to disrupt the integrity of STDE, thereby assessing the critical role of these components in the model’s performance. Removing the STDE-T module for spatio-temporal dependency encoding has the most significant impact on the model’s performance. The effectiveness of STDE-T can be attributed to the fact that EEG signals, as high temporal-resolution time series, exhibit strong temporal dependencies. Prioritizing the modeling of temporal continuity allows for the extraction of more effective and accurate spatio-temporal feature embeddings. Removing the STDE-S module results in accuracy decline, as full-channel spatial convolution captures inter-channel dependencies and establishes a robust spatio-temporal feature framework.

The removal of the MSTE module leads to the loss of multi-scale temporal information and the disruption of potential dependencies between temporal segments, increasing the risk of missing critical temporal features. Finally, removing the CNA module eliminates the model’s ability to dynamically assign feature weights and enhance spatio-temporal representations, thereby weakening the extraction and integration of multi-level spatio-temporal features and further reducing accuracy.

5.3 Computational Cost

ModelParams (M)MACs (M)
MBSSFCC Jiang et al. (2022)83.9189.15
DBPNet Ni et al. (2024)0.9196.55
DARNet Yan et al. (2024)0.0816.36
ListenNet (ours)0.0112.16

ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection (3)

Table 4 compares the parameter counts and MACs of ListenNet with those of MBSSFCC, DBPNet, and DARNet on the KUL dataset. With only 0.01 M trainable parameters, ListenNet achieves remarkable parameter efficiency, requiring approximately 8390 times fewer parameters than MBSSFCC, 90 times fewer parameters than DBPNet, 7 times fewer parameters than DARNet. Additionally, ListenNet’s computational demand is also markedly reduced, with its MACs only 12.16 M, approximately 86% lower than MBSSFCC, 87% lower than DBPNet and 26% lower than DARNet. These substantial reductions in both parameter count and computational complexity highlight ListenNet’s enhanced efficiency, making it especially suitable for deployment on devices with limited computational resources.

5.4 Visualization Analysis

To assess the effect of extracting subject-invariant features, we randomly select 30 samples from each subject in the KUL dataset and visualize them using t-SNE Vander Maaten and Hinton (2008). The resulting plots are shown in Figure 3. Different colors represent subjects, with circles and squares indicating attention to the left or right speaker, respectively. In Figure 3(a), the raw features are scattered with significant overlap between subjects and labels, lacking clear structure and separability. In Figure 3(b), preprocessing improves feature quality to some extent, but notable overlap and insufficient separability still remain. In Figure 3(c), features extracted using STDE form clearer attention-related subgroups. By capturing spatio-temporal cross dependencies, the STDE module learns dynamic patterns and enhances feature separability, though some class boundaries remain indistinct. In Figure 3(d), features extracted by ListenNet exhibit more distinct clustering for attention labels across subjects, and the distributions become more organized. This demonstrates that ListenNet learns subject-invariant features while maintaining clear boundaries between attention categories.

6 Conclusion

This paper introduces ListenNet, a lightweight, highly accurate, and generalizable network for AAD. By combining spatio-temporal convolution operations across time steps and all channels, it effectively utilizes spatial information embedded in temporal EEG signals. Additionally, it captures temporal patterns at multiple scales, previously overlooked, by using multi-scale dilated convolutions. It integrates hierarchical spatio-temporal features through cross-nested attention mechanisms. Subject-dependent and subject-independent experiments are conducted on three AAD datasets. Experimental results show that our ListenNet exhibits competitive accuracy, especially in the very short 0.1-second decision window and across subjects. Furthermore, the compact size of our model and the reduced computational costs open new possibilities for deployment on low-power devices. For future work, we intend to extend ListenNet to streaming architectures, integrating incremental learning for real-time adaptation to AAD scenarios.

Acknowledgements

This work is supported by the STI 2030—Major Projects (No.2021ZD0201500), the National Natural Science Foundation of China (NSFC) (No.62201002, 6247077204), Excellent Youth Foundation of Anhui Scientific Committee (No. 2408085Y034), Distinguished Youth Foundation of Anhui Scientific Committee (No.2208085J05), Special Fund for Key Program of Science and Technology of Anhui Province (No.202203a07020008), Cloud Ginger XR-1.

References

  • Cai et al. [2021]Siqi Cai, Pengcheng Sun, Tanja Schultz, and Haizhou Li.Low-latency auditory spatial attention detection based on spectro-spatial features from eeg.In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 5812–5815. IEEE, 2021.
  • Cai et al. [2023]Siqi Cai, Tanja Schultz, and Haizhou Li.Brain topology modeling with eeg-graphs for auditory spatial attention detection.IEEE Transactions on Biomedical Engineering, 2023.
  • Cai et al. [2024]Siqi Cai, Ran Zhang, and Haizhou Li.Robust decoding of the auditory attention from eeg recordings through graph convolutional networks.In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2320–2324. IEEE, 2024.
  • Chen et al. [2023]Xiaoyu Chen, Changde Du, Qiongyi Zhou, and Huiguang He.Auditory attention decoding with task-related multi-view contrastive learning.In Proceedings of the 31st ACM International Conference on Multimedia, pages 6025–6033, 2023.
  • Cherry [1953]EColin Cherry.Some experiments on the recognition of speech, with one and with two ears.The Journal of the acoustical society of America, 25(5):975–979, 1953.
  • Choi et al. [2013]Inyong Choi, Siddharth Rajaram, LennyA Varghese, and BarbaraG Shinn-Cunningham.Quantifying attentional modulation of auditory-evoked cortical responses from single-trial electroencephalography.Frontiers in human neuroscience, 7:115, 2013.
  • Dai et al. [2018]Bohan Dai, Chuansheng Chen, Yuhang Long, Lifen Zheng, Hui Zhao, Xialu Bai, Wenda Liu, Yuxuan Zhang, LiLiu, Taomei Guo, etal.Neural mechanisms for selectively tuning in to the target speaker in a naturalistic noisy situation.Nature communications, 9(1):2405, 2018.
  • Das et al. [2016]Neetha Das, Wouter Biesmans, Alexander Bertrand, and Tom Francart.The effect of head-related filtering and ear-specific decoding bias on auditory attention detection.Journal of neural engineering, 13(5):056014, 2016.
  • Das et al. [2019]Neetha Das, Tom Francart, and Alexander Bertrand.Auditory attention detection dataset kuleuven.Zenodo, 2019.
  • DeTaillez et al. [2020]Tobias DeTaillez, Birger Kollmeier, and BerndT Meyer.Machine learning for decoding listeners’ attention from electroencephalography evoked by continuous speech.European Journal of Neuroscience, 51(5):1234–1241, 2020.
  • Dmochowski et al. [2014]JacekP Dmochowski, MatthewA Bezdek, BrianP Abelson, JohnS Johnson, EricH Schumacher, and LucasC Parra.Audience preferences are predicted by temporal reliability of neural processing.Nature communications, 5(1):4567, 2014.
  • EskandariNasab et al. [2024]MohammadReza EskandariNasab, Zahra Raeisi, RezaAhmadi Lashaki, and Hamidreza Najafi.A gru–cnn model for auditory attention detection using microstate and recurrence quantification analysis.Scientific Reports, 14(1):8861, 2024.
  • Fan et al. [2024a]Cunhang Fan, Jinqin Wang, Wei Huang, Xiaoke Yang, Guangxiong Pei, Taihao Li, and Zhao Lv.Light-weight residual convolution-based capsule network for eeg emotion recognition.Advanced Engineering Informatics, 61:102522, 2024.
  • Fan et al. [2024b]Cunhang Fan, Hongyu Zhang, Wei Huang, Jun Xue, Jianhua Tao, Jiangyan Yi, Zhao Lv, and Xiaopei Wu.Dgsd: Dynamical graph self-distillation for eeg-based auditory spatial attention detection.Neural Networks, 179:106580, 2024.
  • Fan et al. [2024c]Cunhang Fan, Jingjing Zhang, Hongyu Zhang, Wang Xiang, Jianhua Tao, Xinhui Li, Jiangyan Yi, Dianbo Sui, and Zhao Lv.Msfnet: Multi-scale fusion network for brain-controlled speaker extraction.In Proceedings of the 32nd ACM International Conference on Multimedia, pages 1652–1661, 2024.
  • Fan et al. [2025]Cunhang Fan, Hongyu Zhang, Qinke Ni, Jingjing Zhang, Jianhua Tao, Jian Zhou, Jiangyan Yi, Zhao Lv, and Xiaopei Wu.Seeing helps hearing: A multi-modal dataset and a mamba-based dual branch parallel network for auditory attention decoding.Information Fusion, page 102946, 2025.
  • Fuglsang et al. [2017]SørenAsp Fuglsang, Torsten Dau, and Jens Hjortkjær.Noise-robust cortical tracking of attended speech in real-world acoustic scenes.Neuroimage, 156:435–444, 2017.
  • Fuglsang et al. [2018]SørenA Fuglsang, DanielDE Wong, and Jens Hjortkjær.Eeg and audio dataset for auditory attention decoding.Zenodo, 2018.
  • Jiang et al. [2022]Yifan Jiang, Ning Chen, and Jing Jin.Detecting the locus of auditory attention based on the spectro-spatial-temporal analysis of eeg.Journal of Neural Engineering, 19(5):056035, 2022.
  • Lawhern et al. [2018]VernonJ Lawhern, AmeliaJ Solon, NicholasR Waytowich, StephenM Gordon, ChouP Hung, and BrentJ Lance.Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces.Journal of neural engineering, 15(5):056013, 2018.
  • Liu et al. [2024]Chenyu Liu, Xinliang Zhou, Jiaping Xiao, Zhengri Zhu, Liming Zhai, Ziyu Jia, and Yang Liu.Vsgt: variational spatial and gaussian temporal graph models for eeg-based emotion recognition.In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 3078–3086, 2024.
  • Mesgarani and Chang [2012]Nima Mesgarani and EdwardF Chang.Selective cortical representation of attended speaker in multi-talker speech perception.Nature, 485(7397):233–236, 2012.
  • Miao et al. [2022]Zhengqing Miao, Xin Zhang, Carlo Menon, Yelong Zheng, Meirong Zhao, and Dong Ming.Priming cross-session motor imagery classification with a universal deep domain adaptation framework.arXiv preprint arXiv:2202.09559, 2022.
  • Miao et al. [2023]Zhengqing Miao, Meirong Zhao, Xin Zhang, and Dong Ming.Lmda-net: A lightweight multi-dimensional attention network for general eeg-based brain-computer interfaces and interpretability.NeuroImage, 276:120209, 2023.
  • Monesi et al. [2020]MohammadJalilpour Monesi, Bernd Accou, Jair Montoya-Martinez, Tom Francart, and Hugo VanHamme.An lstm based architecture to relate speech stimulus to eeg.In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 941–945. IEEE, 2020.
  • Ni et al. [2024]Qinke Ni, Hongyu Zhang, Cunhang Fan, Shengbing Pei, Chang Zhou, and Zhao Lv.Dbpnet: Dual-branch parallel network with temporal-frequency fusion for auditory attention detection.In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 3115–3123, 2024.
  • Niu et al. [2024]Yixiang Niu, Ning Chen, Hongqing Zhu, Zhiying Zhu, Guangqiang Li, and Yibo Chen.Auditory spatial attention detection based on feature disentanglement and brain connectivity-informed graph neural networks.In Proc. Interspeech 2024, pages 887–891, 2024.
  • Ouyang et al. [2023]Daliang Ouyang, SuHe, Guozhong Zhang, Mingzhu Luo, Huaiyong Guo, Jian Zhan, and Zhijie Huang.Efficient multi-scale attention module with cross-spatial learning.In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  • Puffay et al. [2022]Corentin Puffay, Jana VanCanneyt, Jonas Vanthornhout, Hugo VanHamme, and Tom Francart.Relating the fundamental frequency of speech with eeg using a dilated convolutional network.arXiv preprint arXiv:2207.01963, 2022.
  • Shen et al. [2022]Xinke Shen, Xianggen Liu, Xin Hu, Dan Zhang, and Sen Song.Contrastive learning of subject-invariant eeg representations for cross-subject emotion recognition.IEEE Transactions on Affective Computing, 14(3):2496–2511, 2022.
  • Su et al. [2021]Enze Su, Siqi Cai, Peiwen Li, Longhan Xie, and Haizhou Li.Auditory attention detection with eeg channel attention.In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 5804–5807. IEEE, 2021.
  • Su et al. [2022]Enze Su, Siqi Cai, Longhan Xie, Haizhou Li, and Tanja Schultz.Stanet: A spatiotemporal attention network for decoding auditory spatial attention from eeg.IEEE Transactions on Biomedical Engineering, 69(7):2233–2242, 2022.
  • Tune et al. [2021]Sarah Tune, Mohsen Alavash, Lorenz Fiedler, and Jonas Obleser.Neural attentional-filter mechanisms of listening success in middle-aged and older individuals.Nature Communications, 12(1):4533, 2021.
  • Vander Maaten and Hinton [2008]Laurens Vander Maaten and Geoffrey Hinton.Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008.
  • Vandecappelle et al. [2021]Servaas Vandecappelle, Lucas Deckers, Neetha Das, AmirHossein Ansari, Alexander Bertrand, and Tom Francart.Eeg-based detection of the locus of auditory attention with convolutional neural networks.Elife, 10:e56481, 2021.
  • Wang et al. [2020]Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu.Eca-net: Efficient channel attention for deep convolutional neural networks.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534–11542, 2020.
  • Wang et al. [2023]Ruicong Wang, Siqi Cai, and Haizhou Li.Eeg-based auditory attention detection with spatiotemporal graph and graph convolutional network.In Proceedings of INTERSPEECH, pages 1144–1148, 2023.
  • Wu et al. [2020]Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang.Connecting the dots: Multivariate time series forecasting with graph neural networks.In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 753–763, 2020.
  • Yan et al. [2024]Sheng Yan, Cunhang Fan, Hongyu Zhang, Xiaoke Yang, Jianhua Tao, and Zhao Lv.Darnet: Dual attention refinement network with spatiotemporal construction for auditory attention detection.In The Thirty-eighth Annual Conference on Neural Information Processing Systems, pages 31688–31707, 2024.
  • Zhang et al. [2023]Yuanming Zhang, Haoxin Ruan, Ziyan Yuan, Haoliang Du, Xia Gao, and Jing Lu.A learnable spatial mapping for decoding the directional focus of auditory attention using eeg.In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  • ZHANG et al. [2024]Hongyu ZHANG, Jingjing ZHANG, DONG Xingguang, LÜZhao, TAO Jianhua, ZHOU Jian, WUXiaopei, and FAN Cunhang.Based on audio-video evoked auditory attention detection electroencephalogram dataset.Journal of Tsinghua University (Science and Technology), 64(11):1919–1926, 2024.
ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Nathanial Hackett

Last Updated:

Views: 5960

Rating: 4.1 / 5 (72 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Nathanial Hackett

Birthday: 1997-10-09

Address: Apt. 935 264 Abshire Canyon, South Nerissachester, NM 01800

Phone: +9752624861224

Job: Forward Technology Assistant

Hobby: Listening to music, Shopping, Vacation, Baton twirling, Flower arranging, Blacksmithing, Do it yourself

Introduction: My name is Nathanial Hackett, I am a lovely, curious, smiling, lively, thoughtful, courageous, lively person who loves writing and wants to share my knowledge and understanding with you.