New articles on Electrical Engineering and Systems Science


[1] 2601.16225

ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation

Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose \textbf{ES4R}, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations are then integrated with textual semantics through speech-guided cross-modal attention to generate empathetic responses. For speech output, we employ energy-based strategy selection and style fusion to achieve empathetic speech synthesis. ES4R consistently outperforms strong baselines in both automatic and human evaluations and remains robust across different LLM backbones.


[2] 2601.16230

Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities

An accurate assessment of L2 English pronunciation is crucial for language learning, as it provides personalized feedback and ensures a fair evaluation of individual progress. However, automated scoring remains challenging due to the complexity of sentence-level fluency, prosody, and completeness. This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances. The model generates rubric-aligned scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within +-2 tolerance, especially for high-quality speech. However, it tends to overpredict low-quality speech scores and lacks precision in error detection. These findings demonstrate the strong potential of speech LLMs in scalable pronunciation assessment and suggest future improvements through enhanced prompting, calibration, and phonetic integration to advance Computer-Assisted Pronunciation Training.


[3] 2601.16240

Test-Time Adaptation for Speech Emotion Recognition

The practical utility of Speech Emotion Recognition (SER) systems is undermined by their fragility to domain shifts, such as speaker variability, the distinction between acted and naturalistic emotions, and cross-corpus variations. While domain adaptation and fine-tuning are widely studied, they require either source data or labelled target data, which are often unavailable or raise privacy concerns in SER. Test-time adaptation (TTA) bridges this gap by adapting models at inference using only unlabeled target data. Yet, having been predominantly designed for image classification and speech recognition, the efficacy of TTA for mitigating the unique domain shifts in SER has not been investigated. In this paper, we present the first systematic evaluation and comparison covering 11 TTA methods across three representative SER tasks. The results indicate that backpropagation-free TTA methods are the most promising. Conversely, entropy minimization and pseudo-labeling generally fail, as their core assumption of a single, confident ground-truth label is incompatible with the inherent ambiguity of emotional expression. Further, no single method universally excels, and its effectiveness is highly dependent on the distributional shifts and tasks.


[4] 2601.16283

BESTOpt: A Modular, Physics-Informed Machine Learning based Building Modeling, Control and Optimization Framework

Modern buildings are increasingly interconnected with occupancy, heating, ventilation, and air-conditioning (HVAC) systems, distributed energy resources (DERs), and power grids. Modeling, control, and optimization of such multi-domain systems play a critical role in achieving building-sector decarbonization. However, most existing tools lack scalability and physical consistency for addressing these complex, multi-scale ecosystem problems. To bridge this gap, this study presents BESTOpt, a modular, physics-informed machine learning (PIML) framework that unifies building applications, including benchmarking, evaluation, diagnostics, control, optimization, and performance simulation. The framework adopts a cluster-domain-system/building-component hierarchy and a standardized state-action-disturbance-observation data typology. By embedding physics priors into data-driven modules, BESTOpt improves model accuracy and physical consistency under unseen conditions. Case studies on single-building and cluster scenarios demonstrate its capability for multi-level centralized and decentralized control. Looking ahead, BESTOpt lays the foundation for an open, extensible platform that accelerates interdisciplinary research toward smart, resilient, and decarbonized building ecosystems.


[5] 2601.16301

Gesture Recognition from body-Worn RFID under Missing Data

We explore hand-gesture recognition through the use of passive body-worn reflective tags. A data processing pipeline is proposed to address the issue of missing data. Specifically, missing information is recovered through linear and exponential interpolation and extrapolation. Furthermore, imputation and proximity-based inference are employed. We represent tags as nodes in a temporal graph, with edges formed based on correlations between received signal strength (RSS) and phase values across successive timestamps, and we train a graph-based convolutional neural network that exploits graph-based self-attention. The system outperforms state-of-the-art methods with an accuracy of 98.13% for the recognition of 21 gestures. We achieve 89.28% accuracy under leave-one-person-out cross-validation. We further investigate the contribution of various body locations on the recognition accuracy. Removing tags from the arms reduces accuracy by more than 10%, while removing the wrist tag only reduces accuracy by around 2%. Therefore, tag placements on the arms are more expressive for gesture recognition than on the wrist.


[6] 2601.16303

Angle of Arrival Estimation for Gesture Recognition from reflective body-worn tags

We investigate hand gesture recognition by leveraging passive reflective tags worn on the body. Considering a large set of gestures, distinct patterns are difficult to be captured by learning algorithms using backscattered received signal strength (RSS) and phase signals. This is because these features often exhibit similarities across signals from different gestures. To address this limitation, we explore the estimation of Angle of Arrival (AoA) as a distinguishing feature, since AoA characteristically varies during body motion. To ensure reliable estimation in our system, which employs Smart Antenna Switching (SAS), we first validate AoA estimation using the Multiple SIgnal Classification (MUSIC) algorithm while the tags are fixed at specific angles. Building on this, we propose an AoA tracking method based on Kalman smoothing. Our analysis demonstrates that, while RSS and phase alone are insufficient for distinguishing certain gesture data, AoA tracking can effectively differentiate them. To evaluate the effectiveness of AoA tracking, we implement gesture recognition system benchmarks and show that incorporating AoA features significantly boosts their performance. Improvements of up to 15% confirm the value of AoA-based enhancement.


[7] 2601.16315

Optimal County-Level Siting of Data Centers in the United States

Data centers are growing rapidly, creating the pressing need for the development of critical infrastructure build out to support these resource-intensive large loads. Their immense consumption of electricity and, often, freshwater, continues to stress an already constrained and aging power grid and water resources. This paper presents a comprehensive modeling approach to determine the optimal locations to construct such facilities by quantifying their resource use and minimizing associated costs. The interdisciplinary modeling approach incorporates a number of factors including the power grid, telecommunications, climate, water use, and collocated generation potential. This work establishes the base model whose functionality is shown through several test cases focusing on carbon-free generation collocation on a county-level in the United States. The results suggest that while capital costs are the biggest driver, having a longer future outlook and allowing more variable generation collocation influences the model to choose sites with higher renewable potential.


[8] 2601.16316

EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting

We introduce an efficient few-shot keyword spotting model for edge devices, EdgeSpot, that pairs an optimized version of a BC-ResNet-based acoustic backbone with a trainable Per-Channel Energy Normalization frontend and lightweight temporal self-attention. Knowledge distillation is utilized during training by employing a self-supervised teacher model, optimized with Sub-center ArcFace loss. This study demonstrates that the EdgeSpot model consistently provides better accuracy at a fixed false-alarm rate (FAR) than strong BC-ResNet baselines. The largest variant, EdgeSpot-4, improves the 10-shot accuracy at 1% FAR from 73.7% to 82.0%, which requires only 29.4M MACs with 128k parameters.


[9] 2601.16358

TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice

The development of robust, multilingual speaker recognition systems is hindered by a lack of large-scale, publicly available and multilingual datasets, particularly for the read-speech style crucial for applications like anti-spoofing. To address this gap, we introduce the TidyVoice dataset derived from the Mozilla Common Voice corpus after mitigating its inherent speaker heterogeneity within the provided client IDs. TidyVoice currently contains training and test data from over 212,000 monolingual speakers (Tidy-M) and around 4,500 multilingual speakers (Tidy-X) from which we derive two distinct conditions. The Tidy-M condition contains target and non-target trials from monolingual speakers across 81 languages. The Tidy-X condition contains target and non-target trials from multilingual speakers in both same- and cross-language trials. We employ two architectures of ResNet models, achieving a 0.35% EER by fine-tuning on our comprehensive Tidy-M partition. Moreover, we show that this fine-tuning enhances the model's generalization, improving performance on unseen conversational interview data from the CANDOR corpus. The complete dataset, evaluation trials, and our models are publicly released to provide a new resource for the community.


[10] 2601.16359

Experience with Single Domain Generalization in Real World Medical Imaging Deployments

A desirable property of any deployed artificial intelligence is generalization across domains, i.e. data generation distribution under a specific acquisition condition. In medical imagining applications the most coveted property for effective deployment is Single Domain Generalization (SDG), which addresses the challenge of training a model on a single domain to ensure it generalizes well to unseen target domains. In multi-center studies, differences in scanners and imaging protocols introduce domain shifts that exacerbate variability in rare class characteristics. This paper presents our experience on SDG in real life deployment for two exemplary medical imaging case studies on seizure onset zone detection using fMRI data, and stress electrocardiogram based coronary artery detection. Utilizing the commonly used application of diabetic retinopathy, we first demonstrate that state-of-the-art SDG techniques fail to achieve generalized performance across data domains. We then develop a generic expert knowledge integrated deep learning technique DL+EKE and instantiate it for the DR application and show that DL+EKE outperforms SOTA SDG methods on DR. We then deploy instances of DL+EKE technique on the two real world examples of stress ECG and resting state (rs)-fMRI and discuss issues faced with SDG techniques.


[11] 2601.16383

On The Robustness of Foundational 3D Medical Image Segmentation Models Against Imprecise Visual Prompts

While 3D foundational models have shown promise for promptable segmentation of medical volumes, their robustness to imprecise prompts remains under-explored. In this work, we aim to address this gap by systematically studying the effect of various controlled perturbations of dense visual prompts, that closely mimic real-world imprecision. By conducting experiments with two recent foundational models on a multi-organ abdominal segmentation task, we reveal several facets of promptable medical segmentation, especially pertaining to reliance on visual shape and spatial cues, and the extent of resilience of models towards certain perturbations. Codes are available at: this https URL


[12] 2601.16418

Robust Grid-Forming Control Based on Virtual Flux Observer

This paper investigates the design and analysis of a novel grid-forming (GFM) control method for grid-connected converters (GCCs). The core novelty lies in a virtual flux observer-based synchronization and load angle control method. The terminal voltage of the converter is directly regulated to provide voltage-source behavior. The control parameters are designed for decoupling and pole placement. The proposed method exhibits strong robustness in stability and dynamical performance across varying and uncertain grid strengths. The robust control performance of the proposed method is first demonstrated by small-signal analysis, then validated by experiments on a 20 kVA power conversion system.


[13] 2601.16421

TransfoREM: Transformer aided 3D Radio Environment Mapping

Providing reliable cellular connectivity to Unmanned Aerial Vehicles (UAV) is a key challenge, as existing terrestrial networks are deployed mainly for ground-level coverage. The cellular network coverage may be available for a limited range from the antenna side lobes, with poor connectivity further exacerbated by UAV flight dynamics. In this work, we propose TransfoREM, a 3D Radio Environment Map (REM) generation method that combines deterministic channel models and real-world data to map terrestrial network coverage at higher altitudes. At the core of our solution is a transformer model that translates radio propagation mapping into a sequence prediction task to construct REMs. Our results demonstrate that TransfoREM offers improved interpolation capability on real-world data compared against conventional Kriging and other machine learning (ML) techniques. Furthermore, TransfoREM is designed for holistic integration into cellular networks at the base station (BS) level, where it can build REMs, which can then be leveraged for enhanced resource allocation, interference management, and spatial spectrum utilization.


[14] 2601.16442

Auditory Attention Decoding without Spatial Information: A Diotic EEG Study

Auditory attention decoding (AAD) identifies the attended speech stream in multi-speaker environments by decoding brain signals such as electroencephalography (EEG). This technology is essential for realizing smart hearing aids that address the cocktail party problem and for facilitating objective audiometry systems. Existing AAD research mainly utilizes dichotic environments where different speech signals are presented to the left and right ears, enabling models to classify directional attention rather than speech content. However, this spatial reliance limits applicability to real-world scenarios, such as the "cocktail party" situation, where speakers overlap or move dynamically. To address this challenge, we propose an AAD framework for diotic environments where identical speech mixtures are presented to both ears, eliminating spatial cues. Our approach maps EEG and speech signals into a shared latent space using independent encoders. We extract speech features using wav2vec 2.0 and encode them with a 2-layer 1D convolutional neural network (CNN), while employing the BrainNetwork architecture for EEG encoding. The model identifies the attended speech by calculating the cosine similarity between EEG and speech representations. We evaluate our method on a diotic EEG dataset and achieve 72.70% accuracy, which is 22.58% higher than the state-of-the-art direction-based AAD method.


[15] 2601.16483

FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning

Generative speech enhancement offers a promising alternative to traditional discriminative methods by modeling the distribution of clean speech conditioned on noisy inputs. Post-training alignment via reinforcement learning (RL) effectively aligns generative models with human preferences and downstream metrics in domains such as natural language processing, but its use in speech enhancement remains limited, especially for online RL. Prior work explores offline methods like Direct Preference Optimization (DPO); online methods such as Group Relative Policy Optimization (GRPO) remain largely uninvestigated. In this paper, we present the first successful integration of online GRPO into a flow-matching speech enhancement framework, enabling efficient post-training alignment to perceptual and task-oriented metrics with few update steps. Unlike prior GRPO work on Large Language Models, we adapt the algorithm to the continuous, time-series nature of speech and to the dynamics of flow-matching generative models. We show that optimizing a single reward yields rapid metric gains but often induces reward hacking that degrades audio fidelity despite higher scores. To mitigate this, we propose a multi-metric reward optimization strategy that balances competing objectives, substantially reducing overfitting and improving overall performance. Our experiments validate online GRPO for speech enhancement and provide practical guidance for RL-based post-training of generative audio models.


[16] 2601.16502

Sequential Operating Simulation of Solid State Transformer-Driven Next-Generation 800 VDC Data Center

Artificial-intelligence (AI) workloads are driving rapid growth in data-center electricity use and rack power density, increasing demand for power-delivery systems that are efficient and robust to fast load transients. Conventional uninterruptible power supply (UPS) based AC distribution chains involve multiple conversion stages and line-frequency transformers, which compound losses and are less compatible with dynamic AI power profiles. Although solid-state transformers (SSTs) and 800 VDC distribution architecture are widely discussed, implementable topology/control details, and long-horizon validation with realistic operating profiles remain limited. This paper develops an SST-driven 800 VDC architecture that converts 10 kV MVAC to an 800V LVDC bus using a three-phase H-bridge AC/DC stage cascaded with a dual-active-bridge (DAB) DC/DC stage. A coordinated closed-loop control scheme, combining rectifier voltage/current regulation and DAB phase-shift control, is designed to maintain DC-bus voltage stability. The proposed system is implemented on the real-time digital simulation (RTDS) platform and evaluated via sequential simulations using real-world day- and month-scale operating profiles of data centers, benchmarked against a UPS supply chain. Numerical studies demonstrate tight 800 VDC regulation, reduced input-side energy consumption compared with the UPS baseline, and satisfactory power-quality performance. A capacitance sensitivity test quantifies tradeoffs between DC-bus ripple and low-frequency input-power oscillations, yielding a practical capacitance range for design. Overall, the work provides a reproducible evaluation workflow and actionable guidance for next-generation AI data centers.


[17] 2601.16543

Cell-Free MIMO with Rotatable Antennas: When Macro-Diversity Meets Antenna Directivity

Cell-free networks leverage distributed access points (APs) to achieve macro-diversity, yet their performance is often constrained by large disparities in channel quality arising from user geometry and blockages. To address this, rotatable antennas (RAs) add a lightweight hardware degree of freedom by steering the antenna boresight toward dominant propagation directions to strengthen unfavorable links, thereby enabling the network to better exploit macro-diversity for higher and more uniform performance. This paper investigates an RA-enabled cell-free downlink network and formulates a max-min rate problem that jointly optimizes transmit beamforming and antenna orientations. To tackle this challenging problem, we develop an alternating-optimization-based algorithm that iteratively updates the beamformers via a second-order cone program (SOCP) and optimizes the antenna orientations using successive convex approximation. To reduce complexity, we further propose an efficient two-stage scheme that first designs orientations by maximizing a proportional-fair log-utility using manifold-aware Frank-Wolfe updates, and then computes the beamformers using an SOCP-based design. Simulation results demonstrate that the proposed orientation-aware designs achieve a substantially higher worst-user rate than conventional beamforming-only benchmarks. Furthermore, larger antenna directivity enhances fairness with proper orientation but can degrade the worst-user performance otherwise.


[18] 2601.16550

Spiking Neural Networks for Communication Systems: Encoding Schemes, Learning Algorithms, and Equalization~Techniques

Machine learning with artificial neural networks (ANNs), provides solutions for the growing complexity of modern communication systems. This complexity, however, increases power consumption, making the systems energy-intensive. Spiking neural networks (SNNs) represent a novel generation of neural networks inspired by the highly efficient human brain. By emulating its event-driven and energy-efficient mechanisms, SNNs enable low-power, real-time signal processing. They differ from ANNs in two key ways: they exhibit inherent temporal dynamics and process and transmit information as short binary signals called spikes. Despite their promise, major challenges remain, e.g., identifying optimal learning rules and effective neural encoding. This thesis investigates the design of SNN-based receivers for nonlinear time-invariant frequency-selective channels. Backpropagation through time with surrogate gradients is identified as a promising update rule and the novel quantization encoding (QE) as promising neural encoding. Given the model of the intensity modulation with direct detection link, we compare two different receiver architectures based on equalization performance and spike count. Using decision feedback and QE achieves both strong performance and low spike counts. Notably, SNN-based receivers significantly outperform ANN-based counterparts. We furthermore introduce policy gradient-based update (PGU), an reinforcement learning-based update algorithm that requires no backpropagation. Using PGU, encoding parameters are optimized, drastically reducing runtime, complexity, and spikes per inference while maintaining performance. This thesis contributes a successful design and optimization framework for SNN-based receivers. By addressing key challenges in SNN optimization, it facilitates future advances in the design and deployment of energy-efficient SNN receivers.


[19] 2601.16565

Agentic AI-RAN Empowering Synergetic Sensing, Communication, Computing, and Control

Future sixth-generation (6G) networks are expected to support low-altitude wireless networks (LAWNs), where unmanned aerial vehicles (UAVs) and aerial robots operate in highly dynamic three-dimensional environments under stringent latency, reliability, and autonomy requirements. In such scenarios, autonomous task execution at the network edge demands holistic coordination among sensing, communication, computing, and control (SC3) processes. Agentic Artificially Intelligent Radio Access Networks (Agentic AI-RAN) offer a promising paradigm by enabling the edge network to function as an autonomous decision-making entity for low-altitude agents with limited onboard resources. In this article, we propose and design a task-oriented Agentic AI-RAN architecture that enables SC3 task execution within a single edge node. This integrated design tackles the fundamental problem of coordinating heterogeneous workloads in resource-constrained edge environments. Furthermore, a representative low-altitude embodied intelligence system is prototyped based on a general-purpose Graphics Processing Unit (GPU) platform to demonstrate autonomous drone navigation in realistic settings. By leveraging the Multi-Instance GPU (MIG) partitioning technique and the containerized deployment, the demonstration system achieves physical resource isolation while supporting tightly coupled coordination between real-time communication and multimodal inference under a unified task framework. Experimental results demonstrate low closed-loop latency, robust bidirectional communication, and stable performance under dynamic runtime conditions, highlighting its viability for mission-critical low-altitude wireless networks in 6G.


[20] 2601.16577

Real-Time Evaluation of an Ultra-Tight GNSS/INS Integration Based on Adaptive PLL Bandwidth

In this contribution, we propose a GNSS/INS ultra-tight coupling in which the GNSS receiver architecture is based on a vector tracking loop type architecture. In the proposed approach, the phase lock loop bandwidth is adapted according to the inertial navigation system information. The latter has the advantage to be easily implementable on a System-on-Chip component such as an FPGA (Field-Programmable Gate Arrays), and can be implemented with minor modifications on an existing GNSS receiver platform. Moreover, compared to classical vector-based solutions, the proposed architecture decodes the navigation message in the loop, without the need to run scalar loops in parallel or having to store pre-downloaded ephemeris data. This architecture therefore does not increase the area occupied on the FPGA and does not use additional resources for storage. The proposed GNSS receiver architecture uses GPS L1/C and Galileo E1 signals and is composed of one acquisition module and 16 tracking channels (8 GPS and 8 Galileo) which are implemented within a FPGA (Zynq-Ultrascale).


[21] 2601.16586

Learning Successive Interference Cancellation for Low-Complexity Soft-Output MIMO Detection

Low-complexity multiple-input multiple-output (MIMO) detection remains a key challenge in modern wireless systems, particularly for 5G reduced capability (RedCap) and internet-of-things (IoT) devices. In this context, the growing interest in deploying machine learning on edge devices must be balanced against stringent constraints on computational complexity and memory while supporting high-order modulation. Beyond accurate hard detection, reliable soft information is equally critical, as modern receivers rely on soft-input channel decoding, imposing additional requirements on the detector design. In this work, we propose recurSIC, a lightweight learning-based MIMO detection framework that is structurally inspired by successive interference cancellation (SIC) and incorporates learned processing stages. It generates reliable soft information via multi-path hypothesis tracking with a tunable complexity parameter while requiring only a single forward pass and a minimal parameter count. Numerical results in realistic wireless scenarios show that recurSIC achieves strong hard- and soft-detection performance at very low complexity, making it well suited for edge-constrained MIMO receivers.


[22] 2601.16602

Unsupervised Super-Resolution of Hyperspectral Remote Sensing Images Using Fully Synthetic Training

Considerable work has been dedicated to hyperspectral single image super-resolution to improve the spatial resolution of hyperspectral images and fully exploit their potential. However, most of these methods are supervised and require some data with ground truth for training, which is often non-available. To overcome this problem, we propose a new unsupervised training strategy for the super-resolution of hyperspectral remote sensing images, based on the use of synthetic abundance data. Its first step decomposes the hyperspectral image into abundances and endmembers by unmixing. Then, an abundance super-resolution neural network is trained using synthetic abundances, which are generated using the dead leaves model in such a way as to faithfully mimic real abundance statistics. Next, the spatial resolution of the considered hyperspectral image abundances is increased using this trained network, and the high resolution hyperspectral image is finally obtained by recombination with the endmembers. Experimental results show the training potential of the synthetic images, and demonstrate the method effectiveness.


[23] 2601.16606

Assessment of Errors of Fundamental Frequency Estimation Methods in the Presence of Voltage Fluctuations and Distortions

The fundamental frequency is one of the parameters that define power quality. Correctly determining this parameter under the conditions that prevail in modern power grids is crucial. Diagnostic purposes often require an efficient estimation of this parameter within short time windows. Therefore, this article presents the results of numerical simulation studies that allow the assessment of errors in various fundamental frequency estimation methods, including the standard IEC 61000-4-30 method, when the analyzed signal has a form similar to that found in modern power grids. For the purposes of this study, a test signal was adopted recreating the states of the power grid, including the simultaneous occurrence of voltage fluctuations and distortions. Conclusions are presented based on conducted research.


[24] 2601.16612

Challenges in the Proper Metrological Verification of Smart Energy Meters

The most common instruments currently measuring active/reactive energy and power quality indicators are smart energy meters. Unfortunately, the verification of such meters is currently performed under ideal conditions or with simple signal models, which do not recreate actual states occurring in the power grid and do not ensure the verification of the properties of their signal chains. This paper presents challenges in the proper metrological verification of smart energy meters. It presents existing legal and normative requirements and scientific research directions regarding these meters. Selected test results are presented, which show that although the tested meters meet the normative and legal requirements because they have been approved for sale, numerous imperfections in the signal and measurement chains of the analyzed instruments are revealed for the selected test signal. On the basis of the presented research results, further directions of research in the field of smart energy meters have been determined.


[25] 2601.16631

PanopMamba: Vision State Space Modeling for Nuclei Panoptic Segmentation

Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder-decoder architecture that integrates Mamba and Transformer with additional feature-enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)-based fusion network to enable efficient long-range perception in pyramid features, thereby extending the pure encoder-decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM-based feature-enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba-based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image-level Panoptic Quality ($i$PQ), boundary-weighted PQ ($w$PQ), and frequency-weighted PQ ($fw$PQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state-of-the-art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at this https URL.


[26] 2601.16648

A Cognitive Framework for Autonomous Agents: Toward Human-Inspired Design

This work introduces a human-inspired reinforcement learning (RL) architecture that integrates Pavlovian and instrumental processes to enhance decision-making in autonomous systems. While existing engineering solutions rely almost exclusively on instrumental learning, neuroscience shows that humans use Pavlovian associations to leverage predictive cues to bias behavior before outcomes occur. We translate this dual-system mechanism into a cue-guided RL framework in which radio-frequency (RF) stimuli act as conditioned (Pavlovian) cues that modulate action selection. The proposed architecture combines Pavlovian values with instrumental policy optimization, improving navigation efficiency and cooperative behavior in unknown, partially observable environments. Simulation results demonstrate that cue-driven agents adapt faster, achieving superior performance compared to traditional instrumental-solo agents. This work highlights the potential of human learning principles to reshape digital agents intelligence.


[27] 2601.16660

Fast, faithful and photorealistic diffusion-based image super-resolution with enhanced Flow Map models

Diffusion-based image super-resolution (SR) has recently attracted significant attention by leveraging the expressive power of large pre-trained text-to-image diffusion models (DMs). A central practical challenge is resolving the trade-off between reconstruction faithfulness and photorealism. To address inference efficiency, many recent works have explored knowledge distillation strategies specifically tailored to SR, enabling one-step diffusion-based approaches. However, these teacher-student formulations are inherently constrained by information compression, which can degrade perceptual cues such as lifelike textures and depth of field, even with high overall perceptual quality. In parallel, self-distillation DMs, known as Flow Map models, have emerged as a promising alternative for image generation tasks, enabling fast inference while preserving the expressivity and training stability of standard DMs. Building on these developments, we propose FlowMapSR, a novel diffusion-based framework for image super-resolution explicitly designed for efficient inference. Beyond adapting Flow Map models to SR, we introduce two complementary enhancements: (i) positive-negative prompting guidance, based on a generalization of classifier free-guidance paradigm to Flow Map models, and (ii) adversarial fine-tuning using Low-Rank Adaptation (LoRA). Among the considered Flow Map formulations (Eulerian, Lagrangian, and Shortcut), we find that the Shortcut variant consistently achieves the best performance when combined with these enhancements. Extensive experiments show that FlowMapSR achieves a better balance between reconstruction faithfulness and photorealism than recent state-of-the-art methods for both x4 and x8 upscaling, while maintaining competitive inference time. Notably, a single model is used for both upscaling factors, without any scale-specific conditioning or degradation-guided mechanisms.


[28] 2601.16662

Low-Power On-Device Gesture Recognition with Einsum Networks

We design a gesture-recognition pipeline for networks of distributed, resource constrained devices utilising Einsum Networks. Einsum Networks are probabilistic circuits that feature a tractable inference, explainability, and energy efficiency. The system is validated in a scenario of low-power, body-worn, passive Radio Frequency Identification-based gesture recognition. Each constrained device includes task-specific processing units responsible for Received Signal Strength (RSS) and phase processing or Angle of Arrival (AoA) estimation, along with feature extraction, as well as dedicated Einsum hardware that processes the extracted features. The output of all constrained devices is then fused in a decision aggregation module to predict gestures. Experimental results demonstrate that the method outperforms the benchmark models.


[29] 2601.16664

OFDM-Based ISAC Imaging of Extended Targets via Inverse Virtual Aperture Processing

This work investigates the performance of an integrated sensing and communication (ISAC) system exploiting inverse virtual aperture (IVA) for imaging moving extended targets in vehicular scenarios. A base station (BS) operates as a monostatic sensor using MIMO-OFDM waveforms. Echoes reflected by the target are processed through motion-compensation techniques to form an IVA range-Doppler (cross-range) image. A case study considers a 5G NR waveform in the upper mid-band, with the target model defined in 3GPP Release 19, representing a vehicle as a set of spatially distributed scatterers. Performance is evaluated in terms of image contrast (IC) and the root mean squared error (RMSE) of the estimated target-centroid range. Finally, the trade-off between sensing accuracy and communication efficiency is examined by varying the subcarrier allocation for IVA imaging. The results provide insights for designing effective sensing strategies in next-generation radio networks.


[30] 2601.16682

Computation-Accuracy Trade-Off in Service-Oriented Model-Based Control

Representing a control system as a Service-Oriented Architecture (SOA)-referred to as Service-Oriented Model-Based Control (SOMC)-enables runtime-flexible composition of control loop elements. This paper presents a framework that optimizes the computation-accuracy trade-off by formulating service orchestration as an A$^\star$search problem, complemented by Contextual Bayesian Optimization (BO) to tune the multi-objective cost weights. A vehicle longitudinal-velocity control case study demonstrates online, performancedriven reconfiguration of the control architecture. We show that our framework not only combines control and software structure but also considers the real-time requirements of the control system during performance optimization.


[31] 2601.16719

On a Coupled Adoption-Opinion Framework for Competing Innovations

In this paper, we propose a two-layer adoption-opinion model to study the diffusion of two competing technologies within a population whose opinions evolve under social influence and adoption-driven feedback. After adopting one technology, individuals may become dissatisfied and switch to the alternative. We prove the existence and uniqueness of the adoption-diffused equilibrium, showing that both technologies coexist and that neither partial-adoption nor monopoly can arise. Numerical simulations show that while opinions shape the equilibrium adoption levels, the relative market share between the two technologies depends solely on their user-experience. As a consequence, interventions that symmetrically boost opinions or adoption can disproportionately favor the higher-quality technology, illustrating how symmetric control actions may generate asymmetric outcomes.


[32] 2601.16722

Model Predictive Control for Coupled Adoption-Opinion Dynamics

This paper investigates an optimal control problem for an adoption-opinion model that couples opinion dynamics with a compartmental adoption framework on a multilayer network to study the diffusion of sustainable behaviors. Adoption evolves through social contagion and perceived benefits, while opinions are shaped by social interactions and feedback from adoption levels. Individuals may also stop adopting virtuous behavior due to external constraints or shifting perceptions, affecting overall diffusion. After the stability analysis of equilibria, both in the presence and absence of adopters, we introduce a Model Predictive Control (MPC) framework that optimizes interventions by shaping opinions rather than directly enforcing adoption. This nudge-based control strategy allows policymakers to influence diffusion indirectly, making interventions more effective and scalable. Numerical simulations demonstrate that, in the absence of control, adoption stagnates, whereas MPC-driven interventions sustain and enhance adoption across communities.


[33] 2601.16727

Precise Low-Current Measurement Techniques for IoT Devices: A Case Study on MoleNet

Power consumption is a crucial aspect of IoT devices which often have to run on a battery for an extended period of time. Therefore, supply current measurements are crucial before deploying a device in the field. Multimeters and oscilloscopes are not well suited when it comes to measuring very small currents which occur e.g. when an IoT device is in sleep mode. In this report, we compare dedicated source measurement units (SMUs) which allow to measure very small currents with high precision. As an application example, we demonstrate current measurements on our MoleNet IoT sensor board.


[34] 2601.16764

ReLU Networks for Model Predictive Control: Network Complexity and Performance Guarantees

Recent years have witnessed a resurgence in using ReLU neural networks (NNs) to represent model predictive control (MPC) policies. However, determining the required network complexity to ensure closed-loop performance remains a fundamental open problem. This involves a critical precision-complexity trade-off: undersized networks may fail to capture the MPC policy, while oversized ones may outweigh the benefits of ReLU network approximation. In this work, we propose a projection-based method to enforce hard constraints and establish a state-dependent Lipschitz continuity property for the optimal MPC cost function, which enables sharp convergence analysis of the closed-loop system. For the first time, we derive explicit bounds on ReLU network width and depth for approximating MPC policies with guaranteed closed-loop performance. To further reduce network complexity and enhance closed-loop performance, we propose a non-uniform error framework with a state-aware scaling function to adaptively adjust both the input and output of the ReLU network. Our contributions provide a foundational step toward certifiable ReLU NN-based MPC.


[35] 2601.16780

PocketDVDNet: Realtime Video Denoising for Real Camera Noise

Live video denoising under realistic, multi-component sensor noise remains challenging for applications such as autofocus, autonomous driving, and surveillance. We propose PocketDVDNet, a lightweight video denoiser developed using our model compression framework that combines sparsity-guided structured pruning, a physics-informed noise model, and knowledge distillation to achieve high-quality restoration with reduced resource demands. Starting from a reference model, we induce sparsity, apply targeted channel pruning, and retrain a teacher on realistic multi-component noise. The student network learns implicit noise handling, eliminating the need for explicit noise-map inputs. PocketDVDNet reduces the original model size by 74% while improving denoising quality and processing 5-frame patches in real-time. These results demonstrate that aggressive compression, combined with domain-adapted distillation, can reconcile performance and efficiency for practical, real-time video denoising.


[36] 2601.16792

A Dynamic Parametric Simulator for Fetal Heart Sounds

Research on fetal phonocardiogram (fPCG) is challenged by the limited number of abdominal recordings, substantial maternal interference, and marked transmissioninduced signal attenuation that complicate reproducible benchmarking. We present a reproducible dynamic parametric simulator that generates long abdominal fPCG sequences by combining cycle-level fetal S1/S2 event synthesis with a convolutional transmission module and configurable interference and background noise. Model parameters are calibrated cyclewise from real abdominal recordings to capture beat-to-beat variability and to define data-driven admissible ranges for controllable synthesis. The generated signals are validated against real recordings in terms of envelope-based temporal structure and frequency-domain characteristics. The simulator is released as open software to support rapid, reproducible evaluation of fPCG processing methods under controlled acquisition conditions.


[37] 2601.16827

Identification of Port-Hamiltonian Differential-Algebraic Equations from Input-Output Data

Many models of physical systems, such as mechanical and electrical networks, exhibit algebraic constraints that arise from subsystem interconnections and underlying physical laws. Such systems are commonly formulated as differential-algebraic equations (DAEs), which describe both the dynamic evolution of system states and the algebraic relations that must hold among them. Within this class, port-Hamiltonian differential-algebraic equations (pH-DAEs) offer a structured, energy-based representation that preserves interconnection and passivity properties. This work introduces a data-driven identification method that combines port-Hamiltonian neural networks (pHNNs) with a differential-algebraic solver to model such constrained systems directly from noisy input-output data. The approach preserves the passivity and interconnection structure of port-Hamiltonian systems while employing a backward Euler discretization with Newton's method to solve the coupled differential and algebraic equations consistently. The performance of the proposed approach is demonstrated on a DC power network, where the identified model accurately captures system behaviour and maintains errors proportional to the noise amplitude, while providing reliable parameter estimates.


[38] 2601.16847

Hierarchical Distribution Matcher Design for Probabilistic Constellation Shaping Based on a Novel Semi-Analytical Optimization Approach

A novel design procedure for practical hierarchical distribution matchers (HiDMs) in probabilistically shaped constellation systems is presented. The proposed approach enables the determination of optimal parameters for any target distribution matcher rate. Specifically, lower bounds on energy loss, rate loss, and memory requirements are analytically estimated for HiDM architectures approximating the Maxwell Boltzmann (MB) distribution. A semi analytical optimization framework is employed to jointly optimize rate and energy loss, allowing the selection of the number of hierarchical layers, memory size, and block length required to optimize channel capacity. The accuracy of the proposed model is validated through probabilistic amplitude shaping of 16QAM (PAS 16QAM), showing good agreement between analytical predictions and simulated results. The proposed analytical tool facilitates the design of HiDM structures that are compatible with practical hardware and implementation constraints, such as those imposed by state-of-the-art application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs). Furthermore, the performance of the optimized HiDM structure, incorporating layer selection based on lower-bound energy loss, is evaluated over the AWGN channel in terms of normalized generalized mutual information (NGMI) as a function of the optical signal-to-noise ratio (OSNR). At a net data rate of 200 Gbps with 25% forward error correction (FEC) overhead, the proposed scheme achieves a shaping gain improvement of 2.8% compared to previously reported solutions.


[39] 2601.16876

Performance of Differential Protection Applied to Collector Cables of Offshore Wind Farms with MMC-HVDC Transmission

The ongoing global transition towards low-carbon energy has propelled the integration of offshore wind farms, which, when combined with Modular Multilevel Converter-based High-Voltage Direct Current (MMC-HVDC) transmission, present unique challenges for power system protection. In collector cables connecting wind turbines to offshore MMC, both ends are supplied by Inverter-Based Resources (IBRs), which modify the magnitude and characteristics of fault currents. In this context, this paper investigates the limitations of conventional differential protection schemes under such conditions and compares them with enhanced strategies that account for sequence components. Using electromagnetic transient simulations of a representative offshore wind farm modeled in PSCAD/EMTDC software, internal and external fault scenarios are assessed, varying fault types and resistances. The comparative evaluation provides insights into the sensitivity and selectivity of differential protection and guides a deeper conceptual understanding of the evolving protection challenges inherent to future converter-dominated grids.


[40] 2601.16915

IRS Compensation of Hyper-Rayleigh Fading: How Many Elements Are Needed?

The letter introduces and studies the problem of defining the minimum number of Intelligent Reflecting Surface (IRS) elements needed to compensate for heavy fading conditions in multipath fading channels. The fading severity is quantified in terms of Hyper-Rayleigh Regimes (HRRs) (i.e., full-HRR (worst-case conditions), strong-, weak-, and no-HRR), and the channel model used (Inverse Power Lomax (IPL)) was chosen since it can account for all HRRs. The research presents the derived closed-form channel coefficient envelope statistics for the single IRS-element channel with IPL statistics in both subchannels and total IRS-assisted channel, as well as tight approximations for the channel coefficient and instantaneous signal-to-noise ratio (SNR) statistics for the latter. The derived expressions helped estimate channel parameters corresponding to the specific HRRs of the total channel and demonstrate that while both single links (i.e., ''source-IRS'' and ''IRS-destination'') are in full-HRR, the minimum number of IRS elements needed to bring the total IRS-assisted link (''source-IRS-destination'') out of full-HRR is no less than $6$ (for the whole range on the IPL scale parameter corresponding full-HRR). Furthermore, the minimum number of IRS elements required to bring the total IRS-assisted link into no-HRR is $14$ (under the same conditions).


[41] 2601.16231

SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.


[42] 2601.16235

Contrastive Knowledge Distillation for Embedding Refinement in Personalized Speech Enhancement

Personalized speech enhancement (PSE) has shown convincing results when it comes to extracting a known target voice among interfering ones. The corresponding systems usually incorporate a representation of the target voice within the enhancement system, which is extracted from an enrollment clip of the target voice with upstream models. Those models are generally heavy as the speaker embedding's quality directly affects PSE performances. Yet, embeddings generated beforehand cannot account for the variations of the target voice during inference time. In this paper, we propose to perform on-thefly refinement of the speaker embedding using a tiny speaker encoder. We first introduce a novel contrastive knowledge distillation methodology in order to train a 150k-parameter encoder from complex embeddings. We then use this encoder within the enhancement system during inference and show that the proposed method greatly improves PSE performances while maintaining a low computational load.


[43] 2601.16273

The CMU-AIST submission for the ICME 2025 Audio Encoder Challenge

This technical report describes our submission to the ICME 2025 audio encoder challenge. Our submitted system is built on BEATs, a masked speech token prediction based audio encoder. We extend the BEATs model using 74,000 hours of data derived from various speech, music, and sound corpora and scale its architecture upto 300 million parameters. We experiment with speech-heavy and balanced pre-training mixtures to study the impact of different domains on final performance. Our submitted system consists of an ensemble of the Dasheng 1.2 billion model with two custom scaled-up BEATs models trained on the aforementioned pre-training data mixtures. We also propose a simple ensembling technique that retains the best capabilities of constituent models and surpasses both the baseline and Dasheng 1.2B. For open science, we publicly release our trained checkpoints via huggingface at this https URL and this https URL.


[44] 2601.16290

Maximizing Reach-Avoid Probabilities for Linear Stochastic Systems via Control Architectures

The maximization of reach-avoid probabilities for stochastic systems is a central topic in the control literature. Yet, the available methods are either restricted to low-dimensional systems or suffer from conservative approximations. To address these limitations, we propose control architectures that combine the flexibility of Markov Decision Processes with the scalability of Model Predictive Controllers. The Model Predictive Controller tracks reference signals while remaining agnostic to the stochasticity and reach-avoid objective. Instead, the reach-avoid probability is maximized by optimally updating the controller's reference online. To achieve this, the closed-loop system, consisting of the system and Model Predictive Controller, is abstracted as a Markov Decision Process in which a new reference can be chosen at every time-step. A feedback policy generating optimal references is then computed via Dynamic Programming. If the state space of the system is continuous, the Dynamic Programming algorithm must be executed on a finite system approximation. Modifications to the Model Predictive Controller enable a computationally efficient robustification of the Dynamic Programming algorithm to approximation errors, preserving bounds on the achieved reach-avoid probability. The approach is validated on a perturbed 12D quadcopter model in cluttered reach-avoid environments proving its flexibility and scalability.


[45] 2601.16292

AMBER: A Columnar Architecture for High-Performance Agent-Based Modeling in Python

Agent-based modeling (ABM) has emerged as an indispensable methodology for studying complex adaptive systems across the natural and social sciences. However, Python-based ABM frameworks face a fundamental tension between the accessibility that has made Python dominant in scientific computing and the performance requirements of large-scale simulations. This paper introduces AMBER, a framework that resolves this tension through a novel architectural approach: replacing the conventional object-per-agent representation with columnar state management using the Polars DataFrame library. We analyze the computational characteristics of both paradigms, present the architectural design of AMBER including its core abstractions, spatial environments, experiment management, and optimization capabilities. Empirical evaluation on three canonical benchmarks demonstrates that AMBER achieves speedups of 1.2x to 93x depending on workload characteristics, with the greatest advantages for models dominated by population-wide attribute operations. Memory profiling reveals 30-50% reduction in peak usage compared to object-oriented frameworks. Our results establish columnar state management as a viable architectural foundation for high-performance ABM in interpreted languages.


[46] 2601.16323

Multi-User Content Diversity in Wireless Networks

Immersive applications such as eXtended Reality (XR), cloud gaming, and real-time video streaming are central to the vision of 6G networks. These applications require not only low latency and high data rates, but also consistent and high-quality User Experience (UX). Traditional rate allocation and congestion control mechanisms in wireless networks treat users uniformly based on channel conditions, rely only on network-centric Key Performance Indicators (KPIs), and ignore the content diversity, which can lead to inefficient resource utilization and degraded UX. In this paper, we introduce the concept of Multi-User Content Diversity, which recognizes that different users concurrently consume media with varying complexity, and therefore have different bitrate requirements to achieve satisfactory UX. We propose multiple different frameworks that exploit multi-user content diversity and lead to overall network-wide gains in terms of UX. For each framework, we demonstrate the required information exchange between Application Servers (ASs), Application Clients (ACs), and the network, and the algorithms that run in each of these components to optimize a network-wide UXbased objective. Simulation results demonstrate that exploiting multi-user content diversity leads to significant gains in UX capacity, UX fairness, and network utilization, when compared to conventional rate control methods. These findings highlight the potential of content-aware networking as a key enabler for emerging wireless systems.


[47] 2601.16332

Efficient Gaussian process learning via subspace projections

We propose a novel training objective for GPs constructed using lower-dimensional linear projections of the data, referred to as \emph{projected likelihood} (PL). We provide a closed-form expression for the information loss related to the PL and empirically show that it can be reduced with random projections on the unit sphere. We show the superiority of the PL, in terms of accuracy and computational efficiency, over the exact GP training and the variational free energy approach to sparse GPs over different optimisers, kernels and datasets of moderately large sizes.


[48] 2601.16367

Game-to-Real Gap: Quantifying the Effect of Model Misspecification in Network Games

Game-theoretic models and solution concepts provide rigorous tools for predicting collective behavior in multi-agent systems. In practice, however, different agents may rely on different game-theoretic models to design their strategies. As a result, when these heterogeneous models interact, the realized outcome can deviate substantially from the outcome each agent expects based on its own local model. In this work, we introduce the game-to-real gap, a new metric that quantifies the impact of such model misspecification in multi-agent environments. The game-to-real gap is defined as the difference between the utility an agent actually obtains in the multi-agent environment (where other agents may have misspecified models) and the utility it expects under its own game model. Focusing on quadratic network games, we show that misspecifications in either (i) the external shock or (ii) the player interaction network can lead to arbitrarily large game-to-real gaps. We further develop novel network centrality measures that allow exact evaluation of this gap in quadratic network games. Our analysis reveals that standard network centrality measures fail to capture the effects of model misspecification, underscoring the need for new structural metrics that account for this limitation. Finally, through illustrative numerical experiments, we show that existing centrality measures in network games may provide a counterintuitive understanding of the impact of model misspecification.


[49] 2601.16472

Secure Intellicise Wireless Network: Agentic AI for Coverless Semantic Steganography Communication

Semantic Communication (SemCom), leveraging its significant advantages in transmission efficiency and reliability, has emerged as a core technology for constructing future intellicise (intelligent and concise) wireless networks. However, intelligent attacks represented by semantic eavesdropping pose severe challenges to the security of SemCom. To address this challenge, Semantic Steganographic Communication (SemSteCom) achieves ``invisible'' encryption by implicitly embedding private semantic information into cover modality carriers. The state-of-the-art study has further introduced generative diffusion models to directly generate stega images without relying on original cover images, effectively enhancing steganographic capacity. Nevertheless, the recovery process of private images is highly dependent on the guidance of private semantic keys, which may be inferred by intelligent eavesdroppers, thereby introducing new security threats. To address this issue, we propose an Agentic AI-driven SemSteCom (AgentSemSteCom) scheme, which includes semantic extraction, digital token controlled reference image generation, coverless steganography, semantic codec, and optional task-oriented enhancement modules. The proposed AgentSemSteCom scheme obviates the need for both cover images and private semantic keys, thereby boosting steganographic capacity while reinforcing transmission security. The simulation results on open-source datasets verify that, AgentSemSteCom achieves better transmission quality and higher security levels than the baseline scheme.


[50] 2601.16495

Load Balanced ISAC Systems for URLLC Users

This paper presents an energy-efficient downlink cell-free massive multiple-input multiple-output (CF-mMIMO) integrated sensing and communication (ISAC) network that serves ultra-reliable low-latency communication (URLLC) users while simultaneously detecting a target. We propose a load-balancing algorithm that minimizes the total network power consumption; including transmit power, fixed static power, and traffic-dependent fronthaul power at the access points (APs) without degrading system performance. To this end, we formulate a mixed-integer non-convex optimization problem and introduce an iterative joint power allocation and AP load balancing (JPALB) algorithm. The algorithm aims to reduce total power usage while meeting both the communication quality-of-service (QoS) requirements of URLLC users and the sensing QoS needed for target detection. Proposed JPALB algorithm for ISAC systems was simulated with maximum-ratio transmission (MRT) and regularized zero-forcing (RZF) precoders. Simulation results show approximately 33% reduction in power consumption, using JPALB algorithm compared to a baseline with no load balancing, without compromising communication and sensing QoS requirements.


[51] 2601.16540

Do Models Hear Like Us? Probing the Representational Alignment of Audio LLMs and Naturalistic EEG

Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.


[52] 2601.16547

CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation

Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio-text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.


[53] 2601.16578

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

We present a reproducible benchmark for evaluating sim-to-real transfer of Multi-Agent Reinforcement Learning (MARL) policies for Connected and Automated Vehicles (CAVs). The platform, based on the Cyber-Physical Mobility Lab (CPM Lab) [1], integrates simulation, a high-fidelity digital twin, and a physical testbed, enabling structured zero-shot evaluation of MARL motion-planning policies. We demonstrate its use by deploying a SigmaRL-trained policy [2] across all three domains, revealing two complementary sources of performance degradation: architectural differences between simulation and hardware control stacks, and the sim-to-real gap induced by increasing environmental realism. The open-source setup enables systematic analysis of sim-to-real challenges in MARL under realistic, reproducible conditions.


[54] 2601.16603

Omni-directional attention mechanism based on Mamba for speech separation

Mamba, a selective state-space model (SSM), has emerged as an efficient alternative to Transformers for speech modeling, enabling long-sequence processing with linear complexity. While effective in speech separation, existing approaches, whether in the time or time-frequency domain, typically decompose the input along a single dimension into short one-dimensional sequences before processing them with Mamba, which restricts it to local 1D modeling and limits its ability to capture global dependencies across the 2D spectrogram. In this work, we propose an efficient omni-directional attention (OA) mechanism built upon unidirectional Mamba, which models global dependencies from ten different directions on the spectrogram. We expand the proposed mechanism into two baseline separation models and evaluate on three public datasets. Experimental results show that our approach consistently achieves significant performance gains over the baselines while preserving linear complexity, outperforming existing state-of-the-art (SOTA) systems.


[55] 2601.16663

A Categorical Approach to Semantic Interoperability across Building Lifecycle

Buildings generate heterogeneous data across their lifecycle, yet integrating these data remains a critical unsolved challenge. Despite three decades of standardization efforts, over 40 metadata schemas now span the building lifecycle, with fragmentation accelerating rather than resolving. Current approaches rely on point-to-point mappings that scale quadratically with the number of schemas, or universal ontologies that become unwieldy monoliths. The fundamental gap is the absence of mathematical foundations for structure-preserving transformations across heterogeneous building data. Here we show that category theory provides these foundations, enabling systematic data integration with $O(n)$ specification complexity for $n$ ontologies. We formalize building ontologies as first-order theories and demonstrate two proof-of-concept implementations in Categorical Query Language (CQL): 1) generating BRICK models from IFC design data at commissioning, and 2) three-way integration of IFC, BRICK, and RealEstateCore where only two explicit mappings yield the third automatically through categorical composition. Our correct-by-construction approach treats property sets as first-class schema entities and provides automated bidirectional migrations, and enables cross-ontology queries. These results establish feasibility of categorical methods for building data integration and suggest a path toward an app ecosystem for buildings, where mathematical foundations enable reliable component integration analogous to smartphone platforms.


[56] 2601.16675

I Guess That's Why They Call it the Blues: Causal Analysis for Audio Classifiers

It is well-known that audio classifiers often rely on non-musically relevant features and spurious correlations to classify audio. Hence audio classifiers are easy to manipulate or confuse, resulting in wrong classifications. While inducing a misclassification is not hard, until now the set of features that the classifiers rely on was not well understood. In this paper we introduce a new method that uses causal reasoning to discover features of the frequency space that are sufficient and necessary for a given classification. We describe an implementation of this algorithm in the tool FreqReX and provide experimental results on a number of standard benchmark datasets. Our experiments show that causally sufficient and necessary subsets allow us to manipulate the outputs of the models in a variety of ways by changing the input very slightly. Namely, a change to one out of 240,000 frequencies results in a change in classification 58% of the time, and the change can be so small that it is practically inaudible. These results show that causal analysis is useful for understanding the reasoning process of audio classifiers and can be used to successfully manipulate their outputs.


[57] 2601.16689

Neural Agonist-Antagonist Coupling in the Absence of Mechanical Coupling after Targeted Muscle Reinnervation

Following limb amputation and targeted muscle reinnervation (TMR), nerves supplying agonist and antagonist muscles are rerouted into separate targeted muscles, disrupting natural neuromechanical coupling between muscle groups. Using high-density intramuscular microelectrode arrays in reinnervated muscles, we show that neural signals for agonist and antagonist tasks remain functionally coupled: motor units active during agonist tasks were also recruited during corresponding antagonist tasks, despite no visual feedback on coactivation being provided.


[58] 2601.16733

Using Shadows in Circular Synthetic Aperture Sonar Imaging for Target Analysis

Circular Synthetic Aperture Sonar (CSAS) provides a 360° azimuth view of the seabed, surpassing the limited aperture and mono-view image of conventional side-scan SAS. This makes CSAS a valuable tool for target recognition in mine warfare where the diversity of point of view is essential for reducing false alarms. CSAS processing typically produces a very high-resolution two-dimensional image. However, the parallax introduced by the circular displacement of the illuminator fill-in the shadow regions, and the shadow cast by an object on the seafloor is lost in favor of azimuth coverage and resolution. Yet the shadows provide complementary information on target shape useful for target recognition. In this paper, we explore a way to retrieve shadow information from CSAS data to improve target analysis and carry 3D reconstruction. Sub-aperture filtering is used to get a collection of images at various points of view along the circular trajectory and fixed focus shadow enhancement (FFSE) is applied to obtain sharp shadows. An interactive interface is also proposed to allow human operators to visualize these shadows along the circular trajectory. A space-carving reconstruction method is applied to infer the 3D shape of the object from the segmented shadows. The results demonstrate the potential of shadows in circular SAS for improving target analysis and 3D reconstruction.


[59] 2601.16758

Noise Resilience and Robust Convergence Guarantees for the Variational Quantum Eigensolver

Variational Quantum Algorithms (VQAs) are a class of hybrid quantum-classical algorithms that leverage on classical optimization tools to find the optimal parameters for a parameterized quantum circuit. One relevant application of VQAs is the Variational Quantum Eigensolver (VQE), which aims at steering the output of the quantum circuit to the ground state of a certain Hamiltonian. Recent works have provided global convergence guarantees for VQEs under suitable local surjectivity and smoothness hypotheses, but little has been done in characterizing convergence of these algorithms when the underlying quantum circuit is affected by noise. In this work, we characterize the effect of different coherent and incoherent noise processes on the optimal parameters and the optimal cost of the VQE, and we study their influence on the convergence guarantees of the algorithm. Our work provides novel theoretical insight into the behavior of parameterized quantum circuits. Furthermore, we accompany our results with numerical simulations implemented via Pennylane.


[60] 2601.16774

E2E-AEC: Implementing an end-to-end neural network learning approach for acoustic echo cancellation

We propose a novel neural network-based end-to-end acoustic echo cancellation (E2E-AEC) method capable of streaming inference, which operates effectively without reliance on traditional linear AEC (LAEC) techniques and time delay estimation. Our approach includes several key strategies: First, we introduce and refine progressive learning to gradually enhance echo suppression. Second, our model employs knowledge transfer by initializing with a pre-trained LAECbased model, harnessing the insights gained from LAEC training. Third, we optimize the attention mechanism with a loss function applied on attention weights to achieve precise time alignment between the reference and microphone signals. Lastly, we incorporate voice activity detection to enhance speech quality and improve echo removal by masking the network output when near-end speech is absent. The effectiveness of our approach is validated through experiments conducted on public datasets.


[61] 2601.16793

A Novel Transfer Learning Approach for Mental Stability Classification from Voice Signal

This study presents a novel transfer learning approach and data augmentation technique for mental stability classification using human voice signals and addresses the challenges associated with limited data availability. Convolutional neural networks (CNNs) have been employed to analyse spectrogram images generated from voice recordings. Three CNN architectures, VGG16, InceptionV3, and DenseNet121, were evaluated across three experimental phases: training on non-augmented data, augmented data, and transfer learning. This proposed transfer learning approach involves pre-training models on the augmented dataset and fine-tuning them on the non-augmented dataset while ensuring strict data separation to prevent data leakage. The results demonstrate significant improvements in classification performance compared to the baseline approach. Among three CNN architectures, DenseNet121 achieved the highest accuracy of 94% and an AUC score of 99% using the proposed transfer learning approach. This finding highlights the effectiveness of combining data augmentation and transfer learning to enhance CNN-based classification of mental stability using voice spectrograms, offering a promising non-invasive tool for mental health diagnostics.


[62] 2601.16812

Sample-wise Constrained Learning via a Sequential Penalty Approach with Applications in Image Processing

In many learning tasks, certain requirements on the processing of individual data samples should arguably be formalized as strict constraints in the underlying optimization problem, rather than by means of arbitrary penalties. We show that, in these scenarios, learning can be carried out exploiting a sequential penalty method that allows to properly deal with constraints. The proposed algorithm is shown to possess convergence guarantees under assumptions that are reasonable in deep learning scenarios. Moreover, the results of experiments on image processing tasks show that the method is indeed viable to be used in practice.


[63] 2601.16863

Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation

This paper introduces the N-Way Self-Evaluating Deliberation (NSED) protocol, a Runtime Mixture-of-Models (MoM) architecture that constructs emergent composite models from a plurality of distinct expert agents. Unlike traditional Mixture-of-Experts (MoE) which rely on static gating networks, NSED employs a Dynamic Expertise Broker - a runtime optimization engine that treats model selection as a variation of the Knapsack Problem, binding heterogeneous checkpoints to functional roles based on live telemetry and cost constraints. At the execution layer, we formalize deliberation as a Macro-Scale Recurrent Neural Network (RNN), where the consensus state loops back through a semantic forget gate to enable iterative refinement without proportional VRAM scaling. Key components include an orchestration fabric for trustless N-to-N peer review, a Quadratic Voting activation function for non-linear consensus, and a feedback-driven state update. Empirical validation on challenging benchmarks (AIME 2025, LiveCodeBench) demonstrates that this topology allows ensembles of small (less than 20B) consumer-grade models to match or exceed the performance of state-of-the-art 100B+ parameter models, establishing a new hardware arbitrage efficiency frontier. Furthermore, testing on the DarkBench safety suite reveals intrinsic alignment properties, with peer-mediated correction reducing sycophancy scores below that of any individual agent.


[64] 2601.16904

Clinical Feasibility of Label-Free Digital Staining Using Mid-Infrared Microscopy at Subcellular Resolution

We present a rapid, large-field bimodal imaging platform that integrates conventional brightfield microscopy with a lensless IR imaging scanner, enabling whole-slide IR image stack acquisition in minutes. Using a dedicated deep learning model, we implement an optical HE staining strategy based on subcellular morpho-spectral fingerprinting.


[65] 2601.16950

Evaluating Wi-Fi Performance for VR Streaming: A Study on Realistic HEVC Video Traffic

Cloud-based Virtual Reality (VR) streaming presents significant challenges for 802.11 networks due to its high throughput and low latency requirements. When multiple VR users share a Wi-Fi network, the resulting uplink and downlink traffic can quickly saturate the channel. This paper investigates the capacity of 802.11 networks for supporting realistic VR streaming workloads across varying frame rates, bitrates, codec settings, and numbers of users. We develop an emulation framework that reproduces Air Light VR (ALVR) operation, where real HEVC video traffic is fed into an 802.11 simulation model. Our findings explore Wi-Fi's performance anomaly and demonstrate that Intra-refresh (IR) coding effectively reduces latency variability and improves QoS, supporting up to 4 concurrent VR users with Constant Bitrate (CBR) 100 Mbps before the channel is saturated.


[66] 2411.14172

TaQ-DiT: Time-aware Quantization for Diffusion Transformers

Transformer-based diffusion models, dubbed Diffusion Transformers (DiTs), have achieved state-of-the-art performance in image and video generation tasks. However, their large model size and slow inference speed limit their practical applications, calling for model compression methods such as quantization. Unfortunately, existing DiT quantization methods overlook (1) the impact of reconstruction and (2) the varying quantization sensitivities across different layers, which hinder their achievable performance. To tackle these issues, we propose innovative time-aware quantization for DiTs (TaQ-DiT). Specifically, (1) we observe a non-convergence issue when reconstructing weights and activations separately during quantization and introduce a joint reconstruction method to resolve this problem. (2) We discover that Post-GELU activations are particularly sensitive to quantization due to their significant variability across different denoising steps as well as extreme asymmetries and variations within each step. To address this, we propose time-variance-aware transformations to facilitate more effective quantization. Experimental results show that when quantizing DiTs' weights to 4-bit and activations to 8-bit (W4A8), our method significantly surpasses previous quantization methods.


[67] 2503.00368

Stacked Intelligent Metasurface-Enhanced MIMO OFDM Wideband Communication Systems

Multiple-input multiple-output (MIMO) orthogonal frequency-division multiplexing (OFDM) systems rely on digital or hybrid digital and analog designs for beamforming against frequency-selective fading, which suffer from high hardware complexity and energy consumption. To address this, this work introduces a fully-analog stacked intelligent metasurfaces (SIM) architecture that directly performs wave-domain beamforming, enabling diagonalization of the end-to-end channel matrix and inherently eliminating inter-antenna interference (IAI) for MIMO OFDM transmission. By leveraging cascaded programmable metasurface layers, the proposed system establishes multiple parallel subchannels, significantly improving multi-carrier transmission efficiency while reducing hardware complexity. To optimize the SIM phase shift matrices, a block coordinate descent and penalty convex-concave procedure (BCD-PCCP) algorithm is developed to iteratively minimize the channel fitting error across subcarriers. Simulation results validate the proposed approach, determining the maximum effective bandwidth and demonstrating substantial performance improvements. Moreover, for a MIMO OFDM system operating at 28 GHz with 16 subcarriers, the proposed SIM configuration method achieves over 300% enhancement in channel capacity compared to conventional SIM configuration that only accounts for the center frequency.


[68] 2503.13583

Stability results for MIMO LTI systems via Scaled Relative Graphs

This paper proposes a frequency-wise approach for stability analysis of multi-input, multi-output (MIMO) Linear Time-Invariant (LTI) feedback systems through Scaled Relative Graphs (SRGs). Unlike traditional methods, such as the Generalized Nyquist Criterion (GNC), which relies on a coupled analysis that requires the multiplication of models, our approach enables the evaluation of system stability in a decoupled fashion, system by system, each of which is represented by its SRG (or an over-approximation thereof), and it provides an intuitive, visual representation of system behavior. Our results provide conditions for certifying the stability of stable and square MIMO LTI systems connected in closed loop.


[69] 2504.10244

Towards contrast- and pathology-agnostic clinical fetal brain MRI segmentation using SynthSeg

Magnetic resonance imaging (MRI) has played a crucial role in fetal neurodevelopmental research. Structural annotations of MR images are an important step for quantitative analysis of the developing human brain, with Deep Learning providing an automated alternative for this otherwise tedious manual process. However, segmentation performances of Convolutional Neural Networks often suffer from domain shift, where the network fails when applied to subjects that deviate from the distribution with which it is trained on. In this work, we aim to train networks capable of automatically segmenting fetal brain MRIs with a wide range of domain shifts pertaining to differences in subject physiology and acquisition environments, in particular shape-based differences commonly observed in pathological cases. We introduce a novel data-driven train-time sampling strategy that seeks to fully exploit the diversity of a given training dataset to enhance the domain generalizability of the trained networks. We adapted our sampler, together with other existing data augmentation techniques, to the SynthSeg framework, a generator that utilizes domain randomization to generate diverse training data. We ran thorough experimentations and ablation studies on a wide range of training/testing data to test the validity of the approaches. Our networks achieved notable improvements in the segmentation quality on testing subjects with intense anatomical abnormalities (p < 1e-4), though at the cost of a slighter decrease in performance in cases with fewer abnormalities. Our work also lays the foundation for future works on creating and adapting data-driven sampling strategies for other training pipelines.


[70] 2505.18182

Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)

AI-powered stethoscopes offer a promising alternative for screening rheumatic heart disease (RHD), particularly in regions with limited diagnostic infrastructure. Early detection is vital, yet echocardiography, the gold standard tool, remains largely inaccessible in low-resource settings due to cost and workforce constraints. This review systematically examines machine learning (ML) applications from 2015 to 2025 that analyze electrocardiogram (ECG) and phonocardiogram (PCG) data to support accessible, scalable screening of all RHD variants in relation to the World Heart Federation's "25 by 25" goal to reduce RHD mortality. Using PRISMA-ScR guidelines, 37 peer-reviewed studies were selected from PubMed, IEEE Xplore, Scopus, and Embase. Convolutional neural networks (CNNs) dominate recent efforts, achieving a median accuracy of 97.75%, F1-score of 0.95, and AUROC of 0.89. However, challenges remain: 73% of studies used single-center datasets, 81.1% relied on private data, only 10.8% were externally validated, and none assessed cost-effectiveness. Although 45.9% originated from endemic regions, few addressed demographic diversity or implementation feasibility. These gaps underscore the disconnect between model performance and clinical readiness. Bridging this divide requires standardized benchmark datasets, prospective trials in endemic areas, and broader validation. If these issues are addressed, AI-augmented auscultation could transform cardiovascular diagnostics in underserved populations, thereby aiding early detection. This review also offers practical recommendations for building accessible ML-based RHD screening tools, aiming to close the diagnostic gap in low-resource settings where conventional auscultation may miss up to 90% of cases and echocardiography remains out of reach.


[71] 2505.24160

Beyond the LUMIR challenge: The pathway to foundational registration models

Medical image challenges have played a transformative role in advancing the field, catalyzing innovation and establishing new performance benchmarks. Image registration, a foundational task in neuroimaging, has similarly advanced through the Learn2Reg initiative. Building on this, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark for unsupervised brain MRI registration. Previous challenges relied upon anatomical label maps, however LUMIR provides 4,014 unlabeled T1-weighted MRIs for training, encouraging biologically plausible deformation modeling through self-supervision. Evaluation includes 590 in-domain test subjects and extensive zero-shot tasks across disease populations, imaging protocols, and species. Deep learning methods consistently achieved state-of-the-art performance and produced anatomically plausible, diffeomorphic deformation fields. They outperformed several leading optimization-based methods and remained robust to most domain shifts. These findings highlight the growing maturity of deep learning in neuroimaging registration and its potential to serve as a foundation model for general-purpose medical image registration.


[72] 2509.04060

Physics-Informed Detection of Friction Anomalies in Satellite Reaction Wheels

As the number of satellites in orbit has increased exponentially in recent years, ensuring their correct functionality has started to require automated methods to decrease human workload. In this work, we present an algorithm that analyzes the on-board data related to friction from the Reaction Wheel Assemblies (RWA) of a satellite and determines their operating status, distinguishing between nominal status and several possible anomalies that require preventive measures to be taken. The algorithm first uses a model based on hybrid systems theory to extract the information relevant to the problem. The extraction process combines techniques in changepoint detection, dynamic programming, and maximum likelihood in a structured way. A classifier then uses the extracted information to determine the status of the RWA. This last classifier has been previously trained with a labelled dataset produced by a high-fidelity simulator, comprised for the most part of nominal data. The final algorithm combines model-based and data-based approaches to obtain satisfactory results with an accuracy around 95%.


[73] 2509.14069

Lightweight Implicit Neural Network for Binaural Audio Synthesis

High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (Lite-INN), a novel two-stage framework. Lite-INN first generates initial estimates using a time-domain warping, which is then refined by an Implicit Binaural Corrector (IBC) module. IBC is an implicit neural network that predicts amplitude and phase corrections directly, resulting in a highly compact model architecture. Experimental results show that Lite-INN achieves statistically comparable perceptual quality to the best-performing baseline model while significantly improving computational efficiency. Compared to the previous state-of-the-art method (NFS), Lite-INN achieves a 72.7% reduction in parameters and requires significantly fewer compute operations (MACs). This demonstrates that our approach effectively addresses the trade-off between synthesis quality and computational efficiency, providing a new solution for high-fidelity edge-device spatial audio applications.


[74] 2509.14076

A Lightweight Fourier-based Network for Binaural Speech Enhancement with Spatial Cue Preservation

Binaural speech enhancement faces a severe trade-off challenge, where state-of-the-art performance is achieved by computationally intensive architectures, while lightweight solutions often come at the cost of significant performance degradation. To bridge this gap, we propose the Global Adaptive Fourier Network (GAF-Net), a lightweight deep complex network that aims to establish a balance between performance and computational efficiency. The GAF-Net architecture consists of three components. First, a dual-feature encoder combining short-time Fourier transform and gammatone features enhances the robustness of acoustic representation. Second, a channel-independent globally adaptive Fourier modulator efficiently captures long-term temporal dependencies while preserving the spatial cues. Finally, a dynamic gating mechanism is implemented to reduce processing artifacts. Experimental results show that GAF-Net achieves competitive performance, particularly in terms of binaural cues (ILD and IPD error) and objective intelligibility (MBSTOI), with fewer parameters and computational cost. These results confirm that GAF-Net provides a feasible way to achieve high-fidelity binaural processing on resource-constrained devices.


[75] 2509.19592

Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.


[76] 2509.21463

Enhanced Generative Machine Listener

We present GMLv2, a reference-based model designed for the prediction of subjective audio quality as measured by MUSHRA scores. GMLv2 introduces a Beta distribution-based loss to model the listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to extend its generalization and applicability. Extensive evaluations on diverse testset demonstrate that proposed GMLv2 consistently outperforms widely used metrics, such as PEAQ and ViSQOL, both in terms of correlation with subjective scores and in reliably predicting these scores across diverse content types and codec configurations. Consequently, GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation, poised to accelerate research and development in modern audio coding technologies.


[77] 2509.22148

Speaker Anonymisation for Speech-based Suicide Risk Detection

Adolescent suicide is a critical global health issue, and speech provides a cost-effective modality for automatic suicide risk detection. Given the vulnerable population, protecting speaker identity is particularly important, as speech itself can reveal personally identifiable information if the data is leaked or maliciously exploited. This work presents the first systematic study of speaker anonymisation for speech-based suicide risk detection. A broad range of anonymisation methods are investigated, including techniques based on traditional signal processing, neural voice conversion, and speech synthesis. A comprehensive evaluation framework is built to assess the trade-off between protecting speaker identity and preserving information essential for suicide risk detection. Results show that combining anonymisation methods that retain complementary information yields detection performance comparable to that of original speech, while achieving protection of speaker identity for vulnerable populations.


[78] 2510.16813

Audio dequantization using instantaneous frequency

We present a dequantization method that employs a phase-aware regularizer, originally successfully applied in an audio inpainting problem. The method promotes a temporal continuity of sinusoidal components in time-frequency representation of the audio signal, and avoids energy loss artifacts commonly encountered with l1-based regularization approaches. The proposed method is called the Phase-Aware Audio Dequantizer (PHADQ). The method are evaluated against the state-of-the-art using the SDR and PEMO-Q ODG objective metrics, and a~subjective MUSHRA-like test.


[79] 2510.26635

SAMRI: Segment Anything Model for MRI

Accurate magnetic resonance imaging (MRI) segmentation is crucial for clinical decision-making, but remains labor-intensive when performed manually. Convolutional neural network (CNN) based methods can be accurate and efficient but often generalize poorly to MRI variable contrast, intensity inhomogeneity, and sequences. Although the transformer-based Segment Anything Model (SAM) has demonstrated remarkable generalizability in natural images, existing adaptations often treat MRI as another imaging modality, overlooking these modality-specific challenges. We present SAMRI, an MRI-specialized SAM trained and validated on 1.1 million labeled MR slices spanning whole-body organs and pathologies. We demonstrate that SAM can be effectively adapted to MRI by fine-tuning its mask decoder using a two-stage strategy, reducing training time by 94 percent and trainable parameters by 96 percent compared to full-model retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice of 0.87, delivering state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, particularly small clinically important structures. In addition, we provide a complete training-to-inference pipeline and a user-friendly local graphical interface that enables interactive application of pretrained SAMRI models on standard machines, facilitating practical deployment for real-world MRI segmentation.


[80] 2511.14652

Robust Offset-free Kernelized Data-Driven Predictive Control for Nonlinear Systems

This paper proposes a novel Kernelized Data-Driven Predictive Control (KDPC) scheme for robust, offset-free tracking of nonlinear systems. Our computationally efficient hybrid approach separates the prediction: (1) kernel ridge regression learns the nonlinear map from past trajectories, and (2) analytical linearization of the kernel map approximates the effect of future inputs. This linearization is key, allowing the controller to be formulated as a standard Quadratic Program (QP) for efficient real-time implementation. Offset-free tracking is inherently achieved by using input increments. We provide theoretical guarantees for recursive feasibility and asymptotic stability. The algorithm is validated on a nonlinear Van der Pol oscillator, where it successfully rejects unmeasured disturbances and eliminates steady-state errors, outperforming a standard model-based controller.


[81] 2511.15480

Robust H-infinity control and worst-case search in constrained parametric space

Standard H-infinity/H2 robust control and analysis tools operate on uncertain parameters assumed to vary independently within prescribed bounds. This paper extends their capabilities in the presence of constraints coupling these parameters and restricting the parametric space. Focusing on the worst-case search, we demonstrate - based on the theory of upper-C1 functions - the validity of using standard, readily available smooth optimization algorithms to address this nonsmooth constrained optimization problem. In particular, we prove that the sequential quadratic programming algorithm converges to Karush-Kuhn-Tucker points, and that such conditions are satisfied by any subgradient at a local minimum. This worst-case search then enables robust controller synthesis: identified worst-case configurations are iteratively added to an active set on which a non-smooth multi-models optimization of the controller is performed. The methodology is illustrated on a satellite benchmark with flexible appendages, of order 50 with 43 uncertain parameters. From a practical point of view, we combine the local exploitation proposed above with a global exploration using either Monte-Carlo sampling or particle swarm optimization. We show that the proposed constrained optimization effectively complements Monte-Carlo sampling by enabling fast detection of rare worst-case configurations, and that the robust controller optimization converges with less than 10 active configurations.


[82] 2512.02091

Fine-tuned Transformer Models for Breast Cancer Detection and Classification

Breast cancer is still the second top cause of cancer deaths worldwide and this emphasizes the importance of necessary steps for early detection. Traditional diagnostic methods, such as mammography, ultrasound, and thermography, which have limitations when it comes to catching subtle patterns and reducing false positives. New technologies like artificial intelligence (AI) and deep learning have brought about the revolution in medical imaging analysis. Nevertheless, typical architectures such as Convolutional Neural Networks (CNNs) often have problems with modeling long-range dependencies. It explores the application of visual transformer models (here: Swin Tiny, DeiT, BEiT, ViT, and YOLOv8) for breast cancer detection through a collection of mammographic image sets. The ViT model reached the highest accuracy of 99.32% which showed its superiority in detecting global patterns as well as subtle image features. Data augmenting approaches, such as resizing croppings, flippings, and normalization, were further applied to the model for achieving higher performance. Although there were interesting results, the issues of dataset diversity and model optimization which present new avenues of research are also still present. Through this study, the crystal potential of transformer-based AI models in changing the detecting process of breast cancer and, thus, to patients health, is suggested.


[83] 2512.04892

Small-Signal Stability Oriented Real-Time Operation of Power Systems with a High Penetration of Inverter-Based Resources

This study proposes a control strategy to ensure the safe operation of modern power systems with high penetration of inverter-based resources (IBRs) within an optimal operation framework. The objective is to obtain operating points that satisfy the optimality conditions of a predefined problem while guaranteeing small-signal stability. The methodology consists of two stages. First, an offline analysis of a set of operating points is performed to derive a data-driven regression-based expression that captures a damping-based stability index as a function of the operating conditions. Second, an Online Feedback Optimization (OFO) controller is employed to drive the system toward an optimal operating point while maintaining a secure distance from the instability region. The proposed strategy is evaluated on an academic test case based on a modified version of the IEEE 9-bus system, in which synchronous generators are replaced by IBRs operating under both grid-following and grid-forming control modes. The results demonstrate the effectiveness of the method and are discussed in detail.


[84] 2512.06369

Data Generation for Stability Studies of Power Systems with High Penetration of Inverter-Based Resources

The increasing penetration of inverter-based resources (IBRs) is fundamentally reshaping power system dynamics and creating new challenges for stability assessment. Data-driven approaches, and in particular machine learning models, require large and representative datasets that capture how system stability varies across a wide range of operating conditions and control settings. This paper presents an open-source, high-performance computing framework for the systematic generation of such datasets. The proposed tool defines a scalable operating space for large-scale power systems, explores it through an adaptive sampling strategy guided by sensitivity analysis, and performs small-signal stability assessments to populate a high-information-content dataset. The framework efficiently targets regions near the stability margin while maintaining broad coverage of feasible operating conditions. The workflow is fully implemented in Python and designed for parallel execution. The resulting tool enables the creation of high-quality datasets that support data-driven stability studies in modern power systems with high IBR penetration.


[85] 2601.00973

Learned Hemodynamic Coupling Inference in Resting-State Functional MRI

Functional magnetic resonance imaging (fMRI) provides an indirect measurement of neuronal activity via hemodynamic responses that vary across brain regions and individuals. Ignoring this hemodynamic variability can bias downstream connectivity estimates. Furthermore, the hemodynamic parameters themselves may serve as important imaging biomarkers. Estimating spatially varying hemodynamics from resting-state fMRI (rsfMRI) is therefore an important but challenging blind inverse problem, since both the latent neural activity and the hemodynamic coupling are unknown. In this work, we propose a methodology for inferring hemodynamic coupling on the cortical surface from rsfMRI. Our approach avoids the highly unstable joint recovery of neural activity and hemodynamics by marginalizing out the latent neural signal and basing inference on the resulting marginal likelihood. To enable scalable, high-resolution estimation, we employ a deep neural network combined with conditional normalizing flows to accurately approximate this intractable marginal likelihood, while enforcing spatial coherence through priors defined on the cortical surface that admit sparse representations. Uncertainty in the hemodynamic estimates is quantified via a double-bootstrap procedure. The proposed approach is extensively validated using synthetic data and real fMRI datasets, demonstrating clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.


[86] 2601.14953

Deep Learning assisted Port-Cycling based Channel Sounding for Precoder Estimation in Massive MIMO Arrays

Future wireless systems are expected to employ a substantially larger number of transmit ports for channel state information (CSI) estimation compared to current specifications. Although scaling ports improves spectral efficiency, it also increases the resource overhead to transmit reference signals across the time-frequency grid, ultimately reducing achievable data throughput. In this work, we propose an deep learning (DL)-based CSI reconstruction framework that serves as an enabler for reliable CSI acquisition in future 6G systems. The proposed solution involves designing a port-cycling mechanism that sequentially sounds different portions of CSI ports across time, thereby lowering the overhead while preserving channel observability. The proposed CSI Adaptive Network (CsiAdaNet) model exploits the resulting sparse measurements and captures both spatial and temporal correlations to accurately reconstruct the full-port CSI. The simulation results show that our method achieves overhead reduction while maintaining high CSI reconstruction accuracy.


[87] 2601.16198

Stochastic Control Barrier Functions under State Estimation: From Euclidean Space to Lie Groups

Ensuring safety for autonomous systems under uncertainty remains challenging, particularly when safety of the true state is required despite the true state not being fully known. Control barrier functions (CBFs) have become widely adopted as safety filters. However, standard CBF formulations do not explicitly account for state estimation uncertainty and its propagation, especially for stochastic systems evolving on manifolds. In this paper, we propose a safety-critical control framework with a provable bound on the finite-time safety probability for stochastic systems under noisy state information. The proposed framework explicitly incorporates the uncertainty arising from both process and measurement noise, and synthesizes controllers that adapt to the level of uncertainty. The framework admits closed-form solutions in linear settings, and experimental results demonstrate its effectiveness on systems whose state spaces range from Euclidean space to Lie groups.


[88] 2403.11762

Full-Duplex Multiuser MISO Under Coarse Quantization: Per-Antenna SQNR Analysis and Beamforming Design

We investigate full-duplex (FD) multi-user multiple input single-output systems with coarse quantization, aiming to characterize the impact of employing low-resolution analog-to-digital converters (ADCs) on self-interference (SI) and to develop a quantization- and SI-aware beamforming method that alleviates quantization-induced performance degradation in the FD systems. We first present an analysis on the perantenna signal-to-quantization noise ratio for conventional linear beamformers to provide the desired range of the number of analog-to-digital converter (ADC) bits, providing system insights for reliable FD operation in regard to the ADC resolution and beamforming strategy. Motivated by the insights, we then propose an SI-aware beamforming method that mitigates residual SI and quantization distortion. The resulting spectral efficiency (SE) maximization problem is decomposed into two tractable subproblems solved via alternating optimization: precoder and combiner design. The precoder optimization is formulated as a generalized eigenvalue problem, where the dominant eigenvector yields the best stationary solution through power iteration, while the combiner is derived as a quantization-aware minimum meansquared error (MMSE) filter. Numerical studies show that the number of required ADC bits with the proposed beamforming falls within the derived theoretical range while achieving the highest SE compared to benchmarks.


[89] 2507.12884

From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation

We present NeckSense, a novel wearable system for head pose tracking that leverages multi-channel bio-impedance sensing with soft, dry electrodes embedded in a lightweight, necklace-style form factor. NeckSense captures dynamic changes in tissue impedance around the neck, which are modulated by head rotations and subtle muscle activations. To robustly estimate head pose, we propose a deep learning framework that integrates anatomical priors, including joint constraints and natural head rotation ranges, into the loss function design. We validate NeckSense on 7 participants using the current SOTA pose estimation model as ground truth. Our system achieves a mean per-vertex error of 25.9 mm across various head movements with a leave-one-person-out cross-validation method, demonstrating that a compact, line-of-sight-free bio-impedance wearable can deliver head-tracking performance comparable to SOTA vision-based methods.


[90] 2509.04744

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.


[91] 2509.12694

Soft Graph Transformer for MIMO Detection

We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.


[92] 2509.15703

SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation

Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.


[93] 2509.16522

Etude: Piano Cover Generation with a Three-Stage Approach -- Extract, strucTUralize, and DEcode

Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware mechanisms or the difficulty of modeling complex rhythmic patterns. Rhythmic information is crucial, as it defines structural similarity (e.g., tempo, BPM) and directly impacts the overall quality of the generated music. In this paper, we introduce Etude, a three-stage architecture consisting of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and applying a novel, simplified REMI-based tokenization, our model produces covers that preserve proper song structure, enhance fluency and musical dynamics, and support highly controllable generation through style injection. Subjective evaluations with human listeners show that Etude substantially outperforms prior models, achieving a quality level comparable to that of human composers.


[94] 2512.14961

Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

Person identification systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a multimodal person identification framework that utilizes gesture as a situational enhancer to supplement traditional modalities like voice and face. Our model employs a unified hybrid fusion strategy, integrating both feature-level and score-level information to maximize representational richness and decision accuracy. Specifically, it leverages multi-task learning to process modalities independently, followed by cross-attention and gated fusion mechanisms. Finally, a confidence-weighted strategy dynamically adapts to missing data, ensuring that our single classification head achieves optimal performance even in unimodal and bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark in this work for the first time. Our results demonstrate that the proposed trimodal system achieves 99.51% Top-1 accuracy on person identification tasks. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in bimodal mode, outperforming conventional approaches. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.


[95] 2512.20332

RIS-Empowered OTFS Modulation With Faster-than-Nyquist Signaling in High-Mobility Wireless Communications

High-mobility wireless communication systems suffer from severe Doppler spread and multi-path delay, which degrade the reliability and spectral efficiency of conventional modulation schemes. Orthogonal time frequency space (OTFS) modulation offers strong robustness in such environments by representing symbols in the delay-Doppler (DD) domain, while faster-than-Nyquist (FTN) signaling can further enhance spectral efficiency through intentional symbol packing. Meanwhile, reconfigurable intelligent surfaces (RIS) provide a promising means to improve link quality via passive beamforming. Motivated by these advantages, we propose a novel RIS-empowered OTFS modulation with FTN signaling (RIS-OTFS-FTN) scheme. First, we establish a unified DD-domain input-output relationship that jointly accounts for RIS passive beamforming, FTN-induced inter-symbol interference, and DD-domain channel characteristics. Based on this model, we provide comprehensive analytical performance for the frame error rate, spectral efficiency, and peak-to-average power ratio (PAPR), etc. Furthermore, a practical RIS phase adjustment strategy with quantized phase selection is designed to maximize the effective channel gain. Extensive Monte Carlo simulations under a standardized extended vehicular A (EVA) channel model validate the theoretical results and provide key insights into the trade-offs among spectral efficiency, PAPR, input back-off (IBO), and error performance, with some interesting this http URL proposed RIS-OTFS-FTN scheme demonstrates notable performance gains in both reliability and spectral efficiency, offering a viable solution for future high-mobility and spectrum-constrained wireless systems.