SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding

Institution Name
2025

*Indicates Equal Contribution
73.36%
F1-Micro Score
+29.12
pp vs. Gemini 2.5 Pro
28
Sound Event Classes
2,697
Training Samples

Overview

Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring where they occur. A car horn and a train horn may sound similar—but their geographic context (highway vs. railway station) makes them easy to distinguish. SpaAudioLM bridges this gap by jointly reasoning over audio and geospatial Point-of-Interest (POI) metadata, enabling spatially grounded sound understanding.

We fine-tune Qwen2.5-Omni through a three-phase pipeline: difficulty profiling → Chain-of-Thought SFT → difficulty-aware GRPO with a composite reward (weighted F1 + format + POI consistency). The model generates structured reasoning before predicting multi-label sound events across 28 categories.

SpaAudioLM training pipeline overview

Main Results

Comparison with baseline models on multi-label audio event classification. Results are mean ± std over 5 independent runs.

Model F1-Mi F1-Ma F1-W Jaccard EM ROC-Mi ROC-Ma PR-Mi PR-Ma
Qwen2-Audio-7B 4.73±0.43 2.86±0.30 5.27±0.53 1.96±0.19 0.00±0.00 50.41±0.20 49.72±0.21 5.72±0.03 5.83±0.05
Qwen2.5-Omni-7B 34.36±0.09 25.90±0.24 37.35±0.15 18.31±0.10 9.97±0.15 68.93±0.13 64.59±0.16 15.61±0.07 12.98±0.09
Qwen3-Omni-30B 29.66±0.43 20.26±0.82 28.80±0.25 14.81±0.34 14.02±0.23 60.97±0.14 58.57±0.47 13.53±0.29 12.27±0.48
GPT-4o Audio 30.09±0.47 26.47±1.21 34.07±0.97 17.18±0.58 9.43±0.91 70.26±0.41 64.70±0.42 13.74±0.29 14.86±0.77
Gemini 2.5 Pro 44.24±0.65 40.35±1.04 47.65±0.83 28.04±0.63 15.58±1.04 73.04±0.51 69.47±0.57 22.70±0.55 25.29±1.09
SpaAudioLM (Ours) 73.36±0.74 63.48±0.59 72.98±0.68 53.57±0.81 54.47±0.72 84.58±0.41 79.57±0.34 55.58±1.04 46.45±1.07

Reasoning Demos

SpaAudioLM generates Chain-of-Thought reasoning grounded in both audio evidence and geospatial context before predicting sound events.

Urban Safety
GT Explosion
Pred Explosion
POI Context
Building: ApartmentsBuilding: CommercialHighway: Bus_stopLanduse: ConstructionLanduse: IndustrialLeisure: Park
Chain-of-Thought Reasoning

In the recording, the environment is likely a mix of residential and possibly commercial areas. The sound of bird wings flapping suggests the presence of pigeons, common in urban areas with buildings. The most striking sound is the explosion—a sudden, loud, and impactful noise. Given the residential context, this could be fireworks, as indicated by the whistling and squealing sounds characteristic of firecrackers. The presence of a dog barking indicates human activity; the dog's bark could be a reaction to the loud sounds.

Emergency Detection
GT Siren
Pred Siren
POI Context
Building: GovernmentBuilding: IndustrialHighway: CrossingLanduse: CemeteryRailway: AbandonedRailway: Buffer_stop
Chain-of-Thought Reasoning

The primary sound that stands out is the siren of an emergency vehicle, characterized by a high-pitched, continuous tone that suggests urgency and movement. The vehicle is likely navigating through traffic at speed. Given the context of a location near roads with vehicle traffic, the presence of a siren is not unexpected. The siren's high-pitched, continuous nature helps in identifying it as a siren rather than a car horn, which would typically be louder and more abrupt.

Wildlife Monitoring
GT Bird sounds, Insects
Pred Bird sounds, Insects
POI Context
Building: BarnHighway: FootwayHighway: PathLanduse: FarmyardLanduse: ForestNatural: WaterLeisure: Nature_reserve
Chain-of-Thought Reasoning

The soundscape is dominated by the distinct calls of birds, characterized by melodic chirping and tweeting, suggesting a natural setting where birds are active. Alongside the bird sounds, there is a subtle but persistent buzzing of insects, consistent with the proximity to farmland and forest areas. The spatial context confirms the proximity to vegetation and natural water features, supporting the identification of these biophonic signals.

Coastal Monitoring
GT Waves, Wind
Pred Waves, Wind
POI Context
Highway: CyclewayLanduse: ConstructionNatural: Bare_rockNatural: BeachWaterway: DrainLeisure: Park
Chain-of-Thought Reasoning

The audio captures the ambiance of a coastal area, likely a beach or seaside neighborhood. The dominant sounds are the ocean, with waves rhythmically crashing onto the shore. The wind is also present, adding to the natural soundscape typical of such environments. The spatial context near natural beach and water features confirms the presence of these coastal sounds.

Street Performance
GT Musical instrument, Singing
Pred Musical instrument, Singing
POI Context
Building: RetailHighway: Bus_stopHighway: CrossingLanduse: CommercialRailway: Buffer_stopRailway: Crossing
Chain-of-Thought Reasoning

The audio captures a lively scene characterized by a group of people singing in unison, with a steady beat accompanying them. The atmosphere suggests a choir or a large group participating in a musical performance. The steady beat indicates the accompaniment of musical instruments, possibly percussion. The environment is likely an open space such as a public square near a commercial area, where such gatherings are common.

Assistive Living
GT Bird sounds, Dog, Speech
Pred Bird sounds, Dog, Speech
POI Context
Building: ChurchHighway: Bus_stopHighway: CyclewayLanduse: ForestNatural: TreeNatural: Water
Chain-of-Thought Reasoning

The environment is a blend of natural and urban elements. Bird sounds indicate vegetation and possibly a water feature nearby. The bird sounds are consistent and clear, suggesting a stable outdoor environment. The barking of a dog is intermittent but distinct, indicating a domestic animal in the vicinity. Human speech is also audible, indicating social activity in a residential neighborhood near a church and forested area.

BibTeX

@article{hou2025spaaudioLM,
  title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
  author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
  year={2025}
}