Overview
Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring where they occur. A car horn and a train horn may sound similar—but their geographic context (highway vs. railway station) makes them easy to distinguish. SpaAudioLM bridges this gap by jointly reasoning over audio and geospatial Point-of-Interest (POI) metadata, enabling spatially grounded sound understanding.
We fine-tune Qwen2.5-Omni through a three-phase pipeline: difficulty profiling → Chain-of-Thought SFT → difficulty-aware GRPO with a composite reward (weighted F1 + format + POI consistency). The model generates structured reasoning before predicting multi-label sound events across 28 categories.
Main Results
Comparison with baseline models on multi-label audio event classification. Results are mean ± std over 5 independent runs.
| Model | F1-Mi | F1-Ma | F1-W | Jaccard | EM | ROC-Mi | ROC-Ma | PR-Mi | PR-Ma |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2-Audio-7B | 4.73±0.43 | 2.86±0.30 | 5.27±0.53 | 1.96±0.19 | 0.00±0.00 | 50.41±0.20 | 49.72±0.21 | 5.72±0.03 | 5.83±0.05 |
| Qwen2.5-Omni-7B | 34.36±0.09 | 25.90±0.24 | 37.35±0.15 | 18.31±0.10 | 9.97±0.15 | 68.93±0.13 | 64.59±0.16 | 15.61±0.07 | 12.98±0.09 |
| Qwen3-Omni-30B | 29.66±0.43 | 20.26±0.82 | 28.80±0.25 | 14.81±0.34 | 14.02±0.23 | 60.97±0.14 | 58.57±0.47 | 13.53±0.29 | 12.27±0.48 |
| GPT-4o Audio | 30.09±0.47 | 26.47±1.21 | 34.07±0.97 | 17.18±0.58 | 9.43±0.91 | 70.26±0.41 | 64.70±0.42 | 13.74±0.29 | 14.86±0.77 |
| Gemini 2.5 Pro | 44.24±0.65 | 40.35±1.04 | 47.65±0.83 | 28.04±0.63 | 15.58±1.04 | 73.04±0.51 | 69.47±0.57 | 22.70±0.55 | 25.29±1.09 |
| SpaAudioLM (Ours) | 73.36±0.74 | 63.48±0.59 | 72.98±0.68 | 53.57±0.81 | 54.47±0.72 | 84.58±0.41 | 79.57±0.34 | 55.58±1.04 | 46.45±1.07 |
Reasoning Demos
SpaAudioLM generates Chain-of-Thought reasoning grounded in both audio evidence and geospatial context before predicting sound events.
Chain-of-Thought Reasoning
In the recording, the environment is likely a mix of residential and possibly commercial areas. The sound of bird wings flapping suggests the presence of pigeons, common in urban areas with buildings. The most striking sound is the explosion—a sudden, loud, and impactful noise. Given the residential context, this could be fireworks, as indicated by the whistling and squealing sounds characteristic of firecrackers. The presence of a dog barking indicates human activity; the dog's bark could be a reaction to the loud sounds.
Chain-of-Thought Reasoning
The primary sound that stands out is the siren of an emergency vehicle, characterized by a high-pitched, continuous tone that suggests urgency and movement. The vehicle is likely navigating through traffic at speed. Given the context of a location near roads with vehicle traffic, the presence of a siren is not unexpected. The siren's high-pitched, continuous nature helps in identifying it as a siren rather than a car horn, which would typically be louder and more abrupt.
Chain-of-Thought Reasoning
The soundscape is dominated by the distinct calls of birds, characterized by melodic chirping and tweeting, suggesting a natural setting where birds are active. Alongside the bird sounds, there is a subtle but persistent buzzing of insects, consistent with the proximity to farmland and forest areas. The spatial context confirms the proximity to vegetation and natural water features, supporting the identification of these biophonic signals.
Chain-of-Thought Reasoning
The audio captures the ambiance of a coastal area, likely a beach or seaside neighborhood. The dominant sounds are the ocean, with waves rhythmically crashing onto the shore. The wind is also present, adding to the natural soundscape typical of such environments. The spatial context near natural beach and water features confirms the presence of these coastal sounds.
Chain-of-Thought Reasoning
The audio captures a lively scene characterized by a group of people singing in unison, with a steady beat accompanying them. The atmosphere suggests a choir or a large group participating in a musical performance. The steady beat indicates the accompaniment of musical instruments, possibly percussion. The environment is likely an open space such as a public square near a commercial area, where such gatherings are common.
Chain-of-Thought Reasoning
The environment is a blend of natural and urban elements. Bird sounds indicate vegetation and possibly a water feature nearby. The bird sounds are consistent and clear, suggesting a stable outdoor environment. The barking of a dog is intermittent but distinct, indicating a domestic animal in the vicinity. Human speech is also audible, indicating social activity in a residential neighborhood near a church and forested area.
BibTeX
@article{hou2025spaaudioLM,
title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
year={2025}
}