A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities (2024)

Sreyan Ghosh^1∗,Sonal Kumar^1∗,Ashish Seth¹,Chandra Kiran Reddy Evuru¹,
Utkarsh Tyagi¹,S Sakshi¹,Oriol Nieto²,Ramani Duraiswami¹,Dinesh Manocha¹
¹University of Maryland, College Park, USA ²Adobe, USA
{sreyang,sonalkum,dmanocha}@umd.edu
Project: https://sreyan88.github.io/gamaaudio/

Abstract

Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understandingand Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.

^†^†^∗ Co-leads with equal contribution.

1 Introduction

Large Language Models (LLMs) possess impressive abilities to understand and reason about the world through languageZhao etal. (2023). While spoken language understanding tasks, like automatic speech recognition, have had a long history of benefiting from language comprehension with (L)LMsWatanabe etal. (2018); Hu etal. (2024), the ability to improve the perception and understanding of non-speech sounds and non-verbal speech through language has been less explored (from hereon we refer to these kinds of audios or sound as “audio” in the paper). Beyond visual and language perception, the ability to understand audio is unarguably important and necessary for autonomous agents to interact with the world.

\includegraphics

[width=]figures/gama1.pdf

Contrastive Language Audio Pre-training (CLAP)Elizalde etal. (2023a) was one of the first Audio-Language Models (ALM) to improve audio understanding through a language interface. Following this, several attempts have been made to improve CLAP and its reasoning abilitiesGhosh etal. (2024b). On the other hand, Deshmukh etal. propose Pengi, a pre-trained decoder-only LLM coupled with an audio-encoder, that can solve all kinds of audio tasks by framing them as open-ended text-generation tasks. Similarly, Large Audio Language Models (LALMs) like LTUGong etal. (2024) and SALMONNTang etal. (2024) follow a similar architecture and attempt to solve audio tasks by empowering the model with instruction following capabilitiesWei etal. (2022). Specifically, all audio tasks are first framed in instruction-response pairs. The model is then fine-tuned on these pairs to learn audio reasoning and, thereby, instruction following. As an emergent ability, these models also show remarkable capabilities in open-ended question answering by reasoning over the input audio. However, two significant problems still persist: (1) All these models employ simple connection modules between the audio encoder and the language decoder to enable the latter with audio understanding capabilities. This hinders comprehensive multimodal connection and alignment, thereby increasing the risk of hallucinations and leading to suboptimal performanceLiu etal. (2023a). (2) Complex reasoning with LALMs is still under-explored. While these models excel at audio event detection (in various forms like captioning, event classification, etc.) and information-seeking questions (e.g., close-ended audio questions like “How many birds are squawking?”), they fail to provide a faithful response for questions involving complex reasoning like “Identifying the context of laughter and its relationship with the automotive sounds in the recording. Draw a conclusion on the possible scenario occurring.”. We define complex reasoning for LALMs in Section3.2 and show examples in Fig.1 and Fig.4.

Main Contributions. Our primary contributions are as follows:

•
A Novel LALM. We introduce GAMA, an LALM with advanced audio understanding and complex reasoning abilities. To improve audio perception and understanding abilities, we propose integrating an LLM with multiple types of audio features that encode diverse aspects of information about the input audio. Specifically, we couple the output features from an Audio Q-Former and an Audio Spectrogram Transformer (AST)Gong etal. (2021), where the AST is further equipped with an aggregation module. While the Audio Q-Former possesses impressive semantic generalization capabilitiesLi etal. (2023), the AST possesses strong knowledge of surface-level audio properties. Additionally, inspired by the fact that different layers in audio models learn audio information at different scalesSingla etal. (2022), the aggregation module aggregates the features from multiple layers of AST, which helps encode diverse knowledge. Both representations are passed through MLP layers that connect these features into the word embedding space before adding them as the prefix. As a result, GAMA possesses improved audio understanding capabilities by moving away from the simple coupling of audio encoders and linear layers commonly employed as connection modules to align the audio and textual modalities, which generally suffer from comprehensive multimodal alignmentLiu etal. (2023a). GAMA is first fine-tuned on a large-scale audio-language corpus, and the resulting model outperforms all other models on standard audio and music understanding benchmarks.
•
A Novel Instruction Tuning Dataset. To endow an LALM with complex reasoning abilities, we propose CompA-R, a dataset synthetically generated with multi-aspect information and human-written in-context examples. Specifically, we prompt GPT to synthesize an instruction-response pair by guiding it with various metadata related to the audio.
•
A Novel Evaluation Dataset. To evaluate an LALM’s complex reasoning abilities, we develop CompA-R-test, a human-labeled benchmark. Specifically, CompA-R-test evaluates an LALM on open-ended AQA that demands complex reasoning over the audio. GAMA-IT (GAMA fine-tuned on CompA-R) shows significant improvements on CompA-R-test over all other baselines from literature.

\includegraphics

[width=]figures/ltu-main.drawio.pdf

2 Related Work

Large Multi-Modal and Audio-Language Models. Prior to the exploration of LLMs as efficient reasoners, encoder-based multi-modal language models, trained to learn a shared space between language and other modalities, have shown great promise. For example, CLAP, inspired by CLIPRadford etal. (2021) in vision, showed state-of-the-art performance on audio-language tasks like retrieval, zero-shot classification, etc.

LLMs pre-trained at an incredible scale with the next token prediction objective implicitly compress world knowledge in their parametersZhao etal. (2023). These models learn general-purpose representations, which can then be aligned with the desired response characteristicsZhang etal. (2023). Instruction Tuning (IT), the process of fine-tuning an LLM with instruction-response pairs, has proved to be one of the most popular forms of alignment. Recent work shows that LLMs can also be instruction-tuned for multi-modal alignment. LLaVaLiu etal. (2024), a pioneering work on multi-modal vision-language alignment, showed that fine-tuning an LLM on visual instruction-response pairs with additional vision features as prefix can endow the model with visual reasoning and understanding abilities. Several works following LLaVa improve aspects of LVLMs and have achieved impressive performance on several vision-language tasksZhang etal. (2024). On the other hand, LALMs like LTU and SALMONN showed impressive performance on several audio-language tasks by reasoning over the audio. Though these models extensively evaluate several closed- and open-ended tasks, their ability to perform complex reasoning is largely under-explored.

Instruction Tuning and Complex Reasoning. IT-based alignment has also shown significant improvements for LLMs on Natural Language Understanding tasks, unlocking impressive capabilitiesBubeck etal. (2023), suggesting that fine-tuning is key to building and improving LLM-based agents. Very recently, (Xu etal., 2024) and (Cui and Wang, 2024) show that well-curated IT data can improve various reasoning capabilities in LLMs, like logical, mathematical, complex reasoning, etc. More specifically, IT teaches LLMs better and more effective methods to reason about a problem, presented in the input instruction (like step-by-step reasoningKojima etal. (2022)).

3 Methodology

In the next sub-sections, we first describe the GAMA architecture and its components in detail, followed by fine-tuning GAMA on audio-language pairs, CompA-R creation, and instruction-tuning GAMA on CompA-R.

3.1 GAMA Architecture

Fig.2 illustrates the architecture of GAMA. GAMA builds on the same base architecture proposed in prior worksGong etal. (2024) but introduces several novel components for improving audio perception. More specifically, we feed the pre-trained LLM with features from multiple audio encoders, including a pre-trained Audio-Q-Former and a pre-trained AST that encode diverse audio knowledge. Additionally, unlike prior work, we do not just use the last layers of the AST but couple it with a multi-layer aggregator that takes features from multiple layers as input and outputs a feature that is aware of various low-level and high-level properties of the input audio. Finally, to endow the model with effective complex reasoning abilities, we employ AST again to extract high-level semantic knowledge, i.e., audio event tags, as supplementary information.

3.1.1 Audio Spectrogram Transformer (AST)

Audio Spectrogram Transformer (AST), was one of the first attempts to model audio signals with a pure Transformer network. We employ an AST model fine-tuned on the AudioSet dataset. AST has been employed as an audio encoder and a feature extractor in a wealth of prior works due to its high informativenessGong etal. (2023, 2024). To extract the last-layer features, we drop the audio classification head and employ it only for event classification for soft prompts.

3.1.2 Audio Q-Former

Motivation. Our primary goal is to integrate GAMA with an audio encoder that possesses strong semantic generalization capabilities for any input audio. Prior work has extensively explored CLAP-style training for learning audio-language encoders. However, other methods and architectures have rarely been explored. As a more powerful alternative, we explore the Q-Former architecture proposed by(Li etal., 2023).

Architecture. The architecture of our Audio Q-Former is based on the Querying Transformer proposed in Li etal. (2023), which is initialized from BERTDevlin etal. (2018) and has $Q$ querying tokens. We employ AST as the audio encoder (in place of the ViT-based vision encoder) and keep the rest of the architecture the same. Similar to the original implementation, we train the model in two stages. For the first stage, we solve three tasks, namely the Audio-Text Matching loss, the Audio-Grounded Text Generation loss, and the Audio-Text Contrastive Learning loss. For the second stage, we employ LLaMa-2_7B as the language decoder and solve the language-modeling loss. For training, we use 2.5M+ audio-caption pairs (detailed in SectionE.2). For architectural details, we refer our readers toLi etal. (2023).

Training with Caption Augmentation. Additionally, due to the lack of large-scale audio caption pairs, we adopt a caption-augmentation methodology to augment the existing audios with diverse additional captions. More specifically, we instruct an LLM to generate $k$ rewrites of the original caption. We employ two different prompts that rewrite the input caption with two different objectives:

Prompts. For Prompt 1, our primary aim is that the resultant rewrite should describe each acoustic event in the caption similarly but more vividly. These augmentations help the model learn various distinctive characteristics of the audio concepts corresponding to the acoustic events. For Prompt 1, our primary aim is such that the resultant rewrite should describe each acoustic event in the caption differently from the original caption. These augmentations aid the model in understanding the diverse linguistic expressions that can describe a single audio concept. We show examples below: (more examples in Table11):

{mdframed}

[linewidth=1pt, linecolor=black, leftmargin=1pt, rightmargin=1pt, innerleftmargin=10pt, innerrightmargin=10pt, innertopmargin=4pt, innerbottommargin=2pt, backgroundcolor=gray!20, roundcorner=5pt](1) Original Caption: Someone made a cool vocal for a dubstep track.

(1) Rewritten Caption by Prompt 1: A captivating vocal performance ignites the dubstep track, delivering a hypnotic and enthralling sound that reverberates through the air.

(1) Rewritten Caption by Prompt 2: The dubstep track features a slick, stylish vocal performance that adds a layer of sophistication to its heavy beats and basslines.

(2) Original Caption: Someone eating crisps and talking.

(2) Rewritten Caption by Prompt 1: Crunchy crisps mingle with the sound of a lively conversation, creating a cozy and intimate atmosphere.

(2) Rewritten Caption by Prompt 2: The crunch of crisps and the rustle of papers create a cozy, intimate atmosphere, accompanied by the gentle hum of a conversation.

3.1.3 Multi-Layer Aggregator

Motivation. To extract additional details about the input audio, we devise a multi-layer aggregator that integrates multi-level hidden features of the pre-trained AST. Although AST has a global reception field in all layers, different layers learn auditory information at different scalesSingla etal. (2022), i.e., the middle layers encode more generic features (e.g., basic sounds, textures), while deeper layers capture high-level concepts (e.g., speech intonations, complex sound patterns). By aggregating these features, the multi-layer aggregator outputs features that encode a more holistic and fine-grained understanding of the audio. Thus, our multi-layer aggregator makes fine-grained auditory knowledge more likely to be learned while training.

Architecture. Our multi-layer aggregator is a transformer-style network consisting of two transformer layers for aggregating the hidden features of the audio encoder. Given the hidden features $A_{j}$ and $A_{k}$ from the middle layers in the audio encoder, the aggregation module uses two blocks to sequentially integrate the former two features with the last layer feature $A_{i}$ . Each block $\mathcal{B}$ is composed of self-attention, cross-attention, and Feed-forward network (FFN) arranged in a sequential manner. Finally, the output features $\bar{A}$ is generated as follows,

\bar{A}=\mathcal{B}_{2}\left(\mathcal{B}_{1}\left(A_{i};A_{j}\right);A_{k}\right)

(1)

\mathcal{B}(X;Y)=\operatorname{FFN}(\operatorname{Cross-Attn}(\operatorname{%Attn}(X),Y)).

(2)

In practice, we employ j = 4 and k = 8 from AST as our input to the multi-layer aggregator.

3.1.4 Soft Prompt

Motivation. Though models like AST and Audio Q-Former have shown much promise in audio tasks, a major problem still exists: real-world audio generally has multiple and overlapping acoustic events, and understanding all such events from model features proves to be inherently complexGhosh etal. (2024b). This eventually leads to sub-optimal performance for complex reasoning, where the explicit knowledge of plausible acoustic events in the audio can improve model responses. Thus, to improve fine-grained audio perception capabilities, we augment GAMA with high-level semantic understanding of the input audio. To do this, we employ an off-the-shelf audio model to extract high-level semantic knowledge, i.e., audio event tags, as supplementary information. However, as audio event classification is not a solved problem, errors in tag predictions are inevitable. Thus, to mitigate the potential adverse effects of inaccurate predictions, we are inspired by prompt tuning to introduce a soft prompting technique that enables the model to utilize the embedded tags within the instructions adaptively.

Architecture. Fig.2 shows an example of how we design our soft prompt together with an instruction. Specifically, we construct a fixed instruction template where we add the audio event tags along with the soft prompt, where the soft prompt is a trainable vector. In contrast to standard prompt tuning, where the model activations are generally steered towards completing the task for which the prompt is optimized, in our version the direction is specified by a tailored input sentence, “According to <hint>, you are allowed to use or partially use the following tags:”, and “<hint>” will be replaced by the soft prompt. This design allows us to select valuable information from tags adaptively rather than serving a specific task, as seen in standard prompt tuning methods. We only employ the soft prompt in the instruction tuning for complex reasoning step and not in the fine-tuning step. We provide a rationale in AppendixC.1.

3.1.5 Connection Module

We employ a multi-layer perceptron (MLP) to connect audio features into the word embedding space. All features are passed through separate MLP layers before being added as prefixes to word embeddings of the text instruction prompt.

3.2 CompA-R

Motivation. We define complex reasoning as the capability of an LALM to understand the input audio, every individual acoustic event in the audio, and reason the corresponding scene in which the audio might have occurred, such that it can infer nuanced relationships between them and its underlying contexts, thereby enabling it to draw sophisticated conclusions. We design CompA-R with the primary goal of endowing LALMs with complex reasoning abilities. We are motivated by the primary finding that current SOTA LALMs can only perform well in prompts that require describing the audio (e.g., Describe the audio) or reasoning-based prompts where identifying the acoustic events present in the audio would suffice for a faithful response (e.g., What type of video can this audio be used for dubbing?). However, when posed with complex reasoning questions, these models often hallucinate or fail to provide a faithful response (see Fig.4). Inspired by a wealth of prior work that shows how IT on well-curated datasets can align model behaviors for the execution of novel skills like reasoning and complex problem solvingXu etal. (2024), we propose a systematic multi-stage pipeline to synthesize instruction-response pairs for CompA-R. CompA-R trains a model to engage in complex reasoning by querying it with instructions that cannot be directly inferred by identifying individual audio events and would require analyzing each event and its context in relation to other scene elements and world knowledge.

Synthesis Pipeline. We employ the AudioSet-strong subset to synthesize CompA-R. Our data synthesis pipeline consists of 3 stages: i) Caption Generation. To generate a caption that is aware of both the audio and the visual scene, we feed GPT-4 with multiple types of information about the audio and its corresponding video. These include a caption of the middle frame of the video generated using BLIP-2Li etal. (2023), objects in the frame identified using Grounding DINOLiu etal. (2023c), image labels for the frame using the ImageNetDeng etal. (2009) ontology obtained from CLIP, environment context using PlaceCNNZhou etal. (2017), caption of the audio obtained using RECAPGhosh etal. (2024a) and audio event tags using the AudioSet ontology obtained from AST. Finally, we prompt GPT-4 to aggregate these descriptions into a comprehensive caption. ii) Dataset Synthesis. We pass the generated caption together with the ground-truth acoustic event information and their corresponding time slices to GPT-4. We prompt GPT-4 with 3 human-written exemplars (which are randomly sampled from a pool of 50 exemplars) to synthesize an instruction-response pair. The exemplars and prompt are designed such that the synthesized instructions demand complex reasoning. We synthesize a total of 25000 instruction-response pairs. iii) Human Verification. We discard instructions due to untended noise and hallucinations. We, the authors of this paper, manually verify a subset of CompA-R corresponding to 500 unique audios for creating the test set, i.e., CompA-R-test. The remainder of the synthesized dataset is used as the training set. We describe the process and annotation details further in AppendixG.1. This finally led to 200,234 unique pairs in training and 1,561 in testing.

\includegraphics

[width=]figures/ltu-Page2.drawio4.pdf

3.3 Training

Fine-tuning. We fine-tune GAMA on the OpenAQA training set released by Gong etal. (2024). We use a faction of all the instances due to the unavailability of the entire AudioSet and resource constraints. Dataset details are provided in AppendixH.1. Additionally, we augmented OpenAQA with 4 more datasets, including MusicCaps, MusicQA, NSynth, and Magna, to improve its music understanding capabilities. For fine-tuning, we follow the exact same 4-stage method proposed byGong etal. (2024) where all parameters of all encoders are trainable, and we train only the LoRA modules of the LLM. We request our readers to refer to Gong etal. (2024) for more details.

Model

ESC50^#

(Acc)

DCASE^#

(Mi-F1)

VS^†

(Acc)

TUT^†

(Acc)

BJO^†

(Acc)

VGG

(Acc)

FSD

(mAP)

NS_ins.

(ACC)

NS_src.

(ACC)

GTZAN^†

(ACC)

MSD^†

(ACC)

AudioSet

(mAP)

Classif.

Avg.

AudioCaps

(SPICE)

Clotho

(SPICE)

Cap.

Avg.

ClothoAQA

(ACC)

Audio-Language encoder-based models. They are generalizable to unseen labels, but a pre-defined label set is required for inference.

AudioCLIP

69.4

- -

CLAP(Elizalde etal., 2023a)

82.6

30.0

48.4

29.6

47.5

24.0

30.2

22.7

16.4

25.0

44.0

5.8

29.4

CLAPWu* etal. (2023a)

89.1

31.3

47.1

35.6

48.0

26.3

30.8

25.2

18.9

26.3

46.9

6.2

36.0

CompA-CLAP

90.1

30.6

49.5

35.8

48.2

29.5

31.5

24.9

17.0

26.1

46.2

6.2

36.3

Audio-Language generation-based models. They directly output label names and do not need a pre-defined label set is needed at inference.

Qwen-Audio-Chat

71.7

32.4

74.2

16.9

50.8

17.5

39.8

30.2

41.3

41.6

69.1

13.4

41.1

14.7

9.8

12.3

32.3

LTU

81.7

37.5

53.3

19.9

67.8

50.3

43.9

28.0

41.8

9.9

74.2

18.3

42.4

16.9

11.7

15.8

25.1

SALMONN

16.4^†

18.0^†

16.9^†

7.8^†

25.0^†

23.3^†

22.1^†

16.2^†

33.7^†

10.1^†

28.8^†

13.4^†

17.9

8.3

7.6

8.0

23.1^†

Pengi

80.8^†

29.6^†

46.4^†

18.4^†

47.3^†

16.6^†

35.8

39.2

46.0

11.9

93.0

11.5

39.7

12.7

7.0

9.9

63.6

AudioGPT

41.3

20.9

35.8

14.9

21.6

5.6

18.8

40.9

15.6

11.9

28.5

12.7

22.4

6.9

6.2

6.6

33.4

GAMA (ours)

82.6

38.4

52.4

21.5

69.5

52.2

47.8

63.9

99.5

13.8

85.6

19.2

53.9

18.5

13.5

16.0

71.6

w/o AST & Aggregator

80.5

36.9

51.6

19.2

66.2

50.8

45.3

62.4

89.6

11.6

83.2

17.3

51.2

17.2

12.4

14.8

68.3

w/ Last Layer Features

81.3

37.6

50.2

20.4

68.2

51.7

45.8

62.6

92.3

11.2

81.5

18.1

51.7

17.7

12.8

15.3

69.5

w/o Audio Q-Former

79.7

37.4

51.3

20.2

68.0

51.6

46.4

60.1

90.4

11.6

79.8

18.4

51.2

16.9

11.9

14.4

61.2

w/ CLAP

81.8

38.4

52.2

21.6

69.1

52.0

47.5

58.8

99.5

12.4

77.9

19.0

52.5

17.2

13.1

15.1

66.4

\resizebox

0.75!CompA-R-test (GPT-4/Human)OpenAQADense CaptioningModelsClarityCorrectnessEngagementAvg.ClarityCorrectnessEngagementAvg.AudioCapsClothoAvg.Qwen-Audio-Chat3.5 / 3.43.3 / 3.43.6 / 3.73.5 / 3.53.63.63.53.63.83.63.7LTU3.5 / 4.03.2 / 3.33.4 / 3.53.4 / 3.63.53.73.53.63.53.63.5SALMONN2.6 / 2.82.4 / 2.32.0 / 2.22.3 / 2.42.42.52.72.52.83.12.9Pengi1.8 / 1.61.5 / 1.41.3 / 1.21.5 / 1.41.71.51.41.52.62.82.7AudioGPT1.3 / 1.41.6 / 1.51.4 / 1.71.4 / 1.51.61.51.51.52.72.92.8LTU w/ CompA-R3.5 / 4.03.2 / 3.33.4 / 3.53.6 /3.63.53.73.53.63.73.83.8GAMA-IT (ours)4.3 / 4.53.9 / 4.13.9 / 4.34.0 / 4.34.04.23.84.04.34.14.2 w/o Soft Prompt4.1 / 4.23.7 / 3.83.6 / 3.43.8 / 3.83.93.83.73.84.13.94.0 w/o Aggregator4.0 / 4.23.5 / 3.53.6 / 3.53.7 / 3.73.73.73.53.63.73.83.8 w/o Audio Q-Former3.8 / 3.73.4 / 3.63.5 / 3.33.6 / 3.53.43.93.53.63.73.53.6

Instruction Tuning on CompA-R. Post fine-tuning, we instruction-tune GAMA on CompA-R to endow it with complex reasoning abilities. Following common conventionsLiu etal. (2023b), we fine-tuned only the LoRA modules. We call the Instruction Tuned GAMA as GAMA-IT. Although fine-tuning on AQA also endows GAMA with instruction-following capabilities, CompA-R differs in the nature of training instances (thereby the capabilities it endows), and thus, we differentiate with such a naming convention for ease of reading.

\includegraphics

[width=]figures/gama-it-compressed.pdf

3.4 Experimental Setup

Hyper-parameters. For the fine-tuning stage, we follow the exact same hyper-parameter setup proposed byGong etal. (2024). However, we scale down our batch sizes to 4, 2, 2, and 2 (due to compute constraints) with an effective batch size of 256 in all stages. For Instruction Tuning, we employ a batch size of 2, an effective batch size of 256, and a learning rate of 1e-4. For both training and evaluation, we sampled audio at 16kHz.

Baselines. We compare GAMA with i) generation-based LALMs: LTU, Qwen-Audio, SALMONN, Pengi and AudioGPT. We only employ the original checkpoints open-sourced by the authors and do not re-train the models due to compute constraints (except LTU, which we retrain on our version of OpenAQA, the same batch size as GAMA, and with LLaMa-2 as the LLM). We do not compare with Audio FlamingoKong etal. (2024) as the checkpoint was not available at the time of writing the paper, and we are constrained by compute for training it from scratch. ii) audio-language encoders: CLAP by Wu* etal. (2023b) and Elizalde etal. (2023b), CompA-CLAPGhosh etal. (2024b), AudioCLIPGuzhov etal. (2021) and Audio Q-Former. For dense captioning and close- and open-ended AQA, we evaluate using GAMA-IT. For all other tasks, we evaluate using the only fine-tuned version of GAMA (rationale in AppendixC).

Evaluation Datasets and Metrics. Evaluation metrics used for all evaluation datasets are mentioned in Table2 and detailed statistics about each dataset is mentioned in SectionH.2. For classification, zero-shot evaluation refers to datasets GAMA that have never been seen during training; weak zero-shot evaluation refers to datasets GAMA that have not been seen in training but are sourced from the same project as part of the training data, and seen datasets refer to datasets GAMA has been trained on. Similar to Deshmukh etal. (2023); Gong etal. (2024), we first caption the audio and retrieve the most similar label using SentenceBERT. We employ either accuracy (Acc), Micro-F1 (Mi-F1), or Mean Average Precision (mAP) for classification evaluation. For captioning, we also propose dense captioning, which evaluates a model for its capability to identify every event in the audio and the context of its occurrence with respect to other events in the audio (more in Section4). For evaluation, we randomly select a subset of 500 samples from AudioCaps and Clotho. We also employ human evaluation for OpenAQA, CompA-R-test, and dense captioning. For human evaluation, we ask human annotators to score the caption on a scale of 1-5 and report the score averaged across the 3. More details on recruitment and background of annotators can be found in AppendixD. Finally, due to human evaluation being prohibitively expensive, we also propose an automated evaluation methodology for complex open-ended AQA on CompA-R-test. We evaluate model responses using text-only GPT-4, where we provide it with the audio caption generated in Section3.2 and the gold-standard audio event with timestamps (prompt in AppendixB).

4 Results and Analysis

Quantitative Results. Table1 compares GAMA with other baselines on classification and captioning tasks. For zero-shot classification evaluation on VocalSound (VS)Gong etal. (2022), TUT 2017 (TUT)Mesaros etal. (2018), Beijing Opera (BJO)Tian etal. (2014), GTZAN (GTZ)Park etal. (2022) and Medley-solos-DB (MDB)Lostanlen etal. (2018), GAMA outperforms our baselines by 2%-67%. For weak zero-shot evaluation on ESC-50Piczak (2015) and DCASE2017 Task 4 (DCASE)Mesaros etal. (2017), GAMA outperforms our baselines by 1%-66%. Finally, for in-domain evaluation on VGGSound (VGG)Chen etal. (2020), FSD50K (FSD)Fonseca etal. (2021), AudioSet (AS)Gemmeke etal. (2017) and NSynth (NS)Engel etal. (2017) GAMA outperforms our baselines by 1%-84%. GAMA sees the steepest drop in performance when the AST and Aggregator are removed (i.e., only Auio Q-Former is employed).

Table2 compares GAMA with other baselines on AQA (open-ended and complex open-ended) and dense captioning. GAMA outperforms all our baselines on all settings. GAMA shows absolute improvement of 4% - 50% on OpenAQA, 8% - 58% on CompA-R-test and 8% - 30% on Dense Captioning. Similar to the tasks in Table1, performance on benchmarks suffers the most when without the Audio Q-Former (when only the AST and Aggregator are employed). Audio Q-Former proves to especially effective (over employing CLAP) in AQA.

Qualitative Results. Fig.4 compares GAMA-IT against other LALMs from literature with instances from CompA-R-test. All models compared by default possess audio chat or open-ended AQA capabilities. GAMA-IT is able to provide more faithful responses that are both correct and preferred more by humans. We provide additional comparisons in Figs.8, 9, 10, 11, 12, and our demo page: (where we also show comparisons of dense captioning).

5 Conclusion

In this paper, we propose GAMA, an LALM with improved audio perception abilities. We integrate an LLM with multiple types of audio representations, which are responsible for providing diverse knowledge about the input audio. GAMA fine-tuned on a mixture of open-source datasets outperforms prior audio-language models by significant margins on 16 datasets spanning 4 tasks. Next, we propose CompA-R, an instruction-tuning dataset that we synthesize using a robust pipeline for endowing an LALM with complex reasoning abilities. GAMA IT-ed on CompA-R outperforms baselines on complex open-ended AQA and dense captioning.

Limitations and Future Work

GAMA and our experimental setup have several limitations, including:

•
For the scope of our experiments, we do not evaluate and compare music understanding extensively. We do not do this as we do not train GAMA on diverse and large-scale music datasets. We also acknowledge that it is possible to employ the GAMA architecture for comprehensive music understanding if trained on large-scale music understanding datasets. As part of future work, we plan to release a music-only version of GAMA, similar to Gardner etal. (2024).
•
We do not employ larger LLMs, for example, the 13B versions of the LLaMA family, similar to Tang etal. (2024) and Gong etal. (2024), due to compute constraints.
•
The audio-encoder(s) in GAMA have more parameters than in our baselines. However, we also acknowledge that this adds to only a fraction of the total parameter count of the LALM.

References

BBC (2018)2018.A dump of BBC’s sound effects library.This dump was created using the script found at https://github.com/FThompson/BBCSoundDownloader. Identifier: BBCSoundEffectsComplete.
sou (2023)2023.SoundBible - Free Sound Clips, Sound Bites, and Sound Effects.Accessed: 25 September 2023.
Agostinelli etal. (2023)Andrea Agostinelli, TimoI. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. 2023.Musiclm: Generating music from text.
Bubeck etal. (2023)Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, YinTat Lee, Yuanzhi Li, Scott Lundberg, etal. 2023.Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712.
Chen etal. (2020)Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020.Vggsound: A large-scale audio-visual dataset.
Chu etal. (2023)Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023.Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919.
Cui and Wang (2024)Wanyun Cui and Qianle Wang. 2024.Ada-instruct: Adapting instruction generators for complex reasoning.
Deng etal. (2009)Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei. 2009.Imagenet: A large-scale hierarchical image database.In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
Deshmukh etal. (2023)Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. 2023.Pengi: An audio language model for audio tasks.
Deshmukh etal. (2022)Soham Deshmukh, Benjamin Elizalde, and Huaming Wang. 2022.Audio retrieval with wavtext5k and clap training.arXiv preprint arXiv:2209.14275.
Devlin etal. (2018)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805.
Drossos etal. (2020)Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020.Clotho: An audio captioning dataset.In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE.
Elizalde etal. (2023a)Benjamin Elizalde, Soham Deshmukh, Mahmoud AlIsmail, and Huaming Wang. 2023a.Clap learning audio concepts from natural language supervision.In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Elizalde etal. (2023b)Benjamin Elizalde, Soham Deshmukh, MahmoudAl Ismail, and Huaming Wang. 2023b.Clap learning audio concepts from natural language supervision.In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
Engel etal. (2017)Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. 2017.Neural audio synthesis of musical notes with wavenet autoencoders.In International Conference on Machine Learning, pages 1068–1077. PMLR.
Fonseca etal. (2021)Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. 2021.Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852.
Fonseca etal. (2022)Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. 2022.Fsd50k: An open dataset of human-labeled sound events.
Gardner etal. (2024)JoshuaP Gardner, Simon Durand, Daniel Stoller, and RachelM Bittner. 2024.LLark: A multimodal foundation model for music.
Gemmeke etal. (2017)JortF Gemmeke, DanielPW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, RChanning Moore, Manoj Plakal, and Marvin Ritter. 2017.Audio set: An ontology and human-labeled dataset for audio events.In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE.
Ghosh etal. (2024a)Sreyan Ghosh, Sonal Kumar, ChandraKiran ReddyEvuru, Ramani Duraiswami, and Dinesh Manocha. 2024a.Recap: Retrieval-augmented audio captioning.In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1161–1165.
Ghosh etal. (2024b)Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra KiranReddy Evuru, Ramaneswaran S, SSakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha. 2024b.Compa: Addressing the gap in compositional reasoning in audio-language models.In The Twelfth International Conference on Learning Representations.
Gong etal. (2021)Yuan Gong, Yu-An Chung, and James Glass. 2021.Ast: Audio spectrogram transformer.arXiv preprint arXiv:2104.01778.
Gong etal. (2024)Yuan Gong, Hongyin Luo, AlexanderH. Liu, Leonid Karlinsky, and JamesR. Glass. 2024.Listen, think, and understand.In The Twelfth International Conference on Learning Representations.
Gong etal. (2023)Yuan Gong, Andrew Rouditchenko, AlexanderH. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and JamesR. Glass. 2023.Contrastive audio-visual masked autoencoder.In The Eleventh International Conference on Learning Representations.
Gong etal. (2022)Yuan Gong, Jin Yu, and James Glass. 2022.Vocalsound: A dataset for improving human vocal sounds recognition.In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 151–155. IEEE.
Gudibande etal. (2023)Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023.The false promise of imitating proprietary llms.arXiv preprint arXiv:2305.15717.
Guzhov etal. (2022)Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022.Audioclip: Extending clip to image, text and audio.In ICASSP 2022.
Guzhov etal. (2021)Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2021.Audioclip: Extending clip to image, text and audio.
Hu etal. (2024)Yuchen Hu, CHEN CHEN, Chao-HanHuck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, and Ensiong Chng. 2024.Large language models are efficient learners of noise-robust speech recognition.In The Twelfth International Conference on Learning Representations.
Huang etal. (2024)Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, etal. 2024.Audiogpt: Understanding and generating speech, music, sound, and talking head.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pages 23802–23804.
Kim etal. (2019)ChrisDongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019.Audiocaps: Generating captions for audios in the wild.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132.
Kojima etal. (2022)Takeshi Kojima, ShixiangShane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022.Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213.
Kong etal. (2024)Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. 2024.Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.
Li etal. (2023)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.In International conference on machine learning, pages 19730–19742. PMLR.
Lipping etal. (2022)Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. 2022.Clotho-aqa: A crowdsourced dataset for audio question answering.In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE.
Liu etal. (2023a)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee. 2023a.Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744.
Liu etal. (2023b)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee. 2023b.Visual instruction tuning.
Liu etal. (2024)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee. 2024.Visual instruction tuning.Advances in neural information processing systems, 36.
Liu etal. (2023c)Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, etal. 2023c.Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499.
Lostanlen etal. (2018)Vincent Lostanlen, Carmine-Emanuele Cella, Rachel Bittner, and Slim Essid. 2018.Medley-solos-db: a crosscollection dataset for musical instrument recognition.Zenodo.
Lostanlen etal. (2019)Vincent Lostanlen, Carmine-Emanuele Cella, Rachel Bittner, and Slim Essid. 2019.Medley-solos-DB: a cross-collection dataset for musical instrument recognition.
Mesaros etal. (2017)Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017.Dcase 2017 challenge setup: Tasks, datasets and baseline system.In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.
Mesaros etal. (2018)Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2018.A multi-device dataset for urban acoustic scene classification.arXiv preprint arXiv:1807.09840.
Morato and Mesaros (2021)IreneMartin Morato and Annamaria Mesaros. 2021.Macs - multi-annotator captioned soundscapes.
Park etal. (2022)Junwoo Park, Youngwoo Cho, Gyuhyeon Sim, Hojoon Lee, and Jaegul Choo. 2022.Enemy spotted: in-game gun sound dataset for gunshot classification and localization.In 2022 IEEE Conference on Games (CoG), pages 56–63. IEEE.
Piczak (2015)KarolJ Piczak. 2015.Esc: Dataset for environmental sound classification.In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018.
Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal. 2021.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR.
Singla etal. (2022)YamanKumar Singla, Jui Shah, Changyou Chen, and RajivRatn Shah. 2022.What do audio transformers hear? probing their representations for language delivery & structure.In 2022 IEEE International Conference on Data Mining Workshops (ICDMW), pages 910–925. IEEE.
Sonniss Limited (2022)Sonniss Limited. 2022.Sonniss Game Audio.Registered in England, UK. Company number: 09377364. Accessed: 25 September 2023.
Tang etal. (2024)Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, LuLu, Zejun MA, and Chao Zhang. 2024.SALMONN: Towards generic hearing abilities for large language models.In The Twelfth International Conference on Learning Representations.
Tian etal. (2014)MiTian, Ajay Srinivasamurthy, Mark Sandler, and Xavier Serra. 2014.A study of instrument-wise onset detection in beijing opera percussion ensembles.In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pages 2159–2163. IEEE.
Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and ThomasScialom. 2023.Llama 2: Open foundation and fine-tuned chat models.
Tzanetakis etal. (2001)George Tzanetakis, Georg Essl, and Perry Cook. 2001.Automatic musical genre classification of audio signals.
Watanabe etal. (2018)Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nish*toba, Yuya Unno, Nelson EnriqueYalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, etal. 2018.Espnet: End-to-end speech processing toolkit.arXiv preprint arXiv:1804.00015.
Wei etal. (2022)Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM. Dai, and QuocV Le. 2022.Finetuned language models are zero-shot learners.In International Conference on Learning Representations.
Wu* etal. (2023a)Yusong Wu*, KeChen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023a.Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.
Wu* etal. (2023b)Yusong Wu*, KeChen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023b.Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.
Xu etal. (2024)Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, PuZhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. 2024.WizardLM: Empowering large pre-trained language models to follow complex instructions.In The Twelfth International Conference on Learning Representations.
Zhang etal. (2024)Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. 2024.Mm-llms: Recent advances in multimodal large language models.arXiv preprint arXiv:2401.13601.
Zhang etal. (2023)Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, etal. 2023.Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792.
Zhao etal. (2023)WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, etal. 2023.A survey of large language models.arXiv preprint arXiv:2303.18223.
Zhou etal. (2017)Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017.Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence.

Appendix A Additional Results

Appendix B Prompts employed for LLMs

Fig.5 illustrates the prompt employed for synthesizing CompA-R. Fig.6 illustrates the prompt employed for evaluating model responses on CompA-R. For dense captioning, we just prompt the model: Write an audio caption describing the sound in detail.

Appendix C GAMA-IT vs GAMA and Evaluation Choices.

GAMA is first fine-tuned on OpenAQA and then instruction-tuned on CompA-R for complex reasoning. We call the instruction-tuned version GAMA-IT. We do not evaluate GAMA-IT on general tasks like classification and vanilla captioning¹¹1Note: Both depend on the description of the input audio generated by the model. GAMA-IT is aligned to generate detailed descriptions as part of the complex reasoning stage, and we found a lack of metrics and methods that can faithfully evaluate such descriptions for classification or captioning. For example, the retrieval-based classification evaluation method, employed extensively in prior work, including ours, uses a Sentence-BERT to retrieve the label closest to the description for classification evaluation. During our preliminary analysis, we found that Sentence-BERT, which just performs retrieval using semantic matching, is unable to faithfully retrieve the correct label despite the caption mentioning the label as an audio event. We further investigated CLAP as our retrieval model for evaluation and found that it suffers from the same limitations. We attribute this to the detailed and dense nature of the descriptions and the fact that these models only focus on high-level semantic meaning for retrieval. Our initial experiments show that LLM prompting serves as a feasible alternative for automatic evaluation (beyond human evaluation) using such dense descriptions, but due to the lack of resources and a formal framework, we leave this as part of future research.

C.1 Soft Prompts

We employ the soft prompt only in the instruction tuning stage for learning complex reasoning and not in the fine-tuning step. We do this for 2 reasons: (i) Fine-tuned GAMA is only expected to solve generic audio tasks like classification, captioning, etc. Thus, we hypothesize that such high-level semantic cues are not necessary for effective and optimal performance. (ii) Since fine-tuning is done on a large-scale dataset and acoustic event classification is far from accurate, our soft prompt method might add unwanted noise to the training process, thereby leading to sub-optimal performance. On the contrary, our instruction-tuning stage, which is done on relatively low-resource data and is only responsible for aligning a model for complex reasoning, is robust to inaccurate audio tags due to our soft-prompting methodology.

Appendix D Additional Details: Human Study

Note. Our institution’s Institutional Review Board (IRB) has granted approval for both human studies presented in the paper.

Background and Recruitment for Dense Captioning and CompA-R-test Evaluation. We recruit 3 professionals for human evaluation of dense captioning and CompA-R-test evaluation. All these 3 professionals come with at least a Ph.D. in Engineering or Sciences and were asked to use headphones to first analyze the audio and then judge the response quality. The authors of this paper gave these annotators 5 examples of responses and the corresponding judgments.The work was done voluntarily and not paid. We refrain from recruiting crowd raters as prior research has noticed discrepancies in evaluation by themGudibande etal. (2023). More precisely, they have been shown to possess a tendency to rate an answer with a high score only by visualizing the style of answering and not the exact factual information making up the response.

All 3 human annotators score the response between 1-5, and we report scores averaged across the 3. Prior to evaluation, all annotators were given at least 10 examples from the authors of the paper of generations and their corresponding scores. For evaluation, only the audio was provided to them with software that could play the audio and has fields to input the scores.

Background and Recruitment for OpenAQA. Since the size of OpenAQA is relatively larger than CompA-R-test, we perform evaluation on Amazon Mechanical Turk similar to Gong etal. (2024). Evaluation was done with a total of 267 unique human evaluators and each generation was scored by 2 evaluators. The same software was used for evaluation as CompA-R-test.

Appendix E Additional Details: Audio Q-Former

E.1 Audio Q-Former Training Details

Pre-training Hyper-parameter. For Stage 1 of training, we employ a training batch size of 192, an initial learning rate of 1e-4, a minimum learning rate of 1e-5, and a warm-up learning rate of 1e-6. We do cosine decay as the learning rate scheduling technique. We do warmup for 5000 steps. Stage 1 was pre-trained on 8 A6000 GPUs for 100 epochs. For Stage 2 of training, we keep the exact same settings as Stage 1 but change the batch size to 128.

Fine-tuning. For zero-shot audio classification evaluation, we fine-tune the Audio Q-Former after Stage 1 pre-training on the same corpus presented in Table3 and using the same Stage 1 objective. The only difference in the fine-tuning step is that we train the AST model, which is otherwise kept frozen in the pre-training stage.

Fine-tuning Hyper-parameter. For fine-tuning, we again use the same hyper-parameter setting as Stage 1 pre-training but use a batch size of 64.

E.2 Training Dataset Details

Table3 provides dataset statistics of all individual datasets used for training Audio Q-Former. We employ $\approx$ 2.2M audio-caption pairs for training with no speech-transcription pairs.

\resizebox

0.99!Dataset#Audio-Caption PairsAudio SetGemmeke etal. (2017)²²2https://huggingface.co/datasets/cvssp/WavCaps1591364Free Sound (Fonseca etal., 2022)³³3https://www.robots.ox.ac.uk/vgg/data/vggsound/259020VGGSoundChen etal. (2020)⁴⁴4https://research.google.com/audioset/download.html185161AudioSet Strong (CompA Version)Ghosh etal. (2024b)⁵⁵5https://zenodo.org/records/5114771108311MACS(Morato and Mesaros, 2021)⁶⁶6https://sound-effects.bbcrewind.co.uk/14400BBC (BBC, 2018) ⁷⁷7https://research.google.com/audioset/download.html31201AudioCaps(Kim etal., 2019)⁸⁸8https://zenodo.org/records/478339148649Clotho(Drossos etal., 2020)⁹⁹9https://labs.freesound.org/datasets/18735SONISS (Sonniss Limited, 2022) ¹⁰¹⁰10https://www.kaggle.com/datasets/soumendraprasad/musical-instruments-sound-dataset1602Musical Instrument(Agostinelli etal., 2023)¹¹¹¹117990SoundBible (sou, 2023)¹²¹²12https://github.com/microsoft/WavText5K1232WavText5K(Deshmukh etal., 2022)¹³¹³13https://github.com/seungheondoh/music_caps_dl4347MusicCaps(Agostinelli etal., 2023)¹⁴¹⁴14https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification2645GTZAN (Tzanetakis etal., 2001)¹⁵¹⁵15https://zenodo.org/records/13441036014Medley-solos(Lostanlen etal., 2019)¹⁶¹⁶16https://zenodo.org/records/1344103732

¹¹footnotetext: https://research.google.com/audioset/download.html¹¹¹¹footnotetext: https://soundbible.com/

E.3 Augmentation Examples

Table9 illustrates prompt augmentations for two categories from each dataset. Table10 illustrates caption augmentations for training Audio Q-Former.

Appendix F Baseline Details

AudioCLIP.Guzhov etal. (2022) AudioCLIP is an extension of the CLIP model that can handle audio in addition to text and images by incorporating the ESResNeXt audio model in the CLIP framework. It was trained on the AudioSet dataset, which contains millions of audio clips with corresponding labels.

CLAP.Elizalde etal. (2023a) CLAP (Contrastive Language-Audio Pre-training), similar to CLIP, is an audio-language model trained with contrastive learning between audio data and their corresponding natural language descriptions. Representations are obtained from audio encoders and text encoders. Wu* etal. (2023b) further extend this using a feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance performance.

CompA-CLAP.Elizalde etal. (2023a) CompA-CLAP, an extension to CLAP, is trained on completely open-sourced datasets and further fine-tuned using specific algorithms and datasets to improve compositional reasoning.

Pengi.Deshmukh etal. (2023) Pengi was one of the first efforts to achieve general-purpose audio understanding through free-form language generation with transfer learning. Precisely, Pengi integrates an audio encoder with a decoder-only pre-trained language model (LM) where the audio features serve as a prefixes for the LM during response generation. Following this, similar to our evaluation strategy, they prompt the model to caption the input audio and calculate the similarity between the caption and the ground-truth audio label for zero-shot classification.

LTU.Gong etal. (2024) As a concurrent work to Pengi, took a step forward and showed that substituting the pre-trained language model with an LLM can induce an LALM with reasoning capabilities. Precisely, they achieved this by integrating an audio encoder to LLaMATouvron etal. (2023) and fine-tuning the model on close-ended and open-ended instruction-tuning datasets. Finally, beyond just close-ended tasks, they also evaluate their models on open-ended reasoning tasks and show superior performance compared to baselines.

AudioGPT.Huang etal. (2024) Different from Pengi and LTU, AudioGPT differs in how the audio models and LLMs are integrated for completing audio tasks. More specifically, different from end-to-end training and alignment, they integrate a closed-source model (ChatGPT) with a pre-trained audio model, already capable of completing the required task, using a modality-transfer transformer $\tau$ . The integration or interaction between the two models is accomplished using the prompts. Additionally, AudioGPT is capable of solving more tasks, which include human verbal speech, beyond just non-verbal speech like Pengi and LTU.

SALMONN.Tang etal. (2024) SALMONN follows a similar architecture to LTU and Pengi and does prefix conditioning with an LLM. However, in addition to an audio encoder, they also integrate a speech encoder for speech or verbal audio understanding. Precisely, the audio and speech features are concatenated before feeding them as prefixes to the LLM. SALMONN shows unique reasoning capabilities over speech inputs overlayed with non-verbal audio.

Qwen-Audio.Chu etal. (2023) Qwen follows a similar architecture to LTU, Pengi, and SALMONN, i.e., adding audio features as prefix to the model, and additionally employs a novel multi-task learning formulation for pre-training. More specifically, they append specific tags to specific parts of the instruction-response text pairs and train the model on diverse speech, non-speech, and music tasks. Post-pre-training, similar to GAMA, employs an instruction-tuning stage for alignment. The resultant model, Qwen-Audio-Chat, is able to respond to respond to diverse queries about the input speech and audio.

Appendix G Additional Details: CompA-R

G.1 Annotation and Annotator Details

As mentioned earlier, CompA-R was cleaned and CompA-R-test was verified by the paper authors themselves. To preserve anonymity, we briefly provide some details about the authors. All authors of the paper are either enrolled in or have graduated from a graduate degree (MS and/or Ph.D.). All authors have at least 2 years of professional research experience at a academic or industry lab. Their research experience spans across speech, audio and language processing. This provides them with adequate knowledge to faithfully complete the process.

For CompA-R-test verification, after at least 3 authors verified the test set, with proper rationales (which they were also asked to provide) the lead author cross-verified all instances. The verification was done manually on local laptops and no kind of application was used which was made specifically for this. More details will be provided on camera-ready.

Appendix H Additional Details: General

H.1 GAMA Training Dataset Details

Table4 shows statistics of all datasets used for fine-tuning and instruction-tuning GAMA. Table5 shows statistics of CompA-R, which is sourced entirely from the AudioSet-Strong dataset.

Dataset	# Audio Samples	# QA Pairs
AudioSet-Strong	102K	636K
AudioSet	500K	441K
VGGSound	184K	336K
FSD50K	41K	82K
AudioCaps	46K	90K
FreeSound	91K	91K
Clotho	5K	32K
Sound Bible	1.2K	12K
NSynth(Instrument+Source)	301K	602K
Clotho AQA	1.5K	4.2K
MusicCaps	5.5K	2.8K
MusicQA	13.1K	118k
Magna	51.7K	51.7K
Sum (Closed-Ended)	1,217K	2,555K
AudioSet-Strong (Open-Ended)	91K	901K
AudioSet-20K	19K	184K
VGGSound (Open-Ended)	184K	907K
FSD50K (Open-Ended)	41K	403K
AudioCaps (Open-Ended)	46K	478K
Freesound (Open-Ended)	91K	791K
Clotho (Open-Ended)	5K	89K
Sound Bible (Open-Ended)	1.2K	10K
Sum (Open-Ended)	453K	3,764K
Total	1,670K	6,319K

Dataset	# Audio Samples	# QA Pairs
AudioSet-Strong	62613	200234
Total	62613	200234

H.2 GAMA Evaluation Dataset Details

Table6 shows statistics of all datasets used for evaluating GAMA. Table8 shows statistics of CompA-R-test, which is sourced entirely from the AudioSet-Strong dataset.

Dataset	# Instances
AudioSet-Strong¹⁷¹⁷footnotemark: 17	102K
AudioSet	500K
VGGSound	184K
FSD50K¹⁸¹⁸footnotemark: 18	41K
AudioCaps	46K
FreeSound	91K
Clotho	5K
Sound Bible	1.2K
NSynth_instrument¹⁹¹⁹footnotemark: 19	4K
NSynth_source²⁰²⁰footnotemark: 20	4K
Clotho AQA²¹²¹footnotemark: 21	1.3K
GTZAN	3K
Medley-solos-DB	12.2K

\resizebox

0.99!DatasetEvaluation MetricClassification (zero-shot)\hdashlineVocalSound (VS)Gong etal. (2022)Acc.TUT 2017 (TUT)Mesaros etal. (2018)Acc.Beijing Opera (BJO)Tian etal. (2014)Acc.GTZAN (GTZ)Park etal. (2022)Acc.Medley-solos-DB (MDB)Lostanlen etal. (2018)Acc.\hdashlineClassification (weak zero-shot)\hdashlineDCASE2017 Task 4 (DCASE)Mesaros etal. (2017)Mi-F1ESC-50Piczak (2015)Acc.\hdashlineClassification (seen)\hdashlineVGGSound (VGG)Chen etal. (2020)Acc.FSD50K (FSD)Fonseca etal. (2021)mAPAudioSet (AS)Gemmeke etal. (2017)mAPNSynth (NS)Engel etal. (2017)Acc.\hdashlineCaptioning (vanilla & dense)\hdashlineAudioCaps(Kim etal., 2019)SPICE & HumanClotho(Drossos etal., 2020)SPICE & Human\hdashlineAQA (close-ended)\hdashlineClotho AQALipping etal. (2022)Acc.\hdashlineAQA (open-ended)\hdashlineOpenAQAGong etal. (2024)Human\hdashlineAQA (complex open-ended)\hdashlineCompA-R-test (ours)GPT-4 & Human

¹¹footnotetext: https://www.kaggle.com/datasets/modaresimr/sound-event-detection-audioset-strong²²footnotetext: https://zenodo.org/records/4060432³³footnotetext: https://www.tensorflow.org/datasets/catalog/nsynth⁴⁴footnotetext: https://zenodo.org/records/6473207

Dataset	# Audio Samples	# QA Pairs
CompA-R-test	500	1561
Total	500	1561

H.3 Other Details

Model Parameters: GAMA has a total of $\approx$ 7B parameters. Out of this, LLaMA-2-7B has 32 transformer-encoder layers and $\approx$ 6.7B parameters, the Audio Q-Former has $\approx$ 280M parameters, and our LoRA modules introduce 4.2 M learnable parameters for fine-tuning. The AST used in our experiments (audio-encoder of CAV-MAEGong etal. (2023)) has $\approx$ 85M parameters with 12 transformer-encoder layers, 768-hidden-state, and 12 attention-heads.

Compute Infrastructure: All our experiments are conducted on four NVIDIA A6000 GPUs. Training GAMA required four days of continuous training. Training GAMA-IT requires 4 hours of training. Pre-training Audio Q-Former requires 7 days each for stages 1 and 2.

Implementation Software and Packages: We implement all our models in PyTorch ²²²²22https://pytorch.org/ and use the HuggingFace ²³²³23https://huggingface.co/ implementations of T5_large and the original implementation of HTSAT_tiny²⁴²⁴24https://github.com/RetroCirce/HTS-Audio-Transformer.

For our baselines, we use the original GitHub repository provided by the authors: LAION-CLAP²⁵²⁵25https://github.com/LAION-AI/CLAP/tree/main, CompA-CLAP²⁶²⁶26https://github.com/Sreyan88/CompA, CLAP²⁷²⁷27https://github.com/microsoft/CLAP, Wav2CLIP²⁸²⁸28https://github.com/descriptinc/lyrebird-wav2clip, AudioCLIP²⁹²⁹29https://github.com/AndreyGuzhov/AudioCLIP, MMT³⁰³⁰30https://github.com/akoepke/audio-retrieval-benchmark, ML-ACT³¹³¹31https://github.com/akoepke/audio-retrieval-benchmark, Pengi³²³²32https://github.com/microsoft/pengi, LTU³³³³33https://github.com/YuanGongND/ltu, AudioGPT³⁴³⁴34https://github.com/aigc-audio/audiogpt, SALMONN³⁵³⁵35https://github.com/bytedance/salmonn, Qwen-Audio³⁶³⁶36https://github.com/QwenLM/Qwen-Audio.

Potential Risks. GAMA might encode biases from the pre-trained LLM or during its fine-tuning stage. Additionally, Audio Q-Former used as a backbone for audio-to-text/music generation, might generate synthetic audio that is misused.

\includegraphics

[width=2]gama_prompt.pdf

\includegraphics

[width=2]figures/aqa_prompt_gama.pdf

\includegraphics

[width=2]figures/caption_prompt_GAMA.pdf

\includegraphics

[width=2]figures/app1.pdf

\includegraphics

[width=2]figures/app2.pdf

\includegraphics

[width=2]figures/app3.pdf

\includegraphics

[width=2]figures/app4.pdf

\includegraphics

[width=2]figures/app5.pdf

Dataset	Category	Acoustic Property	Captions
AudioSet	Baby cry, infant cry	1: "a high-pitched, piercing wail"2: "a persistent, rhythmic sobbing"	1. A tiny voice pierced the stillness of the night, demanding attention with its high-pitched piercing wail.1. In the midst of the bustling market, a high-pitched wail rose above the chatter, signaling a tiny dissenter among us.2. The persistent, rhythmic sobbing of a newborn punctuates the night’s silence.2. Amidst the rustling leaves, a baby’s rhythmic sobs weave an unexpected lullaby.
\cdashline2-4	Stomach rumble	1: "a low, gurgling growl"2: "a grumbling undercurrent"	1. As she leaned in for the kiss, a low, gurgling growl betrayed her pre-date fasting.1. In the midst of the exam, a low, gurgling growl echoed from her stomach, punctuating the tension.2. As the classroom fell into a hushed anticipation of the next lecture, a grumbling undercurrent betrayed the student’s skipped breakfast.2. Amidst the solemnity of the library, a grumbling undercurrent served as a comical reminder of lunchtime’s approach.
ESCC50	siren	1: "Continuous, shrill alarm"2: "Ear-splitting, cyclic blare"	1. The night was pierced by the continuous, shrill alarm of the siren, signaling an urgency that couldn’t be ignored.1. The relentless, shrill alarm of the siren wove through the corridors of the hospital, prompting swift movements from the staff.2. The ear-splitting, cyclic blare cut through the night as the ambulance raced down the street.2. The ear-splitting, cyclic blare of the air raid siren cast a shadow of dread over the city.
\cdashline2-4	sheep	1: "bleating",2: "baaing"	1. In the tranquil meadow, a chorus of woolly creatures serenaded the dawn with their gentle bleating.1. The shepherd smiled as the flock’s bleating echoed through the valley, signaling a return to the fold.2. A chorus of baaing accompanied the farmer as he made his morning rounds in the misty fields.2. Under the starry sky, the gentle baaing of the flock blended with the whispers of the night.
NSynth	flute	1: "Melodious and silvery, carrying a light, airy tune that seems to float on the breeze.",2: "Clear and resonant, with a pure tone that sings above the orchestra like a bird in the morning sky.	1. The flute’s melody weaved through the garden, mingling with the rustle of leaves.1. Amidst the hustle of the market, the silvery flute tune danced above the clamor, a ribbon of tranquility.2. The flute’s crystalline melody soared above the orchestra, a lark greeting the dawn.2. Amidst the rustle of the forest, the flute’s song danced through the leaves, pure and high.
\cdashline2-4	bass	1: "Thumping, providing a rhythmic pulse that can drive the beat of music.",2: "Booming, with a powerful, pervasive quality that can reverberate through a space."	1. The bass pulsated through the dance floor, a heartbeat synchronizing every dancer’s move.1. Amidst the serene silence of the night, the bass from the distant festival throbbed like a gentle earthquake.2. The bass pulsated like a second heartbeat, filling the room with its unyielding presence.2. As the bassline dropped, it seemed to command the very air, a force unseen yet unforgotten.
FSD50K	Slam	1: "an abrupt, resonant boom that startles anyone nearby"2: "a sharp, impactful smack as two hard surfaces collide with force"	1. The mailbox lid clapped shut, a resonant signal marking the departure of the day’s correspondence.1. The oven door’s heavy thud resonated in the kitchen, a prelude to the aroma of freshly baked bread.2. The kitchen was filled with the aroma of spices and the sharp smack of dough being forcefully thrown onto the countertop.2. In the crisp morning air, the sharp smack of the newspaper hitting the doorstep announced the arrival of daily news.
\cdashline2-4	Dishes, pots, and pans	1: "Clanging and clattering"2: "Metallic clinking and clunking"	1. A symphony of clanging and clattering announces the busy bustle of a restaurant kitchen in full swing.1. The rhythmic clanging and clattering of pots and pans punctuate the air as grandma orchestrates her holiday feast.2. The metallic clinking and clunking heralded the start of the dinner rush in the bustling restaurant kitchen.2. A symphony of metallic clinking and clunking rose from the sink as grandma washed up after the family feast.
TUT Urban	bus	1: "a deep, rumbling engine", "2": "the low, steady hum of the diesel motor"	1.The city pulse beats with a deep, rumbling engine, heralding the arrival of the morning commute.1. A gentle giant purrs in the stillness of dawn, its deep, rumbling engine announcing the start of a journey.2. Market stalls buzz with life, their vibrant colors and smells underscored by the bus’s diesel hum rolling down the avenue.2. Leaves rustle in the autumn breeze, a natural chorus to the bus’s diesel motor humming along the cobblestone path.
\cdashline2-4	residential area	1: "The symphony of children’s laughter and chatter fills the air, punctuated by the occasional bark of a dog and the hum of lawn mowers in the distance."2: "A serene hush blankets the neighborhood, broken occasionally by the soft whoosh of passing cars and the rustle of leaves stirred by a gentle breeze.",	1. The neighborhood comes alive with the melody of playful banter and the sporadic chorus of canines.1. Amidst the gentle drone of distant lawn mowers, the air vibrates with juvenile mirth and convivial exchanges.2. The neighborhood rests under a tranquil silence, punctuated now and then by the whisper of tires on asphalt and the soft dance of leaves in the wind.2. Calmness envelops the streets, save for the faint hum of vehicles gliding by and the tender shuffling of foliage in the zephyr’s caress.
Urban- Sound 8K	air conditioner	1: "a steady humming"2: "a low, monotonous droning"	1. The room filled with the steady humming of the air conditioner as they focused intently on their chess match.1. A steady humming enveloped the library, where pages turned almost in rhythm with the air conditioning’s constant song.2. The air conditioner’s low, monotonous droning became the unlikely lullaby for a midsummer’s nap.2. Amid the quiet study hall, the air conditioner’s low, monotonous droning was a steady companion to the students’ focused brows.
\cdashline2-4	gun shot	1: "A loud, sharp crack that echoes through the air.2: "A thunderous boom that startles and reverberates."	1. The night’s silence shattered with a loud, sharp crack echoing through the air.1. A burst of sudden, sharp noise split the tranquil afternoon, reverberating off the canyon walls.2. A thunderous boom startles a flock of birds into the sky, their wings flapping frantically against the silence that had just been.2. The night’s silence was shattered by a boom, reverberating through the alleyways and causing stray cats to scurry.
VGG Sound	mouse squeaking	1: "a high-pitched, sharp chirp"2: "a soft, repetitive squeal"	1. In the moonlit barn, a tiny silhouette pauses to release its high-pitched, sharp chirp, disturbing the stillness of the hay-strewn loft.2. Amidst the rustling leaves, a diminutive creature contributes its sharp chirp to the dusk chorus, a minuscule soloist in nature’s vast orchestra.3. A soft, repetitive squeal punctuated the silence of the old attic.4. The cheese plate on the kitchen counter became the stage for a soft, repetitive squeal.
\cdashline2-4	typing on typewriter	1: "a rhythmic series of sharp clicks"2: "a steady clatter of keys striking paper"	1. Fingers dance across keys, a rhythmic series of sharp clicks punctuating the silence of the library.1. In the attic, a story unfolds to the staccato beat of a rhythmic series of sharp clicks.2. Each steady clatter of keys striking paper weaves a tapestry of words, painting stories on the blank canvas.2. In the dimly lit corner of the library, the rhythmic dance of metallic hammers against the page composes a silent symphony.

Original Caption	Augmented caption
A man speaks followed by the sound of shuffling cards in a small room.	1. A deep, resonant voice fills the small room, accompanied by the soft shuffle of cards as they change hands, creating an intimate and deliberate atmosphere.2. The sound of a man’s voice echoes through the small space, punctuated by the subtle rustle of cards as they are shuffled and arranged, invoking a sense of purposeful deliberation.3. A deep voice speaks, followed by the subtle shuffle of cards, creating an intimate and anticipatory atmosphere in the small room.4. The gentle rustle of cards breaks the silence, punctuated by a man’s voice, evoking a sense of anticipation and private reflection in the cozy space.
A person strums an acoustic guitar, creating melodic music with the sound of a bell ringing in the background.	1. Soothing melodies flow from the acoustic guitar, harmonizing with the soft chime of a distant bell, crafting a peaceful ambiance.2. The acoustic guitar’s strings vibrate with grace, weaving a melodic tapestry that intertwines with the gentle ring of a bell, transporting the listener to a serene realm.3. The gentle strumming of an acoustic guitar weaves a melodic tapestry, intertwined with the soft chime of a background bell, creating a soothing and harmonious atmosphere.4. The rhythmic plucking of an acoustic guitar crafts a lively and uplifting melody, complemented by the delicate ringing of a background bell, transporting the listener to a serene and joyful realm.
Dogs bark while people talk in the background, creating a lively atmosphere in a field.	1. Lively chatter and joyful barks fill the air, capturing the playful spirit of a sunny day in a field.2. The rhythmic sounds of dogs barking and people talking blend together, creating a vibrant and lively ambiance in the open field.3. The chatter of people and the joyful barks of dogs fill the air, creating a vibrant and lively atmosphere in the field.4. The sound of playful dogs and lively conversation fills the field, evoking a sense of happiness and energy.
A man’s voice is heard speaking over a radio as a vehicle passes by in the background.	1. A clear, crisp voice pierces the airwaves, intertwining with the distant hum of a vehicle, creating an engaging audio experience.2. The man’s voice on the radio blends seamlessly with the subtle rumble of a passing vehicle, forming a captivating auditory tapestry.3. A voiceover speaks over a radio, complemented by the distant hum of a vehicle passing by, creating a dynamic and engaging audio experience.4. A man’s voice broadcasts over the radio, intertwining with the subtle rumble of a vehicle in the background, forming a captivating audio landscape.
A woman speaks while a bird chirps in the background, creating a tranquil atmosphere in a natural setting.	1. A gentle voice echoes through the forest, harmonizing with the chirping of birds, creating a soothing ambiance.2. The sound of a gentle voice blends seamlessly with the melodic chirping of birds, transporting the listener to a serene natural setting.3. The woman’s gentle voice blends with the soothing chirps of a bird, creating a serene ambiance reminiscent of a peaceful afternoon in nature.4. The woman’s words are accompanied by the melodic chirping of a bird, transporting the listener to a calming and picturesque outdoor setting.
Water rushes as people talk in the background near a hot spring, creating a serene ambiance.	1. Soothing waters create a peaceful ambiance, punctuated by the gentle chatter of people nearby, as if they are harmonizing with the soothing sounds of the hot spring.2. The calm trickle of water creates an intimate atmosphere, with the soft murmur of voices in the background adding a sense of connection and tranquility to the space.3. A soothing, babbling sound fills the air as people converse near a steaming hot spring, creating a tranquil atmosphere.4. The gentle gurgling of water intertwines with the chatter of people in the background, crafting a peaceful and relaxing ambiance.
Soft music plays in the background as a speech is heard faintly, creating a calm and peaceful atmosphere.	1. A soothing melody floats in the background, complementing the faint speech, creating a tranquil ambiance.2. The soft strains of music blend with the subtle speech, fostering a sense of serenity and calmness in the atmosphere.3. Soothing tunes fill the air, complemented by a gentle speech, creating an atmosphere of tranquility and serenity.4. Mellow music and soft speech blend together, crafting a calming environment that soothes the senses.’
A car engine revs up and then slows down, creating a vroom sound, as the vehicle accelerates in the audio.	1. The car’s engine purrs and then decelerates, emitting a smooth and powerful vroom sound as it shifts gears, creating a dynamic and energizing atmosphere.2. The vehicle’s engine roars to life, producing a bold and intense vroom sound as it speeds up, then gradually slows down, immersing the listener in a thrilling and exhilarating experience.3. The car’s engine purrs powerfully, then decelerates, creating a smooth and steady vroom sound as the vehicle gains speed.4. The car’s engine roars to life, building momentum with a series of sharp vroom sounds before shifting gears and slowing down.
Background music plays softly as the theme music gradually fades in, creating a melodic ambiance in an arena/performance setting.	1. The arena comes alive with a subtle, soothing melody that gradually builds in intensity, creating an electrifying ambiance.2. The soft strains of background music fill the air, setting the tone for an exhilarating performance in a vibrant arena setting.3. Soft, melodic strains fill the air as the theme music subtly builds, establishing a harmonious ambiance in the arena.4. The arena comes alive with a gentle, orchestral tune that gradually gains momentum, creating an uplifting and energetic atmosphere.

\resizebox

!Instruction-Response PairsAudioSet IDCaptionTimestamp EventsInstruction:Analyze the audio to understand the potential emotional state or mood of the man. How does the progression from typing to speech to chewing reflect his transition through different phases of work or activity? Output:The man initially seems engaged and focused during the typing and speaking portion, which might then transition into relaxation during the break, suggested by the chewing sound.YCecEf0abd4YA man speaks while typing on a keyboard in a small room, followed by the sound of chewing.’(Generic impact sounds-0.0-1.037)’, ’(Background noise-0.0-10.0)’, ’(Generic impact sounds-1.191-1.421)’, ’(Generic impact sounds-2.01-2.202)’, ’(Generic impact sounds-2.343-2.574)’, ’(Male speech, man speaking-2.727-3.393)’, ’(Generic impact sounds-3.163-3.406)’, ’(Generic impact sounds-3.585-3.905)’, ’(Generic impact sounds-4.136-4.379)’, ’(Breathing-4.405-4.917)’, ’(Generic impact sounds-4.93-5.288)’, ’(Generic impact sounds-5.442-5.608)’, ’(Generic impact sounds-5.736-6.12)’, ’(Generic impact sounds-6.274-6.569)’, ’(Breathing-6.825-7.26)’, ’(Generic impact sounds-6.863-7.042)’, ’(Male speech, man speaking-7.81-8.873)’, ’(Generic impact sounds-8.041-8.348)’, ’(Breathing-9.001-9.36)’, ’(Human sounds-9.014-9.181)’, ’(Generic impact sounds-9.309-9.565)’, ’(Scrape-9.449-10.0)’Instruction:Considering the diverse array of sounds present in the audio, from insects to birds to a dog, infer the type of ecosystem this outdoor setting might represent. What does the combination of these sounds tell us about the biodiversity and potential human impact in this area? Output:The ecosystem is likely a suburban or rural area with a mix of wildlife and human habitation, indicated by the variety of animal sounds and intermittent female speech.YcQiEI7HLGJgBirds chirp in the distance as a dog barks, creating a lively atmosphere in a peaceful outdoor setting.’(Insect-0.0-0.724)’, ’(Mechanisms-0.0-9.777)’, ’(Female speech, woman speaking-0.737-1.434)’, ’(Bird vocalization, bird call, bird song-1.243-1.775)’, ’(Insect-2.376-3.182)’, ’(Female speech, woman speaking-3.386-3.509)’, ’(Insect-4.397-5.23)’, ’(Dog-7.906-8.78)’, ’(Surface contact-8.603-9.654)’Instruction:Given the presence of mechanisms throughout the audio and the interspersed generic impact sounds, infer the type of machinery that is likely operating in the background and its commonality in an office environment. Output:The continuous mechanism sound suggests a printer or copier, which are common in office settings.YXQ2XAXx7mKsA printer hums while people converse in the background, creating a typical office ambiance.’(Generic impact sounds-0.0-0.622)’, ’(Mechanisms-0.0-10.0)’, ’(Generic impact sounds-0.815-1.227)’, ’(Generic impact sounds-1.632-2.134)’, ’(Child speech, kid speaking-3.591-6.684)’, ’(Squeal-7.385-7.612)’, ’(Child speech, kid speaking-8.437-10.0)’

\resizebox

!Instruction-Response PairsAudioSet IDCaptionTimestamp EventsInstruction:From the sequencing and overlapping of different sound events, infer the likely cause and process of the dripping sounds heard intermittently throughout the audio. Output:The dripping sounds may be a result of water overflowing from a filled sink or bath, supported by the earlier sounds of splashing and liquid gurgling.YCU9A5xL3TVcWater splashes and gurgles as it drips inside a small room, creating a soothing ambiance reminiscent of a tranquil bathroom.’(Background noise-0.0-10.0)’, ’(Generic impact sounds-0.083-0.331)’, ’(Splash, splatter-0.67-1.174)’, ’(Liquid-1.385-1.956)’, ’(Splash, splatter-2.325-3.138)’, ’(Liquid-3.085-4.131)’, ’(Liquid-4.372-5.5)’, ’(Drip-4.949-5.047)’, ’(Drip-5.279-5.458)’, ’(Generic impact sounds-8.819-9.142)’, ’(Drip-9.511-9.649)’Instruction:Assess the style and elements of the rapping and music in this audio clip. Based on this, what genre of music might the DJ be playing on stage? Output: Given the presence of rapping and electronic music, the DJ is likely playing Hip Hop or Electronic Dance Music (EDM).YRjUZjMPP-nAElectronic music plays as a whoosh sound follows, creating a lively atmosphere for the DJ performing on stage.’(Rapping-0.0-0.376)’, ’(Music-0.0-10.0)’, ’(Rapping-0.685-1.663)’, ’(Rapping-2.295-2.837)’, ’(Sound effect-2.423-5.222)’, ’(Sound effect-7.427-10.0)’

\resizebox

!Instruction-Response PairsAudioSet IDCaptionTimestamp EventsInstruction:Identify the role of the crowd’s continuous conversation during the woman’s speech. How does it contribute to the atmosphere of the scene? Output: The crowd’s constant chatter indicates an informal or relaxed environment, suggesting that while the woman’s speech is central, other conversations are also ongoing in the background.Y6fRYeClf5U4A woman delivers a speech while a crowd of people engage in conversation in an urban setting.’(Crowd-0.0-10.0)’, ’(Wind-0.008-10.0)’, ’(Female speech, woman speaking-0.074-1.65)’, ’(Female speech, woman speaking-2.879-5.427)’, ’(Female speech, woman speaking-5.604-6.083)’, ’(Female speech, woman speaking-6.9-10.0)’Instruction:Considering the presence of bird sounds and chainsaw noise, identify the probable location where this sound sequence is occurring. Output:The location is likely an outdoor area, possibly in a woodland or suburban setting where birds can be heard alongside human activity and power tools.YbkG4M4TiXZgA man speaks while a power tool revs up, indicating a speech event followed by the sound of a chainsaw.’(Male speech, man speaking-0.0-0.268)’, ’(Chainsaw-0.0-10.0)’, ’(Male speech, man speaking-1.772-4.425)’, ’(Male speech, man speaking-5.008-8.118)’, ’(Bird vocalization, bird call, bird song-5.362-7.512)’, ’(Bird vocalization, bird call, bird song-8.244-8.709)’, ’(Bird vocalization, bird call, bird song-8.937-9.283)’, ’(Male speech, man speaking-9.661-10.0)’

\resizebox

!Instruction-Response PairsAudioSet IDCaptionTimestamp EventsInstruction:Analyze the frequency and consistency of the ticking sound. What is the likely source of this sound, and what implication might it have on the depicted setting? Output:The steady ticking likely comes from a clock, which coupled with the soft music, suggests a relaxed, cozy domestic setting, perhaps aimed at unwinding or relaxation.YCoBAR5MbjysThe clock ticks steadily as soft music plays in the background, creating a calming atmosphere in a cozy living room.’(Mechanisms-0.0-10.0)’, ’(Alarm clock-0.008-10.0)’, ’(Tick-0.386-0.583)’, ’(Tick-1.071-1.22)’, ’(Tick-1.764-1.906)’, ’(Tick-2.465-2.638)’, ’(Tick-3.197-3.331)’, ’(Tick-3.772-3.976)’, ’(Tick-4.346-4.48)’, ’(Tick-4.646-4.787)’, ’(Tick-5.087-5.22)’, ’(Tick-5.669-5.795)’, ’(Tick-6.031-6.15)’, ’(Tick-6.37-6.528)’, ’(Tick-6.724-6.795)’, ’(Tick-6.969-7.118)’, ’(Tick-7.386-7.614)’, ’(Tick-8.134-8.354)’, ’(Tick-8.882-9.094)’, ’(Tick-9.315-9.425)’, ’(Tick-9.575-9.685)’Instruction:Identify the type of vocal music that is being depicted in the audio based on the presence of singing and beatboxing. Output:This audio resembles A Capella, where voices impersonate the sounds of instruments, including rhythms often mimicked through beatboxing.Y6SvDRiIG2NYA group of people sing and harmonize, creating vocal music with occasional beatboxing, in a room with a piano.’(Male singing-0.0-6.594)’, ’(Music-0.0-10.0)’, ’(Mechanisms-0.0-10.0)’, ’(Breathing-7.064-8.314)’, ’(Breathing-8.911-10.0)’, ’(Male singing-9.713-10.0)’Instruction:Based on the audio, ascertain the possible relationship between the gunfire sounds, artillery fire, and music. How does the sequencing and manner of these sounds contribute to the atmosphere of the scene? Output: The gunfire and artillery sounds likely serve as a ceremonial display, with the music adding to the grandeur and solemnity of a military parade.YbJvOp4gmHBgGunshots and artillery fire echo through the air as music plays during a military parade at a raceway.’(Music-0.0-10.0)’, ’(Generic impact sounds-0.166-0.307)’, ’(Artillery fire-0.32-0.704)’, ’(Generic impact sounds-0.781-0.948)’, ’(Generic impact sounds-1.063-1.165)’, ’(Generic impact sounds-1.524-1.677)’, ’(Generic impact sounds-2.625-2.881)’, ’(Artillery fire-3.035-3.521)’, ’(Generic impact sounds-3.611-3.777)’, ’(Generic impact sounds-4.213-4.43)’, ’(Generic impact sounds-5.096-5.262)’, ’(Artillery fire-5.288-5.762)’, ’(Generic impact sounds-5.89-6.095)’, ’(Generic impact sounds-6.479-6.812)’, ’(Generic impact sounds-6.94-7.106)’, ’(Artillery fire-7.222-7.606)’, ’(Generic impact sounds-8.207-8.425)’, ’(Artillery fire-8.476-8.988)’, ’(Generic impact sounds-9.206-9.385)’, ’(Generic impact sounds-9.654-9.795)’