DI-UMONS : Dépôt institutionnel de l’université de Mons

Recherche transversale
Rechercher
(titres de publication, de périodique et noms de colloque inclus)
2019-10-13 - Colloque/Article dans les actes avec comité de lecture - Anglais - 6 page(s)

Brousmiche Mathilde , Dupont Stéphane , Rouat Jean, "Audio-Visual Fusion And Conditioning With Neural Networks For Event Recognition" in International Workshop on Machine Learning for Signal Processing, Pittsburgh, USA, 2019

  • Codes CREF : Intelligence artificielle (DI1180)
  • Unités de recherche UMONS : Théorie des circuits et Traitement du signal (F105)
  • Instituts UMONS : Institut NUMEDIART pour les Technologies des Arts Numériques (Numédiart)

Abstract(s) :

(Anglais) ideo event recognition based on audio and visual modalities is an open research problem. The mainstream literature on video event recognition focuses on the visual modality and does not take into account the relevant information present in the audio modality. We propose to study several fusion architectures for the audio-visual recognition task of video events. We first build classical fusion architectures using concatenation, addition or Multimodal Compact Bilinear pooling (MCB). Then, we propose to create connections between visual and audio processing with Feature-Wise Linear Modulation (FiLM) layers. For instance, the information present in the audio modality is exploited to change the visual classification behaviour. We found that multimodal event classification performance is always better than unimodal performance, whatever the fusion or conditioning method used. Classification accuracy based on one modality improves when we add the modulation of the other modality through FiLM layers.


Mots-clés :
  • (Anglais) Event recognition
  • (Anglais) Multimodal deep learning
  • (Anglais) Audio-visual fusion
  • (Anglais) Modalities conditioning