DI-UMONS : Dépôt institutionnel de l’université de Mons

Recherche transversale
(titres de publication, de périodique et noms de colloque inclus)
2020-10-20 - Travail avec promoteur/Doctorat - Anglais - 170 page(s)

Laraba Sohaib , "Deep Learning for Skeleton-Based Human Action Recognition", Dutoit Thierry (p) , Tilmanne Joëlle , soutenue le 2020-10-20

  • Codes CREF : Sciences de l'ingénieur (DI2000), Informatique mathématique (DI1160)
  • Jury : Siebert Xavier (p) , Gosselin Bernard , Sahli Hichem, Taleb-Ahmed Abdelmalik, Wannous Hazem
  • Unités de recherche UMONS : Information, Signal et Intelligence artificielle (F105)
  • Instituts UMONS : Institut de Recherche en Technologies de l’Information et Sciences de l’Informatique (InforTech), Institut NUMEDIART pour les Technologies des Arts Numériques (Numédiart)
  • Centres UMONS : Centre de Recherche en Technologie de l’Information (CRTI)
Texte intégral :

Abstract(s) :

(Anglais) Human action recognition from videos has a wide range of applications, including video surveillance and security, human-computer interaction, robotics, health care, etc. Nowadays, 3D skeleton-based action recognition has drawn increasing attention thanks to the availability of low-cost motion capture devices, and accessibility of large-scale 3D skeleton datasets, in addition to real-time skeleton estimation algorithms. In the first part of this thesis, we present a novel representation of motion capture sequences for 3D skeleton-based action recognition. The proposed approach consists of representing the 3D skeleton sequences into RGB image-like data and leveraging recent convolutional neural networks (CNNs) to model the long-term temporal and spatial structural information for action recognition. Extensive experiments have shown the superiority of the proposed approach over the state-of-the-art methods for 3D skeleton-based action recognition. In order to extract skeleton sequences, different devices extract first the depth information using multiple technologies (stereo, time-of-flight, etc.), then 3D skeleton poses are extracted using different algorithms. In the very late years, new researches proposed to extract skeleton sequences directly from RGB videos. The most precise methods extract 2D skeletons in real-time and with high accuracy. In the second part of this thesis, we leverage these tools to extend the use of our proposed approach to RGB videos. We first extract 2D skeleton sequences from RGB videos, and then, following approximately the same process as in the first part, we use CNNs for human action recognition. Different experiments showed that the proposed method outperforms different state-of-the-art methods on a large benchmark dataset. Another contribution of this thesis relates to the interpretability of deep learning models. Deep learning models are still considered alchemy due to the lack of understanding of their internal operations. Interpretability is a crucial task to understand and trust the decisions made by the machine learning model. Thus, we propose in the third part of this thesis to use CNN interpretation methods to understand the behavior of our classifier and extract the most informative joints during the execution of a particular action. This method allows us to see from the CNN point of view the most important joints, and understand why certain actions are confused by the proposed classifier.