Keynote 1: Fernando Pereira
Deep Learning-based Extended Reality: Making Humans and Machines Speak the Same Visual Language
Abstract: The key goal of Extended Reality (XR) is to offer the human users immersive and interactive experiences, notably the sense of being in a virtual or augmented environment, interacting with virtual beings or objects. A fundamental element in this goal is the visual content, its realism, level of interactivity and immersion. The recent advances in visual data acquisition and consumption have led to the emergence of the so-called plenoptic visual models, where light fields and point clouds are playing an increasingly important role, offering 6DoF experiences in addition to the more common and limited 2D images and video-based experiences. This increased immersion is critical for emerging applications and services, notably virtual and augmented reality, personal communications and meetings, education and medical applications and virtual museum tours. To have effective remote experiences across the globe, it is critical that all types of visual information are efficiently compressed to be compatible with the bandwidth resources available. In this context, deep learning (DL)-based technologies came recently to play a central role, already overcoming the compression performances of the best previous, hand-made coding solutions. However, this breakthrough goes much beyond coding since DL-based tools are also nowadays the most effective for computer vision tasks such as classification, recognition, detection, and segmentation. This double win opens, for the first time, the door for a common visual representation language associated to the novel DL-based latents/coefficients which may simultaneously serve for human and machine consumption. While the humans will use the DL-based coded streams to decode immersive visual content, the machines will use the same precise streams for computer vision tasks, thus ‘speaking’ a common visual language. This is not possible with conventional visual representations, where the machine vision processors deal with decoded content, thus suffering from compression artifacts, and even at the cost of additional complexity. This visual representation approach will offer a more powerful and immersive augmented Extended Reality where humans and machines may more seamlessly participate at lower complexity. In this context, the main objective of this keynote talk is to discuss this DL-based dual-consumption paradigm, how it is being fulfilled and what are its impacts. Special attention will be dedicated to the ongoing standardization projects in this domain, notably in JPEG and MPEG.