Multimodal communicative and social behaviours

The study of multiparty interactions provide a wealth of information regarding the communicative behaviors of humans. And these behaviors are inherently multimodal as persons employ a rich set of verbal and nonverbal elements to convey information, emotion and affect.

The ability of humans to unconsciously generate and interpret composite multimodal behavior signals are still far from what computers are capable of. Studying, understanding and modelling a construct such complex and challenging as the multiparty configuration and the underlying affective and social behaviors of its participants, can bring valuable insight on how humans follow and participate to the conversation, how they keep track of the state of the conversation and of the other participants, how they employ various skills to control the interaction flow, and how they respond to verbal and non-verbal cues of their interlocutors in appropriate timing and manner. This insight can then be exploited in designing socially aware, affective artificial systems that can imitate such behaviors and interact with humans in ways far more natural and intuitive than they do today.

The human communicative behavior involves a number of complex, intervolving, low-level signals from different modalities, such as speech, prosody, gaze, head movements, facial expressions and body posture. But these low-level signal are not independent from each other. There exist perceptual and temporal bindings among them, where the content and the timing of one signal can complement, reinforce or modify the meaning of another. This interplay leads to higher-level, composite, multimodal signals that provide a more solid basis for studying the human communicative behavior patterns.

To study the human communicative behavior, its low-level constructs need to be captured and analyzed, i.e. the low-level signals from the different modalities. Particular focus is given to formal or semi-formal communicative situations such as meetings, interviews, group sessions, collaborative problem solving or performing predefined collaborative tasks, as these tend to be more structured and tractable and are thus better suited for analysis and research.

Speech can be captured through microphones and analyzed through speech processing modules such as speech recognizers or speech emotion detectors. Facial expressions and the emotional content they carry can be captured through cameras and face motion capture devices, and analyzed through advanced facial expression analysis modules. Gaze and body pose can be tracked through eye-trackers or depth sensor data for live interactions, or derived through specialized video analysis modules for pre-recorded interactions.

Capturing of all these different biometric signals need to be properly orchestrated and controlled with high precision. These signals are time-sensitive, some at the order of milliseconds, and their relative timing is critical as it significantly affects how humans perceive and correlate them. To ensure proper synchronization and handling of such large amounts of time-sensitive data, specialized biometric analysis platforms are available. These can ensure that data is captured, processed, stored and handled efficiently. Furthermore, they facilitate fusing these low-level signals to derive higher-level information regarding the state of the participants (such as their engagement, attention, arousal, valence or emotional state), and their interaction (for example, turn-taking, overlaps, dominance, group performance etc).