Existing computer vision-based emotion recognition systems are trained to classify images of faces into a very limited number of emotions. In this work, we take a different approach and train a convolutional neural network (CNN) that is able to distinguish subtle differences in facial expressions appearing in both individual images and videos. For this effect, we train a feature embedding network that maps a facial image into a position in embedding space, such that it is positioned close to similar facial expressions compared with other expressions. We train the feature embedding network using triplet loss on the publicly available facial expression comparison (FEC) data set. The proposed facial expression model is lightweight (4.7M parameters) and obtains a triplet prediction accuracy of 84.5%—very close to the average human performance of 86.2%. A key contribution of this work is adapting the model that has been trained on images to work reliably on video content. To this end, we propose a novel automatic face quality assessment (FaceQA), which allows us to filter out faces that are unsuited for expression analysis. We validate the proposed approach on several applications, including searching video content for a specific expression from a single query face, generating facial expression statistics of various actors appearing in a video, as well as generating facial expression summaries.
Print ISSN
1545-0279
Electronic ISSN
2160-2492
Published
2022-04
Content type
Original Research
Keywords
Facial Expression Analysis, Face Quality Assessment, One-shot Learning, Real-time Video Analysis