Background

JPP

JMIR Pediatr Parent

JMIR Pediatrics and Parenting

2561-6722

JMIR Publications

Toronto, Canada

v5i2e26760

35394438

10.2196/26760

Original Paper

Improved Digital Therapy for Developmental Pediatrics Using Domain-Specific Artificial Intelligence: Machine Learning Study

Badawy

Sherif

Franzoni

Valentina

Das

Anthony Vipin

Lin

Yuchen

Washington

Peter

BA, MS 1

Departments of Pediatrics (Systems Medicine) and Biomedical Data Science Stanford University

Stanford, CA

United States 1 5126800926 peterwashington@stanford.edu

https://orcid.org/0000-0003-3276-4411

Kalantarian

Haik

PhD 1

https://orcid.org/0000-0002-7107-7908

Kent

John

BA, MA 1

https://orcid.org/0000-0002-7989-6596

Husic

Arman

BS 1

https://orcid.org/0000-0002-9180-5212

Kline

Aaron

BS 1

https://orcid.org/0000-0002-0077-5485

Leblanc

Emilie

MS 1

https://orcid.org/0000-0002-3492-3554

Hou

Cathy

https://orcid.org/0000-0001-6766-5128

Mutlu

Onur Cezmi

BS 1

https://orcid.org/0000-0002-9263-9332

Dunlap

Kaitlyn

MS 1

https://orcid.org/0000-0003-4423-5269

Penev

Yordan

MS 1

https://orcid.org/0000-0001-8520-9417

Varma

Maya

BS 1

https://orcid.org/0000-0003-0693-7753

Stockham

Nate Tyler

MS 1

https://orcid.org/0000-0002-0752-6801

Chrisman

Brianna

MS 1

https://orcid.org/0000-0002-7157-607X

Paskov

Kelley

MS 1

https://orcid.org/0000-0002-5252-1401

Sun

Min Woo

BS 1

https://orcid.org/0000-0003-1049-1854

Jung

Jae-Yoon

PhD 1

https://orcid.org/0000-0001-7948-9803

Voss

Catalin

MS 1

https://orcid.org/0000-0001-6480-7020

Haber

Nick

PhD 1

https://orcid.org/0000-0001-8804-7804

Wall

Dennis Paul

PhD 1

https://orcid.org/0000-0002-7889-9146

1 Departments of Pediatrics (Systems Medicine) and Biomedical Data Science Stanford University

Stanford, CA

United States

Corresponding Author: Peter Washington peterwashington@stanford.edu

Apr-Jun 2022

8 4 2022

5 2

e26760

23 12 2020 4 2 2021 24 3 2021 3 1 2022

©Peter Washington, Haik Kalantarian, John Kent, Arman Husic, Aaron Kline, Emilie Leblanc, Cathy Hou, Onur Cezmi Mutlu, Kaitlyn Dunlap, Yordan Penev, Maya Varma, Nate Tyler Stockham, Brianna Chrisman, Kelley Paskov, Min Woo Sun, Jae-Yoon Jung, Catalin Voss, Nick Haber, Dennis Paul Wall. Originally published in JMIR Pediatrics and Parenting (https://pediatrics.jmir.org), 08.04.2022.

2022

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Pediatrics and Parenting, is properly cited. The complete bibliographic information, a link to the original publication on https://pediatrics.jmir.org, as well as this copyright and license information must be included.

Background

Automated emotion classification could aid those who struggle to recognize emotions, including children with developmental behavioral conditions such as autism. However, most computer vision emotion recognition models are trained on adult emotion and therefore underperform when applied to child faces.

Objective

We designed a strategy to gamify the collection and labeling of child emotion–enriched images to boost the performance of automatic child emotion recognition models to a level closer to what will be needed for digital health care approaches.

Methods

We leveraged our prototype therapeutic smartphone game, GuessWhat, which was designed in large part for children with developmental and behavioral conditions, to gamify the secure collection of video data of children expressing a variety of emotions prompted by the game. Independently, we created a secure web interface to gamify the human labeling effort, called HollywoodSquares, tailored for use by any qualified labeler. We gathered and labeled 2155 videos, 39,968 emotion frames, and 106,001 labels on all images. With this drastically expanded pediatric emotion–centric database (>30 times larger than existing public pediatric emotion data sets), we trained a convolutional neural network (CNN) computer vision classifier of happy, sad, surprised, fearful, angry, disgust, and neutral expressions evoked by children.

Results

The classifier achieved a 66.9% balanced accuracy and 67.4% F1-score on the entirety of the Child Affective Facial Expression (CAFE) as well as a 79.1% balanced accuracy and 78% F1-score on CAFE Subset A, a subset containing at least 60% human agreement on emotions labels. This performance is at least 10% higher than all previously developed classifiers evaluated against CAFE, the best of which reached a 56% balanced accuracy even when combining “anger” and “disgust” into a single class.

Conclusions

This work validates that mobile games designed for pediatric therapies can generate high volumes of domain-relevant data sets to train state-of-the-art classifiers to perform tasks helpful to precision health efforts.

computer vision emotion recognition affective computing autism spectrum disorder pediatrics mobile health digital therapy convolutional neural network machine learning artificial intelligence

Introduction

Automated emotion classification can serve in pediatric care solutions, particularly to aid those who struggle to recognize emotion, such as children with autism who have trouble with emotion evocation and recognizing emotions displayed by others [1-3]. In prior work, computer vision models for emotion recognition [4-6] used in digital therapeutics have shown significant treatment effects in children with autism [7-17]. The increasing use of signals from sensors on mobile devices, such as the selfie camera, opens many possibilities for real-time analysis of image data for continuous phenotyping and repeated diagnoses in home settings [18-33]. However, facial emotion classifiers and the underlying data sets on which they are trained have been tailored to neurotypical adults, as demonstrated by repeatedly low performance on image data sets of pediatric emotion expressions [34-39].

The Child Affective Facial Expression (CAFE) data set is currently the most popular facial expression data set pertaining to children. Prior machine learning efforts that do not include CAFE images in the training set have reached 56% accuracy on CAFE [36,37,39], even after combining facial expressions (eg, “anger” and “disgust”) into a single class, thus limiting granularity. We do not discuss prior publications that report higher accuracy using subsets of the CAFE data set in the training and testing sets. This overall lack of performance in prior work highlights the need for developing facial emotion classifiers that work for children. With a lack of labeled data being the fundamental bottleneck to achieving clinical-grade performance, low-cost and speedy data generation and labeling techniques are pertinent.

As a first step toward the creation of a large-scale data set of child emotions, we have previously designed GuessWhat, a dual-purpose smartphone app that serves as a therapeutic for children with autism while simultaneously collecting highly structured image data enriched for emoting in children. GuessWhat was designed for children aged 2 and above to encourage prosocial interaction with a gameplay partner (eg, mom or dad), focusing the camera on the child while presenting engaging but challenging prompts for the child to try to act out [40-43]. We have previously tested GuessWhat’s potential to increase socialization in children with autism as well as its potential to collect structured videos of children emoting facial expressions [44]. In addition to collecting videos enriched with emotions, GuessWhat gameplay generates user-derived labels of emotion by leveraging the charades-style gameplay structure of the therapy.

Here, we document the full pipeline for training a classifier using emotion-enriched video streams coming from GuessWhat gameplay, resulting in a state-of-the-art pediatric facial emotion classifier that outperforms all prior classifiers when evaluated on CAFE. We first recruited parents and children from around the world to play GuessWhat and share videos recorded by the smartphone app during gameplay. We next extracted frames from the videos, automatically discarding some frames through quality control algorithms, and uploaded the frames on a custom behavioral annotation labeling platform named HollywoodSquares. We prioritized the high entropy frames and shared them with a group of 9 human annotators who annotated emotions in the frames. In total, we have collected 39,968 unique labeled frames of emotions that appear in the CAFE data set. Using the resulting frames and labels, we trained a facial emotion classifier that can distinguish happy, sad, surprised, fearful, angry, disgust, and neutral expressions in naturalistic images, achieving state-of-the-art performance on CAFE and outperforming existing classifiers by over 10%. This work demonstrates that therapeutic games, while primarily providing a behavioral intervention, can simultaneously generate sufficient data for training state-of-the-art domain-specific computer vision classifiers.

Methods Data Collection

The primary methodological contribution of this work is a general-purpose paradigm and pipeline (Figure 1) consisting of (1) passive collection of prelabeled structured videos from therapeutic interventions, (2) active learning to rank the collected frames leveraging the user-derived labels generated during gameplay, (3) human annotation of the frames in the order produced in the previous step, and (4) training a classifier while artificially augmenting the training set. We describe our instantiation of this general paradigm in the following sections.

Figure 1

Pipeline of the model training process. Structured videos enriched with child emotion evocation are collected from a mobile autism therapeutic deployed in the wild. The frames are ranked for their contribution to the target classifier by a maximum entropy active learning algorithm and receive human labels on a rating platform named HollywoodSquares. The frames are corresponding labels that are transferred onto a ResNet-152 neural network pretrained on the ImageNet data set.

Ethical Considerations

All study procedures, including data collection, were approved by the Stanford University Institutional Review Board (IRB number 39562) and the Stanford University Privacy Office. In addition, informed consent was obtained from all participants, all of whom had the opportunity to participate in the study without sharing videos.

Recruitment

To recruit child video subjects, we ran a marketing campaign to gather rich and diverse video inputs of children playing GuessWhat while evoking a range of emotions. We posted advertisements on social media (Facebook, Instagram, and Twitter) and contacted prior study participants for other digital smartphone therapeutics developed by the lab [13-15]. All recruitment and study procedures were approved by the Stanford University IRB.

User Interfaces GuessWhat Smartphone Therapeutic

GuessWhat is a mobile autism therapy implemented on iOS and Android, which has been previously documented as a useful tool for the collection of structured video streams of children behaving in constrained manners [40-44], including evocation of targeted emotions. GuessWhat features a charades game where the parents place the phone on their forehead facing the child, while the child acts out the emotion prompt displayed on the screen. The front-facing camera on the phone records a video of the child in addition to corresponding prompt metadata. All sessions last for 90 seconds. Upon approval by the parent, each session video is uploaded to a Simple Storage Service (S3) bucket on Amazon Web Services (AWS). The app has resulted in 2155 videos shared by 456 unique children. Parents are asked to sign an electronic consent and assent form prior to playing GuessWhat. After each gameplay session, parents can (1) delete the videos, (2) share the videos with the research team only, or (3) share the videos publicly.

Emotions Considered

We sought labels for Paul Ekman’s list of six universal emotions: anger, disgust, fear, happiness, sadness, and surprise [45-48]. Ekman originally included contempt in the list of emotions but has since revised the list of universal emotions. Because CAFE does not include labels of contempt, we did not train our classifier to predict contempt. We added a seventh category named neutral, indicating the absence of an expressed emotion. Our aim was to train a 7-way emotion classifier distinguishing among Ekman’s 6 universal emotions plus neutral.

HollywoodSquares Frame Labeling

We developed a frame-labeling website named HollywoodSquares. The website provides human labelers with an interface to speedily annotate a sequential grid of frames (Figure 2) that were collected during the GuessWhat gameplay. To enable rapid annotation, HollywoodSquares enables users to label frames by pressing hot keys, where each key corresponds to a particular emotion label. To provide a label, users can hover their mouse over a frame and press the hot key corresponding to the emotion they want to label. As more frames are collected by GuessWhat, they continue to appear on the interface. Because the HollywoodSquares system displays over 20 images on the screen at once, it encourages rapid annotation and enables simultaneous engagement by many independent labelers. This permits rapid convergence of a majority rules consensus on image labels.

We ran a labeling contest with 9 undergraduate and high school annotators, where we challenged each annotator to produce labels that would result in the highest performing classifier on the CAFE data set. Raters were aged between 15 and 24 years and were from the Bay Area, Northeastern United States, and Texas. The raters included 2 males and 7 females. For the frames produced by each individual annotator, we trained a ResNet-152 model (see Model Training). We updated annotators about the number of frames they labeled each week and the performance of the classifier trained with their individual labels. We awarded a cash prize to the annotator with the highest performance at the end of the 9-week labeling period.

Figure 2

HollywoodSquares rating interface. Annotators use keyboard shortcuts and the mouse to speedily annotate a sequence of frames acquired during GuessWhat gameplay.

HollywoodSquares was also used for a testing phase, during which iterations of the frame-labeling practices were made between the research and annotation teams. All the labeled frames acquired during this testing phase were discarded for final classifier training.

All annotators were registered as research team members through completion of the Health Insurance Portability and Accountability Act of 1996 and Collaborative Institutional Training Initiative training protocols in addition to encrypting their laptop with Stanford Whole Disk Encryption. This provided annotators with read-only access to all the videos and derived frames from GuessWhat gameplay that were shared with the research team.

The final labels were chosen by the following process. If all annotators agreed unanimously about the final frame label, then this label was assigned as the final frame label. If disagreements existed between raters, then the emotion gameplay prompt associated with that frame (the “automatic label”) was assigned as the final label for that frame, as long as at least 1 of the human annotators agreed with the automatic label. If disagreements existed between raters but the automatic label did not match any human annotations, then the frame was not included in the final training data set.

Machine Learning Model Training

We leveraged an existing CNN architecture, ResNet-152 [49], with pretrained weights from ImageNet [50]. We used categorical cross entropy loss and Adam optimization with a learning rate of 3 × 10^-4, with β₁ set to .99 and β₂ set to .999. We retrained every layer of the network until the training accuracy converged. The model converged when it did not improve against a validation data set for 20 consecutive epochs. We applied the following data augmentation strategies in conjunction and at random for each training image and each batch of training: rotation of frames between –15 and 15 degrees, zooming by a factor between 0.85 and 1.15, shifting images in every direction by up to 1/10th of the width and height, changing brightness by a factor between 80% and 120%, and potential horizontal flipping.

The CNN was trained in parallel on 16 graphics processing unit (GPU) cores with a p2.16xlarge Elastic Cloud Compute instance on AWS using the Keras library in Python with a Tensorflow 2 backend. With full GPU usage, the training time was 35 minutes and 41 seconds per epoch for a batch size of 1643, translating to US $14.4 per hour.

We trained 2 versions of the model, with 1 exclusively using non-GuessWhat public data set frames from (1) the Japanese Female Facial Expression (JAFFE) [51], (2) a random subset of 30,000 AffectNet [52] images (a subset was acquired to avoid an out of memory error), and (3) the Extended Cohn-Kanade (CK+) data set [53]; the other model was trained with these public data set frames plus all 39,968 labeled and relevant GuessWhat frames.

Model Evaluation

We evaluated our models against the entirety of the CAFE data set [54], a set of front-facing images of racially and ethnically diverse children aged 2 to 8 years expressing happy, sad, surprised, fear, angry, fearful, and neutral emotions. CAFE is currently the largest data set of facial expressions from children and has become a standard benchmark for this field.

Although existing studies have evaluated models exclusively against the entirety of the CAFE data set [34-39], we additionally evaluated them on Subset A and Subset B of CAFE, as defined by the authors of the data set. Subset A contains images that were identified with an accuracy of 60% or above by 100 adult participants [54], with a Cronbach α internal consistency score of .82 (versus .77 for the full CAFE data set). Subset B contains images showing “substantial variability while minimizing floor and ceiling effects” [54], with a Cronbach α score of .768 (close to the score of .77 for the full data set).

Results Frame Processing

The HollywoodSquares annotators processed 106,001 unique frames (273,493 including the testing phase and 491,343 unique labels when counting multiple labels for the same frame as a different label). Of the 106,001 unique frames labeled, 39,968 received an emotion label corresponding to 1 of the 7 CAFE emotions (not including the testing phase labels). Table 1 contains the number of frames that were included in the training set for each emotion class, including how many children and videos are represented for each emotion category. The frames that were not included received labels of “None” (corresponding to a situation where no face or an incomplete face appears in the frame), “Unknown” (corresponding to the face not expressing a clear emotion), or “Contempt” (corresponding to the face not expressing an emotion in the CAFE set). The large number of curated frames displaying emotion demonstrates the usefulness of HollywoodSquares in filtering out emotion events from noisy data streams. The lack of balance across emotion categories is a testament particularly to the difficulty of evoking anger and sadness as well as disgust and fear, although to a lesser extent.

Of the children who completed 1 session of the Emoji challenge in GuessWhat and uploaded a video to share with the research team, 75 were female, 141 were male, and 51 did not specify their gender. Table 2 presents the racial and ethnic makeup of the participant cohort. Representative GuessWhat frames and cropped faces used to train the classifier, obtained from the subset of participants who consented explicitly to public sharing of their images, are displayed in Figure 3.

Table 1

Emotions represented in the HollywoodSquares data set, including how many children and videos are represented for each emotion category.

Emotion	Frequency	Number of children	Number of videos
Anger	643	28	62
Disgust	1723	46	95
Fear	1875	41	89
Happy	13,332	73	228
Neutral	16,055	87	289
Sad	947	31	93
Surprise	5393	52	135

Table 2

Representation of race and ethnicity of children whose who played the “Emoji” charades category and uploaded a video to the cloud.

Race/ethnicity	Frequency
Arab	6
Black or African	16
East Asian	16
Hispanic	36
Native American	7
Pacific Islander	5
South Asian	14
Southeast Asian	7
White or Caucasian	100
Not specified	60

Figure 3

Example of frames collected from GuessWhat gameplay, including examples of cropped (A) and original (B) frames. We have displayed these images after obtaining consent from the participants for public sharing.

Performance on CAFE, CAFE-Defined Subsets, and CAFE Subset Balanced in Terms of Race, Gender, and Emotions

The ResNet-152 network trained on the entire labeled HollywoodSquares data set as well as the JAFFE, AffectNet subset, and CK+ data sets achieved a balanced accuracy of 66.9% and an F1-score of 67.4% on the entirety of the CAFE data set (confusion matrix in Figure 4). When only the HollywoodSquares data set was included in the training set, the model achieved a balanced accuracy of 64.12% and an F1-score of 64.2%. When only including the JAFFE, AffectNet subset, and CK+ sets, the classifier achieved an F1-score of 56.14% and a balanced accuracy of 52.5%, highlighting the contribution of the HollywoodSquares data set.

Figure 4

Confusion matrix for the entirety of the Child Affective Facial Expression data set.

To quantify the contribution of the neural network architecture itself, we compared the performance of several state-of-the-art neural network architectures when only including the HollywoodSquares data set in the training set (Table 3). We evaluated the following models: ResNet152V2 [49], ResNet50V2 [49], InceptionV3 [55], MobileNetV2 [56], DenseNet121 [57], DenseNet201 [57], and Xception [58]. The same training conditions and hyperparameters were used across all models. We found that ResNet152V2 performed better than the other networks when trained with our data, so we used this model for the remainder of our experiments.

The performance improved, resulting in a balanced accuracy of 79.1% and an F1-score of 78% on CAFE Subset A (confusion matrix in Figure 5), a subset containing more universally accepted emotions labels. When only including the non-GuessWhat public images in the training set, the model achieved a balanced accuracy of 65.3% and an F1-score of 69.2%. On CAFE Subset B, the balanced accuracy was 66.4% and the F1-score was 67.2% (confusion matrix in Figure 6); the balanced accuracy was 57.2% and F1-score was 57.3% when exclusively training on the non-GuessWhat public images.

Table 3

Comparison of several popular neural network architectures trained on the same data set^a.

Model	Balanced accuracy (%)	F1-score (%)	Number of network parameters
ResNet152V2; He et al [49]	64.12	64.2	60,380,648
ResNet50V2; He et al [49]	63.67	63.12	25,613,800
InceptionV3; Szegedy et al [55]	59	59.66	23,851,784
MobileNetV2; Sandler et al [56]	57.63	58.19	3,538,984
DenseNet121; Huang et al [57]	58.2	59.19	8,062,504
DenesNet201; Huang et al [57]	57.02	58.95	20,242,984
Xception; Chollet and François [58]	58.16	60.58	22,910,480

^aDefault hyperparameters were used for all networks.

Figure 5

Confusion matrix for Child Affective Facial Expression Subset A.

Figure 6

Confusion matrix for Child Affective Facial Expression Subset B.

Classifier Performance Based on Image Difficulty

CAFE images were labeled by 100 adults, and the percentage of participants who labeled the correct class are reported with the data set [54]. We binned frames into 10 difficulty classes (ie, 90%-100% correct human labels, 80%-90% correct human labels, etc). Figure 7 shows that our classifier performs exceedingly well on unambiguous images. Of the 233 images with 90%-100% agreement between the original CAFE labelers, our classifier correctly classifies 90.1% of the images. The true label makeup of these images is as follows: 131 happy, 58 neutral, 20 anger, 9 sad, 8 surprise, 7 disgust, and 0 fear images. This confirms that humans have trouble identifying nonhappy and nonneutral facial expressions. Of the 455 images with 80%-100% agreement between the original CAFE labelers, our classifier correctly classifies 81.1% of the images.

Figure 7

Classifier performance versus original CAFE annotator performance for 10 difficulty bins. The classifier tends to perform well when humans agree on the class and poorly otherwise. The numbers in parentheses represent the number of images in each bin. This highlights the issue of ambiguous labels in affective computing and demonstrates that our model performance scales proportionally to human performance. CAFE: Child Affective Facial Expression.

Discussion Principal Results

Through the successful application of an in-the-wild child developmental health therapeutic that simultaneously captures video data, we show that a pipeline for intelligently and continuously labeling image frames collected passively from mobile gameplay can generate sufficient training data for a high-performing computer vision classifier (relative to prior work). We curated a data set that contains images enriched for naturalistic facial expressions of children, including but not limited to children with autism.

We demonstrate the best-performing pediatric facial emotion classifier to date according to the CAFE data set. The best-performing classifiers evaluated in earlier studies involving facial emotion classification on the CAFE data set, including images from CAFE in the training set, achieved an accuracy of up to 56% on CAFE [36,37,39] and combined “anger” and “disgust” into a single class. By contrast, we achieved a balanced accuracy of 66.9% and an F1-score of 67.4% without including any CAFE images in the training set. This is a clear illustration of the power of parallel data curation from distributed mobile devices in conjunction with deep learning, and this approach can possibly be generalized to the collection of training data for other domains.

We collected a sufficiently large training sample to alleviate the need for extracting facial keypoint features, as was the case in prior works. Instead, we used the unaltered images as inputs to a deep CNN.

Limitations and Future Work

A major limitation of this work is the use of 7 discrete and distinct emotion categories. Some images in the training set might have exhibited more than 1 emotion, such as “happily surprised” or “fearfully surprised.” This could be addressed in future work by a more thorough investigation of the final emotion classes. Another limitation is that similar to existing emotion data sets, our generated data set contains fake emotion evocations by the children. This is due to limitations imposed by ethics review committees and the IRB who, understandably so, do not allow provoking real fear or sadness in participants, especially young children who may have a developmental delay. This issue of fake emotion evocation has been documented in prior studies [4,5,59,60]. Finding a solution to this issue that would appease ethical review committees is an open research question.

Another limitation is that we did not address the possibility of complex or compound emotions [61]. A particular facial expression can consist of multiple universal expressions. For example, “happily surprised,” “fearfully surprised,” and even “angrily surprised” are all separate subclasses of “surprised.” We have not separated these categories in this study. We recommend that future studies explore the possibility of predicting compound and complex facial expressions.

There are several fruitful avenues for future work. The paradigm of passive data collection during mobile intervention gameplay could be expanded to other digital intervention modalities, such as wearable autism systems with front-facing cameras [7,8,11,13-17]. This paradigm can also be applied toward the curation of data and subsequent training of other behavioral classifiers. Relevant computer vision models for diagnosing autism could include computer vision–powered quantification of hand stimming, eye contact, and repetitive behavior, as well as audio-based classification of abnormal prosody, among others.

The next major research step will be to evaluate how systems like GuessWhat can benefit from the incorporation of the machine learning models back into the system in a closed-loop fashion while preserving privacy and trust [62]. Quantification of autistic behaviors during gameplay via machine learning models trained with gameplay videos can enable a feedback loop that provides a dynamic and adaptive therapy for the child. Models can be further personalized to the child’s unique characteristics, providing higher performance through customized fine-tuning of the network.

Conclusions

We have demonstrated that gamified digital therapeutic interventions can generate sufficient data for training state-of-the-art computer vision classifiers, in this case for pediatric facial emotion. Using this data curation and labeling paradigm, we trained a state-of-the-art 7-way pediatric facial emotion classifier.

Abbreviations

AWS

Amazon Web Services

CAFE

Child Affective Facial Expression data set

CK+

Extended Cohn-Kanade data set

CNN

convolutional neural network

GPU

graphics processing unit

IRB

Institutional Review Board

JAFFE

Japanese Female Facial Expression data set

We would like to acknowledge all the nine high school and undergraduate emotion annotators: Natalie Park, Chris Harjadi, Meagan Tsou, Belle Bankston, Hadley Daniels, Sky Ng-Thow-Hing, Bess Olshen, Courtney McCormick, and Jennifer Yu. The work was supported in part by funds to DPW from the National Institutes of Health (grants 1R01EB025025-01, 1R01LM013364-01, 1R21HD091500-01, and 1R01LM013083), the National Science Foundation (Award 2014232), The Hartwell Foundation, Bill and Melinda Gates Foundation, Coulter Foundation, Lucile Packard Foundation, Auxiliaries Endowment, the Islamic Development Bank Transform Fund, the Weston Havens Foundation, and program grants from Stanford’s Human-Centered Artificial Intelligence Program, Precision Health and Integrated Diagnostics Center, Beckman Center, Bio-X Center, Predictives and Diagnostics Accelerator, Spectrum, Spark Program in Translational Research, MediaX, and from the Wu Tsai Neurosciences Institute's Neuroscience:Translate Program. We also acknowledge generous support from David Orr, Imma Calvo, Bobby Dekesyer, and Peter Sullivan. PW would like to acknowledge support from Mr. Schroeder and the Stanford Interdisciplinary Graduate Fellowship (SIGF) as the Schroeder Family Goldman Sachs Graduate Fellow.

DPW is the founder of Cognoa.com. This company is developing digital health solutions for pediatric care. AK works as a part-time consultant with Cognoa.com. All other authors declare no conflict of interests.

Harms

Martin

Wallace

Facial emotion recognition in autism spectrum disorders: a review of behavioral and neuroimaging studies

Neuropsychol Rev 2010 9 20 290 322

10.1007/s11065-010-9138-6

Hobson

Ouston

Lee

Emotion recognition in autism: coordinating faces and voices

Psychol Med 2009 07 18 4 911 923

10.1017/S0033291700009843

Rieffe

Oosterveld

Terwogt

Mootz

van Leeuwen

Stockmann

Emotion regulation and internalizing symptoms in children with autism spectrum disorders

Autism 2011 07 15 6 655 670

10.1177/1362361310366571

Carolis

D’Errico

Paciello

Palestra

Cognitive emotions recognition in e-learning: exploring the role of age differences and personality traits

Methodologies and Intelligent Systems for Technology Enhanced Learning, 9th International Conference 2019 06

International Conference in Methodologies and intelligent Systems for Technology Enhanced Learning

June 26-28, 2019

Ávila, Spain

97 104

10.1007/978-3-030-23990-9_12

De Carolis

D’Errico

Rossano

Socio-affective technologies [SI 1156 T]

Multimed Tools Appl 2020 10 79 35779 35783

10.1007/s11042-020-10015-3

Franzoni

Biondi

Perri

Gervasi

Enhancing mouth-based emotion recognition using transfer learning

Sensors 2020 09 20 18 5222

10.3390/s20185222

Daniels

Haber

Voss

Schwartz

Tamura

Fazel

Kline

Washington

Phillips

Winograd

Feinstein

Wall

Feasibility testing of a wearable behavioral aid for social learning in children with autism

Appl Clin Inform 2018 02 09 01 129 140

10.1055/s-0038-1626727

Daniels

Schwartz

Voss

Haber

Fazel

Kline

Washington

Feinstein

Winograd

Wall

Exploratory study examining the at-home feasibility of a wearable tool for social-affective learning in children with autism

NPJ Digital Med 2018 08 1 32

10.1038/s41746-018-0035-3

Haber

Voss

Fazel

Winograd

Wall

A practical approach to real-time neutral feature subtraction for facial expression recognition

2016 IEEE Winter Conference on Applications of Computer Vision (WACV) 2016

IEEE Winter Conference on Applications of Computer Vision (WACV)

March 7-10, 2016

Lake Placid, United States

1 9

10.1109/WACV.2016.7477675

Haber

Voss

Wall

Making emotions transparent: Google Glass helps autistic kids understand facial expressions through augmented-reaiity therapy

IEEE Spectrum 2020 4 57 4 46 52

10.1109/MSPEC.2020.9055973

Kline

Voss

Washington

Haber

Schwartz

Tariq

Winograd

Feinstein

Wall

Superpower glass

GetMobile: Mobile Comp Comm 2019 11 23 2 35 38

10.1145/3372300.3372308

Nag

Haber

Voss

Tamura

Daniels

Chiang

Ramachandran

Schwartz

Winograd

Feinstein

Wall

Toward continuous social phenotyping: analyzing gaze patterns in an emotion recognition task for children with autism through wearable smart glasses

J Med Internet Res 2020 04 22 4 e13810

10.2196/13810

Voss

Haber

Wall

The potential for machine learning–based wearables to improve socialization in teenagers and adults with autism spectrum disorder—reply

JAMA Pediatr 2019 11 173 11 1106

10.1001/jamapediatrics.2019.2969

Voss

Schwartz

Daniels

Kline

Haber

Washington

Tariq

Robinson

Desai

Phillips

Feinstein

Winograd

Wall

Effect of wearable digital intervention for improving socialization in children with autism spectrum disorder

JAMA Pediatr 2019 05 173 5 446 454

10.1001/jamapediatrics.2019.0285

Voss

Washington

Haber

Kline

Daniels

Fazel

McCarthy

Feinstein

Winograd

Wall

Superpower glass: delivering unobtrusive real-time social cues in wearable systems

UbiComp '16: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct 2016 09

UbiComp '16: The 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing

September 12-16, 2016

Heidelberg, Germany

1218 1226

10.1145/2968219.2968310

Washington

Voss

Haber

Tanaka

Daniels

Feinstein

Winograd

Wall

A wearable social interaction aid for children with autism

CHI EA '16: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems 2016 05

CHI'16: CHI Conference on Human Factors in Computing Systems

May 7-12, 2016

San Jose, United States

2348 2354

10.1145/2851581.2892282

Washington

Voss

Kline

Haber

Daniels

Fazel

Feinstein

Winograd

Wall

SuperpowerGlass: a wearable aid for the at-home therapy of children with autism

Proc ACM Interact Mob Wearable Ubiquitous Technol 2017 09 1 3 1 22

10.1145/3130977

Abbas

Garberson

Glover

Wall

Machine learning for early detection of autism (and other conditions) using a parental questionnaire and home video screening

2017

IEEE International Conference on Big Data (Big Data)

December 11-14, 2017

Boston, United States

3558 3561

10.1109/bigdata.2017.8258346

Abbas

Garberson

Liu-Mayo

Glover

Wall

Multi-modular AI approach to streamline autism diagnosis in young children

Sci Rep 2020 03 10 5014

10.1038/s41598-020-61213-w

Duda

Kosmicki

Wall

Testing the accuracy of an observation-based classifier for rapid detection of autism risk

Transl Psychiatry 2014 08 4 e424

10.1038/tp.2014.65

Duda

Haber

Wall

Use of machine learning for behavioral distinction of autism and ADHD

Transl Psychiatry 2016 2 6 e732

10.1038/tp.2015.221

Duda

Haber

Daniels

Wall

Crowdsourced validation of a machine-learning classification system for autism and ADHD

Transl Psychiatry 2017 5 7 e1133

10.1038/tp.2017.86

Fusaro

Daniels

Duda

DeLuca

D’Angelo

Tamburello

Maniscalco

Wall

The potential of accelerating early detection of autism through content analysis of YouTube videos

PLoS ONE 2014 4 9 4 e93533

10.1371/journal.pone.0093533

Levy

Duda

Haber

Wall

Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism

Mol Autism 2017 12 8 65

10.1186/s13229-017-0180-6

Leblanc

Washington

Varma

Dunlap

Penev

Kline

Wall

Feature replacement methods enable reliable home video analysis for machine learning detection of autism

Sci Rep 2020 12 10 21245

10.1038/s41598-020-76874-w

Stark

Kumar

Longhurst

Wall

The quantified brain: a framework for mobile device-based assessment of behavior and neurological function

Appl Clin Inform 2017 12 07 02 290 298

10.4338/ACI-2015-12-LE-0176

Tariq

Daniels

Schwartz

Washington

Kalantarian

Wall

Mobile detection of autism through machine learning on home video: a development and prospective validation study

PLoS Med 2018 11 15 11 e1002705

10.1371/journal.pmed.1002705

Tariq

Fleming

Schwartz

Dunlap

Corbin

Washington

Kalantarian

Khan

Darmstadt

Wall

Detecting developmental delay and autism through machine learning models using home videos of Bangladeshi children: development and validation study

J Med Internet Res 2019 04 21 4 e13822

10.2196/13822

Washington

Kalantarian

Tariq

Schwartz

Dunlap

Chrisman

Varma

Ning

Kline

Stockham

Paskov

Voss

Haber

Wall

Validity of online screening for autism: crowdsourcing study comparing paid and unpaid diagnostic tasks

J Med Internet Res 2019 05 21 5 e13668

10.2196/13668

Washington

Leblanc

Dunlap

Penev

Kline

Paskov

Sun

Chrisman

Stockham

Varma

Voss

Haber

Wall

Precision telemedicine through crowdsourced machine learning: testing variability of crowd workers for video-based autism feature recognition

J Pers Med 2020 08 10 3 86

10.3390/jpm10030086

Washington

Leblanc

Dunlap

Penev

Varma

Jung

J-Y

Chrisman

Sun

Stockham

Paskov

Kalantarian

Voss

Haber

Wall

Selection of trustworthy crowd workers for telemedical diagnosis of pediatric autism spectrum disorder

Biocomputing 2021: Proceedings of the Pacific Symposium 2020 14 25

10.1142/9789811232701_0002

Washington

Paskov

Kalantarian

Stockham

Voss

Kline

Patnaik

Chrisman

Varma

Tariq

Dunlap

Schwartz

Haber

Wall

Feature selection and dimension reduction of social autism data

Biocomputing 2020 2020 707 718

10.1142/9789811215636_0062

Washington

Park

Srivastava

Voss

Kline

Varma

Tariq

Kalantarian

Schwartz

Patnaik

Chrisman

Stockham

Paskov

Haber

Wall

Data-driven diagnostics and the potential of mobile artificial intelligence for digital therapeutic phenotyping in computational psychiatry

Biol Psychiatry Cogn Neurosci Neuroimaging 2020 08 5 8 759 769

10.1016/j.bpsc.2019.11.015

Baker

LoBue

Bonawitz

Shafto

Towards automated classification of emotional facial expressions

CogSci 2017 1574 1579

Florea

Badea

Vertan

Racoviteanu

Annealed label transfer for face expression recognition

BMVC 2019 104

10.1109/ecai50035.2020.9223242

Lopez-Rincon

Emotion recognition using facial expressions in children using the NAO Robot

2019

2019 International Conference on Electronics, Communications and Computers (CONIELECOMP)

February 27- March 1, 2019

Cholula, Mexico

146 153

10.1109/CONIELECOMP.2019.8673111

Nagpal

Singh

Vatsa

Singh

Noore

Expression classification in children using mean supervised deep Boltzmann machine

2019 06

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

June 16-17, 2019

Long Beach, United States

236 245

10.1109/CVPRW.2019.00033

Rao

Ajri

Guragol

Suresh

Tripathi

Emotion recognition from facial expressions in children and adults using deep neural network

Intelligent Systems, Technologies and Applications 2020 05 43 51

10.1007/978-981-15-3914-5_4

Witherow

Samad

Iftekharuddin

Transfer learning approach to multiclass classification of child facial expressions

Applications of Machine Learning 2019 09 1113911

10.1117/12.2530397

Kalantarian

Jedoui

Washington

Tariq

Dunlap

Schwartz

Wall

Labeling images with facial emotion and the potential for pediatric healthcare

Artif Intell Med 2019 07 98 77 86

10.1016/j.artmed.2019.06.004

Kalantarian

Jedoui

Washington

Wall

A mobile game for automatic emotion-labeling of images

IEEE Trans Games 2020 06 12 2 213 218

10.1109/TG.2018.2877325

Kalantarian

Washington

Schwartz

Daniels

Haber

Wall

A gamified mobile system for crowdsourcing video for autism research

2018 07

2018 IEEE International Conference on Healthcare Informatics (ICHI)

June 4-7, 2018

New York City, United States

350 352

10.1109/ICHI.2018.00052

Kalantarian

Washington

Schwartz

Daniels

Haber

Wall

Guess What?

J Healthc Inform Res 2018 10 3 43 66

10.1007/s41666-018-0034-9

Kalantarian

Jedoui

Dunlap

Schwartz

Washington

Husic

Tariq

Ning

Kline

Wall

The performance of emotion classifiers for children with parent-reported autism: quantitative feasibility study

JMIR Ment Health 2020 04 7 4 e13174

10.2196/13174

Ekman

Are there basic emotions?

Psychological Rev 1992 99 3 550 553

10.1037/0033-295x.99.3.550

Ekman

Scherer

Expression and the nature of emotion

Approaches to Emotion 1984

United Kingdom

Taylor & Francis

Molnar

Segerstrale

Universal facial expressions of emotion

Nonverbal Communication: Where Nature Meets Culture 1997

United Kingdom

Routledge

27 46

Ekman

Friesen

Constants across cultures in the face and emotion

J Pers Soc Psychol 1971 17 2 124 129

10.1037/h0030377

Zhang

Ren

Sun

Deep residual learning for image recognition

2016 12

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

June 27-30, 2016

Las Vegas, United States

770 778

10.1109/CVPR.2016.90

Deng

Dong

Socher

L-J

Fei-Fei

Imagenet: a large-scale hierarchical image database

2009 IEEE Conference on Computer Vision and Pattern Recognition 2009

IEEE Conference on Computer Vision and Pattern Recognition

June 20-25, 2009

Miami, United States

248 255

10.1109/CVPR.2009.5206848

Lyons

Akamatsu

Kamachi

Gyoba

Coding facial expressions with gabor wavelets

1998

Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition

April 14-16, 1998

Nara, Japan

200 205

10.1109/afgr.1998.670949

Mollahosseini

Hasani

Mahoor

AffectNet: a database for facial expression, valence, and arousal computing in the wild

IEEE Trans Affective Comput 2019 01 10 1 18 31

10.1109/TAFFC.2017.2740923

Lucey

Cohn

Kanade

Saragih

Ambadar

Matthews

The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression

2010 08

2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops

June 13-18, 2010

San Francisco, United States

94 101

10.1109/CVPRW.2010.5543262

LoBue

Thrasher

The Child Affective Facial Expression (CAFE) set: validity and reliability from untrained adults

Front Psychol 2015 01 5 1532

10.3389/fpsyg.2014.01532

Szegedy

Vanhoucke

Ioffe

Shlens

Wojna

Rethinking the inception architecture for computer vision

2016

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

June 27-30, 2016

Las Vegas, United States

2818 2826

10.1109/cvpr.2016.308

Sandler

Howard

Zhu

Zhmoginov

Chen

L-C

Mobilenetv2: inverted residuals and linear bottlenecks

2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

June 18-23, 2018

Salt Lake City, United States

4510 4520

10.1109/CVPR.2018.00474

Huang

Liu

Van Der Maaten

Weinberger

Densely connected convolutional networks

2017 11

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

July 21-26, 2017

Honolulu, United States

2261 2269

10.1109/CVPR.2017.243

Chollet

Xception: deep learning with depthwise separable convolutions

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 11

IEEE Conference on Computer Vision and Pattern Recognition

July 21-26, 2017

Honolulu, United States

1251 1258

10.1109/CVPR.2017.195

Dawel

Wright

Irons

Dumbleton

Palermo

O’Kearney

McKone

Perceived emotion genuineness: normative ratings for popular facial expression stimuli and the development of perceived-as-genuine and perceived-as-fake sets

Behav Res 2016 12 49 1539 1562

10.3758/s13428-016-0813-2

Vallverdú

Nishida

Ohmoto

Moran

Lázare

Fake empathy and human-robot interaction (HRI): A preliminary study

IJTHI 2018 14 1 44 59

10.4018/IJTHI.2018010103

Tao

Martinez

Compound facial expressions of emotion

Proc Natl Acad Sci 2014 03 111 15 E1454 E1462

10.1073/pnas.1322355111

Washington

Yeung

Percha

Tatonetti

Liphardt

Wall

Achieving trustworthy biomedical data solutions

Biocomputing 2021: Proceedings of the Pacific Symposium 2021 1 13

10.1142/9789811232701_0001