Improved Digital Therapy for Developmental Pediatrics Using Domain-Specific Artificial Intelligence: Machine Learning Study

Background Automated emotion classification could aid those who struggle to recognize emotions, including children with developmental behavioral conditions such as autism. However, most computer vision emotion recognition models are trained on adult emotion and therefore underperform when applied to child faces. Objective We designed a strategy to gamify the collection and labeling of child emotion–enriched images to boost the performance of automatic child emotion recognition models to a level closer to what will be needed for digital health care approaches. Methods We leveraged our prototype therapeutic smartphone game, GuessWhat, which was designed in large part for children with developmental and behavioral conditions, to gamify the secure collection of video data of children expressing a variety of emotions prompted by the game. Independently, we created a secure web interface to gamify the human labeling effort, called HollywoodSquares, tailored for use by any qualified labeler. We gathered and labeled 2155 videos, 39,968 emotion frames, and 106,001 labels on all images. With this drastically expanded pediatric emotion–centric database (>30 times larger than existing public pediatric emotion data sets), we trained a convolutional neural network (CNN) computer vision classifier of happy, sad, surprised, fearful, angry, disgust, and neutral expressions evoked by children. Results The classifier achieved a 66.9% balanced accuracy and 67.4% F1-score on the entirety of the Child Affective Facial Expression (CAFE) as well as a 79.1% balanced accuracy and 78% F1-score on CAFE Subset A, a subset containing at least 60% human agreement on emotions labels. This performance is at least 10% higher than all previously developed classifiers evaluated against CAFE, the best of which reached a 56% balanced accuracy even when combining “anger” and “disgust” into a single class. Conclusions This work validates that mobile games designed for pediatric therapies can generate high volumes of domain-relevant data sets to train state-of-the-art classifiers to perform tasks helpful to precision health efforts.

The Child Affective Facial Expression (CAFE) data set is currently the most popular facial expression data set pertaining to children.Prior machine learning efforts that do not include CAFE images in the training set have reached 56% accuracy on CAFE [36,37,39], even after combining facial expressions (eg, "anger" and "disgust") into a single class, thus limiting granularity.We do not discuss prior publications that report higher accuracy using subsets of the CAFE data set in the training and testing sets.This overall lack of performance in prior work highlights the need for developing facial emotion classifiers that work for children.With a lack of labeled data being the fundamental bottleneck to achieving clinical-grade performance, low-cost and speedy data generation and labeling techniques are pertinent.
As a first step toward the creation of a large-scale data set of child emotions, we have previously designed GuessWhat, a dual-purpose smartphone app that serves as a therapeutic for children with autism while simultaneously collecting highly structured image data enriched for emoting in children.GuessWhat was designed for children aged 2 and above to encourage prosocial interaction with a gameplay partner (eg, mom or dad), focusing the camera on the child while presenting engaging but challenging prompts for the child to try to act out [40][41][42][43].We have previously tested GuessWhat's potential to increase socialization in children with autism as well as its potential to collect structured videos of children emoting facial expressions [44].In addition to collecting videos enriched with emotions, GuessWhat gameplay generates user-derived labels of emotion by leveraging the charades-style gameplay structure of the therapy.
Here, we document the full pipeline for training a classifier using emotion-enriched video streams coming from GuessWhat gameplay, resulting in a state-of-the-art pediatric facial emotion classifier that outperforms all prior classifiers when evaluated on CAFE.We first recruited parents and children from around the world to play GuessWhat and share videos recorded by the smartphone app during gameplay.We next extracted frames from the videos, automatically discarding some frames through quality control algorithms, and uploaded the frames on a custom behavioral annotation labeling platform named HollywoodSquares.We prioritized the high entropy frames and shared them with a group of 9 human annotators who annotated emotions in the frames.In total, we have collected 39,968 unique labeled frames of emotions that appear in the CAFE data set.Using the resulting frames and labels, we trained a facial emotion classifier that can distinguish happy, sad, surprised, fearful, angry, disgust, and neutral expressions in naturalistic images, achieving state-of-the-art performance on CAFE and outperforming existing classifiers by over 10%.This work demonstrates that therapeutic games, while primarily providing a behavioral intervention, can simultaneously generate sufficient data for training state-of-the-art domain-specific computer vision classifiers.

Data Collection
The primary methodological contribution of this work is a general-purpose paradigm and pipeline (Figure 1) consisting of (1) passive collection of prelabeled structured videos from therapeutic interventions, (2) active learning to rank the collected frames leveraging the user-derived labels generated during gameplay, (3) human annotation of the frames in the order produced in the previous step, and (4) training a classifier while artificially augmenting the training set.We describe our instantiation of this general paradigm in the following sections.

Ethical Considerations
All study procedures, including data collection, were approved by the Stanford University Institutional Review Board (IRB number 39562) and the Stanford University Privacy Office.In addition, informed consent was obtained from all participants, all of whom had the opportunity to participate in the study without sharing videos.

Recruitment
To recruit child video subjects, we ran a marketing campaign to gather rich and diverse video inputs of children playing GuessWhat while evoking a range of emotions.We posted advertisements on social media (Facebook, Instagram, and Twitter) and contacted prior study participants for other digital smartphone therapeutics developed by the lab [13][14][15].All recruitment and study procedures were approved by the Stanford University IRB.

GuessWhat Smartphone Therapeutic
GuessWhat is a mobile autism therapy implemented on iOS and Android, which has been previously documented as a useful tool for the collection of structured video streams of children behaving in constrained manners [40][41][42][43][44], including evocation of targeted emotions.GuessWhat features a charades game where the parents place the phone on their forehead facing the child, while the child acts out the emotion prompt displayed on the screen.The front-facing camera on the phone records a video of the child in addition to corresponding prompt metadata.All sessions last for 90 seconds.Upon approval by the parent, each session video is uploaded to a Simple Storage Service (S3) bucket on Amazon Web Services (AWS).The app has resulted in 2155 videos shared by 456 unique children.Parents are asked to sign an electronic consent and assent form prior to playing GuessWhat.After each gameplay session, parents can (1) delete the videos, (2) share the videos with the research team only, or (3) share the videos publicly.

Emotions Considered
We sought labels for Paul Ekman's list of six universal emotions: anger, disgust, fear, happiness, sadness, and surprise [45][46][47][48].Ekman originally included contempt in the list of emotions but has since revised the list of universal emotions.
Because CAFE does not include labels of contempt, we did not train our classifier to predict contempt.We added a seventh category named neutral, indicating the absence of an expressed emotion.Our aim was to train a 7-way emotion classifier distinguishing among Ekman's 6 universal emotions plus neutral.

HollywoodSquares Frame Labeling
We developed a frame-labeling website named HollywoodSquares.The website provides human labelers with an interface to speedily annotate a sequential grid of frames (Figure 2) that were collected during the GuessWhat gameplay.To enable rapid annotation, HollywoodSquares enables users to label frames by pressing hot keys, where each key corresponds to a particular emotion label.To provide a label, users can hover their mouse over a frame and press the hot key corresponding to the emotion they want to label.As more frames are collected by GuessWhat, they continue to appear on the interface.Because the HollywoodSquares system displays over 20 images on the screen at once, it encourages rapid annotation and enables simultaneous engagement by many independent labelers.This permits rapid convergence of a majority rules consensus on image labels.
We ran a labeling contest with 9 undergraduate and high school annotators, where we challenged each annotator to produce labels that would result in the highest performing classifier on the CAFE data set.Raters were aged between 15 and 24 years and were from the Bay Area, Northeastern United States, and Texas.The raters included 2 males and 7 females.For the frames produced by each individual annotator, we trained a ResNet-152 model (see Model Training).We updated annotators about the number of frames they labeled each week and the performance of the classifier trained with their individual labels.We awarded a cash prize to the annotator with the highest performance at the end of the 9-week labeling period.

RenderX
HollywoodSquares was also used for a testing phase, during which iterations of the frame-labeling practices were made between the research and annotation teams.All the labeled frames acquired during this testing phase were discarded for final classifier training.
All annotators were registered as research team members through completion of the Health Insurance Portability and Accountability Act of 1996 and Collaborative Institutional Training Initiative training protocols in addition to encrypting their laptop with Stanford Whole Disk Encryption.This provided annotators with read-only access to all the videos and derived frames from GuessWhat gameplay that were shared with the research team.
The final labels were chosen by the following process.If all annotators agreed unanimously about the final frame label, then this label was assigned as the final frame label.If disagreements existed between raters, then the emotion gameplay prompt associated with that frame (the "automatic label") was assigned as the final label for that frame, as long as at least 1 of the human annotators agreed with the automatic label.If disagreements existed between raters but the automatic label did not match any human annotations, then the frame was not included in the final training data set.

Model Training
We leveraged an existing CNN architecture, ResNet-152 [49], with pretrained weights from ImageNet [50].We used categorical cross entropy loss and Adam optimization with a learning rate of 3 × 10 -4 , with β 1 set to .99 and β 2 set to .999.We retrained every layer of the network until the training accuracy converged.The model converged when it did not improve against a validation data set for 20 consecutive epochs.We applied the following data augmentation strategies in conjunction and at random for each training image and each batch of training: rotation of frames between -15 and 15 degrees, zooming by a factor between 0.85 and 1.15, shifting images in every direction by up to 1/10th of the width and height, changing brightness by a factor between 80% and 120%, and potential horizontal flipping.
The CNN was trained in parallel on 16 graphics processing unit (GPU) cores with a p2.16xlargeElastic Cloud Compute instance on AWS using the Keras library in Python with a Tensorflow 2 backend.With full GPU usage, the training time was 35 minutes and 41 seconds per epoch for a batch size of 1643, translating to US $14.4 per hour.
We trained 2 versions of the model, with 1 exclusively using non-GuessWhat public data set frames from (1) the Japanese Female Facial Expression (JAFFE) [51], (2) a random subset of 30,000 AffectNet [52] images (a subset was acquired to avoid an out of memory error), and (3) the Extended Cohn-Kanade (CK+) data set [53]; the other model was trained with these public data set frames plus all 39,968 labeled and relevant GuessWhat frames.

Model Evaluation
We evaluated our models against the entirety of the CAFE data set [54], a set of front-facing images of racially and ethnically diverse children aged 2 to 8 years expressing happy, sad, surprised, fear, angry, fearful, and neutral emotions.CAFE is currently the largest data set of facial expressions from children and has become a standard benchmark for this field.
Although existing studies have evaluated models exclusively against the entirety of the CAFE data set [34][35][36][37][38][39], we additionally evaluated them on Subset A and Subset B of CAFE, as defined by the authors of the data set.Subset A contains images that were identified with an accuracy of 60% or above by 100 adult participants [54], with a Cronbach α internal consistency score of .82(versus .77for the full CAFE data set).Subset B contains images showing "substantial variability while minimizing floor and ceiling effects" [54], with a Cronbach α score of .768(close to the score of .77for the full data set).

Frame Processing
The HollywoodSquares annotators processed 106,001 unique frames (273,493 including the testing phase and 491,343 unique labels when counting multiple labels for the same frame as a different label).Of the 106,001 unique frames labeled, 39,968 received an emotion label corresponding to 1 of the 7 CAFE emotions (not including the testing phase labels).Table 1 contains the number of frames that were included in the training set for each emotion class, including how many children and videos are represented for each emotion category.The frames that were not included received labels of "None" (corresponding to a situation where no face or an incomplete face appears in the frame), "Unknown" (corresponding to the face not expressing a clear emotion), or "Contempt" (corresponding to the face not expressing an emotion in the CAFE set).The large number of curated frames displaying emotion demonstrates the usefulness of HollywoodSquares in filtering out emotion events from noisy data streams.The lack of balance across emotion categories is a testament particularly to the difficulty of evoking anger and sadness as well as disgust and fear, although to a lesser extent.
Of the children who completed 1 session of the Emoji challenge in GuessWhat and uploaded a video to share with the research team, 75 were female, 141 were male, and 51 did not specify their gender.Table 2 presents the racial and ethnic makeup of the participant cohort.Representative GuessWhat frames and cropped faces used to train the classifier, obtained from the subset of participants who consented explicitly to public sharing of their images, are displayed in Figure 3.

Performance on CAFE, CAFE-Defined Subsets, and CAFE Subset Balanced in Terms of Race, Gender, and Emotions
The ResNet-152 network trained on the entire labeled HollywoodSquares data set as well as the JAFFE, AffectNet subset, and CK+ data sets achieved a balanced accuracy of 66.9% and an F1-score of 67.4% on the entirety of the CAFE data set (confusion matrix in Figure 4).When only the HollywoodSquares data set was included in the training set, the model achieved a balanced accuracy of 64.12% and an F1-score of 64.2%.When only including the JAFFE, AffectNet subset, and CK+ sets, the classifier achieved an F1-score of 56.14% and a balanced accuracy of 52.5%, highlighting the contribution of the HollywoodSquares data set.To quantify the contribution of the neural network architecture itself, we compared the performance of several state-of-the-art neural network architectures when only including the HollywoodSquares data set in the training set (Table 3).We evaluated the following models: ResNet152V2 [49], ResNet50V2 [49], InceptionV3 [55], MobileNetV2 [56], DenseNet121 [57], DenseNet201 [57], and Xception [58].The same training conditions and hyperparameters were used across all models.We found that ResNet152V2 performed better than the other networks when trained with our data, so we used this model for the remainder of our experiments.
The performance improved, resulting in a balanced accuracy of 79.1% and an F1-score of 78% on CAFE Subset A (confusion matrix in Figure 5), a subset containing more universally accepted emotions labels.When only including the non-GuessWhat public images in the training set, the model achieved a balanced accuracy of 65.3% and an F1-score of 69.2%.On CAFE Subset B, the balanced accuracy was 66.4% and the F1-score was 67.2% (confusion matrix in Figure 6); the balanced accuracy was 57.2% and F1-score was 57.3% when exclusively training on the non-GuessWhat public images.[58] a Default hyperparameters were used for all networks.

Classifier Performance Based on Image Difficulty
CAFE images were labeled by 100 adults, and the percentage of participants who labeled the correct class are reported with the data set [54].We binned frames into 10 difficulty classes (ie, 90%-100% correct human labels, 80%-90% correct human labels, etc).Figure 7 shows that our classifier performs exceedingly well on unambiguous images.Of the 233 images with 90%-100% agreement between the original CAFE labelers, our classifier correctly classifies 90.1% of the images.The true label makeup of these images is as follows: 131 happy, 58 neutral, 20 anger, 9 sad, 8 surprise, 7 disgust, and 0 fear images.This confirms that humans have trouble identifying nonhappy and nonneutral facial expressions.Of the 455 images with 80%-100% agreement between the original CAFE labelers, our classifier correctly classifies 81.1% of the images.

Principal Results
Through the successful application of an in-the-wild child developmental health therapeutic that simultaneously captures video data, we show that a pipeline for intelligently and continuously labeling image frames collected passively from mobile gameplay can generate sufficient training data for a high-performing computer vision classifier (relative to prior work).We curated a data set that contains images enriched for naturalistic facial expressions of children, including but not limited to children with autism.
We demonstrate the best-performing pediatric facial emotion classifier to date according to the CAFE data set.The best-performing classifiers evaluated in earlier studies involving facial emotion classification on the CAFE data set, including images from CAFE in the training set, achieved an accuracy of up to 56% on CAFE [36,37,39] and combined "anger" and "disgust" into a single class.By contrast, we achieved a balanced accuracy of 66.9% and an F1-score of 67.4% without including any CAFE images in the training set.This is a clear illustration of the power of parallel data curation from distributed mobile devices in conjunction with deep learning, and this approach can possibly be generalized to the collection of training data for other domains.
We collected a sufficiently large training sample to alleviate the need for extracting facial keypoint features, as was the case in prior works.Instead, we used the unaltered images as inputs to a deep CNN.

Limitations and Future Work
A major limitation of this work is the use of 7 discrete and distinct emotion categories.Some images in the training set might have exhibited more than 1 emotion, such as "happily surprised" or "fearfully surprised."This could be addressed in future work by a more thorough investigation of the final emotion classes.Another limitation is that similar to existing emotion data sets, our generated data set contains fake emotion evocations by the children.This is due to limitations imposed by ethics review committees and the IRB who, understandably so, do not allow provoking real fear or sadness in participants, especially young children who may have a developmental delay.This issue of fake emotion evocation has been documented in prior studies [4,5,59,60].Finding a solution to this issue that would appease ethical review committees is an open research question.
Another limitation is that we did not address the possibility of complex or compound emotions [61].A particular facial expression can consist of multiple universal expressions.For example, "happily surprised," "fearfully surprised," and even "angrily surprised" are all separate subclasses of "surprised."We have not separated these categories in this study.We recommend that future studies explore the possibility of predicting compound and complex facial expressions.
There are several fruitful avenues for future work.The paradigm of passive data collection during mobile intervention gameplay could be expanded to other digital intervention modalities, such as wearable autism systems with front-facing cameras [7,8,11,[13][14][15][16][17].This paradigm can also be applied toward the curation of data and subsequent training of other behavioral classifiers.Relevant computer vision models for diagnosing autism could include computer vision-powered quantification of hand stimming, eye contact, and repetitive behavior, as well as audio-based classification of abnormal prosody, among others.
The next major research step will be to evaluate how systems like GuessWhat can benefit from the incorporation of the machine learning models back into the system in a closed-loop fashion while preserving privacy and trust [62].Quantification of autistic behaviors during gameplay via machine learning models trained with gameplay videos can enable a feedback loop that provides a dynamic and adaptive therapy for the child.Models can be further personalized to the child's unique characteristics, providing higher performance through customized fine-tuning of the network.

Conclusions
We have demonstrated that gamified digital therapeutic interventions can generate sufficient data for training state-of-the-art computer vision classifiers, in this case for pediatric facial emotion.Using this data curation and labeling paradigm, we trained a state-of-the-art 7-way pediatric facial emotion classifier.

Figure 1 .
Figure 1.Pipeline of the model training process.Structured videos enriched with child emotion evocation are collected from a mobile autism therapeutic deployed in the wild.The frames are ranked for their contribution to the target classifier by a maximum entropy active learning algorithm and receive human labels on a rating platform named HollywoodSquares.The frames are corresponding labels that are transferred onto a ResNet-152 neural network pretrained on the ImageNet data set.

Figure 2 .
Figure 2. HollywoodSquares rating interface.Annotators use keyboard shortcuts and the mouse to speedily annotate a sequence of frames acquired during GuessWhat gameplay.

Figure 3 .
Figure 3. Example of frames collected from GuessWhat gameplay, including examples of cropped (A) and original (B) frames.We have displayed these images after obtaining consent from the participants for public sharing.

Figure 4 .
Figure 4. Confusion matrix for the entirety of the Child Affective Facial Expression data set.

Figure 5 .
Figure 5. Confusion matrix for Child Affective Facial Expression Subset A.

Figure 6 .
Figure 6.Confusion matrix for Child Affective Facial Expression Subset B.

Figure 7 .
Figure 7. Classifier performance versus original CAFE annotator performance for 10 difficulty bins.The classifier tends to perform well when humans agree on the class and poorly otherwise.The numbers in parentheses represent the number of images in each bin.This highlights the issue of ambiguous labels in affective computing and demonstrates that our model performance scales proportionally to human performance.CAFE: Child Affective Facial Expression.

Table 1 .
Emotions represented in the HollywoodSquares data set, including how many children and videos are represented for each emotion category.

Table 2 .
Representation of race and ethnicity of children whose who played the "Emoji" charades category and uploaded a video to the cloud.

Table 3 .
Comparison of several popular neural network architectures trained on the same data set a .