Human vs. GPT-3: The challenges of extracting emotions from child responses

Conducting interviews with abused children requires a specific skill set that is obtained by undergoing special training. In addition to acquiring the interview training skills, these also have to be constantly refreshed to keep a high level of quality. Technology, such as synthetic video generation and natural language processing, is in a stage that could allow the construction of a system that can make this task easier. Thus, we aim to design a training system aided by machine learning that can support the interview training with an interactive child avatar capable of meaningful interactions with the trainees. In these interviews, emotions play an important role, so we conduct three different user studies in a remote study setting with the aim of analyzing child emotions in these interviews. In these user studies, the participants had to classify different transcripts excerpts as one of the possible predefined emotions. These human annotations are used to measure the performance of sentiment analysis using GPT-3. We investigate different approaches to obtain the correct classifications by changing the amount of context the participants and the model get to see. Our experiments show that humans have a hard time agreeing when choosing between seven different emotions. This improves when we reduce the set of emotions to four. In addition, we found that context is needed to make a motivated choice, but too much context can make it vague, reducing the judgment’s quality.