Geoffrey Boushey

UCSF Library

Contextualizing AI Generated Transcription Accuracy for Researchers

INTRODUCTION: UCSF Library technical staff often provide researchers with AI generated transcripts from diverse image, audio, or video collections. This process often results in varying accuracy and completion rates for documents across and within collections. Unless we communicate to researchers, we risk introducing bias into the data we provide. To better illustrate this risk, we decided to evaluate the variance and accuracy for transcripts generated from a small collection of videos from the UCSF Tobacco Archives. METHODS: We selected a small set of videos from advertisements and legal courtroom depositions and used Google AutoML to extract transcripts and transcription accuracy scores, then evaluated accuracy by document type, length, year, category, and media source. We also compared results with human-corrected transcriptions using BLEU scores and sentiment scores. RESULTS and DISCUSSION: Overall, we found substantial variance in transcript accuracy, though we also found that AutoML’s self-reported accuracy score provided a reliable metric for transcription accuracy. We found that AutoML confidence score, fellow accuracy rating, and Word Accuracy Rate (WAR) are all significantly positively correlated. This means that the AutoML confidence score is a relatively good proxy for transcript accuracy. Court proceedings to have somewhat higher overall average transcription scores than advertisements, most likely because of stylized speaking or singing. We recommend informing researchers that video transcripts from media that contain singing or stylized speaking may be less accurate. For sentiment scores, we found that commercials tend to have a positive sentiment than court proceedings. We advise caution using BLEU scores, as AutoML lacks punctuation for sentence comparison. CONCLUSION: Although the variance in accuracy and completeness from AutoML transcription can be substantial, we found that AutoML confidence scores provide a reliable estimate of the computer-generated text resulting from this transcription. We emphasize that it is essential for technical staff to provide accuracy measurements for document translation data.

Bio: Geoff is a Software Application Developer and member of the Data Science team at the UCSF Library.