Laura Nixon

Laura Nixon

ReThink Media

Making sense of the news: Building an NLP pipeline to analyze news articles

Introduction ReThink Media helps non-profit advocacy groups build media and strategic communications capacity. As part of that work, we track and analyze news coverage of the issues our partner organizations work on. However, analyzing news articles by hand is very resource-intensive, particularly when there’s a high volume of coverage. In order to automate parts of that process, we are working on an NLP pipeline that classifies news articles as straight news or opinion, identifies direct and indirect quotes, identifies the speaker for each quote, and infers the gender of the speakers. We used code from the Gender Gap Tracker tool, developed by Simon Fraser University’s Discourse Lab, as a starting point for the quote extraction, NER, and gender inference tasks. In the spring of 2022, a team of UC Berkeley students in the MiMs program built a working version of the pipeline as part of their capstone project, and we are now conducting error analyses, and trying out different approaches to improve accuracy. The current version of the pipeline uses an ensemble model for news and opinion classification comprised of a MLP neural network trained on sentence embeddings, and a DistilBert base cased model, yielding an average macro F1 score of .93. The quote extraction and NER tasks use spaCy, with NeuralCoref for co-reference resolution. The quote extractor currently achieves .91 precision and .89 recall. For name-to-gender inference, we use the gender-guesser package and GenderAPI. On our test datasets, the classifier achieves 94.9-96.4% accuracy in predicting speaker gender. The current version of the pipeline has yielded promising results, but we would like to improve accuracy further, and develop mechanisms for human annotators to conduct quality control, particularly for aspects of the pipeline where we know inaccuracies can be introduced. We are also exploring, with the help of a UC Berkeley Data Science Discovery team, the possibility of classifying the type of speakers present in an article (e.g. elected officials, academics, voting rights advocates), and the topics of the identified quotes.

Bio: Laura Nixon is the Director of Research & Analysis at ReThink Media, a non-profit organization that helps advocacy groups build media and communications capacity. To support the work of ReThink Media’s partner organizations, she analyzes news media and social media coverage, and conducts messaging and public opinion research. Laura has a particular interest in using programming tools to collect and analyze data. Prior to joining ReThink, Laura spent seven years at Berkeley Media Studies Group analyzing media coverage of public health issues, and helping public health advocates use the media to advance public health policy. She has also worked on research at the UC Berkeley Labor Occupational Health Program, the California Department of Public Health, and the United States Census Bureau. Laura graduated from Pomona College with a bachelor’s degree in sociology and earned a master’s degree in public health at UC Berkeley.