Each scientific discovery builds on a body of knowledge embodied by latent topics that are topically named within a manuscript and accompanied by citations to the founding papers. For example, papers focusing on anti-de Sitter/conformal field theory (AdS/CFT), exploring the correspondence between general relativity and quantum field theory, cite the 1999 paper that introduced the concept. Yet, many papers on AdS/CFT use language that for experts unambiguously defines the paper's topic, without citing the founding work. To identify such hidden citations, we use the Latent Dirichlet Allocation (LDA) model to detect topics in the text of a publication, inferring latent topical structures from a corpus of full-text citation contexts based on symbolic natural language processing and Bayesian inference. The LDA model is an unsupervised machine learning approach that is interpretable, allowing us to associate the outcomes of LDA with confidence levels through transparent probabilistic logic. We find that hidden citations often outnumber citation counts, emerging regardless of publishing venue and discipline. We show that the prevalence of hidden citations is not driven by citation counts, but rather by the degree of the discourse on the topic within the text of the manuscripts, indicating that the more discussed is a discovery, the less visible it is to standard bibliometric analysis. We find that the proportion of hidden citations increases by approximately 20% after 20 years of publication. Our analysis also suggests that for a paper to accumulate hidden citations, it must develop exclusive catchphrases that are synonymous to the paper itself, becoming a 'conceptual symbol' within the field. Acknowledging discoveries on which new research builds on is an integral part of the scientific discourse. Yet, with the exponential growth of science and limits on the number of allowed references, a paper's ability to credit all sources of inspiration is limited. Such limitations lead to inevitable credit distortion, manifest in situations where the textual context indicates that credit is due, but it is not accompanied by explicit citations to the pertinent work. Hidden citations capture the 'burden' of success in science: the more widely a concept is used by the scientific community, the more likely that it will accrue hidden citations.
Bio: Xiangyi is a postdoctoral research associate at the Center for Complex Network Research of Northeastern University, Boston. He graduated from Peking University, China with a B.S. degree in microelectronics and earned his Ph.D. in physics five years later at Boston University. As a physicist and a network scientist, he has been working on a wide range of topics, including the scale-free network theory and information diffusion on complex networks, forecasting chaotic dynamics using tensorized recurrent neural network architectures, as well as interdisciplinary network science applications on brain tractography and econophysics. Recently, he is working with Prof. Albert-László Barabási on applying text analysis to science of science/computational social science, focusing on using interpretable natural language processing tools to reveal connections between academic publications/citations and scientific topics.