Peter Leonard

Peter Leonard

Stanford Libraries

Text & visual cultural heritage collections: evocative possibilities

Large-scale digitized cultural heritage collections, consisting of hundreds of thousands of pictures that are not reducible to text alone, are a growing site of digital research practice. (Wevers & Smits, 2020). Additionally, APIs and standards such as the International Image Interoperability Framework (IIIF) have made possible work at more granular levels – bounding boxes determined by curators, convolutional neural networks, or a combination of human and machine intelligence. It is now possible to produce sub-corpora of illustrated “initial” letters from medieval manuscripts, or faces from 20th century photographs, creating highly specialized image datasets out of larger materials. Enhancing information retrieval of these visual datasets remains a difficult problem, especially in digital search systems that are text-centric. How would a library or museum search engine return relevant results among thousands of undescribed illustrations or unlabeled photographs? One answer is ‘visual similarity’, or computation of cosine distance an embedding space determined by the semifinal layer of a captioning CNN: EPFL’s Replica, the Norwegian State Library’s Maken experiment, among others. But the image similarity method does not solve the initial problem of having an initial image from which to gauge similarity. Given hundreds of thousands of unlabeled images, how does one know where to start? One answer may lie in leveraging text-to-image networks, such as CLIP (Contrastive Language-Image Pre-Training). These architectures generally seek to predict the most likely text section (word or phrase) given an unseen image. Although most commonly thought of in image-generation contexts (DALL-E; various Diffusion models), they can also be used to generate descriptions of visual information that are bracingly unmoored from the scholarly or archival description practices commonly used in the GLAM sector. For good and for ill, they can make image search systems responsive to evocative phrases such as “a man alone on a road at night”, or “I am feeling cold.” (See https://huggingface.co/spaces/NbAiLab/maken-clip-text for a working example from the Norwegian State Library). Important caveats remain: these models may be doubly affected by bias and incomplete data, at both their linguistic and visual ends. They are certainly dually anachronistic in the same ways, assuming the images in question are not contemporary pictures. And questions of reproducibility and integration into existing search methodologies have only begun to be explored. Nevertheless, given the astounding scale of mass digitization projects currently underway, it seems prudent to examine this frontier of “evocative” text search across visual collections. Doing so may re-center text as a human search methodology with a long and well-studied history, and even create the potential for more rewarding results for searchers who lack access to the precise terminology used to describe elements of our common visual cultural heritage. This talk will examine the results of such CLIP-based search models on image datasets, including examples drawn from the Meserve-Kunhardt collection of 19th century American photography, images in Vogue magazine (1892-2013), and the photography of Andy Warhol.

Bio: