The visualisation of news through photographs has exploded since the second half of the 20th century (Kester & Kleppe 2015). Press photographs are not only being used more often in all sorts of media, historical photographs are also increasingly being reused. In some cases the reuse of these images leads to a recontextualisation of the photograph. A well-known example in the Dutch context is a photograph of Dutch socialist party leader Pieter Jelles Troelstra who is known for a failed attempt to overthrow the royal family in 1918. When Troelstra’s failed coup d’état is being discussed in history textbooks, the text is very often accompanied by a photograph of Troelstra holding a speech. However, research has shown that this photo was not made in 1918 but in 1912 when Troelstra was advocating the introduction of women’s suffrage (Kleppe 2013).
Figure 1. Photograph of Dutch socialist partyleader Pieter Jelles Troelstra giving a speech during a demonstration in 1912 while the photo nowadays often is used for Troelstra’s coupe d’etat of 1918. (photograph by Cornelis Leenheer. Source: IISG Amsterdam)
This example illustrates that in the field of history and visual culture there is a need for the thorough study of the origin and reuse of visual materials. However, methods that are employed to analyse the (re)use of visual materials are cumbersome and labour-intensive since Humanities researchers tend to analyse their sources manually (Burke 2001). To estimate the increase in the use of pressphotographs in Dutch newspapers, Kester & Kleppe (2015) e.g manually analysed a sample of 385 newspapers and 5.877 pressphotographs of the period 1870-2013, creating the Foto’s in Nederlandse Kranten (FiNK) (Photos in Dutch Newspaper) dataset (Kleppe, Zeijl, Kester 2014). To find the recurring use of the Troelstra image, Kleppe (2012) followed a same approach by manually analysing over 5.000 photographs in 400 history textbooks, creating the Foto’s in Nederlandse Geschiedenisschoolboeken (FiNGS) (Photos in Dutch History textbooks) dataset (Kleppe 2012).
While manually created and annotated datasets such as FiNK & FiNGS contain rich & well-annotated data that can be reused for all sorts of research questions, their scope remains limited given its labour-intensive creation process. To find the recurring imagery in the FiNGS dataset, Kleppe (2012) e.g. manually created and assessed all images and metadata, leading to inevitabel human errors. However, digitised historical imagery is increasingly becoming available allowing researchers to undertake the first steps in the field of ‘Visual Big Data’, following the footsteps of Barry Salt’s study on the characteristics of opening shots of 20th century films (Salt 1974) and Scott McCloud work on the visual language of Japanese manga and comics from the West (McCloud 1994). More recent, the work of Lev Manovich on exploring large scale visual datasets such as Manga Comics (2012), Time covers, and selfies (Manovich & Tifentale 2015) is seen as a new way of what he calls doing ‘cultural analytics’ (Manovich 2012).
While the focus of these studies is on characteristics of the images, others using large scale image datasets focus on the recurrance of imagery in different types of contexts, aiming 1) to assess the impact of scholarly images online (Kousha 2010), 2) to analyse the reuse of digital images of cultural and heritage material on the internet (Terras & Kirton 2013) or within a closed dataset (Resig 2014; Reside 2014) and 3) to detect poetic content in historical newspapers (Lorang et al 2015).
To cater their research questions, these scholars all created visual datasets on their own. However, large datasets containing photographs that are available for researchers are scarce. Only within the Computer Vision and Natural Language Processing community we found some datasets (Ordonez et al, 2011; Chen et al, 2015a; Chen et al, 2015b; Hodosh et al., 2013), but these are mainly created for training purposes of algorithms, not for Humanities research questions.
Therefor this poster presents the KBK-1M dataset, that was created specifically for (Digital) Humanities researchers. This dataset contains a collection of 1,6 million captioned images of the period 1922-1994, extracted from Dutch digitised newspapers stored in the Dutch National Library (KB) newspaper archive. The images cover black & white photographs, sketches, line-drawn cartoons, and line-drawn weather forecasts and are available as year-by-year ZIP files at the National Library of the Netherlands. A request for access can be submitted to email@example.com. On our poster, we will describe how we obtained the images and what types of research questions it could tailor.