DH 2016 Abstracts

1 Million Dutch Newspaper Images available for researchers: The KBK-1M Dataset

The visualisation of news through photographs has exploded since the second half of the 20th century (Kester & Kleppe 2015). Press photographs are not only being used more often in all sorts of media, historical photographs are also increasingly being reused. In some cases the reuse of these images leads to a recontextualisation of the photograph. A well-known example in the Dutch context is a photograph of Dutch socialist party leader Pieter Jelles Troelstra who is known for a failed attempt to overthrow the royal family in 1918. When Troelstra’s failed coup d’état is being discussed in history textbooks, the text is very often accompanied by a photograph of Troelstra holding a speech. However, research has shown that this photo was not made in 1918 but in 1912 when Troelstra was advocating the introduction of women’s suffrage (Kleppe 2013).

Figure 1. Photograph of Dutch socialist partyleader Pieter Jelles Troelstra giving a speech during a demonstration in 1912 while the photo nowadays often is used for Troelstra’s coupe d’etat of 1918. (photograph by Cornelis Leenheer. Source: IISG Amsterdam)

This example illustrates that in the field of history and visual culture there is a need for the thorough study of the origin and reuse of visual materials. However, methods that are employed to analyse the (re)use of visual materials are cumbersome and labour-intensive since Humanities researchers tend to analyse their sources manually (Burke 2001). To estimate the increase in the use of pressphotographs in Dutch newspapers, Kester & Kleppe (2015) e.g manually analysed a sample of 385 newspapers and 5.877 pressphotographs of the period 1870-2013, creating the Foto’s in Nederlandse Kranten (FiNK) (Photos in Dutch Newspaper) dataset (Kleppe, Zeijl, Kester 2014). To find the recurring use of the Troelstra image, Kleppe (2012) followed a same approach by manually analysing over 5.000 photographs in 400 history textbooks, creating the Foto’s in Nederlandse Geschiedenisschoolboeken (FiNGS) (Photos in Dutch History textbooks) dataset (Kleppe 2012).

While manually created and annotated datasets such as FiNK & FiNGS contain rich & well-annotated data that can be reused for all sorts of research questions, their scope remains limited given its labour-intensive creation process. To find the recurring imagery in the FiNGS dataset, Kleppe (2012) e.g. manually created and assessed all images and metadata, leading to inevitabel human errors. However, digitised historical imagery is increasingly becoming available allowing researchers to undertake the first steps in the field of ‘Visual Big Data’, following the footsteps of Barry Salt’s study on the characteristics of opening shots of 20th century films (Salt 1974) and Scott McCloud work on the visual language of Japanese manga and comics from the West (McCloud 1994). More recent, the work of Lev Manovich on exploring large scale visual datasets such as Manga Comics (2012), Time covers, and selfies (Manovich & Tifentale 2015) is seen as a new way of what he calls doing ‘cultural analytics’ (Manovich 2012).

While the focus of these studies is on characteristics of the images, others using large scale image datasets focus on the recurrance of imagery in different types of contexts, aiming 1) to assess the impact of scholarly images online (Kousha 2010), 2) to analyse the reuse of digital images of cultural and heritage material on the internet (Terras & Kirton 2013) or within a closed dataset (Resig 2014; Reside 2014) and 3) to detect poetic content in historical newspapers (Lorang et al 2015).

To cater their research questions, these scholars all created visual datasets on their own. However, large datasets containing photographs that are available for researchers are scarce. Only within the Computer Vision and Natural Language Processing community we found some datasets (Ordonez et al, 2011; Chen et al, 2015a; Chen et al, 2015b; Hodosh et al., 2013), but these are mainly created for training purposes of algorithms, not for Humanities research questions.

Therefor this poster presents the KBK-1M dataset, that was created specifically for (Digital) Humanities researchers. This dataset contains a collection of 1,6 million captioned images of the period 1922-1994, extracted from Dutch digitised newspapers stored in the Dutch National Library (KB) newspaper archive. The images cover black & white photographs, sketches, line-drawn cartoons, and line-drawn weather forecasts and are available as year-by-year ZIP files at the National Library of the Netherlands. A request for access can be submitted to dataservices@kb.nl. On our poster, we will describe how we obtained the images and what types of research questions it could tailor.

Bibliography

Burke, P. (2001). Eyewitnessing. The uses of images as historical evidence. London: Cornell University Press.
Chen, J., Kuznetsova, P., Warren, D. S. and Choi, Y. (2015a). Deja image-captions: A corpus of expressive descriptions in repetition. NAACL, pp. 504–14.
Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollar, P. and Zitnick, C. L. (2015b). Microsoft COCO captions: Data collection and evaluation server. CoRR. abs/1504.00325.
Hodosh, M., Young, P. and Hockenmaier, J. (2013). Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research,47: 853–99.
Kester, B. and Kleppe, M. (2015). Acceptatie, professionalisering en innovatie. Persfotografie in Nederland, 1837-2014. In Bardoel, J. & Wijfjes, H., Journalistieke Cultuur in Nederland. Amsterdam University Press, pp. 53-76.
Kleppe, M. (2013). Canonieke Icoonfoto's. De rol van (pers)foto's in de Nederlandse geschiedschrijving. Delft: Eburon.
Kleppe, M. (2012). Foto’s in Nederlandse Geschiedenisschoolboeken (FiNGS). DANS
http://dx.doi.org/10.17026/dans-zfn-u8k4.
Kleppe, M., Zeijl, L. and Kester, B. (2014). Foto’s in Nederlandse kranten (FiNK). DANS
http://dx.doi.org/10.17026/dans-2cz-x7rh.
Kleppe, M. (2012). Wat is het onderwerp op een foto? De kansen en problemen bij het opzetten van een eigen fotodatabase. Tijdschrift voor Mediageschiedenis 14(2): 93-107.
Kousha, K., Thelwall, M. and Rezaie, S. (2010). Can the impact of scholarly images be assessed online? An exploratory study using image identificationtechnology. Journal of the American Society for Information Science and Technology 61(9): 1734-44.
Manovich, L. (2009). Cultural analytics: Visualing cultural patterns in the era of more media. Domus (923).
Manovich, L. (2012). How to compare one million Images? In Berry, D. M., Understanding Digital Humanities, pp. 249-78.
Manovich. L. and Tifentale, A. (2015). Selfiecity: Exploring Photography and Self-Fashioning in Social Media. In Berry, David M., Dieter, M. (eds), Postdigital Aesthetics: Art, Computation and Design, pp. 109-22.
McCLoud, S . (1994). Understanding Comics: The Invisible Art. New York: HarperPerennial.
Ordonez, V., Kulkarni, G. and Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.
Reser, G. and Bauman, J. (2012). The Past, Present, and Future of Embedded Metadata for the Long-Term Maintenance of and Access to Digital Image Files. International Journal of Digital Library Systems (IJDLS), 3(1): 53-64.
Reside, D. (2014). Using Computer Vision to Improve Image Metadata. Digital Humanities 2014. http://dharchive.org/paper/DH2014/Paper-294.xml. .
Resig, J. (2013). Using Computer Vision to Increase the Research Potential of Photo Archives. http://ejohn.org/research/computer-vision-photo-archives/#analysis-implementations..
Salt, B. (1974). The Statistical Style Analysis of Motion Pictures. Film  Quarterly, 28(1): 13-22.
Terras, M. M. and Kirton, I. (2013). Where do images of art go once they go online? A Reverse Image Lookup study to assess the dissemination of digitized cultural heritage. Selected papers from Museums and the Web North America, pp. 237–48.