DH 2016 Abstracts

Concept Modeling the Advertising Chinese Modern Society

In the late 19th century, thousands of industrially produced consumer items flooded into extraterritorially governed, internationally regulated, Chinese, treaty port cities. Foreign commodities were products, and formed the backbone of new, urban, popular consumer culture. Consequently, the advertising industry infiltrated commodity brands and branding techniques into everyday life making commodity images a paramount symbol of civilized urban life. Advertising ephemera thus provides researchers with the conditions for thinking about modernity par excellent since it breaks data free of its origins to demonstrate how concepts embedded in ads ingratiate all consumer cultures (Barlow, 2012).

To force advertising to speak clearly, we launched the Chinese Commercial Advertisements Archive (“CCAA”) and ‘metadated’ (Lev Manovich’s term) more than ten thousand high quality images from microfilm copies of three, major, commercial, Chinese newspapers, in the period of 1880 to 1940 (Manovich, 2002). CCAA applies customized metadata schema based on the structural standard, Dublin Core, to each digital image of advertisement, entering all relevant information e.g., cartoon, brand icon, word texts and syntax, plus street names and business titles. Our metadata include: descriptive content, contextual information, bibliographical, technical and image sources of location, copyright status, and owning institution.

Scholars had already studied categories like hygienic/卫生), modern/现代, human/人类, eugenics /优生, and female/女性in commercial/common ideas. They sampled image-based advertisements in libraries using newspapers, facsimiles and microfilm/fiche. Though more recent research projects have done a poor job of digitizing advertising, still we cannot ignore ad digitalization because historians are still generalizing from a fraction of ads that make up any potential archive. To avoid wasting time and to collaborate with other scholars developing what Franco Morreti calls ‘distance reading,’ we seek to connect concepts appearing in advertisements to concepts found in sociological texts employing statistical text mining of advertising copy (Hayles, 2012).

Space prevents a full literature survey here, but we have met with pioneering researchers Professor Peter Bol of ‘China Biographical Database Project’ at Harvard University and Professors Zheng Wenhui and Liu Zhaolin, co-PIs, ‘Database for the Study of Modern Chinese Thought and Literature (1830–1930) ’at Taiwan National Chengchi University, and now have available over 30,000 annotated ad images which our proposed paper will use to augment evidences and expand analysis.

On the basis of these 30,000 annotated, newspaper, advertising images, our work is generating a text-mining model for advertising language, a language presenting historically anchored technical difficulties as follows:

Lack of word boundaries and punctuation. Word boundaries in Chinese are invisible; worse, ad slogans are not punctuated. Raw data is just a sequence of unsegmented Chinese characters which means text mining in Chinese is comparatively tough.
Lacks definitions of vocabulary. Ad texts contain lots of instable, idiosyncratic technical terms, like company names written in different ways, transliterated brand names, product names and so on that we discover during the text mining process.
Lack of training data. Most Chinese text mining methods depend on high quality training data, and will fail if the target texts are remarkably different from the training data. Considering that the advertisements that interest us are from regional newspapers over a long period, ad writing style is uncertain due to local linguistic differences. We cannot rely on current training data employing modern Chinese to establish models for mining 1920s syntax, vocabulary, punctuation (or lack of it) word use, semantic references, ideograph variation for 100-year-old print media.
Difficulties distinguishing technical and background words. Ad texts are a mixture of technical and background phrases, so it is not a trivial task to distinguish technical terms, our true interest , from noise, words rarely used a century after the ads were published.

We have overcome many of these roadblocks using statistical methods for Chinese text mining and knowledge discovery. Text mining allows us to: 1) discover potential associations among features and terms extracted from advertisements; 2) build links among these and ideological trends in the treaty port urban areas of China during our period by developing Deng Ke’s statistical text mining method to establish indices of technical terms (“TT”) and metadated association patterns among technical terms (“APTT”) (Deng, Geng and Liu, 2014). Word Dictionary Model (“WDM”) and Advanced Word Dictionary Model (“AWMD”) are tools for word discovery, text segmentation and entity recognition of Chinese texts when training data are not available. WDM can be extended into an AWDM to achieve automatic recognition of TT (distinguishing technical terms from background words/phrases). In this case, technical terms mean the specific phrases we choose from datasets or metadata of images, and establish as concepts in the network.

To this purpose we are developing the following indices: 1) Bibliographical (volume, issue, page numbers, location, date) to enable statistical analysis of ad publication frequency in one or several newspapers over the course of one or many years. 2) Contextual Informational (brand, product category, company, agency, retailer’s address, registered nationality) allowing users to establish a statistical picture of a commodity, in specific newspapers, geographical locations and decades. 3) Content index (sorting by drawing of male, female, elders, youth, middle age people, infant, human, animal, plant, Chinese, foreigner) meaning ad images are hybrid artifacts, mixing text and cartoons; 4) Theoretical categories (the modern, human, woman) to identify categories used aggressively in ads. Once TT in each and every advertisement have been successfully located and the indices of TT identified, we can reveal the APTT of ads, defined as subsets of technical terms that tend to co-occur in an advertisement frequently. With TDM, association pattern discovery can be converted into a statistical inference problem and solved by statistical means.

Second, we seek concept networks that connect key concepts embedded in ads to sociological theories. The Concept Network (CN) is a graph that can efficiently present domain knowledge and reasoning based on it. Each CN node is a concept corresponding to an entity or a technical term. Thus if concept A appears in the definition of concept B, we add a direct link from A to B and domain topology will eventually reflect the structure of the knowledge system: closely related concepts are direct neighbors or locate in the same neighborhood, while concepts belonging to different disciplines or areas will be far away from each other in the graph. Building CN requires indices of concepts and their descriptions. Traditional dictionaries might be a source and our period shows an efflorescence of dictionary publication. Another source is online knowledge databases, like Wikipedia. However domain knowledge of sociological theories are not represented in any language or in any period anywhere on the World Wide Web. To compensate, we are erecting an ontological knowledge database of sociological theories as these appeared in journals, books, articles and the archived documents, ‘Social Thought in Modern China, 1830–1940’ (STMC). With an ontological database, we can open our sharing platform to define and describe key concepts and relationships among them. Interdisciplinary by design CNMACMS users are welcome to participate by entering data into the databases to improve our model.

Notes

As for the studied have been done by schoalrs on categories like hygienic/卫生), modern/现代, human/人类, eugenics /优生, and female/女性in commercial/common ideas, please see Jin Guantao(金观涛), Liu Qingfeng(刘青峰), Studies in History of Idea: The Building of Basic Political Concepts in Modern China ( 观念史研究:中国现代重要政治术语的形成), Falv Press, 2009; This book has investigated the origins and transformations of tern basic concepts of “gonghe” (republicanism), “minzhu” (democracy), “quanli” (rights), “geren” (individual), “geming” (revolution), “kexue” (science) in modern Chinese history by using the data of “Database for the Study of Modern Chinese Thought” (1830–1930). Huang Kewu (黄克武), “從申報醫藥廣告看民初上海的意料文化與社會生活1912–1926” explored the idea of “disease” in advertisements published on Shen Bao during early 19 ^th century. Tani Barlow’s published papers and book, In the Event of Women (Durham: Duke University Press, 2017) establish a historical parallel connecting advertisement ephemera, social theory and the woman category.

Bibliography

Barlow, T. (2012). Advertising Ephemera and the Angel of History. Positions: asia critique, 20: 111–58.
Bergere, M. C. (1989). The Golden Age of the Chinese Bourgeois, 1911–1937. Cambridge: Cambridge University Press.
Deng, K., Geng, Z. and Liu, J. S. (2014). Association pattern discovery via theme dictionary models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76: 319–47. doi: 10.1111/rssb.12032.
Hayles, N. K. (2012). How We Think: Transforming Power and Digital Technologies. In Berry, D. M., (ed.), Understanding Digital Humanities. London: Palgrave Macmillan, pp. 42–66.
Manovich, L. (2002). Metadata, Mon Amour, http://manovich.net/index.php/projects/metadata-mon-amour (accessed 14 March, 2016)