Computational stylistics now has thirty years or so of publications and has been celebrated as one of the success stories of Digital Humanities (McCarty, 2014: 289). It brings together statistical methods and literary analysis, inferring meaning from the frequency of literary features. In this paper I explore this connection between frequency and meaning, and consider some of the objections which have been made to the statistical approach to style.
Much research in linguistics focuses on individual sentence-level structures. Stylistics introduces a new dimension of extension and cumulation, placing a net of continuing co-occurrence over a language sequence. The added dimension of time, or extent, opens up the analysis of meaning not only in the instance itself but in the series that it forms with other instances of the same feature. Computational stylistics takes a routinely quantitative approach to this cumulative aspect.
Critics have laid down significant challenges to this frequentist approach. They have questioned whether language features are really countable, whether frequency matters in meaning, whether the inevitable choice of features to count undermines the objectivity of the results, and whether quantitative results can ever usefully relate the text to any wider context.
2. Are language features countable?
The first key enabling assumption of computational stylistics is that the language features being counted are homogenous. In his 1970s articles attacking stylistics, Stanley Fish argues that meaning is constructed by the reader at the moment of reading and concludes that stylistics is therefore an invalid practice. There is no meaning in the word on the page, so it is pointless to count instances of the word, of combinations of words, or of any other language feature (Fish, 1973; Fish and Graham, 1979). John Frow, likewise, argues that in literary study features are not stable or commensurate but relational, so counting them is pointless (cited in Bennett, 2009: 287).
3. Is frequency meaningful?
Many would also question the relation between frequency and salience. It seems unwise to assume that an unusual accumulation of a feature is necessarily noticed by writer or reader. Fish in a recent online post critiquing digital humanities argues that only patterns intended by the author are worth discussing (Fish, 2012). Different frameworks influence the noticeability of language elements, and a single instance may be highly salient, and a cluster of instances may pass without any conscious reaction.
4. Function words
Considering function words as a basis for counting helps counter these objections. Computational stylistics has a natural alliance with function words. Function words lend themselves to computation since they are easy for a machine to recognise and appear regularly and in large numbers, offering opportunities for analysis by statistical methods whose power is well established in other domains. On the other hand, computation has a special benefit for function words analysis because counting on a scale not possible for the unaided reader makes it possible to reveal hitherto latent patterns in the behaviour of these words.
Function words do not have a semantics in the usual sense: if has a structural function rather than a meaning. The stylistic import of the word only becomes clear in repetition. By contrast, lexical words are rich in meaning in the individual instance and do not necessarily achieve any cumulative effect through a series. Function words bear traces of larger structures and hence, though not salient in themselves, their frequencies bear meaning as indexes to wider discourse orientations. They help show how a language feature can be sufficiently homogenous to justify counting, and how frequencies can have a literary dimension.
There are two other important objections to consider: the possible bias arising from the fact that a judgement has to be made about which features to count, and the difficulty of relating patterns found within a corpus to extra-textual factors.
5. Features have to be chosen, so results are arbitrary
Tony Bennett points out that researchers have to choose the units to count in – there are no "given units" -- and argues that this choice has a necessary influence on results, which undermines any claims to objectivity (2009: 290, 291).
This is a fundamental critique of quantitative study, i.e. of any quantitative study. The logical extension would be that the choice of units always determines the results, so there can be no surprises and nothing new can be learned. It is easy to show that there are cases where this is not so. If we ask, do women write differently from men? - we have a way of validating the units: if the pattern of use of a given unit shows a significant difference in a balanced and commensurate sets of samples of the writing of women and the writing of men, then it does not matter how the unit was chosen. Here we have an external basis, the difference between two objectively based classes, on which to discard some units and accept others. Then there are cases of classification, e.g. by author and by date. We can seek markers of the classes, check them with known members of the classes, and then apply them to disputed cases. We have an objective way of validating the units, so we don't care much about where they came from.
Computational stylistics begins with textual features, focuses on finding patterns in their use, provides striking visualisations of the patterns, and then struggles to relate the patterns to extra-literary events. The textual data is well defined, easy to explore, and with the help of statistics it can be shown that there are robust structures within it. The world of possible causation beyond is hard to limit and hard to quantify. If there is (say) a consistent and marked increase in the Shannon Entropy of the language of Victorian novels from early in the period to late, how could that be described in terms of the reading experience? And how could that be related to the forces acting on the novel? Computational stylistics is lop-sided: very well developed on the textual side, but weak - tentative and fragmentary - in relating statistical findings to the extra-textual world. Another way of saying this is to call computational stylistics formalist. In this sort of approach the evidential force of the explanation for a pattern will always be less than that for the pattern itself. However, it is only fair to point out that in this it is in the same situation as other literary methods. A literary effect may be demonstrable, but its genesis in composition, and the larger forces to which it relates, are always matters of judgement and selective contextualisation. The text is available, even if dauntingly complex, but the conditions which made it possible have to be painstakingly and always speculatively recreated. It is easier to show that Hamlet changes in the course of his play than that this observed change relates to Early Modern beliefs about the typical course of melancholia.
Computational stylistics has proved itself in the realm of classification. In this area the methods can be thoroughly tested and success or otherwise can be demonstrated. There are some well-established and significant findings, leading to a reassessment of some commonplaces such as the downplaying of authorship as a factor in style (Egan, 2014). This presents a problem for those who think that counting literary features is inherently unsafe, that frequencies in language cannot have any real force, and that all feature choice is fatally arbitrary. Beyond classification, though, these objections still have some force, and a new one intrudes, the argument that computational stylistics is disablingly formalist. Computational stylistics now needs to produce findings in more properly stylistic areas of the same weight as its justly celebrated classification ones, findings which match the style within a corpus to the world beyond it. Only then will we be confident that frequency in literary language is linked to meaning, and that computational stylistics has the methods to do justice to this link.