A comparative study of the vocabulary of the greatest Western authors - and the winner is . . .

(actualisé le ) by Ray

With the help of modern technology and a clever engineering student, we have analysed in depth the vocabulary in 25 of the most famous one-volume works in the history of Western literature - celebrated texts by Dante, Cervantes, Shakespeare, Balzac, Flaubert, Victor Hugo, Charles Dickens, William Thackeray, Herman Melville, Marcel Proust, Jack London, Thomas Mann, Robert Musil, James Joyce, Ernest Hemingway, F. Scott Fitzgerald, and John Steinbeck.

We have simply measured the number of different words in each text, where punctuation marks (other than imbedded dashes and apostrophes), numbers, special characters, initials and Roman numerals have been ignored.

However, grammatical or spelling variations of the same base word have not been eliminated, and have been included in the count for each work.

On the other hand, uppercase/lower-case variations of the same basic word have been ignored for comparison purposes.


There is a case to be made that grammatical variations of the same word (noun plurals, conjugated verbs, declined adjectives, etc.) should not be included in comparative vocabulary studies.

Although doing that for all of the masterworks listed above would be an undertaking well beyond the scope of this essay, we have done a partial study of the rate of elimination of grammatical variations across a sampling of the works in question, shown below:

This shows a rate of some 22-23% “grammatical adjustment” for most of the works in English, with a somewhat higher rate for Charles Dickens (26%) and James Joyce (26%), and higher rates for Thomas Mann (German) and Gustave Flaubert (French): 28% and 38% respectively. And we have reason to believe that the grammatical adjustment rate for those of the other Latin languages is somewhat similar to that for French.


The conclusion is clear, whether one considers the straightforward (and error-free) computer count or the necessarily semi-manual "adjusted" count, the relative vocabulary count of the various works is largely unchanged.

So in purely quantitative terms (quality is exceptionally high throughout this selection of great works, obviously) THERE IS ONE CLEAR WINNER !!