[PW] "Bag of Words" as an insult

John Cowan cowan at ccil.org
Mon Jan 9 12:53:01 PST 2017


On Mon, Jan 9, 2017 at 3:35 PM, Claire <clairefromclare at gmail.com> wrote:

The concept "bag of words" started in linguistics in 1954 and got picked up
> by the computer programming world, which is mostly what you get when you
> try to work your way through this. Neither is the meaning we're after.
>

Geoffrey Pullum (linguistics) and the late Barbara Scholz (philosophy) both
spoke of the "big bag of words theory", which is the folk theory that the
important thing about a language is its words.  This excerpt from a
Language Log post at <
http://itre.cis.upenn.edu/~myl/languagelog/archives/000256.html> describes
the idea and provides a more formal citation:

Watson, of course, like just about every non-linguist who ever writes about
> language, presupposes that a language is just a big bag of words. Barbara
> Scholz and I have attacked that idea (in Nature 413, 27 September 2001,
> p.367), but it's not that we think anyone will listen or anything will
> change. Everybody thinks that the key thing about a language is which words
> it has -- and above all, how many.


The derogatory flavor of this version seems to connect it to the quotation
you give, though perhaps as an influence or analogue rather than an actual
source.  Linguists are to be found in English departments from time to
time, after all.

Here's the text of the Nature article:

In the popular view, a language is merely a fixed stock of words. Purists
worry about foreign loanwords; conservatives decry slang; and groundless
claims that there are hundreds of Eskimo words for snow are constantly made
in popular writing, as if nothing matters about languages but their
lexicons.

But the popular view cannot be right, because (as linguist Paul Postal has
observed) membership in the word stock of a natural language is open.
Consider this example: “GM’s new Zabundra makes even the massive Ford
Expedition look economical.” If English had an antecedently given set of
words, then this expression would not be an English sentence at all,
because ‘Zabundra’ is not a word (we just invented it). Yet the sentence is
not just grammatical English, it is readily interpretable (it clearly
implies that the Zabundra is a large, fuel-hungry sports utility vehicle
produced by General Motors). Similar points could be made regarding word
borrowing, personal names, scientific nomenclature, onomatopoeisis,
acronyms, loaned words, and so on; English is not a fixed set of words.

A more fundamental reason that a language cannot just be a word stock is
that expressions have syntactic structure. For example, in most languages,
the order of words can be significant: “Mohammed will come to the mountain”
contains the same words as “The mountain will come to Mohammed”, but the
expressions are very different. Inclusion within phrases is also an
important part of syntactic structure. In the expression “We could not tell
him”, ambiguity arises from the fact that ‘not’ may belong in the same
phrase as ‘tell him’ — in which case the meaning is that keeping him in the
dark is possible or permitted — or it may be outside the ‘tell him’ phrase,
in which case ‘not’ belongs with ‘could’ and the meaning is that telling
him is impossible or forbidden.

The syntactic structure of natural languages has several important
features. One is revealed by the example just considered; there is no
guarantee of what mathematical logicians call ‘unique readability’ — there
is no one-to-one correspondence between sound strings and syntactic
structures, or between syntactic structures and meanings. Natural languages
are replete with ambiguity.

A second feature is that there is no upper limit on the complexity of
expressions. A verb phrase such as ‘run away’ can be embedded in a larger
verb phrase such as ‘see Spot run away’, and there is no syntactic limit on
further embedding, so expressions can be of arbitrary complexity: “Tell him
they think he overheard someone ask her to confirm that they saw him
watching us waiting for you to see Spot run away in order to ...”. Hence,
natural languages are productive, as they possess the structural resources
for indefinite recombination.

A third feature of natural language syntax is its variability. Even within
a single speech community (which we can roughly define as a human group
whose members broadly understand each other’s speech and recognize it as
being characteristic of the group), there are quite sharp differences
concerning the relevant regularities of syntactic structure, both between
subgroups (dialect differences) and between individuals, whose
idiosyncratic divergences mostly go unnoticed.

Fourth, malformations of syntax vary in their severity — some partially
ill-structured expressions are more deviant than others. President George
W. Bush’s departures from standard English syntax are well known; Tarzan’s
departures are more extreme, and Yoda’s even more so (“Already know you
that which you need”); yet the structure of English is partially respected
in each case. Likewise, familiar phenomena such as hesitation (“It’s in the
... in the drawer”) and use of fragments (“And that would be ...?”)
partially conform with English expressions; they are not random jumbles of
words.

Within mathematical logic and computer science, invented formal symbolic
systems are called languages, but in some respects they are strikingly
different. Their syntactic structure allows embedding and re-embedding, but
their vocabularies are fixed; ambiguity is ruthlessly excluded; structures
must be completely well-formed, and disrupted or fragmented expressions
simply do not belong at all. Most of the linguistic work on syntactic
theory over the past 50 years has used the mathematical methods devised for
defining formal languages of this sort. This suggests a possible confusion
of formal tools with subject matter. T

o formulate grammars that describe natural languages precisely, we need a
formal metalanguage that has the properties of any other scientific, formal
language: a recursively defined, unambiguous syntax that defines a
countably infinite set of expressions. But natural languages themselves —
the focus of scientific linguistics — are not precisely delineated sets of
expressions, any more than they are precisely delineated sets of words.

The aspects of language that can be learned by certain great apes,
particularly the chimpanzees Pan troglodytes and Pan paniscus, are
curiously the very same ones that fit the popular conception of languages.
Apes can learn to name things using hand signs or visual symbols, and to
express some basic demands by uttering them (“Open fridge! Give apple
give!”), but they seem to be incapable of developing a productive grasp of
syntax.

Human languages exhibit a unique combination of characteristics: first,
semantic word-to-world relations that we share with other primates; second,
syntactic structures as complex and exact as in formal languages; and
third, an openness, flexibility and ambiguity that formal languages do not
allow.

[author's addresses omitted; they are obsolete]

 --
John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Pour moi, les villes du Silmarillion ont plus de realite que Babylone.
                --Christopher Tolkien, as interviewed by Le Monde


More information about the Project-Wombat-FM mailing list