To appear in Trends in Cognitive Sciences. © 1997 Elsevier Science. There may be minor discrepancies between this document and the published version.

Probabilistic and Distributional Approaches to Language Acquisition

Martin Redington
Department of Experimental Psychology,
University of Oxford.

m.redington@ucl.ac.uk

Nick Chater
Department of Psychology,
University of Warwick,

nick.chater@warwick.ac.uk

Recent computational research on natural language corpora has revealed that relatively simple statistical learning mechanisms could make an important contribution to certain aspects of language acquisition. For example, statistical and connectionist methods can provide valuable cues to word segmentation, and to the acquisition of inflectional morphology, syntactic classes, and aspects of word meaning. In each case, these cues are partial, and must be integrated with additional information, whether from other environmental cues or innate knowledge, to provide a complete solution to the acquisition problem. The success of these methods with real natural language corpora demonstrates their feasibility as part of the language acquisition mechanism, where much previous research has been limited to highly idealized artificial input or to a priori considerations regarding the feasibility of acquisition mechanisms. Exploring probabilistic learning mechanisms with natural language input provides both an empirical basis for assessing how innate constraints interact with information derived from the environment, and a source of hypotheses for experimental test.

Theoretical accounts of language acquisition have emphasized the role of innate linguistic knowledge, with the influence of the child's environment playing a relatively minor role (e.g., Chomsky[1]. However, psychologists studying language development have to explain how the interaction of innate knowledge and the child's environment account for the developmental progression of language ability. No matter how great the contribution of innate knowledge to language acquisition, some aspects of language (e.g., vocabulary) must be learnt. Moreover, even putatively innate knowledge must be tuned (e.g., by "parameter setting") to the specific properties of the language to be learned.

It has recently become possible to study the possible contribution of learning in language acquisition from a new perspective. Potential models can be coded as computer programs, and exposed to (some approximation of) natural language input. This work explores the utility of important classes of language-internal, or distributional information, derived from the relationships between linguistic units such as phonemes, morphemes, words, and phrases. Distributional information can be readily extracted by a range of probabilistic learning mechanisms, including connectionist networks and conventional statistics, which we shall collectively term distributional learning mechanisms. This approach is inspired by, and builds on work in structural linguistics, where distributional methods were used as a methodology for deriving linguistic theories, rather than as models of acquisition[2]. The research we review shows that distributional information provides valuable cues to many aspects of language, which may potentially be exploited by the child.

Distributional information contrasts with extra-linguistic sources of information which infants might exploit, including features of the physical and social environment or the meaning of an utterance. Extra-linguistic information undoubtedly plays a major role in the acquisition of language, but its utility is difficult to evaluate computationally, because the child's representation of the environment is unknown---even if the resources to compile "corpora" relating language language and environment were available, it would still be unclear how the environment should be encoded. This provides a methodological reason to focus on language-internal aspects of environmental input, although this approach is consistent with the possibility that distributional information may only be relevant to some aspects of language acquisition (see Box 1), and is compatible with the innateness of both domain-specific language learning mechanisms and knowledge of many universal properties of language[3,4]. By empirically evaluating probabilistic learning mechanisms with natural language input, it may be possible to assess how language-external factors and innate constraints interact with distributional information.

Box 1: Can distributional methods account for all language acquisition?

Two extreme views concerning the utility of distributional methods are possible. One is that distributional methods can learn all of language unaided. The other is that distributional methods can provide no useful information about any aspect of language. Debates concerning specific distributional methods often implicitly adopt, or are (mis)interpreted as advocating, one of these extreme positions. Our position is that distributional learning methods are valuable in a number of domains, such as those outlined in this article. But there are many aspects of language (e.g., syntax and compositional semantics) which exhibit highly complex and structured regularities. It has been argued that these are intractable to any learning method, including distributional methods, and hence require the existence of innate symbolic linguistic knowledge[a]. Whatever the strengths of these arguments, they are not in any way undermined by the success of distributional methods described in this paper. Indeed, it might be suggested that distributional methods are useful in learning to encode the aspects of the language which are specific to particular languages, so that innate language universal knowledge can be brought to bear. More generally, this might suggest a possible division of labour between distributional methods and traditional formal learning theory[b]. Nonetheless, we believe that the success of distributional methods in the limited aspects of language so far attacked does show that empirical research may produce better results than may be expected from considerations of linguistic theory. We therefore believe that pushing distributional methods as far as possible is an important enterprise, which is likely to illuminate both the value of distributional information and the nature of innate constraints.

References

a Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.

b Osherson, D. N., Stob, M. & Weinstein, S. (1985). Systems that learn. Cambridge, MA: MIT Press.


Below we outline the application of probabilistic methods to four important aspects of language acquisition, outlining a case study of recent research for each.

Segmentation

A problem faced early in language acquisition is segmenting the continuous speech stream into discrete words. This is a difficult because conversational speech contains no "gaps" or obvious acoustic markers to signal word boundaries[5].

Theories of adult segmentation (e.g., the Cohort Model[6]) propose that the lexicon crucially constrains segmentation. But the child, possessing no initial lexicon, faces a chicken and egg problem. Somehow, the child must "bootstrap" the ability to segment and learn the lexicon of the natural language.

Many possible language-internal cues to segmentation that the child may use in segmentation have been suggested, ranging from bootstrapping a vocabulary from single word utterances[7], or exploiting subtle acoustic/phonetic boundary markers in the speech signal[8], prosody (including pauses, segmental lengthening, metrical patterns, intonation contours[9] and stress patterns[10]) or phonotactic constraints (i.e., sequential regularities between phonemes)[11].

Probabilistic computational models have focussed primarily on lexical stress[12] and phonotactic constraints[13,14,15]. The work we describe here shows how these cues can be combined within a single learning mechanism. Studying combinations of cues is important, because no single cue is likely to produce a complete solution to any problem in language acquisition. Christiansen, Allen and Seidenberg[16] trained a simple recurrent network (see Figure 1), to predict the next input from a representation of previous inputs (where inputs are coded in terms of phonological features, utterances boundaries [but not word boundaries], and stress), using a corpus of maternal speech to preverbal infants[17]. Stress patterns for words were obtained from a standard database (the MRC Psycholinguistic Database). The network's connection strengths are initially random. During training, it learns to exploit phonotactic regularities (e.g., that, in English, /a/ is rarely followed by another /a/, but quite frequently by a /b/) in the input. Because certain combinations of phonemes are more likely to occur at the starts and ends of words, these regularities provide a potentially useful cue for word segmentation.

[Figure 1]

Figure 1. The simple recurrent network model used by Christiansen et al. The current phoneme is represented as a set of features on the input layer. At the output layer, individual units represent each phoneme. The activation of the output units represents the network's prediction of the next phoneme in the sequence. At each timestep the hidden unit activations are copied back onto the context units, allowing the network to maintain prediction-relevant information. The "Ubm" unit codes for utterance boundaries in the input, and is a useful predictor of word boundaries in the output. (Reprinted with permission from Ref. 16).

Although the only boundary information that the net received concerned utterance boundaries, there was a good correlation between the SRN's prediction of boundaries (the activation of the output boundary unit) and the occurrence of word boundaries in the corpus: Figure 2 shows the activation of the network's boundary output unit over a short stretch of the corpus. Although the model has no lexicon, over the entire corpus, 43% of words were correctly segmented, and over 45% of segmented units correspond to words. Performance dropped marginally when stress was ignored, and dropped significantly if phonology was ignored. Distributional methods are capable of even better performance: A state-of-the-art specialized statistical method proposed by Brent and Cartwright[14] correctly segments 72% of words, and 65% of segmented units correspond to words. Christiansen et al. argue that their model is more psychologically plausible because it uses a general purpose sequential learning, which can combine different cues to segmentation, whereas Brent and Cartwright's model is cast at a relatively abstract level.

[Figure 2]

Figure 2. The activation of the output boundary unit over a short stretch of the training corpus. Activation of the boundary unit at a particular position corresponds to the network's hypothesis that a boundary follows this phoneme. Black bars indicate the activation at lexical boundaries, whereas the grey bars correspond to activation at word internal positions. The horizontal line indicates the mean boundary unit activation across the whole corpus. A gloss of the input utterances is found beneath the input phoneme tokens (with "#" denoting an utterance boundary). (Reprinted with permission from Ref. 16).

The work of Christiansen et al. illustrates how a simple and general learning method can find a considerable amount of information about the structure of language, even though that structure, i.e., discrete words, is not overtly marked in the input. Moreover, it illustrates how computational analyses of corpora can shed light on how the informational value of different cues can interact (see Box 2).

Box 2: Interaction of cues

Many problems in language acquisition are difficult because no single feature of the input correlates with the relevant aspects of language structure. Although it is a natural starting point for computational and empirical research to study cues in isolation, it may be that the problem of acquisition is easier when multiple cues are taken into account. Figure A below shows how three constraints A, B and C, represented by regions of the hypothesis space, are insufficient to identify the correct hypothesis when considered in isolation. It is only by combining these cues that the hypothesis space can be substantially narrowed down. Thus, as the number of cues that learner considers increases, the difficulty of the learning problem may decrease. This suggests that the cognitive system may aim to exploit as many sources of information as possible in language acquisition.

[Figure 9]

Figure A. A conceptual illustration of three hypothesis spaces given the information provided by the cues A, B, and C. The `x's correspond to hypotheses that are consistent with all three cues. (Reprinted with permission from Ref. 16).

Moreover, it is possible that cues only be useful when considered together. For example, in the sequences in Figure B below, each cue X and Y seems completely random with respect to the target Z; but when considered together X and Y determine Z perfectly (specifically, Z has value 1 exactly when just one of X and Y have the value 1). Considering cues in isolation implicitly assumes that there is a simple additive relation between cues.

X: 1 0 0 1 1 1 0 1 0 1 1 0 0 0 1 1 0 1 0 1 0 1
Y: 0 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1
Z: 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0


Figure B. A sequence of cues.The Z sequence is independent of the value of X, and independent of the value of Y, but can be predicted exactly (is the XOR of) from X and Y.

Inflectional morphology

Acquiring morphology involves identifying the morphological processes in the language. Across languages, these processes are very diverse, including suffixes, prefixes, infixes, circumfixes, ablaut/umlaut, vowel-tier morphemes, tonal morphemes, metatheses and truncations[18]. We focus here on how computational analysis has addressed a key theoretical question: Whether inflectional morphology requires two "routes," one to handle regular morphology (e.g., add "-ed") and one to handle irregulars (e.g., "go" --> "went").

Connectionist studies with idealized languages patterned on English past tense morphology suggest that a single route may handle both cases[19,20,21]. However, Prasada and Pinker[22] argued that the success of these models results from the distributional statistics of English. Many regular English /-ed/ verbs have low token frequencies, which a connectionist model can handle by learning to add /-ed/ as a default. For irregular verbs, token frequency is typically high, allowing the network to override the default. Prasada and Pinker argued that a default regular mapping with both low type and token frequency could not be learned by a connectionist network. The putative default /-s/ inflection of plural nouns in German[23] appears to provide an example of such a "minority default mapping." Marcus et al.[24] proposed that the German plural system must be modelled by two routes: A pattern associator which memorizes specific cases (both irregular and regular), and a default rule (add "-s") which applies when the pattern associator fails.

Nakisa and Hahn[25] asked whether single-route associative models (the nearest neighbour algorithm, the Generalised Context Model[26], and a simple feed-forward connectionist net with one hidden layer) could learn the German plural system, and generalise appropriately to novel regular and irregular nouns. The associative model's task was to predict to which of 15 different plural types the input stem belonged. The inputs to the learning mechanisms were phonetic representations of approximately 4,000 German nouns taken from the CELEX database (token frequency was ignored). The three simple associative models scored, respectively, 71%, 75%, and 84% respectively correct classifications, on a test set of 4,000 previously unseen test nouns.

Nakisa and Hahn also simulated the Marcus et al. model, by assuming that any test word which is not close to a training word, according to the associative model (i.e., for which the lexical memory fails) will be dealt with by a default "add /-s/" rule. The associative models were trained on the irregular nouns, and the models were tested as before. Nakisa and Hahn found that for all three models, the presence of the rule led to a decrement in performance. In general, the higher the threshold for memory failure (the more similar a test item had to be to a training item to be irregularised via the associative memory), the greater the decrement in performance (See Figure 3).

[Figure 3]

Figure 3. The percentage of German nouns correctly pluralized by a single route neural network model, and by a dual route model, with a default "add -s" rule taking over when the output activation of the network is low. The x-axis corresponds to a measure of output activation, with dual route performance depending on the criterion for using the default rule, but always being lower than the single route model. (© 1996 The Cognitive Science Society. Adapted with permission from Ref. 25).

The use of a default rule could only have improved performance for regular nouns occupying regions of phonemic space surrounding clusters of irregulars (See Figure 4). In real German, Nakisa and Hahn's findings demonstrate that very few regular nouns occur in these regions. The extension of Nakisa and Hahn's findings to the production of the plural form (instead of merely indicating the plural type), and to more realistic input (e.g., taking account of token frequency) remains to be performed. Further work might also focus on the extent to which different single and dual route models are able to capture changes in detailed error patterns of under- and over-regularisation during development, as well as considering overall levels of performance.

[Figure 4]

Figure 4. In this artificially generated data, diamonds correspond to regular nouns, whereas the other symbols represent irregular nouns, which are clustered together in phonological space. Single and dual route models' classifications differ for the shaded regions surrounding the clusters of irregulars. Regular nouns in these areas are correctly classified by the dual route model, via the default rule, and incorrectly classified as irregulars by the single route model. However, Nakisa and Hahn's results suggest that in real German, as in this artificial data, very few regular nouns are found in these regions of phonological space. (© 1996 The Cognitive Science Society. Adapted with permission from Ref. 25).

This is an excellent illustration of how distributional analysis of the statistical structure of real language is crucial in assessing the feasibility of psychological proposals, such as whether default rules are involved in learning inflectional morphology.

Word classes

A central problem in language acquisition is the acquisition of syntactic categories such as noun and verb. This encompasses both discovering that there are different classes and ascertaining which words belong to each class. Even for theorists who assume that the child innately possesses a universal grammar and syntactic categories, identifying the category of particular words must primarily be a matter of learning. Universal grammatical features can only be mapped on to the specific surface appearance of a particular natural language once the identification of words with syntactic categories has been made. Although once some identifications have been made, it may be possible to use prior grammatical knowledge to facilitate further identifications, the contribution of innate knowledge to initial linguistic categories must be relatively slight.

Both language-external and -internal cues may be relevant to learning syntactic categories. One language external approach[27], "semantic bootstrapping," exploits the putative correlation between linguistic categories (in particular, noun and verb) and the child's perception of the environment (in terms of objects and actions). This may provide a means of "breaking in" to the system of syntactic categories. There may also be many relevant language-internal factors: Regularities between phonology and syntactic categories[28], prosody (i.e., relations between intonation and syntactic structure)[29] and distributional analysis, both over morphological variations between lexical items (e.g., affixes such as "-ed" are correlated with syntactic category)[30], and at the word level. We focus on this last approach which has a long history[31,32,33], although such approaches to finding word classes have often been dismissed on a priori grounds within the language learning literature[27].

The "distributional test" in linguistics[34] is based on the observation that if all occurrences of word A can be replaced by word B, without loss of syntactic well-formedness, then they share the same syntactic category. For example, dog can be substituted freely for cat, in phrases such as the cat sat on the mat, nine out of ten cats prefer ..., indicating that these items have the same category. The distributional test is not a foolproof method of grouping words by their syntactic category, because distribution is a function of many factors other than syntactic category (e.g., word meaning). Thus, for example, cat and barnacle might appear in very different contexts in some corpora, although they have the same word class. Nevertheless, it may be possible to exploit the general principle underlying the distributional test to obtain useful information about word classes. The method described here records the contexts in which the words to be classified appear in a corpus of language, and groups together words with similar distributions of contexts. "Context," here, is defined in terms of cooccurrence statistics (see Box 3).

Box 3: Distributional methods, statistics and connectionism

Many distributional methods exploit simple properties such as cooccurrence statistics. Given the corpus to be or not to be, the cooccurrence statistics for adjacent words in this corpus are that to be occurs twice, whilst be or, or not, and not to all occur once. Such statistics can be easily represented in a contingency table, as in Figure A below.

wordn wordn + 1
tobeornot
to 0200
be 0010
or 0001
not1000


Figure A. A contingency table. In this case, each cell of the table indexes the number of times that word n was immediately followed by wordn + 1.

Cooccurrence statistics can also be easily captured by a connectionist network. In the network shown in Figure B, units in the first layer are activated to represent the "current word", and units in the second layer are activated to represent the "next word". The connections between two units are strengthened whenever both units are active (i.e., a form of Hebbian learning). The weights of the network will reflect the cooccurrence statistics of the corpus in exactly the same way that the contingency table does.

[Figure 12]

Figure B. A network with a Hebbian learning rule. The weights of the trained network reflect the same statistics as the contingency table shown in A. For clarity, only nonzero weights are shown.

Clearly there are many other possible distributional properties. A more complex property is the presence/absence of different combinations of phonetic features in the spoken form of a word. Rumelhart & McClelland[a] showed how a single layer connectionist network can map from present to past tense for both regular and irregular English verbs, using this kind of distributional information. The problem of optimally training a single-layer neural network is directly analogous to a conventional statistical technique: Multiple linear regression. So, Rumelhart & McClelland's model can be interpreted as picking up simple distributional statistics. Moreover, at a more general level, many connectionist learning algorithms can be viewed as implementing general statistical principles, such as maximizing the probability of the weights chosen according to Bayesian principles[b], or minimizing description length (Zemel, R. S. [1993]. A minimum description length framework for unsupervised learning. Unpublished doctoral diss., Dept. of Comp. Sci., U. of Toronto).

It is remarkable that such simple statistics, which ignore so much important language structure, are nonetheless so informative about word boundaries, word classes and lexical semantics.

References

a Rumelhart, D. & McClelland, J. (1986). On learning the past tenses of English verbs. Implicit rules or parallel distributed processing. In McClelland, J., Rumelhart, D. (Eds.), Parallel distributed processing, Vol. 2, (pp. 216-271). Cambridge, MA: MIT Press.

b MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neur. Comp., 4, 448-472.


Finch, Chater, and Redington[35,36,37] used the two words before and after each target word as context. Vectors (rows of a contingency table [see Box 3]) representing the cooccurrence statistics for these positions were constructed from a 2.5 million word corpus of transcribed adult speech taken from the CHILDES corpus (much of which was child-directed). The vectors for each position were concatenated to form a single vector for each of 1,000 target words. The similarity of distribution between the vectors was calculated using Spearman's rank correlation, and hierarchical cluster analysis was used to group similar words together.

This approach does not partition words into distinct groups corresponding to the syntactic categories, but produces a hierarchical tree, or dendrogram, whose structure reflects to some extent the syntactic relationships between words. Figure 5 shows the high level structure of the dendrogram resulting from the above analysis. Figure 6 shows examples of the structure of the dendrogram, and its relation to syntactic category at a very fine level.

[Figure 5]

Figure 5. A dendrogram resulting from a word level analysis of the distributional statistics of the CHILDES corpus. The dendrogram has been truncated at a chosen level of similarity, and the resulting discrete clusters labelled by hand with the syntactic categories to which they correspond. The number of items in each cluster is shown in parentheses. Only clusters with 10 or more members are shown here.

[Figure 6]

Figure 6. Low-level clusters of nouns, verbs, and adverbs, taken from the dendrogram presented in summary form in Figure 5.

A quantitative analysis (see Figure 7) of the mutual information between the structure of the dendrogram, and a canonical syntactic classification of the target words, defined as their most common syntactic usage in English, as a percentage of the joint information in both the derived and canonical classifications, revealed that at all levels of similarity, the dendrogram conveyed useful information about the syntactic structure of English: Words which were clustered together tended to belong to the same syntactic category, and words that were clustered apart tended to belong to different syntactic categories.

[Figure 7]

Figure 7. Distributional information at the word level is highly informative with regard to syntactic category The informativeness of the hierarchical classification of the target words with respect to the most common syntactic categories of those words. Informativeness is an information theoretic measure of the degree to which words belonging to the same syntactic category are grouped together, and words belonging to different syntactic categories are separated in the dendrogram. The lower line is a baseline value for informativeness, where words were clustered together randomly. The plot shows that the distributional analysis provides information about syntactic categories at all levels of the dendrogram. The most informative level (0.8) is the level at which the summary dendrogram shown in Figure 5 was cut.

Thus computational analysis of real language corpora shows that distributional information at the word level is highly informative of syntactic category, despite a priori objections to its utility.

Lexical semantics

Acquiring lexical semantics involves identifying the meanings of particular words. Even for concrete nouns, this problem is complicated by the difficulty of detecting which part of the physical environment a speaker is referring to. Even if this can be ascertained, it may still remain unclear whether the term used by the speaker refers to a particular object, a part of that object, or a class of objects. For abstract nouns, and other words which have no concrete referents, these difficulties are compounded further.

The primary sources of information for the development of lexical semantics are presumably language-external. Relationships between the physical, and especially the social, environment of the child are likely to play a major role in the development of lexical semantic knowledge.

However, it also seems plausible that language-internal information might be used to constrain the identification possible meaning of words. For instance, just as semantics might constrain the identity of a word's syntactic category (words referring to concrete objects are likely to be nouns), so knowing a word's syntactic category provides some constraint on its meaning; knowing that a word is a noun, perhaps because it occurs in a particular set of local contexts, generally implies that it will refer to a concrete object or an abstract concept, rather than an action or process[38].

Because there are potentially informative relationships between aspects of language at all levels, this means that even relatively low level properties of language, such as morphology and phonology, might provide some constraints on lexical semantics.

Gleitman[39] has proposed that syntax is a potentially powerful cue for the acquisition of meaning. Gleitman assumes that the child possesses a relatively high degree of syntactic knowledge. However, an examination of Figure 6 above shows that the distributional method used above to provide information about syntactic categories also captures some degree of semantic relatedness, without any knowledge of syntax proper. More effective methods for deriving semantic relationships have been discussed by Burgess and Lund[40,41], Schütze[42], and Landauer and Dumais[43].

We focus here on Burgess and Lund's work. Semantic representations are constructed by collecting "collocation" statistics, capturing the cooccurrence of target and context words within a ten word window of the input corpus (typically a large [160 million] corpus of USENET news), weighted according to the separation of the two words within this window. The output of this process is a matrix representing the extent to which a set of context words occurred within the same window as the target word. The row and column of the matrix corresponding to each word are concatenated to form a "semantic vector." The claim is that the similarity between semantic vectors for different words captures aspects of the semantic relationships between these words.

[Figure 8]

Figure 8. Spatial relationships between vectors representing words from different categories.The distance relationships between semantic vectors from Lund and Burgess' (1994) analysis for words belonging to three categories (animals, locations, and body parts) are shown in here in two dimensions (via multidimensional scaling). These distance relationships clearly capture something of the semantic relationships between these words. (© 1997, Lawrence Erlbaum Associates. Reprinted with permission from Ref. 41).

Figure 8 shows the spatial relationships between vectors representing words from the categories of animal names, body parts, and geographical locations. Multidimensional scaling was used to rerepresent the distance relationships within the high-dimensional space of the semantic vectors in two dimensions. Clearly the semantic vectors do capture aspects of the semantic distinctions between these categories: Distributional statistics do carry information about semantic relationships. The distance between vectors has also been shown to correlate reliably with psychological phenomena such as semantic priming effects in lexical decision tasks[44].

Burgess and Lund[41] have also shown that a model of spreading activation through the space of semantic vectors is able to account for cerebral asymmetries in the time course of semantic priming of multiple meanings; ambiguous words (e.g., bank) presented to the left visual field prime both meanings initially (35ms), but only the dominant meaning after a 70ms delay. Ambiguous words presented to the left visual field prime only the dominant meaning initially, but both meanings after a 70ms delay. Using semantic representations derived from HAL, Burgess and Lund were able to model this difference in terms of differing initial activation, and differing rates of spread and decay between the hemispheres, without appealing to representational differences or modulation of information by the corpus callosum.

We have seen that distributional methods are informative about semantic relatedness, but clearly, language-internal information alone cannot be the basis for the acquisition of word meaning, because learning word meaning requires relating words to the world, to which distributional methods have no access. Nonetheless, language-internal distributional information about semantic relatedness may be important in helping the child constrain hypotheses about word meaning.

Conclusion

Computational studies using natural language corpora show that distributional information is a potentially valuable cue for many aspects of language acquisition (see Box 4). Does the child use these sources of information? As with all theories of language acquisition, empirical evidence regarding distributional methods is difficult to obtain and interpret[45]. It is encouraging that recent experimental evidence in both children and adults shows that the cognitive system is sensitive to features of the input (e.g., cooccurrence statistics) which underlie the mechanisms described here[46,47,48]. It seems a reasonable working assumption that, given the immense difficulty of the language acquisition problem, the cognitive system is likely to exploit such simple and useful sources of information.

Box 4: How good are distributional methods?

In all the examples in this paper, the distributional methods are shown to provide useful information, but do not approach human levels of linguistic knowledge or performance. Indeed, human level performance would not be expected if distributional methods are, as we suggest, only one among many sources of information involved in language acquisition. If human level performance is too ambitious a standard, how can we assess how good distributional methods are? One approach is to compare them against "random" benchmarks. Thus, Christiansen et al. show that their segmentation model greatly exceeds the performance obtained by randomly assigning word boundaries to respect mean word length, and Figure 7 shows a comparison of the Redington and Chater syntactic classification against a random classification. Although this shows that the method is finding some useful information, a better comparison is between different algorithms and/or sources of information. Of course, this requires that competing proposals are computationally explicit, and applied to appropriate corpora. Currently, most non-distributional proposals in language acquisition are described in purely conceptual terms, which makes comparison difficult. Only when a variety of sources of information and/or algorithms can be directly compared will it be possible to accurately assess the potential contribution of distributional methods, but more importantly, it may then be possible to study how different sources of information can be combined to obtain something close to human level performance.

Outstanding questions

Acknowledgments

M. Redington was supported by the U.K. Economic and Social Research Council (ESRC) Research Studentship R00429234268, and N. Chater was partially supported by Research Grant SPG9029590 from the Joint Councils Initiative in Cognitive Science/HCI. Both authors are currently supported by ESRC grant R000236214.

Correspondence concerning this article should be addressed to Martin Redington, now at the Department of Psychology, University College London, 26, Bedford Way, London, WC1E 6BT, or to Nick Chater, at the Department of Psychology, University of Warwick, Coventry, CV4 7AL, U.K. Electronic mail may be sent via internet, to m.redington@ucl.ac.uk or to nick.chater@warwick.ac.uk.

References

  1. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.

  2. Harris, Z. (1955). Methods in structural linguistics. Chicago, IL: University of Chicago.

  3. Kirsh, D. (1991, December). PDP learnability and innate knowledge of language. Center for Research in Language Newsletter, 6, 3-17.

  4. Plunkett, K. (1996, February). Development in a connectionist framework: Rethinking the nature-nurture debate. Center for Research in Language Newsletter, 10, 3-14.

  5. Cole, R. A. (1980). Perception and production of fluent speech. Hillsdale, NJ: LEA.

  6. Marslen-Wilson, W. & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cog. Psychol., 10, 29-63.

  7. Suomi, K. (1993). An outline of a developmental model of adult phonological organization and behavior. J. of Phonetics, 21, 29-60.

  8. Lehiste, I. (1971). The timing of utterances and linguistic boundaries. J. Acoustical Soc. Am., 51, 2018-2024.

  9. Gleitman, L. et al. (1988). Where learning begins: Initial representations for language learning. In F.J. Newmeyer (Ed.), Linguistics: The Cambridge survey, Vol. 3, (pp. 150-193). Cambridge: CUP.

  10. Cutler, A. & Mehler, J. (1993). The periodicity bias. J. of Phonetics, 21, 103-108.

  11. Jusczyk, P. W. (1993). Discovering sound patterns in the native language. In Proc. 15th Ann. Conf. Cog. Sci. Soc., (pp. 49-60). Hillsdale, NJ: LEA.

  12. Cutler, A. & Carter, D. M. (1987). The predominance of strong initial syllables in the English vocabulary. Comp. Speech and Lang., 2, 133-142.

  13. Wolff, J. G. (1988). Learning syntax and meanings through optimization and distributional analysis. In Y. Levy, I. M. Schlesinger & M. D. S. Braine (Eds.), Categories and processes in language acquisition, (pp. 179-215). Hillsdale, NJ: LEA.

  14. Brent, M. R. & Cartwright, T. A. (in press). Distributional regularity and phonotactic constraints are useful for segmentation. Cognition.

  15. Cairns, P. et al. (1994). Modelling the acquisition of lexical segmentation. In Proc. of the Child Lang. Res. Forum, 1994; CSLI, Stanford, CA: U. of Chicago Press.

  16. Christiansen, M. H., Allen, J., Seidenberg, M. S. (in press). Learning to segment speech using multiple cues: A connectionist model. Lang. and Cog. Processes.

  17. Korman, M. (1984). Adaptive aspects of maternal vocalizations in differing contexts at ten weeks. First Lang., 5, 44-45.

  18. Anderson. S. R. (1992) A-morphous morphology. New York: CUP.

  19. Rumelhart, D. & McClelland, J. (1986). On learning the past tenses of English verbs. Implicit rules or parallel distributed processing. In McClelland, J., Rumelhart, D. (Eds.), Parallel distributed processing, Vol. 2, (pp. 216-271). Cambridge, MA: MIT Press.

  20. Plunkett, K. & Marchman, V. (1991). U-shaped learning and frequency effects in a multi-layered perceptron: Implications for child language acquisition. Cognition, 38, 43-102.

  21. Plunkett, K. & Marchman, V. (1993). From rote learning to system building: acquiring verb morphology in children and connectionist nets. Cognition, 48, 1-49.

  22. Prasada, S. & Pinker, S. (1993). Similarity-based and rule-based generalizations in inflectional morphology. Lang. and Cog. Processes, 8, 1-56.

  23. Clahsen, H. et al. (1993). Regular and irregular inflection in the acquisition of German plural nouns. Cognition, 45, 225-255.

  24. Marcus, G. et al. (1995). German inflection: The exception that proves the rule. Cog. Psych., 29, 189-256.

  25. Nakisa, R. C. & Hahn, U. (1996). Where defaults don't help: the Case of the German Plural System. In G. W. Cottrell (Ed.), Proc. 18th Ann. Conf. Cog. Sci. Soc., (pp. 177-182). Mawah, NJ: LEA.

  26. Nosofsky, R. M. (1990). Relations between exemplar similarity and likelihood models of classification. J. Math. Psychol., 34, 393-418.

  27. Pinker, S. (1984). Language learnability and language development. Cambridge, Mass: Harvard U. Press.

  28. Kelly, M .H. (1992). Using sound to solve syntactic problems: The role of phonology in grammatical category assignments. Psychol. Rev., 99, 349-364.

  29. Morgan, J. & E. Newport (1981). The role of constituent structure in the induction of an artificial language. J. Verbal Learning and Verbal Behav., 20, 67-85.

  30. Maratsos, M. & Chalkley, M. (1980). The internal language of children's syntax: The ontogenesis and representation of syntactic categories. In K. Nelson (Ed.), Children's language, Vol. 2. New York: Gardner Press.

  31. Brill, E., Magerman, D., Marcus, M. & Santorini, B. (1990). Deducing linguistic structure from the statistics of large corpora. DARPA Speech and Natural Lang. Workshop. Hidden Valley, Penn: Morgan Kaufmann.

  32. Kiss, G. R. (1973). Grammatical word classes: A learning process and its simulation. Psychol. of Learning and Motiv., 7, 1-41.

  33. Rosenfeld, A., Huang, H. K. & Schneider, V. B. (1969). An application of cluster detection to text and picture processing. IEEE Trans. on Info. Theory, 15, 672-681.

  34. Radford, A. (1988). Transformational grammar, 2nd Edition. Cambridge: CUP.

  35. Finch, S. P. & Chater, N. (1991). A hybrid approach to the automatic learning of linguistic categories. Artif. Intell. and Simul. Behav. Qtrly., 78, 16-24.

  36. Finch, S. P., Chater, N. & Redington, M. (1995). Acquiring syntactic information from distributional statistics. In Levy, J. et al. (Eds.), Connectionist models of memory and language, (pp. 229-242). London: UCL Press.

  37. Redington, M. & Chater, N. (in press). Connectionist and statistical approaches to language acquisition: A distributional perspective. Lang. and Cog. Processes.

  38. Brown, R. (1954). Linguistic determinism and the part of speech. J. Abn. and Soc. Psychol., 55, 1-5.

  39. Gleitman, L. R. (1990). The structural sources of word meaning. Lang. Acq., 1, 3-55.

  40. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Methods, Instrumentation, and Computers, 28, 203-208.

  41. Burgess, C., & Lund, K. (1997). Modeling cerebral asymmetries of semantic memory using high-dimensional semantic space. In Beeman, M., & Chiarello, C. (Eds.), Right hemisphere language comprehension: Perspectives from cognitive neuroscience. Hillsdale, NJ: LEA.

  42. Schütze, H., (1993). Word Space. In S.J. Hanson, J.D. Cowan, & C.L. Giles (eds.), NIPS 5. San Mateo, CA: Morgan Kaufmann.

  43. Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev., 104, 211-240.

  44. Burgess, C. & Lund, K. (in press). Modeling parsing constraints with high-dimensional context space. Lang. and Cog. Processes.

  45. Ninio, A. & Snow, C. E. (1988). Language acquisition through language use: The functional sources of children's early utterances. In Y. Levy, I. M. Schlesinger, & M. D. S. Braine (Eds.), Categories and processes in language acquisition. Hillsdale, NJ: LEA.

  46. Jusczyk, P. W. (1997) The discovery of spoken language. Cambridge, MA: MIT Press.

  47. Saffran, J. R., Aslin, R. N. & Newport, E. L. (1996). Statistical cues in language acquisition: Word segmentation by infants. In G. W. Cottrell (Ed.), Proc. 18th Ann. Conf. of the Cognitive Sci. Soc., (pp. 376-380). Mawah, NJ: LEA.

  48. Saffran, J. R., Newport, E. L. & Aslin, R. N. (1996). Word segmentation: The role of distributional cues. J. of Memory and Lang., 35, 606-621.