TMR-LCG || Participants || Meetings || Texts || Resources

 

Extending Computational Grammars by Learning

Proposal to TMR Network

1. Research Topic

Precis: The practice of implementing grammars on the computer has improved the quality, especially the reliability of syntactic description and it has opened the door to a number of applications in natural language processing (NLP). The best of these systems---whether theoretical or applied---are invariably the result of person-decades of specialized work and become difficult to extend. The focus of the network Extending Computational Grammars by Learning (ECGL) is the investigation of ways to improve these state-of-the-art systems by machine learning applied to current best practice.

Background: Computational grammar models have been developed both for linguistic generality and elaboration and for computational implementation. A grammar is a precise linguistic description, and clearly the domain of linguists. A computational grammar involves at least the implementation these grammars in a machine-interpretable formalism. To a large extent this should not replace purely linguistic work, and it must be linguistically well-informed. The needs of implementation immediately improve the precision, detail and reliability of grammatical description. Novel theoretical proposals have also grown of mixed linguistic and computational roots. Thus [Kaplan 88a] develop a novel theory of unbounded movement motivated by the wish to constrain the LFG formalism; [Abeille 91] derives original analyses of idioms in a way transparently motivated in the TAG framework, which was developed almost exclusively by computer scientists; and [Pollard 85] originally motivated the abandonment of metarules in GPSG for computational reasons. Computational grammars are now being applied to increasingly complex fragments of natural language, normally with the goal of supporting the analysis (parsing) and production (generation) of language.

There is nonetheless a complexity bottleneck in grammar development. Grammars that are developed for wide coverage tend to use dozens of multivalued features (in relatively free combination). Their lexicons need tens of thousands of items, each of which may appear in dozens of inflectional variants, and the number of grammatical constructions tend to run in the hundreds (but this number may be hidden in further lexical complexity). In point of fact, it is very difficult to train new people to work on further development of these simply due to the complexity. While there has been a resurgence of interest in nonlinguistic statistical methods for language processing in the last five years, it is safe to say that none of these promises to replace linguistically informed grammars. This suggests that alternatives need to be sought.

Computational grammar also allows applications which were unthinkable in the recent past. These range from information retrieval (IR), computer-assisted language learning (CALL), grammar checking, language aids for the handicapped, translation tools, documentation standards tools, speech control and speech query---all of which may be found in primitive forms on the market today.

Beyond the importance of commercial exploitation, language technology holds the cultural promise of easing communication among the peoples of Europe. This has already begun to take place as language technology finds its way into CALL, multilingual IR, and software supporting reading and translation (intelligent dictionaries).

2. Objectives

The project will apply several of the currently interesting techniques for machine learning of natural language to a common problem, that of learning noun-phrase syntax.

Common Base: In order to function well, the network must build on state-of-the-art work in several areas, including writing and implementing extended grammars (for different languages), the linguistic theory informing such computational grammars, and language processing algorithms and implementations. The partners within the proposed network are separately involved in these efforts so that they are well-positioned to focus on the critical problem of extending grammars via methods from machine learning.

Common Foci: The network will be focused by its tasks: the common application and evaluation of machine-learning techniques as used to learn natural language syntax from a given knowledge base. To further focus the network, a subarea of syntax will be chosen, preferably NP syntax. Test material is increasingly available through the work of the Linguistic Data Consortium, the Text Encoding Initiative, the Parallel Grammar Project, and the LRE project ``Test Suites for NLP'', and it can be used within the normal work of the consortium. Network partner ISSCO has supported (and in some cases led) projects in corpus collection and standardization, and measures for diagnosis, evaluation and assessment.

Evaluation: The degree to which objectives are reached may be measured. The appropriate measures are given by task, recognizing and analyzing the noun phrases (NP's) in free text. The success of the methods may be measured in standard ways: for the recognition task, recall (% of NP's from text found), precision (% of items found that are NP's); and for the analysis task, the degree to which the correct constituent structure is assigned.

Expected Results: The network expects to contribute to the knowledge of whether and which machine learning techniques are well suited to the task of learning language.

An especially interesting task is that of combining techniques from statistical and symbolic machine learning. This is especially focused in the subprojects FE, GEN and TBEDL (see § 4 ``Research Method'' for an explanation of subproject names). But MBL, SDL and DOSP also proceed (partially) from annotated data, providing a potential link to symbolic categories. NN cannot be expected to proceed from knowledge-based grammars, but is regarded as an interesting area to follow because of its affinity to human neuroprocessing, its sensitivity to higher-order correlation, and its current dynamism in the learning field.

3. Scientific Originality

Previous Approaches: The consortium can build on previous work on automating lexical acquisition. There are a number of proposals for how this can be done, including those associated with AQUILEX ([Briscoe 93]) and DATR ([Evans 90]). These efforts have focused on the existing patterns of structure within the lexicon (e.g., the relation between predicative adjectives used with the verb ``be'', etc. and attributive adjectives used to modify nouns), and on the processes available to create new words (e.g., using ``re'' with a verb such as ``fax'' to create a new word, perhaps for the first time), and on the theoretical apparatus needed for a correct description. The AQUILEX effort was particularly important because it combined a well-founded theoretical view of lexical processes with practical experiments in AQUIring LEXical information automatically (or semi-automatically) from MRDs. It may be that such efforts may be more successful if combined with the learning techniques mentioned above.

Promise: Learning techniques constitute a promising area: not only because purely knowledge-based approaches have been difficult to extend, but also because the knowledge they derive is less sensitive to expert knowledge, and expert error, something rule-based approaches often founder on. The current project is well-poised not only to experiment in machine learning techniques, but also to evaluate more exactly the quality of the results (in the context of grammar extension, this means both the absolute quality and the degree to which it is faithful to the original). GEN is intended to investigate applications of genetic algorithms, which is novel but promising as long as the a locally sensible fitness function is possible.

Benefits of Collaboration: The benefits of collaboration are the control in evaluating the very varied approaches being taken currently to the machine-learning of natural language. None of the laboratories has the resources (or expertise) to experiment simultaneously in all of these areas. The benefits may be seen concretely as well in the opportunity for sharing of resources such as data, information about development systems (present at all sites), and through the sharpening of questions over the problem role of specification vis-à-vis learning/training in the syntax problem.

Specific Innovations are discussed in the following section, in which the various techniques to be applied are presented.

4. Research Method

We describe the projects in turn.

Transformation-Based Error-Driven Learning (TBEDL)
String Transformations were introduced in ([Brill 95]) and have been applied to part-of-speech (POS) tagging and other problems. Brill's techniques combining the empirical breadth of statistics-based approaches with the more transparent representation of knokwledge-based methods. Moreover, they have been shown to work as successfully as other methods (in the POS problem). The focus of the current project will be to extend TBEDL to the case of tree learning. A leading hypothesis is that Brill's techniques, which focused on learning finite automata may be first extended to the case of learning tree automata ([Rounds 70],[Gecseg 84]).

Feature-Estimation (FE)
Stochastic regular and context free grammars have been thoroughly researched in computational linguistics, but extension to feature-based grammars (or attribute-value grammars) is required for this work to be applied to current grammars. A barrier until recently were the reentrancies in feature grammars (essential variables), which introduce inherent context sensitivity. Steve Abney has shown (in recent work at Tübingen) that such grammars can be correctly parameterized using random fields. And Stefan Riezler (also at Tübingen) has provided a logical foundation for such parameterized grammars within fuzzy logic-based model theory. The goal of this project is to extend this theoretical research to: (i) provide practical algorithms for parameter estimation, with or without an already parsed corpus, (ii) to apply these techniques to available typed feature structure-based systems such as the Troll system described by Dale Gerdemann and Paul King (COLING '94), and (iii) to develop practical (heuristic) processing techniques.

Genetic Algorithms (GEN)
Genetic Algorithms have been used to learn context-free grammars, but with inconclusive results ([Lankhorst 94],[Lankhorst 96]). The reason for this appears to be the `global' nature of such grammars: the validity of an overall parse can depend on some particular rule which in itself may be minor, but which interacts crucially with others. This makes it difficult to devise a sensible `fitness function' for the evaluation of candidate grammars thrown up by crossover or mutation. Other more `local' grammatical formalisms, like those used in the `finite state constraint' (used in Helsinki and Grenoble), do not suffer from this problem. In preliminary experiments, a set of negative constraints which gave around 70% precision in POS tagging were learned from a small training corpus. The reason that this seems a more tractable problem is that although there are some interactions between constraints, to a large extent they can operate independently, and thus a simple fitness function will rate a good constraint highly. We intend to develop this line of work, learning finite state constraints for POS tagging, phrasal parsing, and grammatical function annotation.

Memory-Based Learning (MBL)
The Antwerp group will explore memory-based (exemplar-based, instance-based) learning approaches to learning syntactic processing. Previous research in this group on memory-based phonology and morphology learning has shown that the approach, while attaining the same or better accuracy than knowledge-based and statistical approaches, has a number of interesting additional properties: (i) it is incremental: the system keeps learning continuously from its own processing experience; (ii) it allows effective integration of different information sources through feature weighting methods; and (iii) computationally efficient implementations have been developed. We propose to extend the approach for syntactic processing. Syntactic analysis will be interpreted as a cascade of memory-based classifiers (morphosyntactic disambiguation, constituent detection and labelling, attachment disambiguation). Encouraging results have already been obtained with memory-based morphosyntactic disambiguation and PP-attachment.

Connectionist Learning (NN)
Pure connectionist approaches to grammar learning, while limited in many aspects, have met with some success ([Christ 94]). Most work in grammar learning using recurrent connectionist networks has relied on an impoverished form of lexical representation, where even morphological similarity is not preserved. An obvious alternative is to ``bootstrap'' from lexical representation that carries usable grammatical information. One possibility currently being explored at UCD is to use vector representations generated from a PCA analysis of lexical co-occurrence statistics derived from a large corpus of written English ([Zavrel 96]). An approach to the problem of grammar learning from a different angle is suggested by recent successes in other domains in the use of collections of networks that collaborate in carrying out a complex task. It is proposed to apply this approach to building a connectionist system capable of learning and exploiting a natural language grammar. The combination of a richer lexical representation, and an improved learning paradigm should yield significant benefits in harnessing the strengths of connectionist networks for natural language processing.

Semantic-Driven Learning (SDL)
will focus on methods for automatically identifying semantic structures in sets of texts in restricted domains. The goal is to develop methods for elaborating and refining semantic classification schemata that serve to characterize meaningful textual units, and to apply the results to noun phrase analysis. This approach assumes that semantic ``tags'' display regular distributional behavior and can thus be disambiguated by well-known stochastic sequence-learning techniques such as Hidden Markov Models. The tags then serve as the basis for the automatic identification of ever larger hierarchically structured textual "chunks" (phrases and sentences) in an iterative cycle. To-date work has focused on the use of entropy measures to determine chunk boundaries in the sequences of tags. The chunks were subsequently classified and replaced by new tags (intuitively representing the semantic head of the phrase). This process was reiterated until no significant chunks were detected [Hutchens 95]. New algorithms will be explored for recognizing and classifying the chunks using different parameters and these will be applied to various semantic domains.

Data-Oriented Structure Processing (DOSP)
is a method of learning how to provide appropriate linguistic representations for an unlimited set of utterances by generalizing from a given corpus of properly annotated exemplars. This probabilistic technique operates by decomposing the given representations into fragments and recomposing those pieces to analyze previously unseen utterances. It depends only on the nature of the representations, not on any grammatical notation or other descriptive machinery. The technique was original developed for phrase-structure representations [Bod 95] but it has been generalized to include semantic annotations [Bod 96a] and recent proposals have focused on the correspondences between trees and feature structures that are provided by LFG representations ([Kaplan 96],[bod 96b]). One outcome of the Rank Xerox/Xerox PARC/University of Stuttgart Parallel Grammar Project will be a corpus of LFG-annotated training materials that this new approach can be tested on.

Explanation-based learning (EBL)
is a technique from symbolic artificial intelligence and is often viewed as a processing strategy: frequently encountered combinations of ``chunks'' of material are, once analyzed, committed to memory, obviating reprocessing. The SRI Cambridge group has applied this technique in combination with a generalization step. This results in vastly improved analysis speed, often with some small cost in terms of coverage. However, the resulting complex structures have been used to `learn' lexical entries for unknown words with good success, ([Rayner 88]), the sort of learning focused on ECGL. This strand of the work to-date has focused on lexical learning, which will be extended now to (i) the learning of shallow syntactic and semantic structure and (ii) the automatic testing and refinement of hypotheses generated via EBL. The network shall seek external funding for this subproject.

5. Work Plan

Overview of Proposed Subprojects

TitleSite Lang.Key Techniques
-------------------------
TBEDLGroningenEng, GerFinite-State Methods
FETübingenGer, EngParameter Estimation
GENSRIEngVariation Control
MBLAntwerpEngCase-Based Learning
NNDublinEngbootstrapping
SDLISSCOFr, EngMaximum Entropy
DOSPRank XeroxEng, Fr, Ger Full-Structure Sensitivity

Schedule: The first year of each project should emphasize the following: (i) the extension and codification of existing methods, especially as these have been applied to the problem of lexical extension; (ii) theoretical investigation of the local methodology; and (iii) the detailed design of an experiment involving NP syntax. The second two years focus on (i) running and refining the experiments; and (ii) analysis of results.

Milestones: Given the project definitions specified at three years each, checkpoints are at 15 months---in time for the mid-term review, and 33 months (with a final three months for wrap-up, reporting, responding to criticism, if needed). The major milestones are thus a bit less than halfway through the project, when all of the initial two-year subprojects may be evaluated. The definition of the milestones are not given here, but are implicit in the ``goals'' noted in the table above. In every case the initial milestone should show:

References

[Abeille 91] Anne Abeillé. Une grammaire lexicalisée d'arbes adjoints pour le français. PhD thesis, Université de Paris, 1991.

[Bod 95] Rens Bod. Enriching linguistics with statistics: Performance models of natural language. PhD thesis, University of Amsterdam, 1995.

[Bod 96a] Rens Bod, Remko Bonnema, and Remko Scha. A data-oriented approach to semantic interpretation. In Proceedings of the Workshop on Corpus-Oriented Semantic Analysis, Budapest, 1996. ECAI-96.

[Bod 96b] Rens Bod, Ronald Kaplan, Remko Scha, and Khalil Sima'an. A data-oriented approach to lexical functional grammar. In Jan Landsbergen, editor, Computational Linguistics in the Netherlands, Eindhoven, 1996. IPO, Phillips.

[Brill 95] Eric Brill. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543--566, 1995.

[Briscoe 93] Ted Briscoe. Introduction. In Ted Briscoe, Anne Copestake, and Valerie de Paiva, editors, Inheritance, Defaults and the Lexicon, pages 1-12. Cambridge University Press, Cambridge, 1993.

[Evans 90] Roger Evans and Gerald Gazdar. The DATR papers. Technical Report SCRP 139, School of Cognitive and Computing Sciences, University of Sussex, 1990.

[Gecseg 84] Ferenc Gecseg and Magnus Steinby. Tree automata. Akademiai Kiado, Budapest, 1984.

[Hutchens 95] Jason L. Hutchens. Natural Language Grammatical Inference. PhD thesis, Univ. of Western Australia, 1995. Dept. of Information Technology.

[Kaplan 96] Ronald Kaplan. A probabilistic approach to lexical functional grammar. Paper Presented at the LFG Colloquium and Workshops, 1996.

[Kaplan 88a] Ronald Kaplan and Annie Zaenen. Long-distance dependencies, constituent structure, and functional uncertainty. In Mark Baltin and Anthony Kroch, editors, Alternative Conceptions of Phrase Structure, pages xxx--yyy. University of Chicago Press, Chicago, 1988.

[Lankhorst 94] Marc M. Lankhorst. A genetic algorithm for the induction of context-free grammars. In Gosse Bouma and Gertjan van Noord, editors, Proc. of CLIN IV, pages 87--100, Groningen, 1994. Alfa-informatica.

[Lankhorst 96] Marc M. Lankhorst. Genetic Algorithms in Data Analysis. PhD thesis, University of Groningen, 1996.

[Christ 94] M.H.Christiansen. Infinite languages, finite minds: Connectionism, learning, and linguistic structure. PhD thesis, University of Edinburgh, 1994.

[Pollard 85] Carl Pollard. Phrase structure grammar without metarules. In Proc. of the 4th Annual Meeting of the West Coast Conference on Formal Linguistics, pages 246--261, 1985.

[Rayner 88] Manny Rayner. Applying explanation-based generalization to natural language processing. In Fifth Generation Computer Systems 1988, volume 3. Springer Verlag, 1988.

[Rounds 70] William Rounds. Mappings and grammars on trees. Math. Systems Theory, 4(3):257--87, 1970.

[Zavrel 96] Jacob Zavrel and Jorn Veenstra. The language environment and syntactic word-class acquisition. In Charlotte Koster and Frank Wijnen, editors, Proc. of the Groningen Assembly on Language Acquisition (GALA95), Groningen, 1966. Groningen University.

6. Collective Experience of Research Teams

See individual pages for description of institution and mission, key personnel and their experience and expertise, and recent publications. These are marked ``Extending Computational Grammar'', Partner 1, etc.


University of Groningen Linguistics

The host institution for the TMR Network Proposal is the Centre for Language and Cognition, University of Groningen. The host group is Computational Linguistics. Computational Linguistics is one branch of Alfa-Informatica (Humanities Computing), founded in the Faculty of Arts at the University of Groningen in 1986. The work in computational linguistics is concentrated on grammar and parsing, esp. as these are applied in speech understanding, computer-assisted language learning, information systems, and text representation and processing. Foci of this work are the use of computers as laboratories for linguistic research, particularly in syntax and semantics, but with significant efforts in lexical structure, phonology, morphology and language learning. The research take places under the auspices both of Groningen's Centre for Behavioral and Cognitive Neurosciences, and of the Dutch graduate school in logic. @Alfa is a spin-off focusing on WWW applications.

For more detailed information, including personnel, opportunities and requirements for study, etc. see our World-Wide Web page at http://www.let.rug.nl/

Computational Linguistics: Teaching/Research Staff

Prof. John Nerbonne,
Ph.D. Ohio State University, 1984 had worked at HP Labs (Palo Alto, 1985-90), and the German Research Centre for Artificial Intelligence (Saarbrücken, 1990-93) before assuming the Groningen chair. Also chair of the European Association for Computational Linguistics. Work foci: grammar, syntax-semantics interface, and applied linguistics.
Dr. Gertjan van Noord
Ph.D. Utrecht, 1992 also worked at the University of Saarbrüucken (1991-2). Work foci: parsing, generation, bidirectional processing, and grammar. Currently project leader of the 4-man NLP project cooperating in the Dutch National Science Foundation's speech-language program.
Dr. Gosse Bouma
Ph.D. Groningen, 1993 also worked on the LILOG project (IBM Deutschland, 1988-90). Work foci: default reasoning (esp. lexical specification), categorial grammar, and syntax-semantics interface.
Dr. Mark-Jan Nederhof
Ph.D. Nijmegen, 1994 is a postdoc specializing in parsing on the speech-language project.
Dr. Dimitra Kolliakou
Ph.D. Edinburgh, 1995, is a postdoc investigating nominal syntax.
Dr. Harry Gaylord
Ph.D. Jerusalem, 1983 specializes in character coding and text representation. Member of TEI, ISO committees.

Gertjan van Noord. ``An Efficient Implementation of the Head-Corner Parser.'' Accepted to appear in Computational Linguistics, 1997.

Gosse Bouma and Gertjan van Noord, ``Constraint-Based Categorial Grammar'' in: Proc. 32nd ACL, 1994, 147-54.

John Nerbonne, ``A Feature-Based Syntax/Semantics Interface'' in: Annals of Mathematics and Artificial Intelligence, 1993, 107-132.

John Nerbonne, ``Nominal Comparatives and Generalized Quantifiers'' in Journal of Logic, Language and Information 4, 1995, 273-300.

Mark-Jan Nederhof and Ewald Bertsch. ``Linear-time suffix parsing for deterministic languages.'' Journal of the ACM, 43(3), 1996, 524-554.


University of Tübingen: Dept. of Linguistics

The University of Tübingen Department of Linguistics (Seminar für Sprachwissenschaft) incorporates three sections: Computational Linguistics (Prof. Erhard Hinrichs), Mathematical Linguistics (Prof. Uwe Mönnich) and General Theoretical Linguistics (Prof. Arnim von Stechow). Cooperating professors in other departments include Prof. Marga Reis (German) and Prof. Bernard Drubig (English Linguistics).

Computational Linguistics: Teaching/Research Staff

Prof. Erhard Hinrichs
Ph.D. in Linguistics from Ohio State. Previously held positions at BBN Laboratories and the University of Illinois. Major interests in computational linguistics, formal syntax (especially HPSG), formal semantics (especially tense and aspect).
Dr. Dale Gerdemann:
Ph.D. in Linguistics from the University of Illinois. Major interests in parsing & generation of unification grammars, typed feature structures, categorial grammar.
Dr. Steve Abney
Ph.D. in Linguistics from MIT. Previously at Bellcore Labs. Major interests in ent & binding, dependency grammar, feature structure grammars).
Gerald Penn
M.A. in Computational Linguistics from Carnegie Mellon. B.A. in Mathematics from University of Chicago. Co-developer of ALE system. Major interests in compilation techniques for attribute-value grammars.
Tom Cornell
Ph.D.in Linguistics from UCLA. Previously a fellow at the University of Arizona. Major interests in the formal foundations and parsing of Government-Binding theory with some applications to neurolinguistics.
Helmut Feldweg
Master in Sinology from the University of Göttingen. Previously held position at the Max-Planck-Institute for Psycholinguistics. Major interests in corpus linguistics, lexical databases and lexicography.
Thilo Götz
Master in Linguistics from the University of Tübingen. Major interests in formal syntax (especially HPSG) and grammar compilation techniques.

Steven Abney. Parsing by chunks. In Berwick, Abney, and Tenny, editors, Principle-Based Parsing, pages 257--278. Kluwer, 1991.

Dale Gerdemann and Paul John King. The correct and efficient implementation of appropriateness specifications for typed feature structures. In COLING 94, Proceedings, pages 956--960, 1994.

Thilo Götz and Walt Detmar Meurers. Compiling HPSG type constraints into definite clause programs. In Proceedings of ACL 1995,

Erhard W. Hinrichs and Tsuneko Nakazawa. Linearizing AUXs in German verbal complexes. In Nerbonne et al., editor, German in Head-Driven Phrase Structure Grammar, CSLI, 1994.

Bob Carpenter and Gerald Penn. Compiling Typed Attribute-Value Logic Grammars. In H. Bunt and M. Tomita, eds., Recent Advances in Parsing Technology, Kluwer, (1996)


SRI International
Cambridge Computer Science Research Centre

SRI International is a not-for-profit research organisation founded over 50 years ago by Stanford University. SRI Cambridge was SRI's first research laboratory outside California, and has been in existence for 10 years, concentrating on natural language processing and formal methods. It carries out research and consulting for government and commercial clients, and has very close links with the University of Cambridge Computer Laboratory.

Natural Language Processing Research Staff

Dr. Stephen Pulman
Ph.D. in Linguistics from University of Essex. Director of SRI Cambridge: also faculty member of University of Cambridge Computer Laboratory. Main research interests in computational linguistics, especially semantics and contextual representation and reasoning.
Dr. David Carter
Ph.D. in Computer Science from Cambridge University. Previously held position at Cambridge University. Major interest: computational linguistics, especially statistical methods for speech-language interface, ambiguity resolution and domain adaptation.
Dr. Ian Lewin
Ph.D. in Artificial Intelligence from the University of Edinburgh. Major interests in computational linguistics, especially in formal and computational theories of semantics and dialogue.
Dr. David Milward
Ph.D. in Computer Science from Cambridge University. Previously held positions at Edinburgh University and the University of the Saarland, Saarbruecken. Major research interests: natural language processing, syntax and semantics.
Dr. Manny Rayner
Ph.D. in Computer Science from University of Stockholm. Previously held positions at Uppsala University and the Swedish Institute for Computer Science, Stockholm. Major interests in computational linguistics, spoken language translation, knowledge representation and reasoning, machine learning for computational linguistics, logic programming.

Alshawi, H. and Carter, D. 1994, `Training and Scaling Preference Functions for Disambiguation', Computational Linguistics, 20:4, pp 635-648.

Lewin, I. 1995, `Indexical Dynamics', in Polos, L. and Masuch M. (eds), Applied Logic: How, What and Why, Kluwer, pp 121-152.

Milward, D. 1994, 'Dynamic Dependency Grammar', Linguistics and Philosophy, 17, pp 561-605.

Pulman, S. 1996, 'Unification Encodings of Grammatical Notations', Computational Linguistics, 22:3, pp 295-327.

Rayner, M., and Carter, D. 1996, 'Fast Parsing using Pruning and Grammar Specialisation', in Proceedings of the 34th ACL, Santa Cruz, pp 223-230.


University of Antwerp Centre for Dutch Language and Speech

Current central research topics of CNTS include machine learning of natural language (symbolic induction of lexical and grammatical knowledge), speech synthesis, lexical organisation and acquisition (design of object-oriented lexical databases, acquisition of lexical knowledge from corpora), data-oriented parsing and part-of-speech tagging, machine learning of user models and pragmatic knowledge, and intelligent text processing (spelling checking, hyphenation, report generation).

CNTS is an associate node of ELSnet since 1993, member of Erasmus ICP NL-1022/09 on Natural Language Processing since 1991, and its follow-up ACO*HUM (Computing in the Humanities thematic network), and coordinator of CLIF (Computational Linguistics in Flanders), the Flemish FWO-funded research community on Computational Linguistics and Language Technology since 1995. The group is also the European headquarters of CHILDES (electronic archive of corpora) and participates in the development of data retrieval and manipulation tools for these corpora. CNTS is currently involved in several externally funded research projects (FWO, VNC, IWT etc.).

Center for Dutch Language and Speech: Teaching/Research Staff

Prof. Dr. Georges De Schutter.
Ph.D University of Gent, 1972, teaches and researches general and Dutch syntax, morphology and phonology, and applications.
Prof. Dr. Walter Daelemans.
Ph.D University of Leuven, 1987, worked at Language Technology group University of Nijmegen, managed an office automation Esprit project at the AI LAB Brussels, and is presently affiliated to both University of Antwerp and Tilburg University. Work focus: machine learning of natural language, linguistic knowledge representation.
Prof. Dr. Steven Gillis.
Ph.D University of Antwerp, 1984, worked as a research manager at RIKS, currently research director at the NFWO (Belgian NSF). Work focus: language technology, user modeling, child language acquisition.
Drs. Masja Kempen, Gert Durieux, Drs. Peter Berck.
Research staff with backgrounds in psychology, linguistics, and computational linguistics, respectively.

Daelemans, W., Van den Bosch, A., & Weijters, A. `IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms.' To appear in D. Aha (ed.) Artificial Intelligence Review, special issue on Lazy Learning, 1997.

Daelemans, W., J. Zavrel, P. Berck, S. Gillis. `MBT: A Memory-Based Part of Speech Tagger-Generator'. In: E. Ejerhed and I. Dagan (eds.) Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, 14-27, 1996.

Daelemans, W. `Memory-Based Lexical Acquisition and Processing.' In: P. Steffens (ed.) Machine Translation and the Lexicon, Springer Lecture Notes in Artificial Intelligence 898, 85-98, 1995.

Daelemans, W., S. Gillis and G. Durieux. `The Acquisition of Stress, a data-oriented approach.' Computational Linguistics 20 (3), 421-451, 1994.

Daelemans, W. and K. De Smedt. `Inheritance in an Object-Oriented Representation of Linguistic Categories.' International Journal Human-Computer Studies, 41, 149-177, 1994.


University College Dublin: Dept. of Computer Science

The largest research group in the Department of Computer Science at UCD is the AI group, and work on natural language processing accounts for a significant portion of the group's research effort. The dominant research themes are connectionist NLP, discourse modelling, robust parsing. There are contacts to the Antwerp group with whom Ronan Reilly spent a year at NIAS, and with the Groningen Behavioural and Cognitive Neuroscience center.

NLP Group: Teaching/Research Staff

Dr. Arthur Cater
Ph.D. in Computer Science from Cambridge University. Interests in robust parsing, generation, commonsense inference, compound noun interpretation.

Gemma Lyons
M.Sc. in Computational Linguistics fron UCD. Ph.D. on generation in progress. Major interests in bi-directional grammars, categorial grammars, parsing and bi-directional generation.

Ronan Reilly
Ph.D. in Psychology from UCD. Interests in connectionist parsing, eye movements in reading, computational modelling of cortical function as it relates to NLP.

Arthur Cater and Dermot McLoughlin Compound noun interpretation using taxonomic links: An experiment in progress. Cognitive Science of Natural Language Processing Proceedings, Dublin City University, 1996.

Arthur Cater Lexical knowledge required for natural language processing. In Cheng-Ming Guo, ed., Machine tractable dictionaries: Design and construction, Ablex, 1995.

Gemma Lyons Presupposition: Its use in Discourse. In Cognitive Science of Natural Language Processing Proceedings, Dublin City University,1992.

Gemma Lyons Natural Language Generation: An Intermediate Step. In Proceedings of AICS'94 Conference, Trinity College Dublin,1994.

Ronan Reilly Sandy ideas and coloured days: The computational implications of embodiment. Artificial Intelligence Review, 9, pages 305--322, 1995.

Ronan Reilly A connectionist technique for on-line parsing. Network, 3, pages 37--45, 1992.

Ronan Reilly and Noel Sharkey, eds., Connectionist approaches to natural language processing. Hillsdale, NJ: Erlbaum, 1993.


ISSCO (``Istituto Dalle Molle per gli Studi Semantici e Cognitivi'')

ISSCO was established by the Fondazione Dalle Molle and is affiliated with the University of Geneva. It has been active in NLP for twenty years, especially focusing on multilingual language processing, evaluation of NLP systems and products, and the development of corpus-based techniques. ISSCO members have participated in numerous EC-funded projects, among which are TEMAA, TSNLP, and MULTEXT, and EAGLES (Evaluation and Corpus Groups). ISSCO is a member of the European network of excellence for language and speech (ELSNET) and maintains the secretariat for the European Chapter of the Association (EACL) and the European Association for Machine Translation (EAMT). Corpus collection initiatives in which ISSCO has played a major role include the European Corpus Initiative (ECI) and the Multilingual Corpora for Cooperation (MLCC) which have resulted in two of the largest collections of parallel and multilingual data currently available to the NLP research community. This data provides the necessary resource for many of the learning methods currently under development.

Recent work directly related to this proposal focuses on the extraction of semantic information from text corpora. The work attempts to recognize semantic equivalences across portions of texts.

As an ``External Partner'' ISSCO will seek funding from Swiss sources.

Computational Linguistics: Research Staff

Susan Armstrong-Warwick
is professor of Computational Linguistics at the Université de Geneve and senior research at ISCOO. She has worked extensively on corpus-based methods, translation, and lexicorgrphy.
Pierrette Bouillon
is a researcher also affiliated with the Université Libre de Bruxelles and is particularly interested in corpus processing, lexica, and lexical semantics.
Sabine Lehmann
is a researcher with special interests in syntax, and test and evaluation procedures.
Graham Russell
(Ph.D., Sussex) is a researcher specializing in the lexicon, morphology, unification-based linguistic technology.

S. Armstrong-Warwick. Acquisition and Exploitation of Textual Resources for NLP. In: Proceedings of the KB & KS Workshop, Tokyo, 1994.

Susan Armstrong. Using Large Corpora, MIT Press, Cambridge, 1995.

S. Armstrong, G. Russell, D. Petitpierre, and G. Robert. An Open Architecture for Multilingual Text Processing. In: Proceedings of the ACL Sigdat Workshop -- From Texts to Tags: Issues in Multilingual Language Analysis. Dublin, 1995. pp. 30--34.

P. Bouillon, S. Lehmann, D. Petitpierre, G. Russell. Definition and Exploitation of Sublanguage Descriptions for MT in a Finite Domain. Final Scientific Report - Projet FNRS no. 1213-42173.94, 1996.

P. Bouillon, S. Lehmann, D. Petitpierre. Inférence statistique de structures sémantiques. In: Journées Scientifiques et Techniques 1997, Avignon, France, 15-16 April, forthcoming.


MLTT Rank Xerox Research Centre, Grenoble

Rank Xerox European Research Centre comprises two laboratories, one in Cambridge (UK) in existence since 1988 and one in Grenoble (France), created in September 1993. The research will be conducted in laboratory in France.

The Grenoble Laboratory's aim is to enhance the understanding of business processes in a multilingual, distributed environment around multimedia documents and to create technology which helps businesses and individuals to become more efficient in these environments. MLTT concentrates on developing technologies which support effectively the work of individuals and groups of individuals in multilingual settings: creation, manipulation, modification, translation of the natural language content of documents. Most relevant to this project, is the work done on 'light parsing' and on constraint based grammar development (LFG). Some of this work is carried out in collaboration with PARC.

Multilingual Theory and Technology Research Staff

Dr. Annie Zaenen
Ph.D. in Linguistics from Harvard University. Area manager MLTT, Main research interests in computational linguistics, especially syntax and lexical semantics.
Dr. Jean Pierre Chanod
Agrege in mathematics of Ecole Normale Superieure. Major interest: computational linguistics, especially light parsing.
Dr. Ronald Kaplan
Ph.D. in Psychology fom Harvard University. Major interests in computational linguistics, especially finite-state methods, parsing.
Dr. Frederique Segond
Doctorat in Applied Mathematics of the Ecole des Hautes Etudes en Sciences Sociales. Major interests in computational linguistics, syntax and lexicography

Breidt, Lisa and Segond Frederique, 1995, "Comprehension automatique des expressions a mots multiples en francais et en allemand" to be presented at the Quatriemes journees scientifiques de Lyon

Chanod J.-P., Tapanainen P. "Tagging French: comparing a statistical and a constraint-based method" Proc. 1995 EACL, Dublin, 1995.

Chanod J.-P., Tapanainen P. "Creating a Tagset, Lexicon and Guesser for a French Tagger" From texts to tags: issues in multilingual language analysis, ACL SIGDAT Workshop, Dublin, 1995

Zaenen, Annie and Mary Dalrymple, 1995, "Polymorphic Causatives",to appear in Klavans (ed.) Representation and Acquisition of Lexical Knowledge, Proceedings of the AAAI symposium.

Karttunen, L., R. Kaplan and A.Zaenen, "Finite State Morphology with Composition", in Proceedings of COLING, 1992.


7. Collaboration

The teams will collaborate and interact regularly through email, and further through three network meetings. A kick-off meeting must guarantee that the task is well-defined and that common training and test material are available. This will be the primary responsibility of ISSCO and Groningen. At this point the ``common base'' mentioned in § 2 must be inventoried (informally) to avoid duplication of effort. A midterm meeting around 16 months into the project must review milestones, providing feedback and course-correction where needed. The final meeting is to be devoted to evaluation of results, dissemination and discussion of plans for potential further exploitation.

As noted in § 3, ``Originality'', the benefits of collaboration are the control in evaluating the very varied approaches being taken currently to the machine-learning of natural language. None of the laboratories has the resources (or expertise) to experiment simultaneously in all of these areas. The benefits may be seen concretely as well in the opportunity for sharing of resources such as data, information about development systems (present at all sites), and through the sharpening of questions over the problem role of specification vis-á-vis learning/training in the syntax problem.

8. Network Organization and Management

The network is organized as seven teams with a single coordinator. There is a single focus problem which is to be attacked by related methods. Travel has been budgeted to allow visits by postdocs and also by permanent staff within the focal groups.

The network consists of very experienced teams who require no micro-management. The projects have been defined to be of the same temporal span so that review and presentation can be done simultaneously at annual meetings. This was deliberately done in order to ensure that attendance would be maximally attractive. We should regard the annual meetings as particularly attractive if they attract other site members not directly attached to ECGL. In order to further enhance the quality of meetings, we propose to invite leading researchers for extended presentations of their work. We propose to consider tutorial-like presentations among these, given opportunity and cooperation.

The teams in the proposed network know each other to some degree. They are involved in European projects together (DYANA, COMPASS, GLOSSER), participate in exchange programs (ERASMUS), share duties in the European ACL and in the Foundation for Logic, Language and Information, and have experience in organizing professional meetings and summer schools. This, too, should ease the management task.

In order to facilitate communication, each postdoc will report quarterly on progress. We have in mind reports for initiates of approx. one page in length. In order to promote dialogue, the quarterly reports will be distributed throughout the network. We do not foresee a formal reporting procedure for ``reviewing'' these, although the coordinator will report, naturally.

The coordinator, John Nerbonne, is experienced in project management. Having managed groups in industry (Hewlett-Packard Labs) and contract research (German AI Center, Saarbrücken), he is currently the chair of a dept of five permanent staff, and six on temporary contracts. He furthermore serves as the chair of the European chapter of European ACL ('97-98), and is on the board of the Foundation for Logic, Language and Information, and the Dutch National Science Foundation. He has organized conferences (HPSG '91, Linguistic Databases '95), ACL tutorials and ESSLLI and ELSNET summer school sections.

9. Training Need

The particular training need in the area of applying machine-learning techniques to NLP arises because the techniques are novel and not widely understood, and because the growing demand for NLP in the marketplace is competing with research for the small number of trained professionals.

10. Training Content

Each of the postdocs will work in one of the leading labs of the continent. The site descriptions detail the highly qualified personnel related to ECG's project goals ate each of the sites. We estimate the project supervision---all of which represents donated time---at 10% of project time, totalling 18 person months.

Furthermore, the labs involved are almost all located at or associated with universities where advanced courses in natural language processing, descriptive linguistics, and machine learning---the main feeder disciplines for ECGL---are given. If each postdoc takes a single course (estimated at 0.5 person months) each year, this contributes a further 7.5 person months of training.

At several of the participating sites, postdocs also have the opportunity to conduct graduate courses. If half of them avail themselves of this opportunity, each holding a course for five graduate students (counting as 0.5 person-months of training each), this adds another 7.5 further person-months of training.

Finally, the participants in the ECG network are committed to professional services such as the teaching of summer school courses and professional tutorials. If we estimate that three of these are given over the three-year time span, for thirty participants each at 0.25 person-months, this adds 22.5 person-months. These participants tend to be young postdocs. (One such course will be given in the 1997 European Summer School in Language, Logic and Information.)

We estimate the training effect at approx. 55.5 person-months, about half of this directly to the special target group (postdoctoral researchers).

11. Involvement of Industry

There is direct industrial involvement through the participation of Rank Xerox, perhaps the world's leader in marketing natural language technology. SRI, one of the leading private laboratories charged with of language technology, also participates. It is part of SRI's charter to perform contract research for industrial clients, so that the network has a direct channel to industry. Finally, several of the other partners have personnel with industrial experience (e.g., Nerbonne in Groningen and Hinrichs in Tübingen) who will seek industrial application of the technology developed here wherever this seems promising. The partners in the network collaborate with dozens of companies in other projects, most of whom are very interested in applying this technology.

12. Financial Information

All the partners have budgeted money for postdocs, including their salaries and benefits, computers, incidentals such as copying and telephone, and overhead. There is furthermore a generous budget for travel since collaboration of the different partners will require a good deal of it.

Budget Summary

All figures in kECU/annum.

Partner Budget/annum
------ ----------
1. Groningen 69
2. Tübingen 68
3. SRI Cambridge 70
4. Antwerp 70
5. Dublin 70
6. ISSCO 0
7. Rank Xerox 70
---
Total 417
Average ~69

(ISSCO, which does not seek funding, is not included in the average. With ISSCO, the average would be ~60 kECU/annum.)

Breakdown into main titles/annum
Salaries and Benefits 270
Overhead (20%) 54
Computing & Incidentals 54
Travel & Networking 39
---
Total 417