A "parser" is a system that transforms sentences (strings of characters) into a representation that describes the groupings of words (phrases) and their relations (e.g. subject and object). The representation of choice for such information is a syntactic tree in which nodes refer to phrases, word categories, or words, and links refer to relations between these objects:
Sample output from MBSP visualized with GraphViz.
In Computational Linguistics, different approaches exist to building a parser. One option is to construct a grammar – a set of rules that describes the syntactic structures that are expected to occur in the sentences of a language. Combined with a computer-readable dictionary and a search algorithm a grammar can produce (generate) all possible syntactic structures a sentence can have. Unfortunately, there can be many of them (most of them nonsensical) and the parser will not assign more weight to the more relevant ones. A grammar can be constructed by hand by a linguist, but can also be induced automatically from a "treebank" (a text corpus in which each sentence is annotated with its syntactic structure).
Although the annotation of a sizable corpus of text with syntactic structures is time-consuming, the effort pays off: from the treebank a grammar can be induced based on actual language use rather than on the often idealized image a linguist has about what is "grammatical". Moreover, the induced grammar and dictionary are probabilistic. They contain statistics on the grammar rules, and on the association of words with syntactic categories. The many syntactic analyses that can be constructed for any sentence can at the same time be ordered according to probability.
The type of syntactic analysis a sentence gets depends on the type of grammar used. Constituent-based grammars focus on the hierarchical phrase structure of a sentence. Dependency-based grammars focus on (grammatical) relations between words. However, all of them attempt to provide a syntactic analysis that is as complete as possible. Given that syntactic analysis is mainly useful as an intermediate step toward semantic interpretation, detailed syntactic information is not always necessary. What we need for applications in many text-mining applications is a robust, efficient, accurate, and deterministic analysis of a sentence in terms of its main constituents and the relations between them.
Shallow parsing is an approach to achieve this. The syntactic parsing process is carved up into a set of classification problems, each of which can be separately learned using standard supervised machine learning methods. The MBSP modules include part of speech tagging, phrase labeling and grammatical relation finding for English. In combination, these modules produce a syntactic analysis of sentences that is detailed enough to drive further semantic and application-oriented processing. Especially in applications such as information retrieval, question answering, and information extraction, where large volumes of (often ungrammatical) text have to be analyzed in an efficient and robust way, shallow parsing is useful.
The concept of shallow parsing, proposed by Abney (1991) has no clearly defined meaning, and is used sometimes in a very limited sense, referring only to tagging and chunking, and sometimes in a broader sense, referring also to semantic tasks such as named-entity recognition. It can best be interpreted as a family of related tasks attempting to recover some syntactic-semantic information in a robust and deterministic way at the expense of ignoring detailed configurational syntactic information. In our approach to shallow parsing, we use memory-based learning as machine learning method, and defined the following software modules in the current version of MBSP:
- Part-of-speech tagging and chunking
After tokenization and sentence splitting, each word in each sentence of an input text is assigned its grammatical category. At the same time, constituents like noun phrases, verb phrases, prepositional phrases etc. are detected.
- Grammatical relations
On the basis of the predicted constituents, grammatical relations between them are predicted (e.g. subject, object, etc.)
- Prepositional phrase attachments
Prepositional phrases are related to the constituent to which they belong.
A machine learning approach to automatic text analysis is only as good as the treebank material on which it is trained. It is also the case that any available treebank is not representative for language as a whole but only (if large enough) for the domain from which it was sampled. This means, for example, that a shallow parser trained on a corpus that contains descriptions of machine parts will not perform very well on text about astronomy. Our shallow parser currently provides two versions: one trained on general newspaper language, and one on biomedical language (included in the module as an example).
Reference: Abney, S. (1991). Parsing By Chunks. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht.