Wednesday, January 27, 2010

Few thoughts in defining the core issue of NLP: Parsing....

With reference (& continuation) to my post dated 5th November, 2009
In Natural Languages, a sentence expresses a proposition, idea or thought, and says something about some real or imaginary world. Thus, extracting the meaning from a sentence is undoubtedly non-trivial. In fact, sentences are not just linear sequences of words. That is why it requires an analysis of each sentence to determine its syntactic structure (which is itself based on a grammar, an abstract formal system of rules and principles) -- a procedure widely known as parsing. However, parsing is not a goal in itself, but rather, an intermediary step for the purpose of further processing, such as assigning an appropriate meaning to a natural language sentence.

Although parsers are proved to be uncontroversially useful in the domain of processing Programming Languages, the issue of parsing in the domain of Natural Language Processing (NLP), has been a cause for tension between the computational and linguistic perspective over a long period of time. In the past, the controversy about parsing was due to the divergence of objectives between natural language application developers who were oriented to developing practical parsers and psycholinguists who were concerned with the psychological process of language comprehension. In recent times, however, developers of natural language applications have questioned the usefulness of parsing in practical NLP systems. This is primarily because there are no grammars available that have complete coverage of freely occurring natural language texts, and there are no parsers that are robust enough to deal with that inadequacy. This limitation is further compounded by the fact that the inherent ambiguity of Natural Languages forces parsers to operate at speeds far from real-time requirements.

It is quite evident that there is a close relationship between a parser and the linguistic representation the parser manipulates. However, in recent times there has been increasing debate on such issues as what should the representation be, how linguistically detailed should the representation be and how one can go about constructing such a representation.

A parser based on a linguistically motivated wide-coverage grammar has many advantages. Depending on how directly a grammar framework encodes linguistic facts, a linguistically motivated grammar developed in that framework could produce output that is quite detailed and directly amenable to further processing. Furthermore, if a grammar for one language is created in detail and the structures of the grammar are organized systematically, then it is conceivable that abstracting away from language specific features could automatically generate grammars for closely related languages.

There are numerous linguistic theories that are embedded in mathematically restricted formal systems. Work in syntactic description along the lines of GB Theory and Minimalism proposed by Chomsky (1981, 1995) and others have always been the most thoroughly detailed and worked-out aspect of linguistic inquiry. A great deal has been borrowed from generative linguistics by NLP researchers. Lexical Functional Grammar (LFG) (Kaplan and Bresnan, 1982), Head-driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994) and Tree Adjoining Grammar (TAG) (Joshi et al., 1975, Joshi, 1985, 1987; among others) are probably the most efficient ones, so far. Each of these formal systems has its own limitations. However, considering the facts mentioned in the preceding paragraph, I intend to adopt Minimalism for developing the computational grammar for Assamese (an Eastern Indo-Aryan language) in the context of Asian language processing. At least I wish to address two theoretical questions:
  • How thorough grammatical information can be incorporated into a computational parsing model?
  • Can a parser attempt to infer from raw text annotated only with Parts of Speech (POS) tags just as a child attempts to do while learning his/her first language?
HOW??????

No comments:

Post a Comment