Annotations-based finite-state processing in a large scale NLP architecture

Branimir Boguraev, IBM

Abstract and presentation

There are well-articulated arguments promoting the deployment of finite state processing techninques for NLP application development. This talk adopts a point of view of designing industrial strength NLP frameworks, where emerging notions include a pipelined architecture, open-ended intercomponent communication, and the adoption of linguistic annotations as fundamental descriptive/analytic device. For such frameworks, certain issues arise concerning the underlying data stream over which the FS machinery operates. I will review recent work on finite-state processing of annotations and highlight some essential features required from a 'congenial' architecture for NLP aiming to be broadly applicable to, and configurable for, an open-ended set of tasks.

Presentation

Back


A Type-Theoretic Approach to Anaphora and Ellipsis Resolution

Chris Fox and Shalom Lappin, University of Essex; King's College, London

Abstract and presentation

We present an approach to anaphora and ellipsis resolution in which pronouns and elided structures are interpreted by the dynamic identification in discourse of type constraints on their semantic representations. The content of these conditions is recovered in context from an antecedent expression. The constraints define separation types (sub-types) in Property Theory with Curry Typing (PTCT), an expressive first-order logic with Curry typing that we have proposed as a formal framework for natural language semantics.

Presentation

Back


New Developments in Extracting Temporal Information

Inderjeet Mani, Georgetown University

Abstract and presentation

The growing interest in practical NLP applications such as text summarization and question-answering places increasing demands on the processing of temporal information in natural languages. One would like to be able to distinguish distinct but similar events, determine when an event occurs, and find out what events occurred prior to a particular event. To support this, several new capabilities have emerged. These include the ability to represent the meanings of time expressions, and to temporally anchor and order events. Such capabilities require a variety of sources of linguistic knowledge, including temporal adverbials, tense, aspect, discourse relations, and background knowledge. This talk will describe some of the new annotation schemes, tools, and corpora that have resulted from this research, and will conclude with a discussion of the major opportunities and challenges for future research.

Inderjeet Mani is an Associate Professor of Linguistics at Georgetown University. He has published two books on summarization: Automatic Summarization (Benjamins 2001) and (co-edited) Advances in Automatic Text Summarization (MIT Press 1999), and is also co-editing the forthcoming book The Language of Time: A Reader (Oxford University Press 2004). He has led projects in summarization, temporal information extraction, and information retrieval funded by ARDA, DARPA, MITRE, and NSF.

Presentation

Back


Learning Domain Theories

Stephen Pulman, Oxford University

Abstract and presentation

By `domain theories' we mean collections of non-linguistic facts and generalisations or `commonsense theories' about some domain. Traditional approaches to many problems in AI, and in NL disambiguation, presupposed that such theories were to hand for the relevant domain of application. Notoriously these were difficult to formalise and fragile to use, and over recent years the traditional approach has been almost entirely superseded by statistical learning mechanisms. However, this does not mean that the traditional view was entirely wrong: it is clear that many aspects of linguistic interpretation are based on reasoning to novel conclusions of a type that is unlikely to be replicable by purely statistical methods. This talk reports some recent experiments on learning domain theories automatically, starting with parsed corpora, and using various machine learning techniques, particularly Inductive Logic Programming, to derive some simple Domain Theories from them.

Presentation

Back


Language Technology for Digital Memories

Hans Uszkoreit, University of Saarland

Abstract and presentation

One of the strongests limitations on human cognition are the capacity constraints on memory. Digital information processing offers the potential to extend personal memories beyond its biologically determined physical capacity. However, in order to become powerful external extensions to our memory, our personal data repositories have to be associatively structured in a high-dimensional information space. Moreover, the dimensions have to be meaningful with respect to our cognitive memory. Similar structuring mechanisms can be employed for the creation of dynamic collective memories. There is a strong demand for the associative dynamic structuring of collectively information spaces since one of the strongest limitations on social development results from constraints structure, growth and utilization of collective memory. Collective memory is composed of the personal memories of communicating individuals and of libraries, archives, and other information repositories. Language is not only the predominant medium on the WWW and on large personal information archives, it is also the only medium rich enough for providing the dimensions of useful associative structuring. Language technology is urgently needed for the creation of effective digital memories and for the implementation of associative interfaces to archives and applications. The required technology needs to go beyond traditional IR technology but it does not need the power of nearly full semantic text analysis that would be needed for turning digital texts automatically into the knowledge format of the envisaged Semantic Web. We will show how information extraction technology can be utilized for building associative memories and interfaces.

Presentation

Back


Cross-language algorithms: the progressive conflation of machine translation and information retrieval

Yorick Wilks Sheffield University

Abstract and presentation

MT and NLP researchers of a certain age will remember that, about fifteen years ago, the group under Jelinek and Brown at IBM mounted an attack on the idea of MT as a purely linguistic/symbolic enterprise, and argued that engineering methods based purely on text statistics, and derived from their success in speech recognition, could yield fundamental advance in MT. There were debates at conferences and in newsletters and matters came to a head in the DARPA MT competitions of the early Nineties, where both types of system (supported by DARPA) were pitted against each other and against commercial systems, including SYSTRAN. The answer was pretty clear, statistical MT did well, better than many expected, but never beat SYSTRAN over texts and domains for which neither had been trained. Many believe that nothing much has happened since, but I will argue that that is not so. What has happened above all is the web, which has both provided a new easy-accessible market for MT, through page translation, and has also provided a source of vast corpora, unimaginable before. However, that availability has not yet been cashed in: there is an enormous amount of work, of both sorts and above all as hybrids of both, but nothing fundamental has yet enabled purely empirical methods to overcome the data-sparseness problem, not even the web itself, viewed as a corpus. It seems pretty clear that some form of symbolic methods will be needed to do that. Again, that opposition is increasingly hard to make, as "symbolic" methods now themselves tend to be empirically based, and refer only to information types, rather than to structures that are written down directly from intuition. Most striking has been the division of the old MT task up into sub-tasks, each being tackled and evaluated independently--the most MT relevant case has been Word Sense Discrimination--- but whose limited successes have not, so far, been built back into more advanced MT itself. Again, MT has disintegrated in another way, in that multilingual functionality over a whole range of tasks, up from simple text editing to summarization and information extraction, are now such that one cannot really say whether they are MT or not. None of this should matter if real advances for language workers are being made, and they are. But intellectually, it can be bewildering, as in the recent turning of tables in which it has been argued that information retrieval should be seen as a form of machine translation (as opposed to vice versa!). This last , and most interesting, reversal is a main topic of the lecture.

Presentation

Back