Audun Rømcke gives his Masters lecture on
Named Entity Recognition using the Web
The Truth is Out There
Named Entity Recognition is the task of identifying and classifying proper names for entities like people and locations. Identifying and Classifying named entities is an important task in Natural Language Processing, as proper names often contributes significantly to the content of a message. The proper names can be classified in broad or narrow classes. This work attempts the four classes: Person, Organisation, Location and Miscalleneous.
Classification is done for Dutch texts used as a test corpus for the NER-task in CoNLL-2002. The method for classification is to try to find candidate words in contexts that are highly predictive of the class of the words; such as “His name is X”. If the contexts are chosen carefully this simple method can give results that are as good as more elaborate methods.
There are several problems for successful NER:
1. Approaches using lists of names and name elements are very precise, but fail to generalise beyond their intended domain and the language they were tailored for.
2. Statistical and computational approaches have broader coverage, but often lack the level of precision needed.
3. Training of statistical models requires large amounts of annotated data.
Entity Cruncher is a program for Named Entity Recognition using the web as a linguistic corpus. The program was developed and evaluated using the Dutch data from the CoNLL 2002 shared task (Tjong Kim Sang, 2002). The final F-score for the system was 63.20, well above the baseline results given in the CoNLL literature.
Entity Cruncher gives reason for optimism regarding the use of web data for name/word classification. Many additional elaborations of the basic scheme are possible, including machine learning and use of additional information, such as including more context words and POS information.
The system results support the view, expressed among others by Modjeska et al. (2003, inter al.), that the web is a goldmine for statistical linguistic purposes. The sheer amount of data results in a consistently high coverage across domains and languages.
Additional information from the text in which the candidate word was found can be used in order to improve on the results, using Machine Learning.
Leave a Reply
You must be logged in to post a comment.