SCARRIE Final Report

WordFinder SCARRIE
Project ref. no. LE3-4239
Project title SCARRIE
Deliverable status Public
Contractual date of delivery February, 1999
Actual date of delivery June, 1999
Deliverable number DEL 0.4
Deliverable title Final Project Report
Type Report
Status & version Final
Number of pages 29
WP contributing to the deliverable WP0 – Project Management
WP/Task responsible WordFinder Software AB, Box 155, SE-351 04 Växjö
Phone: +46 470 700000 Fax: +46 470 700099
E-mail: ola@wordfinder.se
Author(s) Claus Povlsen, Anna Sågvall Hein, Koenraad de Smedt (with contributions from Bo Löfvendahl, Patrizia Paggio, Ola Persson, Victoria Rosén)
EC Project Officer Antonio Sanfilippo
Keywords Proof-reading, spell checking, grammar checking, Scandinavian languages, validation
Abstract The report describes the achievments made in the SCARRIE project. In particular, three demonstrator systems for proof-reading of Danish, Norwegian, and Swedish are described, and the lingware and software that was developed for the systems. Evaluation and validation issues are also treated.

1. Executive summary

Three demonstrator systems supporting spell checking and grammar checking have been developed, one for each language. In the development of the demonstrator systems a variety of new lingware and software has been developed.

Besides having a user-specific lexical coverage, Danish SCARRIE has focused on meeting the end user requirements for treating context sensitive spelling errors. All the word forms in the general dictionary were thus matched against an end-user specific corpus before inclusion in the lexical coverage. Moreover, the compound grammar together with the assignment of information about combining potential in the dictionary has improved the lexical coverage in a precise way.

The linguistic functionality in terms of treating context sensitive spelling errors is based on a thorough analysis of spelling errors collected by the end users. The coverage of the syntactic grammar is therefore in agreement with the overall aspect in the SCARRIE project, i.e. to involve the end users in the definition of the system’s linguistic functionality.

Norwegian SCARRIE has addressed spelling and grammar correction in Norwegian Bokmål, one of the two major written norms in Norway. SCARRIE has emphasised correction in the context of different styles or written subnorms. Whereas current spelling checkers do not discriminate between different possible corrections depending on style or written norm, SCARRIE does so for Norwegian. This gives a more reliable and coherent result. Another prominent achievement is grammar checking which not only detects but also corrects various nominal and verbal agreement errors, although it must be added that the grammar does not offer unlimited coverage. Among the most appreciated achievements is better recognition of newly found compounds.

Swedish SCARRIE addresses both spell checking and grammar checking. A new software for grammar checking was developed, ScarCheck. It is based on a chart parser that was developed earlier by the Swedish partner. The grammar checker of Swedish SCARRIE targets more than thirty error types. A subset of them turned out to be productive in the validation phase. They refer to errors in the nominal phrase, errors in the adjectival phrase, errors in the verb phrase, word order errors, and erroneously split words. The selection of error types addressed by Swedish SCARRIE was made on the basis of an extensive error database including close to 9,000 authentic error tokens with corrections. Materials for the database as well as for the dictionary and the validation were supplied by the two users, Svenska Dagbladet and Upsala Nya Tidning. The errors are categorised in accordance with a detailed error typology that was also worked out in the project. For updating and maintenance of the dictionary a lexical database with a graphical interface for searching and updating was implemented. A demonstrator of Swedish SCARRIE is to be found at: http://stp.ling.uu.se/~ljo/scarrie-pub/scarrie.html

The technology baseline for spell checking in all the three systems is the Dutch software CORRie that was delivered by Cognitech. CORRie also provides the technology baseline for grammar checking in the Danish and Norwegian systems, whereas grammar checking in the Swedish system is based on ScarCheck. ScarCheck has been successfully integrated with the CORRie spell checker. The CORRie platform was extended with additional functionality to adapt to the needs of the Scandinavian languages and grammar checking. The platform was considerably rewritten and extended to make it a generic multilingual platform.

ScarCheck is a new software for grammar checking that was developed in the project by the Swedish partner. It is based on a chart parser, Uppsala Chart Processor, UCP. In addition to the chart parser, ScarCheck comprises an error scanning and message generating function ReportChart. UCP performs partial parsing and builds as much structure as the syntactic grammar allows. The resulting structures that are stored in the chart represent correct constructions as well as constructions with errors in them. Errors are assigned codes in accordance with the SCARRIE error typology. ReportChart scans the chart for errors and generates error messages accordingly. An error message comprises information about error type and error span.

Dictionaries of single word forms and phrases with grammatical information of relevance for compound analysis and syntactic analysis were produced for all the three languages, in addition to compound grammars and syntactic grammars for compound analysis and grammar checking. On an average, the dictionaries include some 250,000 items. A number of grapheme-to-phoneme rules has been formulated in order to generate a phonetic representation of the above-mentioned lexicons. The functionality of the phonetic transcription is to treat real spelling errors such as heterographic homophones. The usefuleness of these linguistic resources is not limited to a language checking setting.

A hierarchical error typology was worked out comprising close to 600 different error types, and error data bases of substantial sizes were produced in which the error tokens are marked in accordance with the typology. For easy search in the Swedish error database, a web-based graphical interface guiding the user through the hierarchical error typology was implemented.

A database with a graphical interface was also implemented for storing the complete set of lexical resources used by the Swedish version of SCARRIE. It was heavily used in the development and fine-tuning of the Swedish dictionary, and it is intended as a support tool for future commercial use.

In addition to the software mentioned about, two kinds of validation software was produced, and several interfaces for testing, developing, demonstrating, and validating the prototypes. The set of interfaces includes a Windows-based interface and three web-based interfaces.

Evaluation and validation have been key issues of the project. Like project internal evaluation, validation has largely been based on an evaluation methodology inspired by the EAGLES project (EAGLES 1996). The project has accordingly focussed on adequacy evaluation. The main object of evaluation has been the system’s linguistic functionality. However, the usability of the interface developed has also been validated by the end users.

Linguistic functionality is split up into three main attributes: coverage (recall), flagging (precision) and suggestion adequacy. Coverage subsumes lexical, grammar and error coverage, each of which is again split up into sub-attributes having to do with portions of the lexicon, grammatical constructions and error types. Suggestion adequacy subsumes sub-attributes having to do with whether the suggestion produced is correct and whether the user is presented with a diagnosis. For a discussion of the single attributes, see (Paggio & Underwood 1999) and (Paggio & Music 1988).

During project internal evaluation, dedicated test suites have been used to test and evaluate the system’s recall, precision and suggestion adequacy for different types of error. Validation, on the other hand, has been carried out by testing the system on sets of text excerpts where the errors were randomly distributed. Evaluation measures were computed automatically by way of two evaluation tools implemented by the project, Kraut and Scareval. Validation results were then analysed manually to see which error types were particularly problematic.

Although no precise goals had been set up in terms of desired recall, precision and suggestion adequacy measures, an assessment of the results obtained as regards spelling checking has been carried out by comparing the results with the measures obtained by other commercial systems in similar tests. Due to the absence of commercial grammar checkers for the Scandinavian languages, no such comparison could be carried out with regard to the grammar checking capacity of the SCARRIE prototypes.

The validation of SCARRIE has shown that the Danish version of SCARRIE is a system with the potential of becoming a viable software product on the market of proofreading systems in Denmark. The immediate potential of the Norwegian prototype is a product, which has higher quality than MS Word. The Swedish version of SCARRIE is a system with the potential of becoming a viable software product including spell checking and grammar checking of selected error types.

Exploitation is mostly for WordFinder, the other partners get royalties. The Swedish partner, UU, is considering exploiting the results in an on-going co-operation with Scania on language checking of controlled language. Another area that is being considered for exploitation is CALL, in specific, second-language learning focusing on grammatical aspects. Finally, the results that were achieved with regard to partial parsing seem to provide an interesting starting point for research on sophisticated information retrieval and extraction.

Windows versions of SCARRIE for all the three languages will be introduced on the market according to the following schedule: Swedish version in December 99, Danish version in March 00, and Norwegian version in December 00. The Windows versions will be followed by Macintosh versions. The product will be sold via WordFinder Software AB in Sweden, WordFinder Software AS in Norway and WordFinder Software A/S in Denmark.

Strong user involvement in the project has been a deliberate strategy from the very outset. Users have contributed in a highly dedicated manner in every stage of the project, from the initial specification of user requirements all the way through phases such as definition of the linguistic functionality and deliveries of material for databases, dictionaries, etc., clear up to the concluding validation phase. The following users have been directly involved in the project: Bergen Trykk, Norway, Berlingske Tidende, Denmark, Fagbokforlaget, Norway, Munksgaard International Publishers, Denmark, Svenska Dagbladet, Sweden, and Upsala Nya Tidning, Sweden. User co-ordination has been managed by the Swedish newspaper Svenska Dagbladet.

2. Project timetable

1 Dec. 1996 Project start date

2-3 Dec. 1996 LE3 Concertation meeting in Brussels

9 Dec. 1996 Consortium meeting in Växjö

9 Dec. 1996 Consortium agreement signed

14-15 Jan. 1997 CORRie course and workshop in Copenhagen

14 Jan 1997 WP2 working meeting in Copenhagen

11 Feb 1997 Official project kick-off meeting in Luxembourg

11 Feb 1997 Consortium meeting in Luxembourg

25-26 Feb 1997 LE3 Concertation meeting in Luxembourg

21-22 May 1997 User event: SNDS newspaper conference in Billund

22 May 1997 Consortium meeting with EC Project Officer in Billund

1 June 1997 WP6 working meeting in Uppsala

30 Oct-2 Nov 1997 User event: Scandinavian Book & Library Fair in Göteborg

20 Nov 1997 Coordinator/PO meeting in Luxembourg

26 Nov 1997 Consortium meeting in Copenhagen

19 March 1998 Mid-term review in Luxembourg

31 March 1998 Consortium meeting in Uppsala

April-May Technology baseline investigation

13 May 1998 Consortium meeting in Stockholm

14-15 May 1998 User event: SNDS newspaper conference in Oslo

19 May 1998 Coordinator/PO meeting in Luxembourg

29-30 June 1998 In-depth on-site review in Växjö

July 1998 3 months project extension applied for

22-25 Oct 1998 User event: Scandinavian Book & Library Fair in Göteborg

11 Nov. 1998 3 months project extension granted

30 Nov. 1998 Original project end date

28 Feb. 1999 Extended project end date

3. Achievements

1. Demonstrator systems

Three demonstrator systems have been developed, one for each language.

1.1. Main functions supported

The demonstrator systems support spell checking and grammar checking with a slightly different emphasis on the relation between these two functions in the different language versions. Before going into the technologies and modules used by the systems, the characteristic features of the three systems are briefly mentioned below:

1.1.1. Danish SCARRIE

Besides having a user-specific lexical coverage, Danish SCARRIE has focused on meeting the end user requirements for treating context sensitive spelling errors. All the word forms in the general dictionary were thus matched against an end-user specific corpus before inclusion in the lexical coverage. Moreover, the compound grammar together with the assignment of information about combining potential in the dictionary has improved the lexical coverage in a precise way.

The linguistic functionality in terms of treating context sensitive spelling errors is based on a thorough analysis of spelling errors collected by the end users. The coverage of the syntactic grammar is therefore in agreement with the overall aspect in the SCARRIE project, i.e. to involve the end users in the definition of the system’s linguistic functionality. For a demonstration of Danish SCARRIE, see the screencam movie at SCARRIE’s shared workspace in the following folder: SCARRIE/Dissemination/The_Danish_SCARRIE_demo.

1.1.2. Norwegian SCARRIE

Norwegian SCARRIE has addressed spelling and grammar correction in Norwegian Bokmål, one of the two major written norms in Norway. SCARRIE has emphasised correction in the context of different styles or written subnorms. Whereas current spelling checkers do not discriminate between different possible corrections depending on style or written norm, SCARRIE does so for Norwegian. This gives a more reliable and coherent result. Another prominent achievement is grammar checking which not only detects but also corrects various nominal and verbal agreement errors, although it must be added that the grammar does not offer unlimited coverage. Among the most appreciated achievements is better recognition of newly found compounds.

1.1.3. Swedish SCARRIE

Swedish SCARRIE addresses both spell checking and grammar checking. However, some emphasis has been on grammar checking as being a crucial aspect for the success of SCARRIE on the Swedish market. Thus a new software for grammar checking was developed, ScarCheck. It is based on a chart parser that was developed earlier by the Swedish partner. The grammar checker of Swedish SCARRIE targets more than 30 error types. A subset of 7 turned out to be productive in the validation of a user corpus of some 15,000 words. They refer to errors in the nominal phrase, errors in the adjectival phrase, errors in the verb phrase, word order errors, and, finally, erroneously split words. The motivation for keeping the remaining error types in the target has to be based on validation studies of a larger corpus. The selection of error types to be addressed by Swedish SCARRIE was made on the basis of an extensive error database including close to 9,000 authentic error tokens with corrections. Materials for the database as well as for the dictionary and the validation were supplied by the two users, Svenska Dagbladet and Upsala Nya Tidning. The errors are categorised in accordance with a detailed error typology that was also worked out in the project (see An error typology for automatic proofreading purposes – Del 2.1). Grammar error messages are based on these error codes.

The Swedish grammar checker does not suggest corrections. Rather than focusing on error correction, a decision was made to give priority to maintenance issues, notably the updating and maintenance of the dictionary, an important issue for the commercial user. This was the motivation for building a lexical database with a graphical interface for searching and updating. Via the database interface a runnable dictionary is also compiled for the checker. If correction would be found to be important in the commercial version of SCARRIE, an implementation strategy is in store. For a demonstration of Swedish SCARRIE, see section 6.1: http://stp.ling.uu.se/~ljo/scarrie-pub/scarrie.html.

1.2. Technologies

The technology baseline for spell checking in all the three systems is the Dutch software CORRie that was delivered by Cognitech. CORRie also provides the technology baseline for grammar checking in the Danish and Norwegian systems, whereas grammar checking in the Swedish system is based on ScarCheck. ScarCheck has been successfully integrated with the CORRie spell checker.

1.2.1. CORRie

The CORRie platform was extended with additional functionality to adapt to the needs of the Scandinavian languages and grammar checking. The platform was considerably rewritten and extended to make it a generic multilingual platform. The following extensions of its functionality are direct results of the project:

  • possibility of full parsing with error weights, partial parsing, or interface to external parser
  • compound grammar based on multiple regular expressions
  • grapheme-to-phoneme module with multiple-level rules
  • support for partitioning of vocabulary in style registers and written norms
  • multiple dictionaries in user-readable format
  • multi-word expressions with style codes and grammatical information
  • recognition of split compounds and incorrectly joined words
  • support for different ways of character and mark-up encoding

1.2.2. ScarCheck

ScarCheck is a new software for grammar checking that was developed in the project by the Swedish partner. It is based on a chart parser, Uppsala Chart Processor, UCP. In addition to the chart parser, ScarCheck comprises an error scanning and message generating function ReportChart. UCP performs partial parsing and builds as much structure as the syntactic grammar allows. The resulting structures that are stored in the chart represent correct constructions as well as constructions with errors in them. Errors are assigned codes in accordance with the SCARRIE error typology. ReportChart scans the chart for errors and generates error messages accordingly. An error message comprises information about error type and error span. In scanning the chart ReportChart selects the longest possible span, going from left to right. If there are two edges spanning the same part of the input, one with an error in it and one with no error, ReportChart selects the correct one ignoring the erroneous alternative. These are the two basic strategies that the grammar writer may use in order to avoid (override) false alarms.

The input to ScarCheck is a list of lemmas and grammatical codes representing the words in the sentence. Typically, there are several alternatives for each word. The input is generated in the interface between CORRie and ScarCheck. It is forwarded to the grammar checker, sentence by sentence. A total of 365 different codes are used.

1.2.3. Software for dictionary maintenance

A lexical database, ScarrieLex, comprising all the different lexical resources used by Swedish SCARRIE was developed by the Swedish partner. It may be searched and updated via a graphical interface. It includes single words as well as phrases, approved words and phrases as well as unapproved words and phrases with suggestions for corrections. Also the grammatical codes in terms of which information about single words is forwarded to the grammar checker are stored in the database. By means of an export function accessible via the interface, a compiled version of the dictionary in the format required by SCARRIE is automatically generated.

1.2.4. Software for grammar development

A www-based software for testing ScarCheck has been developed. The user inputs the sentence to be checked as a string of codes and lemmas, and the program returns the complete chart. Input strings for working with the software are generated via the SCARRIE demonstration interface, see below. Mostly, the information provided by the chart is sufficient for tracing the grammatical analysis. In addition to that, however, there are several tracing options. For instance, a trace of every step of every task may be displayed.

Location http://stp.ling.uu.se/~ljo/scarrie-pub/ucp_light.html

1.2.5. Software for demonstration and testing

A www-based interface for demonstrating Swedish SCARRIE to external users has been developed. Via this interface, the user may test the program one sentence at a time. The user inputs a sentence, and the system returns it with the errors highlighted. Red is used for word errors, green for grammar errors, and yellow for some types of typographical errors such as multiple spaces and for compounds that are identified via the compound analysis and which the system finds questionable. The motivation for highlighting compounds is to point out to the user that he should verify the adequacy of the analysis. When the user points at a highlighted segment, an error message is given.

For instance, if the sentence “Det nya fyravåningshuset var vacker.” is analysed, the segment “fyravåningshuset var vacker”, representing a grammatical error, will be flashed green. If the green segment is pointed at, the following error message will be displayed: GPAGNA03 wrong gender in the complement.

Location: http://stp.ling.uu.se/~ljo/scarrie-pub/scarrie.html

The developer may work in a similar interface, where not only single sentences but full texts may be analysed. In this interface the input string to the grammar checker in terms of codes and lemmas may be displayed. This is an important option, since, certainly, errors in the input to the grammar checker are fatal.

Location: http://stp.ling.uu.se/~ljo/scarrie/scarrie.html

A Windows-based interface, Corriewin, was developed by WordFinder for testing and demonstrating the systems.

1.2.6. Software and hardware requirements

The Danish system runs under Windows, Corriewin. The Norwegian and the Swedish system run on the unix platform. The Swedish users do not work with Windows, and thus it was decided to remain on the unix platform during the project phase. Access to the prototype is granted via three web-based interface, see above. During the exploitation phase, a decision will be made as to the end-user environment.

1.2.7. Two softwares for automatic validation

The involvement of end-users in SCARRIE has among others resulted in access to a set of parallel unedited/proffered texts which has made it possible to automate the evaluation of the proofreading tools’ linguistic functionality with respect to these corpora. Two evaluation programs were implemented for this purpose, kraut and scareval.

Kraut was developed by CST. It is a Perl program which takes as arguments the filenames corresponding to the proofreading tool output and the manually proofread version of the text (all in ASCII format). It aligns the texts, then compares the original input, the proofreading tool’s suggested corrections, and the manually proofread version, deriving information on what the valid and invalid words of the input text are, as well as how well the system identified these and whether it suggested appropriate corrections. In this way, the results produced by the running prototypes of the proofreading tools in terms of recall, precision and suggestion adequacy can be tested and evaluated automatically.

Scareval was developed by UU. It supports the analysis of the grammar errors generated by Swedish SCARRIE by presenting four kinds of output: errors detected by SCARRIE and the human proofreader and assigned the same error code, errors detected by SCARRIE and the human proofreader and assigned different error codes, errors detected by SCARRIE only, and errors detected by the manual proofreader only. In addition to tables with figures on error code occurrences, scareval presents the sentences in which the errors are found for manual inspection. Prior to the application of the program, errors in the manually proffered text have to be assigned error codes, and this text version is used by scareval as a “golden standard”. It does happen, however, that SCARRIE spots real errors that are overlooked by the human proofreader; thus the golden standard is, in fact, only an approximated standard. The errors detected by SCARRIE only have to be examined carefully for a distinction between real errors and false alarms. Errors assigned different error codes by SCARRIE and the human are also an important source of information in the validation process.

2. Components

2.1. Lexicons

2.1.1. Danish lexicons

2.1.1.1. The Danish full form lexicon

The main idea behind defining the lexical coverage of the Danish full form dictionary was that the selection of entries should be fully corpus based. So instead of simply taking all the expanded lemmas from various “official” dictionaries and include them in the Danish full form dictionary, it was examined whether the full forms had been used in real-life represented by a domain-relevant text corpus. The latter text corpus was composed of one volume of three Danish papers issued by Berlingske Hus and consisted of about 26 mill. tokens. The result of the matching process, i.e. the common set of full forms in the dictionary and the text corpus was included in the lexicon. Subsequently, frequently occurring full forms and proper nouns in the text corpus were identified and included as well. The final number of full form entries ended up being approximate 120,000.

The information types tagged at each entry are Part of speech, number, gender, case, frequency and for nouns information about potential combinations with other nouns (cf. the description below of the compound grammar).

2.1.1.2. The Danish idiom lexicon

The idiom list was made by extracting idioms from existing dictionaries of Danish and then deleting forms that did not occur in the corpus. In this way, the idiom list contains a considerable number of domain-relevant forms. The procedure was the following: first idioms were identified by collecting all lemmas containing one or more blanks and all complex prepositions (to the forms thus obtained were added more idioms coming from CST internal lists). This resulted in a preliminary list of 1433 idioms, comprising prepositions, adverbs, common and proper nouns. Then the list was expanded morphologically (e.g. to include all forms of compound nouns) and then matched against the corpus to prune out non-occurring forms. The final result is a list of 717 idioms.

2.1.1.3. Phonetic lexicon

A number of grapheme-to-phoneme rules has been formulated in order to generate a phonetic representation of the above-mentioned lexicon. The functionality of the phonetic transcription is to treat real spelling errors such as heterographic homophones (for a thorough description see Molbæk Hansen 1999).

2.1.2. Norwegian lexicons

The lexical information for Norwegian has been coded in several word lists. The main lexicon comprises open class words for Bokmål: adjectives, adverbs, nouns and main verbs. This dictionary contains 360,933 wordform entries organised in 72,626 lemmas (corresponding to citation forms). This means that for each citation form, on the average 5 inflected word forms are stored. Additional separate word lists have been made for closed class (grammatical) words, affixes, abbreviations and words occurring only in multi-word expressions.

The total coverage of the dictionaries corresponds closely to the words in Bokmålsordboka, which is the official dictionary of the language written by the University of Oslo. For texts not requiring specific terminology, this coverage turned out to be satisfactory during the validation, thanks to the fact that SCARRIE has mechanisms for identifying newly found compounds and proper names.

The Norwegian lexicons for SCARRIE have been provided with new information not available before, specifically verb subcategorization and lexical variants (style replacements). The dictionaries with inflected forms contain massive information for replacement under given styles. In the main dictionary, 136,048 of the entries, which is more than 1/3, are replaced under one or more given styles. The style encoding has been done manually both for lexical stylistic choices (e.g. the choice between snø and sne “snow”) and for morphological paradigms (e.g. the choice between the alternate suffixes -a and -et in the preterite and past participle in verbs like kaste “to throw”, kasta/kastet).

2.1.3. Swedish lexicons

2.1.3.1. The main lexicon

SCARRIE word form dictionary is based on a corpus of newspaper text. It comprises all the articles that were published in 1995 and 1996 by the two prominent Swedish newspapers Svenska Dagbladet and Upsala Nya Tidning. The SvD/UNT corpus holds over 220,000 articles. The SvD/UNT corpus consists of more than 70 million tokens and 1.5 million types. The texts were dumped in ANSI-format from the database archives of the two newspapers. This huge corpus material was segmented into tokens and types, word lists were formed and sorted, and word forms were grouped into categories with respect to in-going types of characters. This sub-categorisation provided a starting-point for a rough process of approval or rejection for further dictionary processing. In this process, frequency was also taken into account. Based on frequency and the character-based sub-categorisation, the word types were referred to one of the following basic groups: a) Irrelevant for the dictionary b) Dictionary candidates c) Misspellings. The most frequent types (350,000) of b) were extracted for further morphological processing. They constitute the basis for the first version of the Swedish word-form dictionary for SCARRIE. The character-based classification of the word types of the corpus turned out to be of good help in identifying well-formed and not so well-formed types for further morphological processing before entering of the words into the dictionary.

The Swedish lexical material is stored in ScarrieLex, a lexical database. It comprises 257,136 single word forms, and 4,899 phrases. 256,930 of the word forms are approved words, and 139 are minus words with suggestions for replacements. The word forms are marked with respect to style and user conventions. Typically a word form record in ScarrieLex contains information about lemma, word class, and morpho-syntactic features. Information about morpho-syntactic features and a few semantic aspects is wrapped up in a grammatical code. All in all, there are 358 different codes. Information about compounding properties is also stored in the lexical database.

2.1.3.2. Phonetic lexicon

A number of grapheme-to-phoneme rules has been formulated in order to generate a phonetic representation of the above mentioned lexicon. The functionality of the phonetic transcription is to treat real spelling errors such as heterographic homophones.

2.2. Grammars

2.2.1. Danish grammars

2.2.1.1. Compound grammar

Due to the huge number of exceptions with respect to compounding in Danish, generation of a general compound grammar will result in the acceptance of many invalid compounds. As a consequence of this fact, additional information about noun combinatory potential in terms of compounding has been added to the dictionary. For each noun a feature expressing which combining element the noun in question requires is thus coded in the dictionary. This information has been exploited in the compound grammar rules for Danish (expressed as regular expressions) and had increased the system’s ability to distinguish between invalid and valid compounds substantially.

2.2.1.2. Syntactic grammar

The grammar focuses on the treatment of the most common types of context sensitive spelling errors in the Danish corpus, i.e. NP internal agreement and wrong verb form combinations (as well as missing finite verb). The application of the grammar results in a fast shallow syntactic analysis of each sentence in which the various constituents found are attached under the topmost sentence node as fragments.

The grammar consists of two types of grammar rules for treating feature mismatches and structural errors, respectively. The feature overriding mechanism makes it possible for the system when applying a grammar rule to suggest a correct replacement in cases of feature inconsistency between elements in constituents (e.g. agreement errors in nominal phrases). Structural errors in which no unambiguous correction can be suggested are captured by error rules. In both cases, the system relies on weights to find the best possible analysis. The output of these error rules is presented explicitly to the user as an error diagnosis in a specific log file (for a thorough description of the treatment of context sensitive spelling errors in the Danish pilot system see Paggio 1999).

The results obtained by running the grammar on the test suites generated show that the rules perform very well at least on contexts of not too great complexity.

2.2.2. Norwegian grammars

2.2.2.1. Compound grammar

The Norwegian compound grammar has wide but selective coverage thanks to its reliance on a detailed coding of categories and features in the lexicon. The compound rules allow the combination of open-class words only if they occur with the right features, e.g. a plural noun can be the last part, but normally not the first part of a compound. In addition, there are separate codings for prefixes and suffixes, digits, and compound numbers. The binding morphemes e and s are handled in the rules, although their behaviour is not completely predictable. The reduction of triple consonants to double ones (e.g. klubb+ball is written as klubball) is adequately handled by special rules.

2.2.2.2. Syntactic grammar for sentence-level correction

The syntactic grammar focuses on useful mechanisms for detecting as well as correcting several kinds of errors in the NP and the VP. For this purpose, it was decided to attempt full sentence parsing rather than partial parsing. This is motivated by the fact that it is notoriously difficult to reliably identify noun phrases or to find whether a verb should be finite or non-finite without having a full structural account of the sentence. The following kinds of grammatical errors can be automatically corrected by the Norwegian SCARRIE grammar:

  1. Lack of gender, number and/or definiteness agreement between (a) determiner, adjective phrase and noun in NP, (b) subject or object and nominal or adjectival complement in S, and (c) noun and postposed possessive in NP.
  2. Errors involving (a) the wrong sequence of verb forms in VPs and (b) finite vs. non-finite verb forms.
  3. Errors involving case forms for object pronouns in topicalized position and for corresponding subject pronouns in inverted position.

An error type that is correctly handled by the grammar but which cannot be automatically corrected in SCARRIE is the omission of s in a genitive NP. These cannot be corrected because although the compound grammar correctly identifies genitives, since they are not identified with certain lemmas; the program cannot find another form in the same lemma with the appropriate grammatical features.

An extraordinary feature of Norwegian SCARRIE is that a single grammar is used to enforce agreement for four different gender systems: This is achieved by using different translation tables providing each lexical item with up to four different grammatical categories and features, depending on style or written norm.

The grammar has good coverage but is not exhaustive. Failing to find a parse is, however, not disastrous since it merely overlooks an error. However, homonomy combined with extensive interaction between grammar rules leads to many alternative parses, among which it is hard for the system to identify the correct one. This may lead to wrong suggestions for correction.

2.2.3. Swedish grammars

2.2.3.1. Compounding grammar

The Swedish compound grammar has wide but selective coverage thanks to its reliance on a detailed coding of categories and features in the lexicon. The compound rules allow the combination of open-class words only if they occur with the right features, e.g. a plural noun can be the last part, but normally not the first part of a compound. In addition, there are separate codings for prefixes and suffixes, digits, and compound numbers. The binding morphemes e and s are handled in the rules, although their behaviour is not completely predictable. The reduction of triple consonants to double ones (e.g. upp+pushade is written as uppushade) is adequately handled by special rules.

2.2.3.2. Syntactic grammar

The grammar is a phrase structure type grammar. It is at the same time a positive and a negative grammar. If applied to a correct text, it will provide exactly the same kinds of descriptions as a traditional phrase structure grammar. It contains rules for the recognition of NPs, APs, AdvPs, VPs (in a limited sense), and PPs. Clause rules for the recognition of clause fragments are also included. The rules are not weighted. The parser, UCP, builds as much structure as the grammar allows, and the error scanning mechanism interprets the chart. So far, valency information is not included in the grammar in a systematic way. The grammar rules are formulated in the procedural UCP formalism. It is to a large extent unification-based.

The grammar may be used for partial parsing outside a grammar checking environment. An example of a future application is partial parsing as a basis for sophisticated information extraction. A sister grammar is used by the parsing component of a machine-translation system that was also developed by the Swedish partner. In this grammar valency information is heavily used.

3. Other achievements

3.1. Error data bases

3.1.1. The Danish error database

In the initial phase of the project, a text corpus consisting of real-life spelling errors was provided by the SCARRIE end users. The text corpus was a parallel corpus of unedited and manually proofread texts which by automatic alignment of the two text corpora eased the analysis of the spelling errors. The classification of the Danish spelling errors identified was based on the error typology made by the Swedish partner in SCARRIE (cf. Rambell 1998). The result of the analysis was an error database containing of about 1120 classified spelling errors. These data are unique in that sense that no other information of this kind seems to have been collected and classified systematically.

3.1.2. The Norwegian error database

A database was constructed containing 630 actual errors collected in proofreading books and newspapers. Each error was classified according to a typology which, for Norwegian, was extended with a detailed subtypology for actual spelling errors, as opposed to punctuation and other changes. One important finding is that the error corpus seems to be much too small to provide any kind of representative picture of the distribution of errors in unproofread prose. Constructing the error database was the least useful and most time wasting of all tasks in the project.

3.1.3. The Swedish error database

The Swedish error database, ECD, consists of two components, a relational database for data storage and a www-interface for searching and updating the base. In the database erroneous and corrected versions of text fragments, often full sentences, are stored. An error type code has been assigned to each error according to the SCARRIE error typology. Information about newspaper, publishing date, text section (e.g. domestic news, editorial) and text type (e.g. headline, plain text) is also given. The error database comprises close to 9,000 error tokens. Basic materials for the database were delivered by the two newspapers, Svenska Dagbladet (SvD) and Upsala Nya Tidning (UNT). SvD delivered proofread and uncorrected text electronically, and via alignment the corrected sentences were identified and extracted. The corrections were then manually marked in accordance with the error typology. UNT delivered paper proofs, and the error fragments with their corrections had to be registered in the base manually. The error typology is quite complex, and via the www-interface the user is guided through the error code hierarchy. From the error database detailed error frequencies were produced (see An Error Database of Swedish – Del 2.1.3.2). They were used as a basis for selecting the error types to be targeted by Swedish SCARRIE. The error database also provided examples for development, testing and demonstration.

Location: http://strindberg.ling.uu.se/cgi-bin/w3-msql/ECD/index.html

3.2. An error typology

The error typology is a hierarchically organised classification system of all kinds of language related errors found in the newspaper material. The hierarchy has of four levels. Each level is given a 2-character code, resulting in an 8 character error type code to be assigned to every token in the error database. At the top level there are five different categories, referring to spelling errors (SE), grammar problems (GP), punctuation problems (PU), graphical problems (GR), and, finally, style, meaning, and reference (SP). The total number of error types in each major group is:

  • SE 54
  • GP 314
  • PU 70
  • GR 79
  • SP 56

The main focus of the typology has been on the recognition of errors, on the information needed to detect and recognise different language errors. The correction made by the proofreader and the performance of the proofreading tool has also been taken into consideration when constructing the error typology. The error cause has been given the lowest priority.

4. Other results

4.1. Project level dissemination/LE concertation

LE3 Concertation Meeting in Brussels, 2-3 December 1996 (Ola Persson, WF, and Koenraad de Smedt, HIT.

LE3 Concertation Meeting in Luxembourg, 25-26 Feb. 1997, including project presentation (Berith Brännström, WF).

Project Officer “hand-over” meeting in Luxembourg 20 Nov. 1997 with resigning EC Project Officer Iain Urquhart, new Project Officer Paul-Pierre Sondag and asst. Project Officer Ray Hudson (Berith Brännström, WF).

Mid-term review in Luxembourg, 17 March 1998 (Ola Persson, WF, and Anna Sågvall Hein, UU).

1998 (Ola EC/Coordinator meeting in preparation for the in-depth review, Luxembourg, 19 May, Persson, WF, Anna Sågvall Hein, UU, Claus Povlsen, CST.

On-site in-depth review in Växjö, 29-30 June, 1998 with reviewer Stelios Piperidis, EC Project officers Ray Hudson and Antonio Sanfilippo (all SCARRIE management).

4.2. User contacts

User coordination has been managed by the Swedish newspaper Svenska Dagbladet.

An Internet user forum has been established: the SCARRIE User List is a closed unmoderated discussion list related to the SCARRIE project. The list has averaged approx. 60 subscribers from all three Scandinavian countries (plus Swedish-language Finnish newspapers).

SCARRIE has been represented at the yearly SNDS (Society for Newspaper Design/Scandinavia) conferences: May 1997, in Billund, Denmark (with project presentation), and May 1998, in Oslo, Norway.

SCARRIE has also been represented at the yearly Scandinavian Book & Library exhibition in Göteborg, Sweden, November 1997 and October 1998.

Separate meetings where SCARRIE was discussed with other major Scandinavian newspapers, viz. Aftenposten (Oslo), Hufvudstadsbladet (Helsinki) and Dagens Nyheter (Stockholm).

4.3. General promotion

A SCARRIE brochure with basic information about the project and references to other sources of information has been produced. Distribution has included all Scandinavian newspapers and printing houses.

A Powerpoint presentation of the SCARRIE project has been produced and used on a number of occasions.

WordFinder Software has held continuous customer seminars (about 12 per year) which have included SCARRIE information and distribution of material.

4.4. Press coverage

A SCARRIE press release was distributed to Swedish news media in connection with the Älvsjö Computer fair, Stockholm, January 1997 (WF).

An article about SCARRIE by Claus Povlsen, CST, was published in April issue of the branch magazine Dansk Presse (Danish Press).

The Swedish magazine Datateknik published an article about SCARRIE in No. 4, 1997.

An article about SCARRIE with an interview with Anna Sågvall Hein, UU, was published in Upsala Nya Tidning, 25 Sep.,1997.

4.5. Presentations/demonstrations at conferences, exhibitions & meetings

Kick-off meeting for the new phase of the national Swedish language technology program in Göteborg, August, 1997 (UU)

National meeting for Swedish authorised language consultants in Stockholm, September 1997 (UU).

MONS7 (National meeting about the Norwegian language) in Trondheim, November 1997 (HIT).

Vetenskapsfestivalen (Science Festival) in Göteborg, May 1998 (UU).

DALF meeting (annual meeting of the union of Danish computational linguists) in Copenhagen, May 1999 (CST – demonstration of the running prototype).

The National Danish Seminar on the 5th Framework Programme (“Sproget i informationssamfundet”) in Copenhagen, January 1999 (CST – demonstration of the running prototype).

Visit of the Danish minister of research at CST, March 1999 (CST – demonstration of the running prototype).

Meetings with language consultants and IT experts from the Swedish state office and the Swedish parliament in Stockholm, June 1998, and Uppsala, August 1998 (UU).

Nordisk Språkmøte (Nordic Language Meeting), supported by the Language Academies in the Nordic countries, September 1998 (HIT).

National Danish Seminar on the 5th Framework Programme (“Sproget i informationssamfundet”) in Copenhagen, January 1999 (CST).

4.6. Web site information

Note: the following URLs were valid at the time of writing this report, but some are no longer available.

SCARRIE home page (URL: http://www.scarrie.com)

http://cst.ku.dk

http://fasting.hf.uib.no/~desmedt/scarrie/

http://www.ling.uu.se

http://www.svd.se

http://www.ling.uu.se

http://stp.ling.uu.se/~ljo/scarrie-pub/ucp-light.html

http://stp.ling.uu.se/~ljo/scarrie-pub/scarrie_html.

http://stp.ling.uu.se/~ljo/scarrie/scarrie.html

http://strindberg.ling.uu.se/cgi-bin/w3-msql/ECD/index.html

Newspaper branch organisations such as Tidningsutgivarna (SE), Freelancegruppen i Dansk Journalistforbund (DK) and Society of Newspaper Design/Scandinavia have all had SCARRIE information on their web sites.

4. Evaluation and assessment

4.1. Validation

4.1.1. The methodology

Like project internal evaluation, validation has largely been based on an evaluation methodology inspired by the EAGLES project (EAGLES 1996). The project has accordingly focussed on adequacy evaluation. The main object of evaluation has been the system’s linguistic functionality. However, the usability of the interface developed has also been validated by the end users.

Linguistic functionality is split up into three main attributes: coverage (recall), flagging (precision) and suggestion adequacy. Coverage subsumes lexical, grammar and error coverage, each of which is again split up into sub-attributes having to do with portions of the lexicon, grammatical constructions and error types. Suggestion adequacy subsumes sub-attributes having to do with whether the suggestion produced is correct and whether the user is presented with a diagnosis. For a discussion of the single attributes, see (Paggio & Underwood 1999) and (Paggio & Music 1988).

During project internal evaluation, dedicated test suites have been used to test and evaluate the system’s recall, precision and suggestion adequacy for different types of error. Validation, on the other hand, has been carried out by testing the system on sets of text excerpts where the errors were randomly distributed. Evaluation measures were computed automatically by way of two evaluation tools implemented by the project, Kraut and Scareval. Validation results were then analysed manually to see which error types were particularly problematic.

Although no precise goals had been set up in terms of desired recall, precision and suggestion adequacy measures, an assessment of the results obtained as regards spelling checking has been carried out by comparing the results with the measures obtained by other commercial systems in similar tests. Due to the absence of commercial grammar checkers for the Scandinavian languages, no such comparison could be carried out with regard to the grammar checking capacity of the SCARRIE prototypes.

As far as the usability of the interface is concerned, users were asked to fill in a questionnaire. The usability attributes taken into consideration in the questionnaire are inspired by the TEMAA project (Manzi et al 1996).

4.1.2. Validation of the Danish prototype

The validation of the Danish prototype has been carried out in cooperation with the publishing house Berlinske Tidende. The system’s linguistic functionality was validated by running SCARRIE on a set of 33 different newspaper article excerpts (see Validation of the Danish prototype – Del 8.1.1).

The recall obtained (95.88% on lexical recall and 42.17% on error recall) is quite satisfactory. The vast majority of the real errors missed by the system are in fact typographical errors (e.g. wrong quotation signs) and punctuation errors which are clearly outside the system’s intended coverage.

Precision and suggestion adequacy should and can be improved. Nevertheless at least on precision (17.51% good flags), the Danish SCARRIE did better than the Danish spelling checker in Word in a comparative test (Word scored 12.4% on good flags, see Evaluation Report – Del 7.2). Many of the incorrect flags produced by SCARRIE were due to an unsatisfactory treatment of abbreviations, numbers, and typographical signs such as quotes. None of these areas have been in focus in the project, but they should of course be covered in a commercial system. It must be noted that unknown compound forms, which are a notorious problem for commercial spelling checkers of Danish, and an area specifically addressed by the project, as a general rule are successfully recognised by SCARRIE.

Suggestion adequacy is the most problematic attribute for the Danish SCARRIE. Especially the high number of errors for which no suggestion replacement is suggested (33.88%) is somewhat surprising. A large number of these are typos (some of them typos in compounds), again an area to which no great attention has been devoted during the course of the project. It must be noted, however, that CORRie (the technical baseline chosen by the project) makes several different methods available for spelling correction and that the method used for the validation of the Danish prototype does not rely on trie-based dictionary storage. Adopting this method may improve the results. The measure of wrong replacements suggested (51.36% of the cases) can also be improved. About half of these cases, in fact, are due to the fact that the system wrongly suggests to split up a word it cannot analyse otherwise. The mechanism that does this, was specifically developed by Cognitech for the treatment of run-ons in Danish, but it was integrated in the prototype at a very late stage, and needs further development.

As for usability, the Danish validator preferred to give his comments in a discursive manner rather filling in the questionnaire. He provided a number of suggestions (see Del 8.1.1), most notably that the system should work from within a word processor and not as a stand-alone, and that it should be able to remember a correct replacement provided interactively by the user throughout a correction session.

4.1.3. Validation of the Norwegian prototype

For the Norwegian prototype, validation was done in cooperation with Fagbokforlaget, a publisher of textbooks. Human proofreading at the user is always supported by computational tools. Even with their limitations, current commercial products are found to contribute to quality and efficiency.

The validation was performed by comparing an author’s version of a 3791-word excerpt of a university-level psychology textbook with a version checked by a human proofreader. The author’s version seemed to have already been checked rather carefully at the linguistic level. There were in fact very few orthographical and typographical errors in the text. An analysis of the proofreader’s changes shows that only 13 of 114 changes are of this type. In all, only 41 of the 114 changes are errors that could conceivably be corrected automatically.

The performance of the SCARRIE system on the text provided by the user was assessed by using automatic validation software. From a careful comparison of the statistics, it appears that the Norwegian SCARRIE prototype performs better than MS Word 98 on all counts.

SCARRIE accepted 98.4% of the valid words against 96.6% for Word and spotted 28.6% real errors compared to 26.2% for Word. These differences may seem small in percentage points, but in fact Word rejected more than twice as many valid words as SCARRIE did. The fact that both programs missed over 70% of the proofreaders’ changes may seem poor. However, it must be noted that, as pointed out above, there are very few real linguistic errors in the text, the large majority of the proofreader’s changes being changes pertaining to punctuation and style.

SCARRIE’s precision is at 37.5% compared to 21.7% for Word, a considerable difference which is appreciated by the user. For identified real errors, 30.6% of SCARRIE’s suggestions for correction were correct, compared to 23.5% in Word. High scores on recall and precision are judged to be of prime importance by the user. SCARRIE’s relative strength in recognizing new compounds is especially appreciated due to the fact that textbooks contain a very large number of compounds. SCARRIE’s lexical coverage is adequate, even though the lexicon is based almost exclusively on a normal dictionary, and is nicely supplemented by mechanisms for recognizing newly found compounds and proper names.

Also consistency is deemed to be a very desirable characteristic of error correction. In this respect, the user appreciates the capabilities of SCARRIE for enforcing different written norms in Bokmål. SCARRIE identified possible inconsistencies (so-called radical forms in an otherwise rather conservative text) which were not changed by the human proofreader.

The total impression after validation is that the results are surprisingly good for a brand new, unadjusted prototype, even in comparison with leading commercial software. It must be added, though, that the validation test for Norwegian was run without the parser. With the parser switched on, the prototype functions very impressively on selected short sentences, but it does not produce any useful output on many longer sentences in the user’s text. It must be concluded that on the one hand, the grammar does not reach sufficient coverage to deal with authentic texts, and on the other hand, if the parser produces a set of possible analyses, it does not always succeed in selecting the correct one. This is the biggest area for further research.

4.1.4. Validation of the Swedish prototype

The validation of the Swedish prototype has been carried out in co-operation with the two newspapers Svenska Dagbladet and Upsala Nya Tidning. The system’s linguistic functionality was validated by running SCARRIE on a corpus of text comprising some 15,000 current (see Validation of the Swedish prototype – Del 8.1.3) from a randomly sampled set of articles.

Spell checking

The spell checking recall obtained (98.0 % on lexical recall and 96.5% on error recall) seems to be very good. The vast majority of the real errors missed are punctuation errors, in particular, errors in the use of the comma. Also a few errors in the use of the capital letter, and some typos (split words) are overlooked.

Precision is good, 41.3% good flags as compared to 20.0% in a comparative test with Swedish MS Word (see Evaluation Report – Del 7.2, and Dahlqvist, B., 1999). Many of the incorrect flags produced by SCARRIE are due to an unsatisfactory treatment of abbreviations, numbers, and typographical signs such as quotes. These areas have not been in focus in the project, but they should of course be covered in a commercial system. Unknown compound forms are as a rule successfully recognised by SCARRIE. Some over-generation remains to be handled, though.

Suggestion adequacy is the most problematic attribute for the spell checker of Swedish SCARRIE. Especially the high number of errors for which no suggestion replacement is suggested (60.6 %) is unsatisfactory. A large number of these are typos (some of them typos in compounds), again an area to which no great attention has been devoted during the course of the project. The measure of wrong replacements suggested (9.3 % of the cases) may also be improved. About half of these cases, in fact, are due to the fact that the user has wrongly split words at unexpected places.

Grammar checking

The grammar checker of Swedish SCARRIE targets more than 30 error types. 7 of them were represented in the validation corpus. They refer to errors in the nominal phrase, errors in the verb phrase, word order errors, and, erroneously split words. For these error types, an overall recall of 83.3 % and a precision of 77.0 % was obtained. (Two errors that SCARRIE detected were overlooked by the human proofreader.)

The results obtained for the error types in the validation corpus seem to be quite good. Still they may be further improved by fine-tuning the grammar and the input to the grammar checker. Two types of shortcomings with regard to the input were encountered. One is due to cases where the spell checker has made an incorrect analysis of a word that is outside the dictionary. The other is due to wrong sentence segmentation. The system may be improved in both respects. However, in order to arrive at a reliable sentence splitting, the typographical markings in the newspaper articles have to be taken into account and presented to SCARRIE in a way that it can handle, e.g. in an SGML format.

The validity of the remaining error types currently targeted by SCARRIE has to be assessed by validation studies of a larger corpus.

The dominating error type outside the defined scope of SCARRIE is punctuation, notably errors in the use of the comma. Except for quite special contexts, the detection of errors in the use of the comma must be based on a reliable recognition of clause boundaries. This is feasible within the ScarCheck framework, but requires further development of the grammar to an extent that was outside the scope of the project. Another major break through into more error types would be feasible if valency information could be taken into account in a systematic way. So far, valency information in ScarCheck is limited to individual lexical items. Including valency aspects in a systematic way would require a substantial extension of the dictionary, also that outside the scope of the project, but inside the ScarCheck grammar checking strategy.

The grammar checking validation results that were presented above are the first that we know of for Swedish. A grammar checking competitor was announced by the Finnish company LingSoft in October 1998 to be included in Office 2000. It will work together with the Word spell checker. As soon as it will be on the market, comparative studies will be made.

4.1.5. Validation conclusion

In conclusion, the validation results obtained show that the three prototypes developed by the project perform well even in comparison with leading commercial spelling checkers. As expected, SCARRIE seen as a whole can handle linguistic phenomena which similar systems do not cover, e.g. it can perform compound analysis, it can flag and to a certain extent correct certain types of context dependent spelling errors, and it can enforce style consistency. The three prototypes display a certain variation with respect to the system’s linguistic functionality, due to the fact that different areas have been in focus for the three languages. In general, however, lexical coverage and coverage of errors is very satisfactory for all of them.

Precision is also good compared to e.g. the spelling checkers in Word, and most incorrect flags are caused by the incorrect treatment of e.g. punctuation, abbreviations, and digits, all of them areas on which the project did not focus. Suggestion adequacy is deemed to be the most problematic attribute. Two areas for further improvements which would improve suggestion adequacy are the correction of typos, which again has not been emphasised in the project, and the treatment of split-ups and run-ons, which is a very interesting, but still somewhat unstable feature of the system.

The results obtained with the Swedish grammar checker seem to be good for the error types that were represented in the validation corpus. Validation of a larger corpus is needed to assess the validity of the remaining error types. Still there is no product on the market with grammar checking functionality for Swedish to compare with.

4.2. Feedback

Throughout the project, the users have contributed in a highly dedicated manner. As regards the commercial product, the following user views were expressed.

  • The Danish users would be happy to see Danish SCARRIE integrated in a word processing system, in the first place Word.
  • The Norwegian users would only consider using a new product if it is clearly better than TANSA and offers added functionality. They prefer a mode of operation where a text is fully proofread after writing, not during writing.
  • The Swedish users would prefer a product that may be interconnected with different word processors but not as an integrated part. In specific, they would like to influence the design and functionality of the interface, and quite concrete views on this matter were expressed.

4.3. Internal collaboration

Partly due to common characteristics of the Scandinavian languages, it made sense for this project to be set up as an international project rather than three separate projects. On the one hand, common approaches and tools were shared by the partners. On the other hand, the partners also each explored some alternative methods which could be compared.

5. Conclusions and future prospects

5.1. Synthesis and conclusions

5.1.1. Technical feasibility, implications for exploitation

5.1.1.1. Danish SCARRIE

The end user validation of Danish SCARRIE revealed various advantages and drawbacks with the current Danish prototype. Concerning the innovative elements – syntactic and compound analysis – the performance of the Danish SCARRIE proved superior in comparison with similar commercial systems. On the other hand, standard text-handling functionality such as treatment of different format standards and punctuation markers is not for the time being covered adequately by SCARRIE.

In all, however, the validation of SCARRIE showed that the Danish version of SCARRIE is a system with the potential of becoming a viable software product on the market of proofreading systems in Denmark. Besides being based on the validation of the system, this assessment also rests on the fact that the Danish market by far is not as competitive as the Swedish and Norwegian markets. As an indicator of this fact, not only Berlingske Tidende but also other Danish newspapers have shown a great deal of interest in using the Danish version of SCARRIE.

5.1.1.2. Norwegian SCARRIE

As to the Norwegian results, the project has successfully showed that linguistic functionality can be greatly enhanced by syntactic analysis. But although working mechanisms have been implemented, the project has neither built a grammar with complete coverage, nor built mechanisms allowing precision in syntactic disambiguation. The latter goals would rightly be worthy of separate follow-up projects.

Unfortunately, much of the Norwegian prototype’s functionality that is highly desired by the user does require syntactic analysis. Therefore, the following consequences can be sketched with respect to prospective for a commercial application. (1) At present, commercial exploitation of the prototype is only feasible if the grammar is left out, because it is too unreliable in real life texts. The immediate potential is a product which has higher quality than MS Word but which still misses some of the desired functionality. (2) In the long run, only a big investment in large-scale grammar development and sophisticated parsing can result in a product where the full functionality which is successfully demonstrated in this project will reliably work on authentic texts.

5.1.1.3. Swedish SCARRIE

Swedish SCARRIE is a system with the potential of becoming a viable software product on the Swedish market. Evaluation and validation of its spell checking functionality has shown that SCARRIE is superior to Swedish MS Word. Validation of the prototype’s grammar checking capacity has shown good results for the error types that were represented in the validation corpus. The validity of the remaining error types in the target should be further examined by running the prototype on the full set of validation texts provided by the users. Due to project limitations, only a subset of these texts was actually used for validation. In connection with an extended validation, some fine-tuning of the grammar and the dictionary may be foreseen.

In a commercial product, the user should be free to choose what error types to activate or deactivate. Such a choice is supported by the use of well-defined error type codes in Swedish SCARRIE. Proper sentence splitting also has to be guaranteed. The solution seems to be to convert the information inherent in the typographical codes used by the newspapers into some general format such as SGML. Standard text-handling functionality such as treatment of different format standards and punctuation markers has to be adequately covered by SCARRIE. Also, the assignment of grammatical codes to word forms outside the dictionary has to be made more precise.

With the announcement of a new grammar checker for Swedish, SWEGC, by the Finnish company LingSoft, competition on the Swedish market has increased. SWEGC targets grammar errors of the same type as SCARRIE. It works with the Word spell checker as a basis. When Swegc is released on the market, one would like to see a comparison of the two alternative systems, Word and SWEGC on the one hand, and SCARRIE on the other. If the linguistic functionality and performance of the two systems turn out to be analogous, Word users will certainly be happy about SWEGC, whereas non-Word users, such as the newspapers that we know of and others, will prefer SCARRIE.

5.1.2. Economical viability

The project has been an ambitious one from the start. With an EC contribution of 700,000 ECU, it has produced useful results, but it has been impossible to fully develop the intended prototypes to the level desired for direct exploitation, even with immense extra contributions from the partners beyond the intended budget. There is still good reason to believe that the SCARRIE product will reach break-even within the first three years, counted from the time when the product is introduced.

At present, WordFinder Software sells between 1,000 and 2,000 copies of English, German and French combined spell checkers/grammar checkers annually merely in Sweden. The market for the SCARRIE product is at least 5-10 times larger. All people write in their native language, so everyone in a Scandinavian country with a computer is a potential client.

WordFinder estimates that it will be possible to sell approximately 3,000 copies in the year 2000, twice that amount in 2001 and a further increase in sales by 50% in 2002. All partners will receive revenues from sales and royalties.

5.2. Business perspective

Grammars and spell checkers for English, French and German have been on the market for several years. Clients in Sweden, especially newspapers and publishing houses, have asked for a Swedish combined spell checker and grammar checker for many years. And still no development has been made. The situation on the other Scandinavian markets are the same. In some of the countries, there is not even a good spell checker. The product developed in the SCARRIE project will be the first combined spell checker and grammar checker for these languages.

In October 1998, at the Swedish Book & Library Fair, the Finnish company LingSoft, announced a new grammar checker for Swedish, SWEGC, to be launched as part of MS Office 2000. A fairly detailed description of the error types targeted by SWEGC was presented on the web. To a great extent they overlap with those targeted by Swedish SCARRIE. As opposed to SCARRIE, SWEGC handles grammar checking only; it will work as an add-on to the spell checker provided by WORD. The SWEGC demonstration sentences that were presented on the web were included in the test material to be used in the evaluation of the Swedish SCARRIE prototype. No possibility of testing SWEGC via the web was provided.

In addition, some companies have developed spell checkers and text checkers, but none of them with grammar included. Skribent on the Swedish market and Tansa on the Norwegian market are today the only competitors on the Scandinavian markets. MS Office 2000 will be one, and of course a tough one, when Microsoft introduce the new version. The SCARRIE product must therefore prove much stronger than Lingsoft’s add-on in order to attract the users. Nevertheless, it should be remembered that Reference Software managed to sell more than 2 million copies of Grammatik (English version), even though Microsoft delivered Correct Grammar within MS Word.

5.3. Exploitation planning

5.3.1. Benefits to partners and consortium

Exploitation is mostly for WordFinder, the other partners get royalties.

UIB/HIT will not exploit the results in other ways.

UU is considering exploiting the results in an on-going cooperation with Scania on language checking of controlled language. Another area that is being considered for exploitation is CALL, in specific, second-language learning (Swedish) focusing on grammatical aspects. Finally, the results that were achieved with regard to partial parsing seem to provide an interesting starting point for research on sophisticated information retrieval and extraction.

All the academic partners have gained research experience which they will apply in future research.

5.3.2. Business plan

5.3.2.1. Timetable for developing the SCARRIE product

June 98 – February 99 The Swedish, Danish and Norwegian prototype will be finalized during this period and integrated into a simple user interface for Windows

Feb. 99 – October 99 Further development of the user interface in cooperation with a user group will be done.

The user interface for Windows will be finished.

November 99 Beta test phase for the Windows version.

December 99 Production of the Windows version

January 00 – June 00 Further development of the user interface for

Macintosh. The user interface for Macintosh will be finished. But only if Macintosh still exists on the market and the newspapers and the publishing houses still use it. We need orders from a couple of newspapers and publishing houses to cover the extra development costs for the Macintosh version.

July 00 Beta test phase for the Macintosh version.

August 00 Production of the Macintosh version.

5.3.2.2. Timetable for introducing SCARRIE on the market

Step I

December 99

Introducing the Swedish version of SCARRIE for Windows in December 99 on the market. The product will be sold through WordFinder Software AB in Sweden.

Step II

March 00

Introducing the Danish version of SCARRIE for Windows in March 00 on the market. The product will be sold through WordFinder Software A/S in Denmark.

Step III

September 00

Introducing the Norwegian version of SCARRIE for Windows, the Swedish, Danish and Norwegian version of SCARRIE for Macintosh in December 00 on the market. The product will be sold through WordFinder Software AB in Sweden, WordFinder Software AS in Norway and WordFinder Software A/S in Denmark.

6. Appendices

6.1. Runnable demonstrator of Swedish SCARRIE

A demonstrator of Swedish SCARRIE may be found at url: http://stp.ling.uu.se/~ljo/scarrie-pub/scarrie.html. The demonstrator takes one sentence at a time. It expects you to finish your input with a major sign of punctuation; if not, only spell checking will be performed.

6.2. List of deliverables

Deliverable id. Qty Title
D0.1
13 Bimonthly management report
D0.2 4 Semestrial progress report
D0.3 2 Public annual report
D0.4 1 Final project report
D0.5 1 Dissemination and concertation plan
D1.2 1 Installation of the CORRie software at the partners
D2.1 1 Error typology for automatic proofreading purposes
D2.1.1.2 1 An error database of Danish
D2.1.1.3 1 Parallel corpus of proofread and non-proofread text in Danish
D2.1.2.2 1 An error database of Norwegian
D2.1.3.2 1 An error database of Swedish
D2.1.3.3 1 Parallel corpus of proofread and non-proofread text in Swedish
D2.2.1.2 1 BET Test corpus
D2.2.2.1 1 BTR Test corpus
D2.2.3.1 1 SVD Test corpus
D2.2.3.2 1 UNT Test corpus
D3.1.3 1 A Swedish text corpus for generating dictionaries
D3.2.1 1 A Danish wordform dictionary
D3.2.3 1 A Swedish wordform dictionary
D3.3.1 1 Specification of tagset
D3.3.2 1 Adjusted tagset
D3.3.3 1 Wordform dictionary files for Norwegian
D3.3.4 1 Wordform files with additions for Norwegian
D3.4.1 1 A Swedish hyphenation marker
D4.1.1 1 Specification of phonemic representations, Norwegian
D4.1.2 1 Specification of phonemic representations, Danish
D4.1.3 1 Specification of phonemic representations, Swedish
D5.1.1 1 Compounding rules for Norwegian
D5.1.2 1 Compounding rules for Danish
D5.1.3 1 Compounding rules for Swedish
D5.3.1 1 List of multi-word expressions for Norwegian
D5.3.2 1 List of multi-word expressions for Danish
D5.3.3 1 List of multi-word expressions for Swedish
D6.1 1 A study of three commercial grammar checkers
D6.2.1 1 Three types of grammatical errors in Danish
D6.2.2 1 Three types of grammatical errors in Norwegian
D6.2.3 1 Three types of grammatical errors in Swedish
D6.3.2 1 Norwegian phrase constituent rules
D6.3.3 1 Swedish phrase constituent rules
D6.4 1 A formalism for the expression of local error rules
D6.4.3 1 Local error rules for Swedish
D6.5.1 1 A test version of the grammar checker for Swedish
D6.5.1b 1 A chart-based framework for grammar checking
D6.5.2 1 A specification of the required grammar checking machinery
D6.6.1 1 A grammar checking module for Danish
D6.6.2 1 A grammar checking module for Norwegian
D6.6.3 1 A grammar checking module for Swedish
D7.1.1 1 Test suites covering the functional specifications of the subcomponents of the Danish prototype
D7.1.2 1 Test suites covering the functional specifications of the subcomponents of the Norwegian prototype
D7.1.3 1 Test suites covering the functional specifications of the subcomponents of the Swedish prototype
D7.2 1 Evaluation report
D8.0 1 Validation plan
D8.1a 1 Evaluation sheet
D8.1.1 1 Evaluation report for the Danish prototype
D8.1.2 1 Evaluation report for the Norwegian prototype
D8.1.3 1 Evaluation report for the Swedish prototype
D8.2a 1 Specification of a simple user interface
D8.2b 1 Automatic validation software
D9.1 1 Guidelines for implementing products from project results
D9.2 1 An updated market survey
D9.3 1 An exploitation plan

6.3. References

6.3.1. Project Reports

Paggio, P., Dahlqvist, B., Sågvall Hein, A., Olsson, L-J. and Povlsen, C. (1999) Evaluation report SCARRIE Project Report. Del. 7.2.

Paggio, P. and Povlsen, C. (1999) Validation of the Danish prototype SCARRIE Project Report. Del. 8.1.1

Wedbjer Rambell, Olga (1998): Error Typology for Automatic Proofreading Purposes. SCARRIE Project Report, Del. 2.1, version 1.1.

Wedbjer Rambell, Olga, Dahlqvist, Bengt, Tjong Kim Sang, Erik and Hein, Nils (1998) Error

Database of Swedish. SCARRIE Project Report, Del. 2.1.3.2.

6.3.2. Project-relevant Papers

Paggio, Patrizia and Music, Bradley (1998) Evaluation in the SCARRIE Project. Proceedings of the First International Conference on Language Resources & Evaluation. Granada, Spain, pp.277-282.

Sågvall Hein, Anna, Paggio, Patrizia and Wedbjer Rambell, Olga (with contributions from Bart Jongejan, Leif-Jöran Olsson, Claus Poulsen, and Per Starbäck). An inquiry on software for combined spell checking and grammar checking. SCARRIE Common Workspace.

Dahlqvist, Bengt (1997) Word Frequency Lists for the Uppsala Newspaper Corpus. Collection of Swedish word lists from the 67 million word Uppsala Newspaper Corpus.

Dahlqvist, Bengt (1999) Protokoll över stavningskontroll med MS Word 97 på testtext fredag4.txt. Uppsala universitet. Instituttionen för lingvistik. Arbetsrapport.

6.3.3. Other references

EAGLES (1996) EAGLES Evaluation of natural language processing systems. Final Report. EAGLES Document EAG-EWG-PR2. ISBN 87-90708-00-8.

Manzi, S., King, M. and Douglas, S. (1996) Working towards user-oriented evaluation. In Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLP+IA 96), (pp 155–160). Moncton, New-Brunswick, Canada.

Paggio, Patrizia and Underwood, Nancy L. (1998) Validating the TEMAA LE evaluation methodology: a case study on Danish spelling checkers. Journal of Natural Language Engineering, 4 (3), Cambridge University Press, pp. 211-228.

6.3.4. Papers presented at conferences

Hansen, Peter Molbæk (1999) Grapheme-to-Phoneme Rules for the Danish Component of the SCARRIE Project. Datalingvistisk Forenings årsmøde 1998 i København. Proceedings. Handelshøjskolen i København, Coepnhagen, pp. 79-90.

Paggio, Patrizia and Music, Bradley (1998) Evaluation in the SCARRIE Project. Proceedings of the First International Conference on Language Resources & Evaluation. Granada, Spain, pp.277-282.

Paggio, Patrizia (1999) Treatment of grammatical errors and evaluation in SCARRIE. Datalingvistisk Forenings årsmøde 1998 i København. Proceedings. Handelshøjskolen i København, Coepnhagen, pp. 65-78.

Povlsen, Claus (1999) SCARRIE – et skandinavisk forskningsprojekt. Datalingvistisk Forenings årsmøde 1998 i København. Proceedings. Handelshøjskolen i København, Copenhagen, pp. 55-64.

Rosén, V., & De Smedt, K. (1998). SCARRIE: Automatisk korrekturlesning for skandinaviske språk. Faarlund, J.T., Mæhlum, B. & Nordgård, T. (eds.) Mons 7: Utvalde artiklar frå det 7. Møtet Om Norsk Språk i Trondheim 1997 (pp. 197-210). Oslo: Novus. (see full text at http://fasting.hf.uib.no/~desmedt/scarrie/)

Sågvall Hein, Anna (1998), A Chart-Based Framework for Grammar Checking. Initial Studies. Nodalida ´98, Proceedings of the 11th Nordic Conference on Computational Linguistics, Copenhagen, Denmark, pp.68-80.)

6.4. Project brochure

Order address: WordFinder Software AB, Box 155, SE-351 04 Växjö.

Fax: +46 470 700099

E-mail: info@wordfinder.se

6.5. User groups

6.5.1. List of user groups directly involved in the project work

Bergen Trykk, Norway

Berlingske Tidende, Denmark

Fagbokforlaget, Norway

Munksgaard International Publishers, Denmark

Svenska Dagbladet, Sweden

Upsala Nya Tidning, Sweden

6.5.2. Other user contacts

Members of the Internet user forum ‘scarrie-users’:

grayseal@grayseal.pp.se

lena.tingstam@svd.se

torgny.hinnemo@svd.se

catharina.grunbaum@dn.se

stefan-l@stab.sr.se

hansw@stab.sr.se

bo.lofvendahl@svd.se

sanna.kilner@gp.se

berit.norman@op.se

milan.galik@unt.se

paul.heisholt@gw250.ntb.no

bergentrykk@graficonn.no

hje@mail.munksgaard.dk

Info.Berith@mailbox.swipnet.se

desmedt@hf.uib.no

thomas.breinstrup@berlingske.dk

b.skovsende@jp.dk

bent.nordbo@aftenposten.no

johan@lindgren.pp.se

per.amnestal@pressinstitute.se

bengt.dahlqvist@ling.uu.se

iain.urquhart@lux.dg13.cec.be

anders.rhodiner@tt.se

beth.antonisen@vk.se

nils.thostrup@jv.dk

uffe.stambej@mail.hbl.fi

ebbe.andersen@online.pol.dk

peter.mortensen@borsen.dk

fl.hvidtfeldt@jv.dk

mats.lundman@ksdbladet.se

bengt.engwall@nt.se

stig.grimelid@dn.nhst.no.

per.ahlstrom@vpress.se.

johan.hjelm@medialab.bonnier.se

hans.gundestrup@jv.dk

truls.lie@morgenbladet.no

mari-ann.jonsson@sds.se

nicolai.doellner@borsen.dk

ole-kristian.lyngstad@nordi.no

baumann@sj-dagblade.dk

marc.lofgren@aftonbladet.se

marita.granroth@mail.hbl.fi

herning.folkeblad@euroconnect.dk

wisby@image.dk

anna.wahlen@di.se

forlag@dkdfh.dk

krister.wistbacka@kkuriren.se

lise.nestelso@borsen.dk

anna.ostlund@fri-kopenskap.se

ihp@dbc.dk

hsn@fyens-stiftstidende.dk

rpc@post5.tele.dk

kari.westengen@ntb.no

per-erik.skramstad@dagbladet.no

kt@berlingske.dk

alf.zettersten@ljusnan.se

susanne_derby/berlingske.berlingske@mailserver.berlingske.dk

dyvikk@sjo.statkart.no

o.hoel@samlaget.no

josven@stud.ntnu.no

bertil.eriksson@expressen.se

lotta.chakir@expressen.se