Skip to Main content Skip to Navigation
Journal articles

Report on the CMS forward backward MSGC Milestone

Abstract : In this paper, we treat the problems of Part-of-Speech (PoS) tagging of unannotated corpora of specialty. The existing taggers are trained on non-specialized corpora, and most often give inconsistent results on specialized texts. In order to learn rules adapted to a specialized field, the usual approach labels manually a large corpus of this field. This is extremely time-consuming. We propose here a semi-automatic approach for PoS tagging corpora of specialty. ETIQ, the new tagger we are building, make it possible to correct the base of rules obtained by Brill‘s tagger and to adapt it to a corpus of specialty. The expert of the field visualizes a basic tagging and corrects it by the insertion of specialized contextual lexical rules. The inserted rules are more expressive than Brill‘s rules. To help the user in this task, we designed an inductive algorithm biased by the "correct" knowledge acquired beforehand by the user. By using machine learning techniques while allowing the expert to incorporate knowledge of the field in an interactive and convivial way, we improve the tagging of a specialty corpus. Our approach has been applied to a molecular biology corpus.
Document type :
Journal articles
Complete list of metadata
Contributor : Yvette Heyd Connect in order to contact the contributor
Submitted on : Thursday, August 31, 2000 - 2:35:10 PM
Last modification on : Thursday, April 23, 2020 - 2:26:14 PM


  • HAL Id : in2p3-00005961, version 1



J.M. Brom, U. Goerlach, A. Lounis, I. Ripp-Baudot, A. Zghiche. Report on the CMS forward backward MSGC Milestone. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Elsevier, 1998, 419, pp.375. ⟨in2p3-00005961⟩



Record views