language-icon Old Web
English
Sign In

Comparable corpora BootCaT

2011 
The BootCaT method (Baroni and Bernardini, 2004) has proved a fast, effective and versatile approach to corpus building. The method has been applied to small specialist corpora for finding terminology and translations (as originally envisaged by Baroni and Bernardini), and to large, general corpora, for large numbers of languages. First we review BootCaT, and present some figures for the sizes of corpora that can be built in a few minutes, on various parameter-settings. To date BootCaT has not been applied multilingually. We explore this by building matching corpora for different languages from matching seeds. We consider three ways of obtaining matching seeds: manual translation, automatic translation, and by finding keywords from corresponding Wikipedia articles. In one experiment, we present a bilingual word sketch based on seed-translation by Google Translate. In another, seeds are from Wikipedia, and we evaluate the corpora by seeing, firstly, how many domain terms they deliver, and secondly, by seeing how often the terms in the one language are translation equivalents of the terms in the other.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    4
    Citations
    NaN
    KQI
    []