A monolingual approach to detection of text reuse in Russian-English collection

Oleg Bakhteev,Rita Kuznetsova,Alexey Romanov,Anton Khritankov

A monolingual approach to detection of text reuse in Russian-English collection

2015

Oleg Bakhteev
Rita Kuznetsova
Alexey Romanov
Anton Khritankov

In this paper we develop a method for cross-lingual (Russian and English) text reuse detection. The method is based on the monolingual approach — translation of texts into one language and reduction to the text similarity problem. We split texts into non-overlapping fragments and compare fragments to each other by means of different metrics — BLEU(1–2), ME-TEOR, cosine similarity between bag-of-words representations of each snippet, and cosine similarity between vectors obtained from doc2vec-trained model. We explore the impact of choice of metric on the quality of text reuse detection. We assess quality of the method on a sample of a hundred scientific documents, originally in Russian, machine translated into English. Preliminary findings demonstrate feasibility of the approach.

Keywords:

Reuse
BLEU
Natural language processing
Snippet
Information retrieval
Cosine similarity
Evaluation of machine translation
Artificial intelligence
Computer science
machine translated

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations