Learning Similarity-Preserving Meta-Embedding for Text Mining

2020 
Publicly available pre-trained word embeddings are rich sources for turning critical high-dimensional representations of huge text data repositories into meaningful compact vectors essential for text mining applications. With many of such pre-trained embedding sources available, each faces limitations in the appropriateness of their language use for the downstream text-mining tasks. Meta-embeddings aim to tackle this ambiguity challenge by fusing multiple embedding sources into one feature space. However, current meta-embedding methods assume vocabularies across sources are similar or even identical; which unfortunately stands in sharp contrast to the fact that many sources barely overlap. Further, these methods encode a meta-embedding for each word by reconstructing its actual embedding values (word-encoder), while valuable information of relationships (distances) among words within each source are not directly considered. In this work, we instead propose a novel relation-encoder learning approach called Similarity-Preserving Meta-Embedding (SimME) that directly integrates word-pair relationships from partially overlapping embedding sources. SimME embeds words such that their similarities are learned from those observed in multiple pre-trained sources. To handle relations between words that are not present in all sources, we introduce maskout, a new loss term, that steers the learning selectively to the sources containing said relations. SimME consistently outperforms state-of-the-art methods by 10% on average and with up to 20% across several core metrics in 4 popular mining tasks on 23 datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    41
    References
    0
    Citations
    NaN
    KQI
    []