Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database

2016 
We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity- specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.edu Methods: Existing approaches have focused on machine learning techniques that extract features of names, with known ethnicities, harvested from online sources (e.g., TextMap: Ambekar et al. 2009, and EthnicSeer: Treeratpituk and Giles 2012). TextMap provides for hierarchical classification with 12 leaves, while EthnicSeer has a flat set of 12 slightly different classes (e.g., it excludes Jewish, Nordic, and African, and includes Chinese, Korean, and Vietnamese in place of EastAsian). Our approach differs in several respects. First, it is instance-based. That is, no machine training or feature selection occurs, we just perform a look-up of author name instances previously geocoded and mapped to countries worldwide (Torvik, 2015). A temporally weighted multiclass logistic regression model then probabilistically maps the country distribution to ethnicities, reducing the undesirable effects of outliers and highly unbalanced classes. In order to enable fast partial name matching, all 3- and 4-character n- grams of each author name was indexed using MySQL + Sphinx which has rankers that allow for higher weighting of e.g., name-endings. Partial matching only kicks in when the name under question occurs fewer than 100 times in our database. Instance-based classifiers are often more capable than feature-based classifiers at capturing highly non-linear classification boundaries but they rely more heavily on a large, dense set of instances. Our database has tens of millions of author name instances distributed across 200+ countries over 20+ years. Second, we picked the 26 ethnic classes (see Table 2) to be as specific possible, yet separable, and to broadly cover the ethnicities observed with a significant frequency in PubMed. As a result, some countries were pooled regionally (e.g., non-Arab African countries map to African), and countries with no single super-majority were mapped to multiple classes (e.g., Canada maps both to French and English).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    25
    Citations
    NaN
    KQI
    []