Model for Estimating the Optimal Parameter Values of the Scoring Matrix in the Entity Resolution of Unstandardized References

2021 
The research describes obtaining the best linking results when using the scoring matrix to perform entity resolution (ER) on unstandardized references. The accuracy of the linking results produced by the scoring matrix depends upon three critical parameters, the blocking frequency threshold, the stop word frequency threshold, and the scoring (matching) threshold. This paper describes results from building a regression model for estimating the values of the optimal parameters, i.e. the parameter values giving the best ER results in terms of F-measure. The experimental method used 20 fully-annotated sets of unstandardized references of varying size and data quality. The reference sets were a mixture of synthetically created person references and real-world business references. A grid search was used to find the setting giving the best results along with seven statistical values collected for each reference set. For each combination of statistics from the 20 training set, three linear regression models were built to predict each of the critical scoring matrix parameters. The final result was using the combination of reference set size and the standard deviation of the token frequency distribution as independent variables produced the best linear regression models for estimating the three critical scoring matrix parameters. The linear regression model developed and in this research will help users generate more accurate estimates of the three critical scoring matrix parameters in practical applications. This research proposed solution and opens the door to a number of new research questions for improving the performance of the scoring matrix approach to ER.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []