|Enis Ceyhun Alp||Ecole Polytechnique Fédérale de Lausanne|
|Paul Barford||comScore, University of Wisconsin|
The authors demonstrate how measurement, tracking, and other internet entities can associate multiple identifiers with a single device or user after coarse associations are made. The authors employ a Bayesian similarity algorithm
Datasets that organize and associate the many identifiers produced by PCs, smartphones, and tablets accessing the internet are referred to as internet device graphs . In this paper, we demonstrate how measurement, tracking, and other internet entities can associate multiple identifiers with a single device or user after coarse associations, e.g ., based on IP-colocation , are made. We employ a Bayesian similarity algorithm that relies on examples of pairs of identifiers and their associated telemetry, including user agent, screen size, and domains visited, to establish pair-wise scores. Community detection algorithms are applied to group identifiers that belong to the same device or user. We train and validate our methodology using a unique dataset collected from a client panel with full visibility, apply it to a dataset of 700 million device identifiers collected over the course of six weeks in the United States, and show that it outperforms several unsupervised learning approaches. Results show mean precision and recall exceeding 90% for association of identifiers at both the device and user levels.