An animal health example of managing and analyzing a large volume of data on a PC: Modeling body weight and age of over 13 million cats for explanatory and predictive purposes

2020 
Abstract Large amounts of animal health data are available to researchers, but are often stored in different formats and information silos. Analysis of this existing information can provide new insights into the health and welfare of animals and possibly reduce the need to collect additional data. The objective of this study was to develop a method of managing and analyzing large amounts of data on a personal computer that can be run within 24 h to limit the time and resources spent deploying models on larger servers. This paper describes an overall approach that makes use of existing methods for data acquisition and modeling, but adapts and combines them in a way that allows manipulation and analysis of large volumes of data on a PC. This included a total of five steps: removing errors; removing data points outside the scope of a specific hypothesis; creating descriptive statistics; developing explanatory and/or predictive models; and assessing the fit or accuracy of the models created. The approach was developed using electronic medical records for 19,416,753 feline patients from 3972 anonymized veterinary clinics in the United States and Canada, recorded between January 1981 and June 2016. Data regarding patient signalment (age, sex, breed, reproductive status) and body weight were extracted from the records and used to create linear regression models to describe body weight in cats of different ages, breeds, genders and reproductive status. Ordinary least squares linear regression and stochastic gradient descent linear regression were compared to determine their effectiveness and suitability for creating predictive models with large datasets, using 10 fold cross validation. This approach could be used to build workflows to create models to determine exploratory and predictive properties of health parameters for animals and people. The ability to work with large datasets on a PC or equivalent technology was demonstrated. Significant interactions were present among sex, reproductive status and age. A peak in weight occurred between 6 and 9 years depending on the sex, reproductive status and breed. The predictive ability of the two models was similar, with both producing a root mean square error of 1.45 and a mean absolute error of 1.09, and mean error that was approximately zero on the validation dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []