Creating Recommender Systems Datasets in Scientific Fields

M. Barros,Francisco M. Couto,Matilde Pato,Pedro Ruas

Creating Recommender Systems Datasets in Scientific Fields

2021

Recommender systems (RS) have been successfully explored in a vast number of domains, e.g. movies and tv shows, music, or e-commerce. In these domains we have a large number of datasets freely available for testing and evaluating new recommender algorithms. For example, Movielens and Netflix datasets for movies, Spotify for music, and Amazon for e-commerce, which translates into a large number of algorithms applied to these fields. In scientific fields, such as Health and Chemistry, standard and open access datasets with the information about the preferences of the users are scarce. First, it is important to understand the application domain, i.e. "what the recommended item is". Second, who are the end users: researchers, pharmacists, clinicians or policy makers. Third, the availability of data. Thus, if we wish to develop an algorithm for recommending scientific items, we do not have access to datasets with information about the past preferences of a group of users. Given this limitation, we developed a methodology, called LIBRETTI - LIterature Based RecommEndaTion of scienTific Items, whose goal is the creation of datasets, related with scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. More info about the tutorial at https://lasigebiotm.github.io/RecSys.Scifi/.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations