Automatic identification of relevant chemical compounds from patents
2019
textabstractIn commercial research and development projects, public disclosure of new chemical
compounds often takes place in patents. Only a small proportion of these compounds
are published in journals, usually a few years after the patent. Patent authorities make
available the patents but do not provide systematic continuous chemical annotations.
Content databases such as Elsevier’s Reaxys provide such services mostly based on
manual excerptions, which are time-consuming and costly. Automatic text-mining
approaches help overcome some of the limitations of the manual process. Different
text-mining approaches exist to extract chemical entities from patents. The majority
of them have been developed using sub-sections of patent documents and focus on
mentions of compounds. Less attention has been given to relevancy of a compound in a
patent. Relevancy of a compound to a patent is based on the patent’s context. A relevant
compound plays a major role within a patent. Identification of relevant compounds
reduces the size of the extracted data and improves the usefulness of patent resources
(e.g. supports identifying the main compounds). Annotators of databases like Reaxys
only annotate relevant compounds. In this study, we design an automated system
that extracts chemical entities from patents and classifies their relevance. The goldstandard set contained 18 789 chemical entity annotations. Of these, 10% were relevant
compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition
system was based on proprietary tools. The performance (F-score) of the system on
compound recognition was 84% on the development set and 86% on the test set. The
relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and
classify their relevance with high performance. This enables the extension of the Reaxys
database by means of automation.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
40
References
17
Citations
NaN
KQI