Query Optimization for Massive RDF Data Based on Spark

2018 
Sparql (SPARQL Protocol and RDF Query Language) is a query language and data acquisition protocol designed for RDF development. Although it is defined for the RDF data model developed by the W3C, it can be used in any form of RDF to represent data resources. With the explosive growth of web information resources, more and more data is using RDF structure. The research and obtaining of useful information in massive data has become a major challenge. Efficient search and effective query has become the focus attention of research. In this paper, we design an efficient optimization method by finding a semantic connection chain in the system (SparkIlink). Data was stored on the file system of hadoop (HDFS). Based on Spark framework with efficient distributed memory, this system has achieved efficient searching and optimizing performance for massive RDF data. Our work includes the following mechanism: (1) using vertical partition as data storage structure; (2) using twice data statistics; (3) using information connection chain based on semantic. Our system can support massive triples query in distributed environment to achieve efficient query processing. The experiment of this paper is based on the latest SPARQLGX on the spark platform RDF system. In contrast, our system is more efficient in data search than SPARQLGX.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    1
    Citations
    NaN
    KQI
    []