An Efficient Parallel Approach of Parsing and Indexing for Large-Scale XML Datasets

Kunfang Song,Hong-Wei Lu,Xiao Qin

An Efficient Parallel Approach of Parsing and Indexing for Large-Scale XML Datasets

2016

MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. We propose an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. Our solution seamlessly integrates data storage, labelling, indexing, and parallel queries to process a massive amount of XML data. Specifically, we introduce an SDN labelling algorithm and a distributed hierarchical index using DHTs, we develop an efficient data retrieval approach called B-SLCA. More importantly, we design an advanced two-phase MapReduce solution that is able to efficiently address the issues of labelling, indexing, and query processing on big XML data. We implemented our solution on a real-world Hadoop cluster processing the real-world datasets. Our experimental results show that SDN outperforms NCIM by up to a factor of 1.36 with an average of 1.17, our BSLCA outperforms BwdSLCA by up to a factor of 1.96 with an average of 1.2.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations