Graph-Based AJAX Crawl: Mining Data from Rich Internet Applications

2012 
AJAX (Asynchronous JavaScript and XML) is becoming more and more popular with the prosperity of web 2.0. However, traditional crawlers fail to retrieve information from AJAX applications because of complex JavaScript operations. Moreover, a single AJAX application with one URL may have different page states, which violates the rule that one URL corresponds to one unique page. The AJAX application can be modeled as a state transition graph and to crawl AJAX is to traverse the graph without prior knowledge of its structure. In this paper, we have distinguished different AJAX events which are not well defined in previous work and proposed a Graph-based AJAX State Traversal (GAST) algorithm to crawl AJAX with minimal edge visits. If topology of the graph is given, this optimization problem turns into a Directed Rural Postman Problem (DRPP) and the optimal lower bound can be obtained. Experimental results show that the proposed algorithm approaches optimum and exhibits better performance than existing work.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    18
    Citations
    NaN
    KQI
    []