w2c logo Missouri S&T
About People News Projects Publications Services Grants Contact Us
Projects

A Change detection algorithm for Unordered XML Document (XSQL)

“The eXtensible Markup language(XML) is the universal format for structured documents and data on the web” and it is likely to replace HTML as the standard format for publishing and transporting documents on the web[W3C]. In order to efficiently handle huge online information, which is constantly changing, an effective change detection tool is important. A lot of research is done in finding an effective storage and retrieval mechanism for XML documents and data. Previous work in XML change detection focused on detecting changes by constructing XML Trees from XML documents and comparing two tree structures. This approach may not be efficient in handling large XML documents due to the fact that an equivalent XML tree will be twice as large as the original document and the entire tree has to be memory resident during the comparison. Also, in case of unordered XML documents, Zhang et al. [ZSS92] proved that general unordered tree-to-tree correction problem is NP-Complete. We propose an algorithm (XSQL) for detecting changes in unordered XML documents using the simple yet novel approach of storing XML documents in relational databases and using iterative SQL queries to detect changes. Using relational databases was also due to the commercial success and the ability to store large volumes of XML documents. XSQL iterates over every node that has one or more child elements from bottom-up and identifies a key or a combination of keys to compare elements of the same type in two given XML documents. We have implemented a prototype system in Java using JDBC and we chose XREL[1] as XML storage scheme, storing XML documents in Oracle database. Our experimental results show that XSQL approach performs better both in terms of memory constraints(application heap size) and execution time than other currently available tools (X-Diff[2], DeltaXML[3]) in case of large sized shallow or semi-shallow XML documents. It also has comparable efficiency and result quality to these tools for small and medium sized XML documents. However, in case of large sized deep XML documents with depths of more than 15, DeltaXML was clearly the best of the alogirthms, while X-Diff couldn’t complete the change detection for file sizes of more than 2MB (with maximum heap size of 700mb). In case of applications like online query system, the tree comparison approach may be more suitable and for applications like mining and knowledge discovery from XML data, the relational approach may be more suitable. Also for data mining applications and XML versioning systems, the relational approach will be more feasible.

Researcher

Dr. Sanjay Madria

Dr. Sourav Bhowmick, Nanyang Technological University, Singapore