A Change detection algorithm
for Unordered XML Document (XSQL)
“The eXtensible Markup language(XML) is the
universal format for structured documents and data
on the web” and it is likely to replace HTML
as the standard format for publishing and transporting
documents on the web[W3C]. In order to efficiently
handle huge online information, which is constantly
changing, an effective change detection tool is important.
A lot of research is done in finding an effective
storage and retrieval mechanism for XML documents
and data. Previous work in XML change detection focused
on detecting changes by constructing XML Trees from
XML documents and comparing two tree structures. This
approach may not be efficient in handling large XML
documents due to the fact that an equivalent XML tree
will be twice as large as the original document and
the entire tree has to be memory resident during the
comparison. Also, in case of unordered XML documents,
Zhang et al. [ZSS92] proved that general unordered
tree-to-tree correction problem is NP-Complete. We
propose an algorithm (XSQL) for detecting changes
in unordered XML documents using the simple yet novel
approach of storing XML documents in relational databases
and using iterative SQL queries to detect changes.
Using relational databases was also due to the commercial
success and the ability to store large volumes of
XML documents. XSQL iterates over every node that
has one or more child elements from bottom-up and
identifies a key or a combination of keys to compare
elements of the same type in two given XML documents.
We have implemented a prototype system in Java using
JDBC and we chose XREL[1] as XML storage scheme, storing
XML documents in Oracle database. Our experimental
results show that XSQL approach performs better both
in terms of memory constraints(application heap size)
and execution time than other currently available
tools (X-Diff[2], DeltaXML[3]) in case of large sized
shallow or semi-shallow XML documents. It also has
comparable efficiency and result quality to these
tools for small and medium sized XML documents. However,
in case of large sized deep XML documents with depths
of more than 15, DeltaXML was clearly the best of
the alogirthms, while X-Diff couldn’t complete
the change detection for file sizes of more than 2MB
(with maximum heap size of 700mb). In case of applications
like online query system, the tree comparison approach
may be more suitable and for applications like mining
and knowledge discovery from XML data, the relational
approach may be more suitable. Also for data mining
applications and XML versioning systems, the relational
approach will be more feasible.
Researcher
Dr. Sanjay Madria
Dr. Sourav Bhowmick, Nanyang Technological University, Singapore
|