Change Management of Web
Data
The Web offers access to large amounts of heterogeneous
information and contains a mix of many different data
types. Specifically, the Web contains additional data
types not available in large scale before including
hyperlinks and massive amount of (indirect) user usage
information. Spanning across all these data types
there is the dimension of time since data in the Web
can change any time and in any way. These changes
take two general forms. The first is existence. Web
pages and Web sites exhibit varied longevity pattern.
The second is structure and content modification.
Web pages replace its antecedent, usually leaving
no trace of the previous document. Finally, there
is data that is generated dynamically in response
to user input and programmatic scripts. The availability
of these various types of data which often change
rapidly and unpredictably create a new problem of
mining these data to discover useful and hidden information
and knowledge from them.
In this project, we focus on mining changes to Web
data (also called web deltas). This is a challenging
problem because information sources in the Web are
autonomous and typical database approaches to detect
and mine these changes are not usable. In conventional
databases, changes to data is made easier by the availability
of facilities such as transaction logs, triggers,
etc.. However, such facilities often are absent for
web sites. Even in cases where these facilities are
available, they may not be accessible to outside users.
A system for mining web deltas has wide range of
applications. For instance, it can be used to monitor
E-commerce web sites and analysis product feature,
trends over a period of time. Companies can monitor
evolution of their competitors web sites to discover
their new directions or offering over a period of
time that may influence their market positions. Observe
that such analysis is impossible if it is performed
directly on web sites, as these web sites do not keep
track of historical data in a format that can be analyzed
by its competitors.
Resercher
Dr.
Sanjay Madria
Yan Chen (MS Thesis), UMR
Dr. Sourav Bhowmick, Nanyang Technological University,
Singapore
|