w2c logo Missouri S&T
About People News Projects Publications Services Grants Contact Us
Projects

Change Management of Web Data

The Web offers access to large amounts of heterogeneous information and contains a mix of many different data types. Specifically, the Web contains additional data types not available in large scale before including hyperlinks and massive amount of (indirect) user usage information. Spanning across all these data types there is the dimension of time since data in the Web can change any time and in any way. These changes take two general forms. The first is existence. Web pages and Web sites exhibit varied longevity pattern. The second is structure and content modification. Web pages replace its antecedent, usually leaving no trace of the previous document. Finally, there is data that is generated dynamically in response to user input and programmatic scripts. The availability of these various types of data which often change rapidly and unpredictably create a new problem of mining these data to discover useful and hidden information and knowledge from them.

In this project, we focus on mining changes to Web data (also called web deltas). This is a challenging problem because information sources in the Web are autonomous and typical database approaches to detect and mine these changes are not usable. In conventional databases, changes to data is made easier by the availability of facilities such as transaction logs, triggers, etc.. However, such facilities often are absent for web sites. Even in cases where these facilities are available, they may not be accessible to outside users.

A system for mining web deltas has wide range of applications. For instance, it can be used to monitor E-commerce web sites and analysis product feature, trends over a period of time. Companies can monitor evolution of their competitors web sites to discover their new directions or offering over a period of time that may influence their market positions. Observe that such analysis is impossible if it is performed directly on web sites, as these web sites do not keep track of historical data in a format that can be analyzed by its competitors.

Resercher

Dr. Sanjay Madria

Yan Chen (MS Thesis), UMR

Dr. Sourav Bhowmick, Nanyang Technological University, Singapore