WEB & WIRELESS COMPUTING LABORATORY :: MISSOURI University of Science & Technology

Projects

WHOWEDA : Warehouse of Web Data

From a user's perspective, the World Wide Web is a broadcast medium where a wide range of up-to-date information can be obtained at a low cost. Information on the WWW is important not only to individual users, but also to business organizations especially when critical decision making is concerned. While most users obtain WWW information using a combination of search engines and browsers, these two types of retrieval mechanisms do not necessarily address all of a user's information needs.

It is instructive to examine why available tools such as search engines fall short of delivering the requisite knowledge. First, search engines are purely resource locators with no capability to reliably suggest the contents of the websites they return in response to a query. Furthermore, the task of information retrieval still burdens the user, who has to manually sift through 'potential' sites to discover the relevant information. Often times, the repetitive, sequential list of sites returned adds redundancy and tedium to the process. Thus, business organizations which currently lack suitable tools to systematically harness strategic information from the Web that may impact the organization.

To overcome the limitations of search engines and provide the user with a powerful and friendly query mechanism for accessing information on the web, the critical problem is to find the effective ways to build web data models of the information of interest, and to provide a mechanism to manipulate these information to garner additional useful information. Until now, knowledge discovery on the WWW has been limited to mining path traversal patterns by analysing server access logs (Web Log Mining) and extraction of semistructured information from HTML documents. This leaves much to be desired in deriving interesting, non-explicit patterns from the web information base.

WHOWEDA Approach

To design and implement a web warehouse that materializes and manages useful information from the web to support strategic decision making. We aim to build a web warehouse containing strategic information coupled from the web that may also inter-operate with conventional data warehouses.

WHOWEDA is a meta-data repository of useful, relevant web information, available for querying and analysis. As relevant information becomes available in the WWW, these information are coupled from various sources, translated into a common web data model (Web Information Coupling Model), and integrated with existing data in WHOWEDA. At the warehouse, queries can be answered and web data analysis can be performed quickly and efficiently since the information is directly available. Accessing data at the warehouse does not incur costs that may be associated with accessing data from the information sources scattered at different geographical locations. In a web warehouse data is available even when the WWW sources are inaccessible.

WHOWEDA consists of two major components: a data manipulation module called Web Information Coupling System (WICS) and a data mining module called Web Information Mining System (WIMS). WICS focuses on the manipulation of information in the WHOWEDA system. It includes the following tasks:

· extraction and retrieval of information from the WWW,

· storage and organization of information in the warehouse,

· data manipulation via various web operators such as web select, web join, web project, etc.

· Knowledge Discovery and web mining

· Ranking

· Schema Evaluation

WICS brings WWW information into the warehouse and provides various operators for preprocessing and storing these information. These information are fed into WIMS for various forms of mining and knowledge discovery. Thus, WICS plays a supporting role in the overall WHOWEDA system.

We explored other aspects of the warehouse such as improving the data model to handle XML and HTML data, ranking, query processing, web data mining, maintenance, and so on.

This project has lead to many International Publications in Conferences and Journals.

Resercher

Dr. Sanjay Madria

Dr. Sourav Bhowmick, Nanyang Technological University, Singapore

Dr. Wee Keong, Nanyang Technological University, Singapore

Dr. Lim Ee Peng, Nanyang Technological University, Singapore