From a user's perspective, the World Wide Web is
a broadcast medium where a wide range of up-to-date
information can be obtained at a low cost. Information
on the WWW is important not only to individual users,
but also to business organizations especially when
critical decision making is concerned. While most
users obtain WWW information using a combination of
search engines and browsers, these two types of retrieval
mechanisms do not necessarily address all of a user's
information needs.
It is instructive to examine why available tools
such as search engines fall short of delivering the
requisite knowledge. First, search engines are purely
resource locators with no capability to reliably suggest
the contents of the websites they return in response
to a query. Furthermore, the task of information retrieval
still burdens the user, who has to manually sift through
'potential' sites to discover the relevant information.
Often times, the repetitive, sequential list of sites
returned adds redundancy and tedium to the process.
Thus, business organizations which currently lack
suitable tools to systematically harness strategic
information from the Web that may impact the organization.
To overcome the limitations of search engines and
provide the user with a powerful and friendly query
mechanism for accessing information on the web, the
critical problem is to find the effective ways to
build web data models of the information of interest,
and to provide a mechanism to manipulate these information
to garner additional useful information. Until now,
knowledge discovery on the WWW has been limited to
mining path traversal patterns by analysing server
access logs (Web Log Mining) and extraction of semistructured
information from HTML documents. This leaves much
to be desired in deriving interesting, non-explicit
patterns from the web information base.
WHOWEDA Approach
To design and implement a web warehouse that materializes
and manages useful information from the web to support
strategic decision making. We aim to build a web warehouse
containing strategic information coupled from the
web that may also inter-operate with conventional
data warehouses.
WHOWEDA is a meta-data repository of useful, relevant
web information, available for querying and analysis.
As relevant information becomes available in the WWW,
these information are coupled from various sources,
translated into a common web data model (Web Information
Coupling Model), and integrated with existing data
in WHOWEDA. At the warehouse, queries can be answered
and web data analysis can be performed quickly and
efficiently since the information is directly available.
Accessing data at the warehouse does not incur costs
that may be associated with accessing data from the
information sources scattered at different geographical
locations. In a web warehouse data is available even
when the WWW sources are inaccessible.
WHOWEDA consists of two major components: a data
manipulation module called Web Information Coupling
System (WICS) and a data mining module called Web
Information Mining System (WIMS). WICS focuses on
the manipulation of information in the WHOWEDA system.
It includes the following tasks:
· extraction and retrieval of information
from the WWW,
· storage and organization of information
in the warehouse,
· data manipulation via various web operators
such as web select, web join, web project, etc.
· Knowledge Discovery and web mining
· Ranking
· Schema Evaluation
WICS brings WWW information into the warehouse and
provides various operators for preprocessing and storing
these information. These information are fed into
WIMS for various forms of mining and knowledge discovery.
Thus, WICS plays a supporting role in the overall
WHOWEDA system.
We explored other aspects of the warehouse such as
improving the data model to handle XML and HTML data,
ranking, query processing, web data mining, maintenance,
and so on.
This project has lead to many International Publications
in Conferences and Journals.