MetaWeb Search Engine Design Document

Zhimin Zhan

Software Engineer

Distributed Systems Technology Centre

1. Introduction

This document gives a high-level overview of the internal architecture of the MetaWeb Search engine.

The MetaWeb Search Engine is a metadata search engine, which fetches Web resources from the Internet, indexes the metadata extracted from them and provides a user interface to search on this metadata. The MetaWeb Search Engine is a product of the MetaWeb Project, an initiative partially funded by the National Priority (Reserve) Fund allocation for Improved Library Infrastructure (administered by the AV-CC Standing Committee on Information Resources).

2. Components

The MetaWeb Search Engine basically uses the same design as the Harvest system and it consists of two main components: a gatherer and a broker. Both were written in Java using the Java Development Kit version 1.1 and are executable using the Kit or the Java Runtime Environment version 1.1. The software has been tested on a SUN server using the SunOS version 5.5.1 operating system.

See Figure 1 for an overview.

2.1 Gatherer

The Gatherer is responsible for fetching Web-accessible resources from the Internet, parsing metadata from <META> tags embedded in HTML files into attribute-value pairs, encoding the metadata into SOIF (Summary Object Interchange Format), and sending the SOIF abstracts to a Repository managed by the Broker. Web-accessible resources are available at URLs or PURLs.


2.2 Broker

The Broker receives indexing information (as SOIFs) from one or more Gatherers, removes duplicate information, incrementally indexes the collected information (saving into the metadata repository), and provides a WWW query interface to it.

Fig. 1 Overview of MetaWeb Search Engine

3. Gatherer

The MetaWeb Gatherer has been designed using object-oriented technologies. To avoid Gatherers fetching URLs too frequently, a URL database is used to register the URLs which have been gathered. With the help of this information, a TIME_INTERVAL parameter can be set in the gatherer’s configuration file to decide how often to check the URLs. The URL database is accessed via a JDBC interface.


3.1 URL Database Structure

The URL database has been set up using MiniSQLversion M*SQL 2.0.31. There are four fields in its table:

 

Field Name

Description

1.*

Identifier

URL names

2.

Metadata Standard

Metadata standard used by this page, eg. DC

3.

Last Check Date

The last date on which the gatherer checked this URL

4.

ID

Assigned ID for this URL

* the key for this table

Fig. 2 Table structure of the URL database


3.2 Components of MetaWeb Gatherer

Gatherer Classes

The following is a list of Java classes used in the Gatherer:

Gatherer class

A class is used to represent the Gatherer. This class reads the configuration file to customize the gatherer, calls one of the following three gathering classes based on a pre-specified configuration file parameter, schedules (re)gathering, and generates the log files. See Appendix III for an illustration of the log files.

GatherSite class

The GatherSite class harvests the root URLs one by one, fetches URLs from the URL Database, generates a fetcher to fetch web pages, adds their hyperlinks back into the URL Database, then sends the SOIF abstracts to the broker.  

GatherMulti class

The GatherMulti class does the same thing as the GatherSite class but provides better performance by generating multiple fetchers in parallel. The limitation of this class compared to the GatherSite class is that there is no depth control when harvesting.

GatherImport class

The GatherImport class adds a list of URLs held in a given file, then harvests them without retrieving their hyper link pages. This option is applicable where metadata records are held separately from the resources they describe.

RobotExclusion class

The RobotExclusion class checks the standard robots.txt on the targeted server, analyses it and saves the information to ensure only harvesting of URLs is permitted and that there is no conflict with any site policy.

 

Fetcher Classes

Fetcher class

The Fetcher is a threaded class which fetches the content of a given URL from the gatherer classes, sends it to the HTMLMetaParser class for parsing and return of the SOIF abstract, the hyperlinks and the metadata standard used in this URL. The SOIF abstract is then sent to the broker for indexing. A SOIF abstract appears as follows:

@FILE { http://www7.conf.au/
DC.Creator{30}: Andrew Wood, woody@dstc.edu.au
DC.Title{47}: The 7th International World Wide Web Conference
}

Fig 3. One example of SOIF abstract used by the MetaWeb search engine

 

HTMLMetaParser class

The HTMLMetaParser class parses the HTML content and sends the <META> tags to the meta-tag-parsers to do further parsing and return the attribute-value pairs in formatted standard SOIF. It also extracts all of the hyperlinks inside each document (within the limit of the MAX_LINKS parameter).

Parser Classes


MetaTagParser class

The MetaTagParser class is an abstract class. It is the parent class for individual meta_tag_parsers. The MetaWeb search engine can be extended to support a new metadata standard by writing a new parser inherited from this meta_tag_parser class. The reason for this design is to make the MetaWeb search engine flexible enough to handle the more and more complicated data structure of new metadata standards.

 

DefaultMetaTagParser class

The DefaultMetaTagParser class is used to parse the common <META> tags, like < META NAME="ATTRIBUTE" CONTENT="VALUE">. This class is called first by the HTMLMetaParser class to get the attribute of this <META> tag, and choose the right meta_tag_parser to parse it. For instance, for the attribute "DC.Creator", the DublinCoreMetaTagParser will be chosen. If the expected parser could not be found or the metadata format is not recognisable, the DefaultMetaTagParser class can still return the metadata in an attribute-value pair as it is or ignore it if chooses.

DublinCoreMetaTagParser class

The DublinCoreMetaTagParser class is one specific meta_tag_parser designed to parse the metadata using the Dublin Core Metadata standard in <META> tags.

 

Communication Classes

Msg_ToBroker class

After the fetchers have finished fetching, the generated SOIF abstracts are sent to the Broker by calling this class.

Msg_FromBroker class

The Msg_FromBroker class is called to receive data from the broker.

JDBC Interface

SQLURLDatabase class

The SQLURLDatabase class provides some basic functions such as Insert(), Delete(), and Update() for the Gatherer to communicate with the URL database via the JDBC interface.

JDBC Driver

The JDBC driver chosen by the MetaWeb Project for the URL database was the from Imaginary at http://www.imaginary.com/Java/java.html. It was selected from a list of JDBC drivers for SQL databases at http://java.sun.com/products/jdbc/jdbc.drivers.html.

 

3.3 Gatherer Details

The core part of the MetaWeb Gatherer is the threaded fetching process, which is called by all three Gatherer components – GatherSite, GatherMulti and GatherImport. Basically, the gathering process provides a URL (from the URLdatabase) to the fetcher. The fetcher retrieves the SOIF abstract of a given URL, then updates the URL database with the data received.

Fig.4 Fetching one URL

Figure 4 illustrates how the fetching process works. It consists of following steps:

    1. The Fetcher receives the URL Name provided by the Gatherer.
    2. The Fetcher retrieves the content of the given URL, up to 16,000 characters (to ensure the capture of the <META> tag in full, and relevant links).
    3. The Fetcher sends the URL’s content to the HTMLMetaParser to process.
    4. HTMLMetaParser locates the <META> tags in the given HTML text, and sends them one by one to the MetaTagParser to parse.
    5. The DefaultMetaTagParser is firstly called to get the attribute of one <META> tag. Depending on the attribute name, one specified MetaTagParser will be invoked. For example, if the attribute name is "DC.Creator", then DublinCoreMetaParser will be invoked.
    6. The individual MetaTagParser recognises the data format within the <META> tags, parses them, and sends each attribute-value pair to the HTMLMetaParser.
    7. When the HTMLMetaParser gets all attribute-value pairs, it formats them into SOIF abstracts, then sends them back to the Fetcher.
    8. As long as continues receiving the SOIF abstract, the Fetcher will send them to the Broker using a standard HTTP connect, then waits for a response.
    9. Although the whole fetching process is now finished, some important data is retrieved, such as the hyperlinks in the URL (which are added to the URL database) and the metadata standard used in this URL (which is useful for statistics displayed in the log files). The Fetcher will wait for the Gatherer to retrieve this information before it stops.

3.4 Gathering Policy

The gathering policy is controlled by the parameters in the Gatherer’s configuration file. See Appendix I for more information about parameter settings.

3.4.1 The MetaWeb Gatherer only fetches URLs within Australia by default.

The Gatherer does not restrict its search to the .au domain because Australian sites use other domains as well. The Gatherer will search any domain with significant Australian metadata and content, as notified to the URL database.

Notification to the URL database is performed using the MetaWeb Control Panel at http://purl.nla.gov.au/metaweb/control.

Any site’s URL which is provided to the Gatherer but does not contain tagged metadata will be held for future gathering.

3.4.2 URLs are also restricted to the same domain server as a given Root URL.

URLs may be notified to the URL database at site (root) or sub-directory (leaf) level. It may be necessary to specify a sub-directory in the event that metadata is created for part of a site, or for only one service at a site. Sub-directory names may be different from that provided for the site to distinguish between services.

If sub-directory naming conventions are different from the site’s highest level naming convention, then it is possible to gather across the site by providing the site level URL and setting the gatherer's configuration file to traverse the site.

If the gatherer’s configuration file is set to traverse the site, and a sub-directory URL is provided for traversing, only the sub-directory will be gathered. However, the site's other sub-directories will be checked to confirm whether they conform to the directory naming convention.

3.4.3 The Gatherer will ignore non-recognisable metadata formats in <META> tags, as it has been encoded to recognise only Dublin Core metadata. The Gatherer accepts the <META> tag in either version 3.2 or version 4.0 of HTML.

The encoding to recognise DC metadata is based on the "namespace" concept, where any tag name is preceded by the definitive element name set (in this case, DC for Dublin Core). This provided a future path forward for both the use of metadata and the Resource Description Framework in which it will interoperate.

Where an investment has been made in the creation of metadata tags such as

<META name = "description" content="Short desc. Phrase">
<META name = "keywords" content="List of keywords">,

their consistent encoding will permit an easy conversion to be applied as a separate process outside the gathering.

There is no restriction on the order of the metadata elements in the <META> tag.

3.4.4 Before inserting any URL into the URL Database, the robots.txt file on the selected site's server is checked by the MetaWeb Gatherer to ensure it does not conflict with any site policy.

The traversing of a site by the Gatherer may result in unexpected workloads on the selected server, although performance parameters have been provided to address this possibility. It is recommended that your site policy be checked before supplying your target URL to the Gatherer.

Before the Gatherer is activated, the site policy is checked by the Gatherer’s administrator.

 

3.4.5 The maximum number of URLs and search depth is specified to control the number of URLs to be indexed.

The depth parameter is DEPTH. It is only applied against a site level URL.


3.4.6 A TIME_INTERVAL parameter is set to determine how often to do the Regathering process.

The default gathering schedule it set in the configuration file to 14 days. The gathering process may be resource intensive when it accesses a server, so the schedule is set to this timeframe in order to avoid over-utilisation of sites.

It is recommended that the gathering process be scheduled at night, between the hours of midnight and 6am, to avoid impacting business usage. However, arrangements can be made to target a particular site if required, outside the regular schedule.

If a site does not change for a pre-specified timeframe, such as one month, the Gatherer will not revisit the site for another month.

 

3.4.7 Newly identified URLs are indexed into the repository first, then URLs which already exist are updated.

This processing allows new URLs to be displayed first in a search results set.

Existing URLs are recognised by an exact string match. The metadata associated with the URL is completely overwritten.

3.4.8 The number of simultaneous fetchers for the GatherMulti Class is set to three in the configuration file.

Tests have shown that if more than four fetchers are activated, performance is degraded. Using three fetchers is 1.5 times faster than four fetchers. Performance does, however, also depend on the site’s configuration of URLs.

3.4.9 URLs will be removed from the URL Database when they are invalid or unreachable for one month (default setting).

If this occurs after an initial successful retrieval of metadata from a URL, a message will be placed in the repository record indicating that the URL is no longer reachable.

3.4.10 The interval time between fetches to the same site by a single Fetcher is closely managed by setting a time interval.

The default is set to two seconds between each access. This reduces the impact on a server.

3.4.11 Fetches done by multiple Fetchers are managed by allowing the Gatherer to switch between sites while gathering.

3.4.12 It is possible to have a URL removed from the URL database by sending a request to the repository administrator, in order to address situations where the Gatherer incorrectly targets a virtual or proxy URL. The request may be sent via the feedback option on the control panel.

4. Broker

The MetaWeb Broker is a http daemon server also implemented in the Java language. It listens on one specified port of the running server, waits for the requests and handles them in parallel whenever requests come. There are five different defined requests in the MetaWeb Broker. They are PUT, GET, FIND, BROWSE and DELETE, explained in further detail below in section 4.3.

By supporting these five kinds of requests, the Broker can receive the metadata abstract from the Gatherers and save them into the metadata repository; return the results matching users’ queries; enable users to find the SOIF abstract of a supplied URL and browse the repository; and if authorised, delete the metadata at a particular URL.


4.1 TABLE structure of the MetaWeb Metadata Repository

The Metadata Repository is a database used to provide store and search capabilities for the metadata harvested by the Gatherers. SQL databases were chosen as the backend metadata repository by the MetaWeb Project. Since the Broker talks to the Repository via a JDBC interface, it is theoretically capable of interacting with any SQL database with a JDBC driver provided.

The MetaWeb Repository has the following table structure (Figure 5). Figure 6 shows how data is stored in the Repository.

 

Field Name

Description

1.*

URI_RESOURCE

The identifier, usually a URL

2.*

COMPLETE_PROPERTY

The full attribute name, like DC.Creator.Email

3.*

SEQ_NUMBER

The serial number of the same attributes within one document

4.

LANG

The LANGUAGE qualifier

5.

PROPERTY_VALUE

The value or content of the attribute

6.

VALUE_TYPE

The data type for this value

7.

SCHEME

The SCHEME qualifier

* these fields are the key for this table
Figure 5.Table Structure of the Repository


Uri_resource

Complete_property

seq_number

Lang

property_value

value_type

scheme

Http://www.dstc.edu.au/RDU

DC.Date.Modifed

2

en

1997-01-28

String

ISO8601

Figure 6. Data as stored in the Repository

4.2 Components of MetaWeb Broker

The following is a list of Java classes used in the Broker.

Broker.class

A class used to represent the Broker. This class reads the configuration file to customise the Broker and listens on the specified port of the running host. Whenever a request comes, one Comm.class will be invoked to handle it (in parallel with others).

Comm.class

The Comm.class handles the incoming requests, parses the query argument inside, and starts the HttpCmd to handle them.

HttpCmd.class

The HttpCmd class is a thread class that does the main work. Depending on the request, corresponding functions will be called to answer it.

SOIFParser.class

The SOIFParser is used to parse the SOIF abstract received from gatherers into attribute-value pairs, the format used to store them in the metadata Repository.

SOIFRepository.class

The SQLRepository class provides basic functions such as Insert(), Delete(), and Update() for the Broker to communicate with the Repository via a JDBC interface. If the underlying database is changed, this class may need to have some minor modifications made.

JDBC Driver

The JDBC driver chosen by the MetaWeb Project for the Repository database was the Oracle Thin driver from the Oracle company, available at http://www.oracle.com/nca/java_nca/jdbc/v7/html/download.html. It was selected from a list of JDBC drivers for SQL databases at http://java.sun.com/products/jdbc/jdbc.drivers.html.

Figure 7 shows the internal structure of MetaWeb broker and the relationships between these six components.


Fig. 7 Internal structure of MetaWeb Broker


4.3 Broker Details

4.3.1 Five kinds of requests

(1). PUT Request - receiving the SOIF abstract from gatherers:

Format: /CMD/PUT SOIF abstract

Broker’s action: Calling the SOIFParser class to parse the SOIF (as shown in the example below) into attribute-value pairs, and save them into the Repository.

@FILE { http://www.dstc.edu.au/RDU/
DC.Title{33}:	Resource Discovery Unit Home Page
DC.Subject{69}:	Resource Discovery, URN, URC, Metadata, Z39.50, Information Retrieval
DC.Description{153}:	The Resource Discovery Unit researches emerging technologies for the seamless discovery and retrieval of information and services on the Internet and WWW
DC.Creator{15}:	Renato Iannella
DC.Creator.Email{18}:	renato@dstc.edu.au
DC.Publisher{12}:	DSTC Pty Lt
DC.Date.Created{38}:	1995-01-01	|QUALIFIERS:	SCHEME=ISO8601
DC.Date.Modified{38}:	1997-01-28	|QUALIFIERS:	SCHEME=ISO8601
DC.Format{33}:	text/html	|QUALIFIERS:	SCHEME=IMT
DC.Identifier{51}:	http://www.dstc.edu.au/RDU/	|QUALIFIERS:	SCHEME=URL
DC.Identifier{48}:	urn:inet:dstc.edu.au:rdu	|QUALIFIERS:	SCHEME=URN
DC.Relation{47}:	http://www.dstc.edu.au/	|QUALIFIERS:	SCHEME=URI
DC.Language{33}:	en-gb	|QUALIFIERS:	SCHEME=RDF1766
DC.Rights{37}:	Copyright DSTC Pty Ltd 1995,1996,1997
}

Figure 8. An example of SOIF abstract received from Gatherer

The qualifiers such as SCHEME and LANG are appended after each value, and saved into the Repository for potential use. However they are currently not be displayed on the results screen.


(2). GET Request -- searching request from users:

Format: /CMD/get?mode=simple&val=metadata&att=DC.Title&number=10
/CMD/get?mode=advanced&nratt=2&number=10&att0=DC.Title&
op0=+&val0=metadata&att1=DC.Creator&op1=+&val1=renato

Broker’s action: Finds the matching URLs in the repository, formats the result pages in an HTML file and returns them to the user.

The MetaWeb Broker supports the following five types of queries:


(3). FIND request - users want to see the full SOIF abstract of one URL:

Format: /CMD/find?url=http//www.dstc.edu.au/RDU/MetaWeb/

Broker’s action: Returns the full SOIF abstract of the given URL in HTML format.


(4). BROWSE request -- users want to view harvested URLs:

Format: /CMD/browse?lasturl=http://www.dstc.edu.au/RDU/MetaWeb/

Broker’s action: Lists the URLs in alphabetical order starting from the given one, with linkage to its full SOIF abstract, 50 URLs per page in HTML format.

If the given URL is "http://", then broker will list the URLs starting from the first one.

(5). DELETE request:

Format: /CMD/delete?wholehost=no&url=http://www7.conf.au/

Broker’s action: Deletes given URL or URLs in the repository.

Deletion is supported only when the host where the request came from is the same as the host which the broker is running on, for security reasons. Users may provide one URL, and choose to delete this URL or all of the URLs which start with this one. Note, if "http://" is provided, it will delete whole database.



4.3.2 Querying

Since the underlying database for the MetaWeb repository is an SQL database, SQL is used to query the repository. For the query "DC.Title:metaweb", the SQL statement would appear as follows:

SELECT uri_resource FROM REPOSITORY
WHERE complete_property LIKE 'DC.Title%'
AND UPPER(property_value) LIKE '%METAWEB%'

And the SQL for the query "any:metaweb" is

SELECT uri_resource FROM REPOSITORY
WHERE UPPER(property_value) LIKE '%METAWEB%'

After getting results from the metadata repository by sending the SQL queries via the JDBC interface, the Broker will do subsequent processing to retrieve the final matching results in HTML format, and return them to the users.

Any search keywords with less than 2 characters will be ignored.

 

4.3.3 Stemming

Stemming is set to occur when multiple search keywords are provided in a query. Stemming is defined as "a form of automatic right truncation of each word in the index to its root"2. The Broker matches the word stem during the query. For example, the search query "networked education" will also retrieve records containing "education" and "network".

The matching routine is based on searching each related word to see if it contains the stem of the word. For example, a search query on "principles" will also retrieve records containing "principle".

The STEM algorithm has been implemented as a Java class. Precision is approximated; for example, "library librarian" will be presented as "librari" on the results page.

4.3.4 Stopwords

Stopwords are set to be invoked when multiple keywords are provided in one query. That is, the words contained in the stopword list are ignored if provided in a query. The stopwords are stored in a file called stopwords.txt in the same directory within the Broker software package. Users can add/delete words to suit their needs. The current stopwords file contains: and, or, australia, the, a, an, not, of, in, on.



References:
_____________________________

1 MiniSQL is a free lightweight database, which provides fast access to small datasets. MiniSQL is the product of Hughes Technologies, Australia. The MiniSQL software can be downloaded at from: http://www.Hughes.com.au/.

Its JDBC Driver is located at http://www.imaginary.com/~borg/Java/java.html.

2 http://www.hsc.missouri.edu/library/engines/help/stem.html.




Appendix I: The configuration file for the MetaWeb Gatherer

[GATHERER]
# Fetching Threads Number (multiple mode), 2-4 is recommended
FETCHER_NUMBER 3
 
# Check Time Interval in Days
TIME_INTERVAL 14

# Display debug message? 
DEBUG OFF

# Traverse the Hyper Links?
FOLLOW_LINKS YES

# Only parse the HTML files with suffix ".htm" or ".html"
HTML_FILE_ONLY YES

# Restrict to the HTML files under this directory. If "yes", LOCAL_DOMAIN_ONLY is ignored.
UNDER_THIS_DIR_ONLY YES

# Restrict to current domain
LOCAL_DOMAIN_ONLY YES

# Restrict to Australian Sites, default is Yes
AUS_SITES_ONLY NO

# Only gathering new sites, do not need to do regathering (equal to running -add mode)
GATHER_NEW_SITES_ONLY NO

# Fetching Timeout: the maximum time for fetchers to fetch a valid page, in seconds.
FETCHING_TIMEOUT 60

# Fetching Interval: the time between two fetches, in seconds
FETCHING_INTERVAL 4 

# Source file name if input is to be taken from a file (ignored otherwise)
IN_FILENAME infile.txt

# Result file name for output to a file (ignored otherwise)
OUT_FILENAME outfile.txt

# Name of file to place URLs failed to fetch
LOST_FILENAME lost.txt

# If YES, the SCHEME tag will be saved as part of SOIF. It may not be compatible
# with other brokers as it only recognize simple SOIF format
SUPPORT_SCHEME YES

# Broker address (if any). In this case the same host the fetcher is running on.
SERVER_ADDRESS sunshine.dstc.edu.au 

# Server port number (if any)
SERVER_PORT 9018	 

# size of file which will be read to parse the tags
READ_SIZE 16384

# Maximum links in one URL
MAX_LINKS 200

# Maximum attributes in one URL
MAX_ATTRIBUTES 100

# Search depth, only valid for harvesting site by site, 0 means search all possible URLs
DEPTH 0

# Option to save gathered SOIF into text file
SAVE_SOIF YES



Appendix II: The configuration file for MetaWeb Broker


[BROKER]
# Port number for Broker to listen to
PORT 9018 

# Buffer to read SOIF from the gatherer 
BUFFER_SIZE 4096 

# Response items per page in results page, default value
# For Quick search, user can choose it on the search page as well
ITEMS_PER_PAGE 10

# Max clients supported simultaneously
MAX_CONNECTIONS 10

# Timeout for network operations (seconds)
TIMEOUT 600

# Stemming ON - Y or N (using Stem Class)
STEMMING N

# Removing stopwords ON - Y or N 
REMOVE_STOPWORDS Y

# Maximum connections the broker can handle at the same time
MAX_CONNECTIONS 10

# Maximum number of characters to appear in DC.Title field
MAX_TITLE_LENGTH 100

# Maximum number of characters to appear in DC.Subject field
MAX_SUBJECT_LENGTH 100

# Maximum number of characters to appear in DC.Description field
MAX_DESCRIPTION 200

#When Browsing URLs indexed, display the number per page
MAX_URLS_PER_LIST 50

#Save the qualifiers into the repository, like SCHEME=ISO31, LANG=en, etc
SAVE_QUALIFIERS Y

#The attributes to be displayed in the default return page, with lengths
[DEFAULT_DISPLAYED_ATTRIBUTES] START
DC.Title 75
DC.Subject 100
DC.Description 200
DC.Identifier 100
[END_DEFAULT_DISPLAYED_ATTRIBUTES] END




Appendix III: The log files

 

There are two log files.

The data in the Gatherer's Log File appear as follows:


-----* FINAL STATISTICS OF http://www.transport.qld.gov.au/ * ------

Total URLs --> 34 URLs contain Dublin Core Metadata -> 21 URLs without Metadata --> 13 Unreachable URLs --> 0 The total time cost of gathering :-> http://www.transport.qld.gov.au/ is 0:3:42! Next gathering will be done at 09 07 1998 20:25:21 AET

The data in the Broker's log file appear as follows:


The MetaWeb Broker starts running at 26 06 1998 00:05:12 AET
Querying{simple} -> any: tech edge
The cost is :->30164 ms
Finding the SOIF of <-http://www.nla.gov.au/dstc/12/09/41/12094124.html
Querying{combined} ->DC.Title:metaweb DC.Creator:Debbie Campbell
The cost is :->16038 ms