EMIR - Electronic Mathematical Information Retrieval

EMIR - Electronic Mathematical Information Retrieval


EMIR is the proposal of the Euromath Center for a simple, inexpensive and quick implementation of an information retrieval system of local maintained information at mathematical institutes. The design of EMIR enables the system to evolve into future more sophisticated systems without total redesign.

Introduction

The daily work of mathematicians and other scientists is to a great extend dependent on collecting existing information to which value is added. Therefore the speed in which necessary information can be found is crucial for the progress of the work. Most mathematicians today use modern EDP (Electronic Data Processing) equipment enabling them to access public databases, to send messages to colleagues, to run computer calculation programs and to write articles for publication. To fully exploit the possibilities for these mathematicians the access to distributed local information files will enhance the effectiveness of their work.

This paper proposes a simple, inexpensive, quick way to exploit the already existing information at the individual institutes.

The Need for an Information Service

All Mathematical Departments keep records of data concerning employees, telephone numbers, courses, publications etc. Most of these are (or are planned to be) stored, maintained and made available locally in electronic form.

These islands of information are normally not interconnected enabling a mathematician of one institute to access data of another institute. In this way the daily problem of remembering an e-mail address of a colleague at another institute or the time of an interesting Ph.D. lecture at a neighbouring university does exist.

Even if interconnection is made possible the data files are stored in different formats and normally outside the reach of a remote login'ing/ftp'ing person.

Naively speaking the right solution would be to get rid of the existing local files with different formats and instead dictate a common format all local files have to adapt to in this way creating a huge common distributed data base with sophisticated searching facilities. Even though this might be the end goal, EmC has been asked to investigate and to come up with a proposal for a quick implementable solution utilizing the present available information systems, networks and the informations and to develop conversion tools in an inexpensive, simple and user friendly way.

EMIR

EmC has looked into the problems of the non-interconnected information of interest to mathematicians and has come up with the EMIR proposal (Electronical Mathematical Information Retrieval). In doing so, EmC has set up the following set of demands:

  1. required changes to existing data files must be kept to a minimum;
  2. the maintenance of the local data files shall be the responsibility of the local departments and must be possible without any high level knowledge of EDP;
  3. all existing accessible platforms (UNIX WS, PC, ..) containing the relevant data files must be used with a minimum of required adaptations;
  4. the usage of EMIR must be user friendly, easily accessible, with a non-abbreviated set of options to select among at all times;
  5. EMIR must be implementable within a few month after the go-ahead decision has been made;
  6. EMIR must be implemented in a general, well documented, standardized way enabling it to migrate to higher level solutions without total reconstruction;
  7. the proposed implementation must be inexpensive.

What Information Is Provided

In this section we outline the kind of information which we think will be of interest to mathematicians outside the local institute. It will not be difficult to adapt the classes of information to suit special needs, such as those of the Danish mathematicians or those of the EmNet projects. As mentioned in the document Dan-EMIR , the first implementation, which will take place in Denmark, will only contain information about a subset (persons and publications) of the intended information service. In the final implementation, however, each participating mathematical institute should be motivated to maintain local data about: Many other items are of interest, such as department profiles, research projects, special expertise, etc. After a small field research performed by EmC, the 8 classes above were, however, those which most users requested. In the following we shall therefore describe specifically what information should be available in each of these 8 classes.

Persons

Persons at a mathematical institute can be divided into For each person must be provided last name, first name, affiliation, position, e-mail address, telephone number, fax number, and professional interests (including willingness to referee reviews, offer consulting services, participation in projects, etc.). For guests a permanent address should also be included. For a member of staff data could, for example, be organized as follows:

<LastName>       Feddersen
<FirstName>      Henrik
<Position>       Ph.D.
<Affiliation>    Ko/benhavns Universitet, EmC
<E-mail>         henrik@euromath.dk
<Tel>            35 32 07 12
<Fax>            35 32 07 19
<Interests>      Nonlinear dynamics, solitons
</>
Note that the above is only meant as an example that sketches the organization of data as provided by the institute (here EmC). The actual format will be different (here the emphasis is on the data that the institutes should provide, not on the markup - technical details about the format of data files will be given to the participating institutes at a later stage). For a guest we would need an additional field about his permanent address, e.g.

<Permanent>      University of Honolulu, Hawaii, USA

Publications

By publications we mean bibliographic data about publications by people affiliated with the institute. There are several fields associated with publications. The type of publication (article, proceedings, thesis, preprint, etc.) determines which fields should be included:
<Author>          
<Title>
<Journal>
<Volume>
<Number>
<Pages>
<Editor>
<Booktitle>
<Edition>
<Publisher>
<Chapter>
<Series>
<Address>
<Year>
<Abstract>
<Classification>
<Comments>
<FTP>
</>
The last field <FTP> is used if the full text publication (typically a preprint) is available by anonymous ftp. The user interface will enable you to fetch the publication simply by clicking the mouse where it says something like click here to fetch document. (See also section 3).

Seminars

Information about seminars, colloquia, thesis defences, etc.\ should be organized like the following example:
<Title>          Singular Integral Equations on Curves with Cusps
<Speaker>        Roland Duduchava
<Affiliation>    Tbilisi
<When>           22 Feb. 1994, 14.15-15.00
<Where>          Aud. 10, HCO/
<Abstract>       In the lecture I shall explain how one-dimensional
                 singular intergral equations with complex conjugated
                 unknown functions appear in boundary value problems.
                 I shall show that Fredholm properties and the index
                 of such equations depend on angles in corner points
                 and orders in cusp points of the underlying curve.
</>

Teaching

The Teaching section should include all courses, from 1st year undergraduate to Ph.D.\ level, given by the institute. The information that should be included about a course is illustrated in the example below:
<Title>          Solitons
<Level>          4th year
<Prerequisite>   Course A3 + B17
<Teacher>        NN
<Affiliation>    Ko/benhavns Universitet, Matematisk Institut
<Syllabus>       Solitons, inverse scattering transform, complete
                 integrability of Hamiltonian systems, fundamental
                 models, Poisson brackets.
<Text book>      L D Faddeev and L A Takhtajan: Hamiltonian Methods in
                 the Theory of Solitons, Springer, 1987
<Extra litt>     M J Ablowitz and P A Clarkson: Solitons, Nonlinear
                 Evolution Equations and Inverse Scattering, Cambridge
                 Univ. Press, 1991.
<When>           Sep 98 - Jan 99
<Hours per week> 2
<Evaluation>     To be seen!
</>

Conferences

Conferences includes every conference worldwide that the institute feels may be of general interest. The information that will be provided to the service by an institute should include some or all of the fields indicated in the example below:
<Conf>           Workshop on Groups and Three-Manifolds
<When>           Spring 1995
<Where>          Centre de Recherches Mathematique, Universite de
                 Montreal
<Program>        There will be a program of visitors, both short- and
                 long-term, with more informal activities organized in
                 consequence. Special emphasis will be placed on...
<Committee>      L Vinet, director of CRM; S Boyer, UQAM...
<Topics>         Progress on Thurston's Geometrization conjectures,
                 group actions on trees,...
<Papers>         People wishing to present a paper or poster should...
                 Particularly encouraged are papers relating to...
                 Deadline for submission of abstracts is...
<Invited>        NN,...
<Registration>   M Louis Pelletier, CRM, Universite de Montreal,...
</>

Journals Subscribed to

It may be of interest to others which journals your institute subscribe to. The name of the journal and how long back your subscription dates are needed in the provided information, e.g.
<Journal>        Journal of Differential Equations
<From>           Vol. 4, 1970
</>

Vacant Positions

Positions and possible Ph.D.\ grants can be advertised on the information service, e.g.
<Position>       Lecturer
<Where>          Ko/benhavns Universitet, Matematisk Institut
<Date>           28 Feb 1994
<Text>           We are looking for ...
<Contact>        NN, E-mail: nn@math.ku.dk, Tel: 12 34 56 78
</>

Access to Departmental Library Databases

This point is different from the rest. Here the institutes do not have to type in information specifically for the information service. We ``just'' want to have an Internet connection to the institute's library database so that anyone can search in this database.

At a later point in time a common interface might be included.

User Interface

You can find information in two ways: browsing and searching. As mentioned in section Initial implementation we have suggested to make the service available from a World Wide Web server so if you are familiar with NCSA Mosaic you may keep the program xmosaic in the back of your mind when you read the following.

In order to ease browsing, the information will be organized hierarchical with the participating mathematical institutes in the top of the hierarchy and information about persons, publications, etc. under each institute as illustrated in the figure 1.

There will be one common entry point to the information service. Under this point there will be links to the participating institutes (entries under one institute is illustrated in the figure 1) as well as links to the individual classes of informations (persons, publications, etc.) collected from institutes. That is, when you connect to this information service you will get a menu where you can select from

So if you, e.g., select the menu entry Persons, you will get a list of all persons affiliated with all the participating institutes. There is no subdivision according to institute (but for Persons we suggest a division into Staff, Postdocs, Research students and Guests).

Under each menu point, on every level lower than institute in the hierarchy, there will be a Search function which you can use for free text searching on that level. For example, if you have selected the menu item ``Persons'' under a particular institute, say ``Mathematical Institute, University of Copenhagen'' you can select ``Search'' and do a search on all persons on Mathematical Institute, University of Copenhagen. If you only want to search among staff on that institute, you select ``Staff'' under ``Persons'' and then ``Search'' under ``Staff''.

Technical overview

This chapter gives an overview of the implementation needed at the central server site, e.g. at EmC. Although being an overview, it still lists the necessary tasks but without going into trivial details on each subtask.

Design objectives

The basic idea of EMIR is to collect useful information and make it available for search and retrieval at an information server. The implementation of such a system must be able to cope with a variety of scenarios, depending on exactly what information will be handled in the system.

The first implementation of EMIR will include only a few types of information (most likely about people at Danish mathematical institutes and their publications). The system design, however, allows for extensions both in what information types are handled (width) and by the number and variety of information sources that supply information to the service (depth).

Wherever possible, the implementation will use existing software tools.

The fundamental requirement for such a system is that it must be flexible. More specifically, the system design permits:

Naturally the first implementation will only handle a few information types and retrieval & delivery methods, but the design is so that new information types and retrieval & delivery methods can fit easily into the existing architecture.

The actual behaviour of the system is controlable via a configuration file. Typical changes (such as change, addition or removal of a data source) is possible merely by changing the configuration file. Extensions to the system itself (such as addition of a information retrieval or delivery method and/or format) will be possible without changing the configuration file format (syntax) substantially.

Functionality

The ultimate objective of the system is to provide central access to otherwise widely spread information. The system functionality can thus basically be divided into three classes (See figure 2):
Retrieval
Retrieval of data files from institute servers.
Processing
Validation of data (syntax check), conversion from institute formats to unified data format (normalization), conversion to delivery formats, indexing.
Delivery
Actual access to data (e.g. via menus, hypertext, or a search mechanism) for users' search and retrieval.
The above describes the functionality classes, the following sections provide a more detailed description of each of the three classes.

Retrieval

Each participating institute provides information to the system. For each institute one or more information sources are identified. Each information source represents an information type (e.g. information about people at the institute) and contains the institute's information of that type. Each information source is placed in an identified file on an identified server.

At regular intervals (once a week, say) the set of (updated) information sources is retrieved for processing and -- eventually -- delivery.

Processing

The purpose of the processing is to prepare the collected information for delivery. The institutes may deliver the information in various formats. (Although having just one format for each information type would allow a much simpler implementation.) Each of the files must be checked syntactically, and then -- if the validation is successful -- converted into a unified format. All subsequent processing is then performed on the unified format.

The unified format will be defined using SGML (Standard Generalized Markup Language, ISO 8879) and SGML tools will thus be used for processing.

The processing converts the collected information into deliverable formats (e.g. into hypertext formats, ASCII, menu structures, etc., depending on the chosen delivery methods), and indexes are created for fast search response times. All data is converted to a single character set (Latin 1, ISO 8859-1).

Delivery

The delivery functionality provides access to the collected information. Basically, the delivery consists of responding to information retrieval requests from users. Users are expected to access the information service via the Internet. Various systems exist (WWW, Gopher, WAIS, X.500, etc.) that could be used for this access, and the exact functionality thus depends on the system in question.

The basic problem that must be addressed in any case, is that the access method used must be able to use the information that is to be served, either by importing the information into an internal database or by using access programs. Both methods are supported.

Tools

The solution will run on standard UNIX equipment. The configuration file is essentially a small formal language that describes almost all aspects of the system's behaviour. The implementation of the configuration language will use (an extension of) the programming language TCL (Tool Command Language). In addition, many standard UNIX facilities (such as FTP) are used. For the text processing needed to validate the retrieved institute data files, Perl is used.

Initial implementation

As described above EMIR is a concept that can be implemented in many ways. The idea is to start with a limited initial implementation (which supports only two information types, one retrieval method and one access method). If desirable, this first implementation can later be expanded to support more information types, retrieval methods, and access methods.

The initial implementation will support retrieval via FTP, either anonymously or via a EMIR specific account on the institute server. For access, the initial implementation will support access via WWW.

The processing in the first implementation will convert the data retrieved from the institutes to the hypertext format of WWW (called HTML), and deliver hypertext documents. These documents will contain mostly text (and thus not many hyperlinks). To permit use of the search facilities of WWW, indexes of the content are generated.

Prerequisites

The proposal for the technical solution of the EMIR task has as mentioned in section Introduction been designed to require as few local investments in new hardware and software as possible. This section describes these minimal requirements.

The requirements described are the one-time ``investments'' that must be made before starting or joining the EMIR project.

Hardware

The hardware requirements may be split into three areas: the hardware needed to keep the local datafiles for retrieval by the server-site (e.g. EmC), the hardware needed at the server-site to retrieve this data and serve queries, and the local hardware needed to perform these queries.

Local datafile servers

Since the collection of data is performed via FTP, each local institute needs to keep their datafiles on a machine offering FTP-access. Whether this machine runs UNIX, is a PC-server, a VMS VAX or anything else is of no importance.

Likewise, the FTP-server doesn't necessarily have to be physically located at the local institute. Institutes purchasing computing power externally (e.g. through UNI-C) may use any such machine.

A practical requirement will naturally be, that the person(s) responsible for the daily management of the data on this machine has easy access to it.

The server-site hardware

As mentioned in section Functionality, the server-site hardware shall perform two functions: collecting and processing local datafiles using FTP, and answering HTTP queries for the collected data.

The first of these tasks can be performed on for instance a weekly basis, during the night. Thus this task will not require significant processing power.

The other task -- processing HTTP queries -- is more difficult to evaluate precisely. First of all, the load incurred by such queries -- especially searches -- vary widely, and secondly the number of queries (i.e. the success of EMIR!) is not yet known.

EmC proposes a two-stage strategy. To begin with, a machine with the processing power roughly equivalent of a DECstation 3100 (32 MB RAM) is used. Such a server should easily be able to serve up to 10 simultaneous requests.

If the load frequently exceeds 10 simultaneous queries, a larger server may have to be purchased. Such a load-level will however indicate such a success for EMIR, that the investments already made can not be considered a waste.

The local client hardware

The EMIR proposal has explicitly been designed to be usable on any hardware platform. Thus PCs running DOS, UNIX machines, PCs running Windows and possibly even MAC's running DOS can be used.

Software

Only public domain software will be required. Thus, no additional costs from this item.

Networking

Since EMIR is a networked database, Internet access is a must for ALL the machines involved. This includes the software drivers for the TCP/IP protocol suite.

The last point -- TCP/IP drivers -- is especially important for PCs (both DOS and Windows), since they usually don't include network support. Several public domain TCP/IP software drivers are available (EmC may supply a list), but they will have to be installed, supported and maintained by the local institutes.

Organization

The last and crucial point is, that the data is correct and up-to-date. Many grand schemes in distributed databases have collapsed because the data was not maintained properly.

Thus, EmC suggests, that only data for which a well defined local need exists is included in EMIR. This means that institutes are not forced to maintain information for which they do not themselves see a local need. Keep in mind that the individual institutes are not required to supply all the information listed in section What Information Is Provided.

How the responsibility for maintaining the local data is delegated is naturally only the business of the individual institute. Since most institutes already maintain local data-collections, such a delegation has probably already been made.

Continuous maintenance requirements

This section describes the daily work incurred on institutes participating in the EMIR project. Though the project has been designed to minimize this, some maintenance tasks must be expected.

Maintaining local data

The local data being collected by the central server-site must naturally be kept up-to-date. Since the guiding principle is, that only information for which a clear local need can be identified -- for instance data that is already being maintained -- shall be included in EMIR, the work involved will be up to the local institutes. An institute will presumably only offer data through EMIR, if the institute is willing to allocate the resources needed to maintain it.

An essential point is, however, that the persons(s) responsible for the local support is known also outside the institute, so questions on data reliability can be addressed to the right point.

Local system maintenance

For the EMIR system to work, the local FTP-server from which the data is collected must be up when the central server-site tries to collect data.

Thus efforts must be made to keep this local FTP-server highly reliable. If the institute already has and uses the intended server, system administration procedures to ensure this must be assumed to be in place. In that case, EMIR will not incur additional system-administration overhead.

Again it must be stressed that the person(s) responsible for local system administration is known to the other partners.

Server-site system maintenance

The server-site is expected to allocate the resources for

Extending EMIR

EMIR can be extended both in the width and depth. In section What Information Is Provided is outlined the classes of information which is believed to be the natural choice of today. However, as time goes by, one or several institutes might see the need for additional information classes and the question then arises if and how EMIR should be extended in the depth.

New techniques or extended facilities might require a redesign of EMIR allowing it to exploit new possibilities. The question therefore arises whether EMIR locks you to the present suggested design or whether it can be extended in the width to provide new functionality without seeing the original investments as wasted!

Adding new classes of information

EMIR encompasses users from institutes which do not meet each other each day. The needs of the institutes for information might evolve in different directions and therefore the "server-site" might get conflicting proposals from the participating institutes for how EMIR should be extended in the depth. To avoid making local optimization, a common organization should own EMIR and thereby have the power to decide in which - if any - direction EMIR shall evolve.

EmC is not very happy suggesting yet another bureaucratic entity - however with the concept of distributed users/owners of EMIR these users/owners should meet at regular intervals and decide upon suggestions relating to EMIR, which have come up since the previous meeting. In most cases an existing organization will most easily could take over the responsibility of EMIR, such as the National Mathematical Society. One of the important tasks will be to ensure the quality of EMIR by continuously evaluating the performance.

Possibilities for future migration to other distributed database schemes

If the central-server idea is maintained, one may want to widen the EMIR scheme in two areas: the method used for collecting the local data, and the method used for accessing (browsing/searching/querying) this data.

If the EMIR project becomes such a success, that several servers are needed, several solutions are possible.

Common to all the extensions mentioned is, that -- though they may require substantial work to implement -- they are truly extensions to EMIR, i.e. the work put into the present suggested solution will not be wasted.

Adding collection methods

As mentioned in section Initial implementation, the technical EMIR proposal is designed to accommodate multiple collection methods, though only one (FTP) is implemented to begin with.

Alternate collection methods may be: NFS, querying existing information systems such as WWW, X.500, Gopher or others.

The only theoretical requirement on the collection method is, that the data provided (FTP-files at present, gopher menu entries/WWW nodes/X.500 objects in the future) is sufficiently structured to allow automatic processing into the EMIR information.

Note, however, that new retrieval methods will require additional work for implementation.

Adding query methods

The EMIR technical proposal is also designed to accommodate multiple query methods. The collected data will in the present proposal only be accessible through HTTP (the WWW). But access methods such as gopher, X.500 and others may be added, if sufficient resources are provided later on.

Several servers

If EMIR becomes an intensive daily used tool by many mathematicians, the single-server approach may be a bottleneck. In this case, more servers may need to be set up, either for distributing responsibility for parts of the information tree described in section User Interface, or for replicating the existing server.

Distributing the information tree

If the load on the central server or its network connection becomes too high, servers may have to be added to share this load. This may reasonably be done by conceptually splitting the information tree from figure 1 into parts, and assigning the responsibility for each of these parts to an individual server.

The data collection for this new scheme will not require many modifications to the EMIR project. Since both HTTP, gopher and X.500 supports referrals from one server to another, the query methods may remain unchanged.

Note that this distribution thus can be performed transparently to the end-users of EMIR.

Such distribution will however incur a non-negligible workload on the sites hosting the additional servers: servers must be up and available always. The guiding principle must be: Better slow service than no service.

Replicating servers

Server replication can be usefull for fault-tolerance. The EMIR data-collection and processing system allows this without significant modification. The HTTP/WWW query mechanism, however, does not, so server-replication is not a realistic possibility.