EMIR - Electronic Mathematical Information Retrieval |
This paper proposes a simple, inexpensive, quick way to exploit the already existing information at the individual institutes.
These islands of information are normally not interconnected enabling
a mathematician of one institute to access data of another institute.
In this way the daily problem of remembering an e-mail address of a
colleague at another institute or the time of an interesting Ph.D.
lecture at a neighbouring university does exist.
Even if interconnection is made possible the data files are stored in
different formats and normally outside the reach of a remote
login'ing/ftp'ing person.
Naively speaking the right solution would be to get rid of the
existing local files with different formats and instead dictate a
common format all local files have to adapt to in this way creating a
huge common distributed data base with sophisticated searching
facilities. Even though this might be the end goal, EmC has been asked
to investigate and to come up with a proposal for a quick
implementable solution utilizing the present available information
systems, networks and the informations and to develop conversion tools
in an inexpensive, simple and user friendly way.
At a later point in time a common interface might be included.
In order to ease browsing, the information will be organized
hierarchical with the participating mathematical institutes in the
top of the hierarchy and information about persons, publications, etc.
under each institute as illustrated in the figure 1.
There will be one common entry point to the information service. Under
this point there will be links to the participating institutes
(entries under one institute is illustrated in the figure
1) as well as links to the individual classes of
informations (persons, publications, etc.) collected from
institutes. That is, when you connect to this information service you
will get a menu where you can select from
Under each menu point, on every level lower than institute in the
hierarchy, there will be a Search function which you can use for free
text searching on that level. For example, if you have selected the
menu item ``Persons'' under a particular institute, say ``Mathematical
Institute, University of Copenhagen'' you can select ``Search'' and do
a search on all persons on Mathematical Institute, University of
Copenhagen. If you only want to search among staff on that institute,
you select ``Staff'' under ``Persons'' and then ``Search'' under
``Staff''.
The first implementation of EMIR will include only a few types of
information (most likely about people at Danish mathematical
institutes and their publications). The system design, however,
allows for extensions both in what information types are handled
(width) and by the number and variety of information sources that
supply information to the service (depth).
Wherever possible, the implementation will use existing software
tools.
The fundamental requirement for such a system is that it must be
flexible. More specifically, the system design permits:
The actual behaviour of the system is controlable via a
configuration file. Typical changes (such as change, addition or
removal of a data source) is possible merely by changing the
configuration file. Extensions to the system itself (such as addition
of a information retrieval or delivery method and/or format) will be
possible without changing the configuration file format (syntax)
substantially.
At regular intervals (once a week, say) the set of (updated)
information sources is retrieved for processing and -- eventually --
delivery.
The unified format will be defined using SGML (Standard Generalized
Markup Language, ISO 8879) and SGML tools will thus be used for
processing.
The processing converts the collected information into deliverable
formats (e.g. into hypertext formats, ASCII, menu structures, etc.,
depending on the chosen delivery methods), and indexes are created for
fast search response times. All data is converted to a single
character set (Latin 1, ISO 8859-1).
The basic problem that must be addressed in any case, is that the access
method used must be able to use the information that is to be served,
either by importing the information into an internal database or by using
access programs. Both methods are supported.
The initial implementation will support retrieval via FTP, either
anonymously or via a EMIR specific account on the institute server.
For access, the initial implementation will support access via WWW.
The processing in the first implementation will convert the data
retrieved from the institutes to the hypertext format of WWW (called
HTML), and deliver hypertext documents. These documents will contain
mostly text (and thus not many hyperlinks). To permit use of the
search facilities of WWW, indexes of the content are generated.
The requirements
described are the one-time ``investments'' that must be made before
starting or joining the EMIR project.
Likewise, the FTP-server doesn't necessarily have to be
physically located at the local institute. Institutes purchasing
computing power externally (e.g. through UNI-C) may use any such
machine.
A practical requirement will naturally be, that the person(s)
responsible for the daily management of the data on this machine has
easy access to it.
The first of these tasks can be performed on for instance a weekly
basis, during the night. Thus this task will not require significant
processing power.
The other task -- processing HTTP queries -- is more difficult to
evaluate precisely. First of all, the load incurred by such queries --
especially searches -- vary widely, and secondly the number of queries
(i.e. the success of EMIR!) is not yet known.
EmC proposes a two-stage strategy. To begin with, a machine with the
processing power roughly equivalent of a DECstation 3100 (32 MB RAM)
is used. Such a server should easily be able to serve up to 10
simultaneous requests.
If the load frequently exceeds 10 simultaneous queries, a larger
server may have to be purchased. Such a load-level will however
indicate such a success for EMIR, that the investments already made
can not be considered a waste.
The last point -- TCP/IP drivers -- is especially important for
PCs (both DOS and Windows), since they usually don't include network
support. Several public domain TCP/IP software drivers are available
(EmC may supply a list), but they will have to be installed, supported
and maintained by the local institutes.
Thus, EmC suggests, that only data for which a well defined
local need exists is included in EMIR. This means that institutes are
not forced to maintain information for which they do not themselves
see a local need. Keep in mind that the individual institutes are not
required to supply all the information listed in section
What Information Is Provided.
How the responsibility for maintaining the local data is delegated is
naturally only the business of the individual institute. Since most
institutes already maintain local data-collections, such a delegation
has probably already been made.
An essential point is, however, that the persons(s) responsible for
the local support is known also outside the institute, so questions
on data reliability can be addressed to the right point.
Thus efforts must be made to keep this local FTP-server highly
reliable. If the institute already has and uses the intended server,
system administration procedures to ensure this must be assumed to be
in place. In that case, EMIR will not incur additional
system-administration overhead.
Again it must be stressed that the person(s) responsible for local
system administration is known to the other partners.
New techniques or extended facilities might require a redesign of EMIR
allowing it to exploit new possibilities. The question therefore
arises whether EMIR locks you to the present suggested design or whether
it can be extended in the width to provide new functionality without
seeing the original investments as wasted!
EmC is not very happy suggesting yet another bureaucratic entity -
however with the concept of distributed users/owners of EMIR these
users/owners should meet at regular intervals and decide upon
suggestions relating to EMIR, which have come up since the previous meeting. In
most cases an existing organization will most easily could take over
the responsibility of EMIR, such as the National Mathematical Society.
One of the important tasks will be to ensure the quality of EMIR by
continuously evaluating the performance.
If the EMIR project becomes such a success, that several servers are
needed, several solutions are possible.
Common to all the extensions mentioned is, that -- though they may
require substantial work to implement -- they are truly
extensions to EMIR, i.e. the work put into the present suggested
solution will not be wasted.
Alternate collection methods may be: NFS, querying existing
information systems such as WWW, X.500, Gopher or others.
The only theoretical requirement on the collection method is, that the
data provided (FTP-files at present, gopher menu entries/WWW
nodes/X.500 objects in the future) is sufficiently structured to allow
automatic processing into the EMIR information.
Note, however, that new retrieval methods will require
additional work for implementation.
The data collection for this new scheme will not require many
modifications to the EMIR project. Since both HTTP, gopher and X.500
supports referrals from one server to another, the query methods may
remain unchanged.
Note that this distribution thus can be performed transparently to the
end-users of EMIR.
Such distribution will however incur a non-negligible workload on the
sites hosting the additional servers: servers must be up and
available always. The guiding principle must be: Better
slow service than no service.
The Need for an Information Service
All Mathematical Departments keep records of data concerning
employees, telephone numbers, courses, publications etc. Most of these
are (or are planned to be) stored, maintained and made available
locally in electronic form.
EMIR
EmC has looked into the problems of the non-interconnected information of
interest to mathematicians and has come up with the EMIR proposal
(Electronical Mathematical Information Retrieval). In doing so, EmC has set
up the following set of demands:
What Information Is Provided
In this section we outline the kind of information which we think will
be of interest to mathematicians outside the local institute. It will
not be difficult to adapt the classes of information to suit special
needs, such as those of the Danish mathematicians or those of the
EmNet projects. As mentioned in the document Dan-EMIR
, the first implementation, which will take place in
Denmark, will only contain information about a subset (persons and
publications) of the intended information service.
In the final implementation, however, each participating mathematical
institute should be motivated to maintain local data about:
Many other items are of interest, such as department profiles, research
projects, special expertise, etc. After a small field research
performed by EmC, the 8 classes above were, however, those which most
users requested. In the following we shall therefore describe
specifically what information should be available in each of these 8
classes.
Persons
Persons at a mathematical institute can be divided into
For each person must be provided last name, first name, affiliation,
position, e-mail address, telephone number, fax number, and
professional interests (including willingness to referee reviews,
offer consulting services, participation in projects, etc.). For
guests a permanent address should also be included. For a member of
staff data could, for example, be organized as follows:
<LastName> Feddersen
<FirstName> Henrik
<Position> Ph.D.
<Affiliation> Ko/benhavns Universitet, EmC
<E-mail> henrik@euromath.dk
<Tel> 35 32 07 12
<Fax> 35 32 07 19
<Interests> Nonlinear dynamics, solitons
</>
Note that the above is only meant as an example that sketches the
organization of data as provided by the institute (here EmC). The
actual format will be different (here the emphasis is on the
data that the institutes should provide, not on the markup -
technical details about the format of data files will be given to the
participating institutes at a later stage).
For a guest we would need an additional field about his permanent
address, e.g.
<Permanent> University of Honolulu, Hawaii, USA
Publications
By publications we mean bibliographic data about publications by
people affiliated with the institute. There are several fields
associated with publications. The type of publication (article,
proceedings, thesis, preprint, etc.) determines which
fields should be included:
<Author>
<Title>
<Journal>
<Volume>
<Number>
<Pages>
<Editor>
<Booktitle>
<Edition>
<Publisher>
<Chapter>
<Series>
<Address>
<Year>
<Abstract>
<Classification>
<Comments>
<FTP>
</>
The last field <FTP> is used if the full text publication
(typically a preprint) is available by anonymous ftp. The user
interface will enable you to fetch the publication simply by clicking
the mouse where it says something like click here to fetch
document. (See also section 3).
Seminars
Information about seminars, colloquia, thesis defences, etc.\ should
be organized like the following example:
<Title> Singular Integral Equations on Curves with Cusps
<Speaker> Roland Duduchava
<Affiliation> Tbilisi
<When> 22 Feb. 1994, 14.15-15.00
<Where> Aud. 10, HCO/
<Abstract> In the lecture I shall explain how one-dimensional
singular intergral equations with complex conjugated
unknown functions appear in boundary value problems.
I shall show that Fredholm properties and the index
of such equations depend on angles in corner points
and orders in cusp points of the underlying curve.
</>
Teaching
The Teaching section should include all courses, from 1st year
undergraduate to Ph.D.\ level, given by the institute. The information
that should be included about a course is illustrated in the example
below:
<Title> Solitons
<Level> 4th year
<Prerequisite> Course A3 + B17
<Teacher> NN
<Affiliation> Ko/benhavns Universitet, Matematisk Institut
<Syllabus> Solitons, inverse scattering transform, complete
integrability of Hamiltonian systems, fundamental
models, Poisson brackets.
<Text book> L D Faddeev and L A Takhtajan: Hamiltonian Methods in
the Theory of Solitons, Springer, 1987
<Extra litt> M J Ablowitz and P A Clarkson: Solitons, Nonlinear
Evolution Equations and Inverse Scattering, Cambridge
Univ. Press, 1991.
<When> Sep 98 - Jan 99
<Hours per week> 2
<Evaluation> To be seen!
</>
Conferences
Conferences includes every conference worldwide that the institute
feels may be of general interest. The information that will be
provided to the service by an institute should include some or all of
the fields indicated in the example below:
<Conf> Workshop on Groups and Three-Manifolds
<When> Spring 1995
<Where> Centre de Recherches Mathematique, Universite de
Montreal
<Program> There will be a program of visitors, both short- and
long-term, with more informal activities organized in
consequence. Special emphasis will be placed on...
<Committee> L Vinet, director of CRM; S Boyer, UQAM...
<Topics> Progress on Thurston's Geometrization conjectures,
group actions on trees,...
<Papers> People wishing to present a paper or poster should...
Particularly encouraged are papers relating to...
Deadline for submission of abstracts is...
<Invited> NN,...
<Registration> M Louis Pelletier, CRM, Universite de Montreal,...
</>
Journals Subscribed to
It may be of interest to others which journals your institute
subscribe to. The name of the journal and how long back your
subscription dates are needed in the provided information, e.g.
<Journal> Journal of Differential Equations
<From> Vol. 4, 1970
</>
Vacant Positions
Positions and possible Ph.D.\ grants can be advertised on the
information service, e.g.
<Position> Lecturer
<Where> Ko/benhavns Universitet, Matematisk Institut
<Date> 28 Feb 1994
<Text> We are looking for ...
<Contact> NN, E-mail: nn@math.ku.dk, Tel: 12 34 56 78
</>
Access to Departmental Library Databases
This point is different from the rest. Here the institutes do not have
to type in information specifically for the information service. We
``just'' want to have an Internet connection to the institute's library
database so that anyone can search in this database.
User Interface
You can find information in two ways: browsing and searching. As
mentioned in section Initial implementation we have suggested to
make the service available from a World Wide Web server so if you are
familiar with NCSA Mosaic you may keep the program xmosaic in
the back of your mind when you read the following.
So if you, e.g., select the menu entry Persons, you will get a list of
all persons affiliated with all the participating institutes.
There is no subdivision according to institute (but for Persons we
suggest a division into Staff, Postdocs, Research students and
Guests).
Technical overview
This chapter gives an overview of the implementation needed at the
central server site, e.g. at EmC. Although being an overview, it still
lists the necessary tasks but without going into trivial details on
each subtask.
Design objectives
The basic idea of EMIR is to collect useful information and make it
available for search and retrieval at an information server. The
implementation of such a system must be able to cope with a variety of
scenarios, depending on exactly what information will be handled in
the system.
Naturally the first implementation will only handle a few information
types and retrieval & delivery methods, but the design is so
that new information types and retrieval & delivery methods can fit
easily into the existing architecture.
Functionality
The ultimate objective of the system is to provide central access to
otherwise widely spread information. The system functionality can thus
basically be divided into three classes (See figure 2):
The above describes the functionality classes, the following sections
provide a more detailed description of each of the three classes.
Retrieval
Each participating institute provides information to the system. For
each institute one or more information sources are identified.
Each information source represents an information type (e.g.
information about people at the institute) and contains the
institute's information of that type. Each information source is
placed in an identified file on an identified server.
Processing
The purpose of the processing is to prepare the collected information
for delivery. The institutes may deliver the information in various
formats. (Although having just one format for each information
type would allow a much simpler implementation.) Each of the files
must be checked syntactically, and then -- if the validation is
successful -- converted into a unified format. All subsequent
processing is then performed on the unified format.
Delivery
The delivery functionality provides access to the collected information.
Basically, the delivery consists of responding to information retrieval
requests from users. Users are expected to access the information service
via the Internet. Various systems exist (WWW, Gopher, WAIS, X.500, etc.)
that could be used for this access, and the exact functionality thus depends
on the system in question.
Tools
The solution will run on standard UNIX equipment. The configuration
file is essentially a small formal language that describes almost all
aspects of the system's behaviour. The implementation of the
configuration language will use (an extension of) the programming
language TCL (Tool Command Language). In addition, many standard UNIX
facilities (such as FTP) are used. For the text processing needed to
validate the retrieved institute data files, Perl is used.
Initial implementation
As described above EMIR is a concept that can be implemented in many
ways. The idea is to start with a limited initial implementation
(which supports only two information types, one retrieval method and
one access method). If desirable, this first implementation can later
be expanded to support more information types, retrieval methods, and
access methods.
Prerequisites
The proposal for the technical solution of the EMIR task has as
mentioned in section Introduction been designed to require as few local
investments in new hardware and software as possible. This section
describes these minimal requirements.
Hardware
The hardware requirements may be split into three areas: the hardware
needed to keep the local datafiles for retrieval by the server-site
(e.g. EmC), the hardware needed at the server-site to retrieve this
data and serve queries, and the local hardware needed to perform these
queries.
Local datafile servers
Since the collection of data is performed via FTP, each local
institute needs to keep their datafiles on a machine offering
FTP-access. Whether this machine runs UNIX, is a PC-server, a VMS VAX
or anything else is of no importance.
The server-site hardware
As mentioned in section Functionality, the server-site hardware shall
perform two functions: collecting and processing local datafiles using FTP, and
answering HTTP queries for the collected data.
The local client hardware
The EMIR proposal has explicitly been designed to be usable on any
hardware platform. Thus PCs running DOS, UNIX machines, PCs running
Windows and possibly even MAC's running DOS can be used.
Software
Only public domain software will be required. Thus, no additional
costs from this item.
Networking
Since EMIR is a networked database, Internet access is a must for ALL
the machines involved. This includes the software drivers for the TCP/IP
protocol suite.
Organization
The last and crucial point is, that the data is correct and
up-to-date. Many grand schemes in distributed databases have collapsed
because the data was not maintained properly.
Continuous maintenance requirements
This section describes the daily work incurred on institutes
participating in the EMIR project. Though the project has been
designed to minimize this, some maintenance tasks must be expected.
Maintaining local data
The local data being collected by the central server-site must
naturally be kept up-to-date. Since the guiding principle is, that
only information for which a clear local need can be identified
-- for instance data that is already being maintained -- shall
be included in EMIR, the work involved will be up to the local
institutes. An institute will presumably only offer data through EMIR,
if the institute is willing to allocate the resources needed to
maintain it.
Local system maintenance
For the EMIR system to work, the local FTP-server from which the data
is collected must be up when the central server-site tries to collect
data.
Server-site system maintenance
The server-site is expected to allocate the resources for
Extending EMIR
EMIR can be extended both in the width and depth. In section What
Information Is Provided is outlined the classes of information which
is believed to be the natural choice of today. However, as time goes
by, one or several institutes might see the need for additional
information classes and the question then arises if and how EMIR
should be extended in the depth.
Adding new classes of information
EMIR encompasses users from institutes which do not meet each other
each day. The needs of the institutes for information might evolve in
different directions and therefore the "server-site" might get
conflicting proposals from the participating institutes for how EMIR
should be extended in the depth. To avoid making local optimization, a
common organization should own EMIR and thereby have the power to
decide in which - if any - direction EMIR shall evolve.
Possibilities for future migration to other distributed
database schemes
If the central-server idea is maintained, one may want to widen the
EMIR scheme in two areas: the method used for collecting the local
data, and the method used for accessing (browsing/searching/querying)
this data.
Adding collection methods
As mentioned in section Initial implementation, the technical
EMIR proposal is designed to accommodate multiple collection methods,
though only one (FTP) is implemented to begin with.
Adding query methods
The EMIR technical proposal is also designed to accommodate multiple
query methods. The collected data will in the present proposal only be
accessible through HTTP (the WWW). But access methods such as gopher,
X.500 and others may be added, if sufficient resources are provided
later on.
Several servers
If EMIR becomes an intensive daily used tool by many mathematicians,
the single-server approach may be a bottleneck. In this case, more
servers may need to be set up, either for distributing responsibility
for parts of the information tree described in section User Interface, or
for replicating the existing server.
Distributing the information tree
If the load on the central server or its network connection becomes
too high, servers may have to be added to share this load. This may
reasonably be done by conceptually splitting the information tree from
figure 1 into parts, and assigning the
responsibility for each of these parts to an individual server.
Replicating servers
Server replication can be usefull for fault-tolerance. The EMIR
data-collection and processing system allows this without significant
modification. The HTTP/WWW query mechanism, however, does not, so
server-replication is not a realistic possibility.