Research Technologies Service,
Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN
lou.burnard@oucs.ox.ac.uk
tony.dodd@btinternet.com
tel: +44 1865 273285
fax: +44 1865 273275
Xara is the
working name for a new version of SARA, the `SGML aware retrieval application'
originally developed for use with the British National Corpus (BNC) in 1994.
The system has been completely rewritten as a general purpose tool for
searching large XML corpora, with a particular focus on the needs of corpus
linguists, with close attention to new XML-based encoding standards, and with
the benefit of hindsight derived from a decade of feedback from hundreds of
SARA-users world wide.
The Xara
system combines the following components: (1) an indexer, which
creates inverted file style indexes to a large collection of discrete XML
documents; (2) a server, which handles all interaction between the
client programs and the data files; (3) a Windows client, which
handles interaction between the server and the user. The modularity of this
architecture has several advantages, permitting, for example, the development
of multiple specialized client programs for different applications or styles of
usage.
In
addition, an index building utility, called Indextools, is supplied with Xara,
which simplifies the process of constructing a Xara database. Its chief
function is to collect information about the corpus to be supplied additional
to that present in any pre-existing corpus header, and to produce a validated
and extended form of the corpus header. It can also be used to run the indexer
and test its output.
Rather than review extensively
its history and design philosophy in this short presentation, we give here some
general comments covering the following aspects of the system:
Xara will
process any XML encoded corpus. The more detail present in the tagging, the
more facilities are available to the client but the minimal requirement is only
that the text be well-formed XML. If the corpus to be processed specifies a
document type definition (DTD), then Xara will validate it at indexing time,
and will not proceed if any validity errors are discovered. Unlike earlier
versions of the program, any DTD may be used for this purpose, though Xara was
(naturally) designed with the likely needs of such widely used DTDs as the TEI
or CES in mind. Also unlike earlier versions of the program, as previously
noted, no DTD at all need be supplied.
Xara can
thus do something useful with the full range of digital material one might wish
to build into a corpus. At one extreme, we will demonstrate how it can be used
to provide basic searching facilities for a collection of Project Gutenberg
style texts, innocent of any explicit descriptive markup at all; at the other,
we will show how it can also take full advantage of the rich annotation present
in a multilingual corpus produced in full conformance with the XCES Guidelines,
containing detailed feature structure analysis, POS-tagging, and explicit
lemmatization. Oddly enough, the most problematic material is likely to be
texts which have been marked up in loose (syntactically incorrect) HTML, such
as that generated by automatic conversion from Microsoft Word; fortunately,
utilities such as Dave Ragget's Tidy are readily available to generate
well-formed XML from such conversions.
With XML
support, comes Unicode support. Xara uses Unicode internally to represent all
character data: it can thus handle text in any language, and any combination of
languages. To take full advantage of this, the user of the system needs
convenient methods both for displaying and for inserting Unicode characters at
their workstation.
Good
Unicode fonts are now available for display of texts in several different
scripts (we have so far tested Chinese, Eastern European, Medieval, and Ancient
Greek scripts) and the number continues to grow. Xara does not assume the
existence of any particular font, however, and allows the user to select the
display font at run time. In common with other XML systems, the system will
correctly process character entity references found in the data, such
references being retained untranslated in the underlying corpus index, but
rendered as the appropriate Unicode code point when displayed in non-XML modes.
To enter
characters not found on the keyboard, for example to search for words
containing them, however, an appropriately configurable input system is necessary.
Dynamic keyboard redefinition is built into more recent operating systems, but
is still a little user-unfriendly in most cases; the Xara client therefore
includes its own keyboard redefinition facilities, which allow selection of
specific characters from a Unicode table by point and click, temporary mapping
of keyboard keys, or complete redefinition of a new keyboard map, which can be
loaded as needed.
Any XML or SGML aware search
engine has the ability to locate specific tagged components and to carry out
searches within the context of such components. Most also have the ability to
reorganize and display search results in a variety of forms. SARA extended
these facilities with some additional, more lexically-motivated, abilities.
These include:
Xara
supports the full range of facilities originally provided by Sara, but with
several modifications and simplifications to the interface. A number of new
facilities have also been added.
Xara
inherits from SARA a rich range of query facilities. The user
can search for substrings or regexp-style patterns, words, phrases, or the tags
which delimit XML elements and their descriptive attributes, either simply
checking for their presence or absence in the lexical index maintained for the
corpus (and inspecting their frequency or collocative patterns therein), or
retrieving and displaying actual instances of the words etc. sought for.
Searches
can be made using additional keys such as part of speech, or root form (lemma)
of a token, specified either explicitly in the tagging of the texts, or
implicitly by means of a named algorithm. Xara also supports a variety of
scoped queries, searching for combinations of words etc. in particular
contexts, which may be defined as XML elements, or as combinations of other
identifiable entities, or as stretches of text. Such searches may be
order-sensitive or insensitive.
Different
kinds of search can be combined to form complex queries of
various kinds, using either a simple graphical interface, or a (rather
esoteric) `Corpus Query Language' (CQL). This language defines the full
capability of the query interface and forms a major part of the protocol by
means of which client and server modules communicate. In Xara, it has been
re-expressed using an XML syntax, in line with the desire to leverage XML
standards wherever possible. The client also supports a simple ECMAscript-like
scripting language which makes direct calls on an API defined in very similar
terms to the CQL.
Like its
predecessor, Xara displays the results of searches either one
at a time or within a traditional KWIC style window which can be sorted,
thinned, expanded etc. The context of hits located in this way can be explored
by expanding it, up to the full text level if necessary. Unlike its
predecessor, Xara allows user-definition of a range of formatting properties
for KWIC and single hit displays, using a subset of the formatting properties
defined by the W3C standard Cascading Stylesheets (CSS); it also allows the
user to export results in a simple XML format which can then be reprocessed,
either by Xara, or by any other XML-competent application, such as a word
processor or XSLT engine.
Corpora can
be reorganized or partitioned in a user-defined way, using the
results of any query, the values of specified element/attribute combinations,
or a manual classification. Searches carried out across partitioned corpora can
be analysed by partition: so the client can display the relative frequencies of
a given lexical phenomenon in texts of different categories identified in a
corpus.
For
existing users of SARA, the main new feature of Xara will probably be the
facilities it offers for indexing of new corpora. As noted above, all the
parameters required by the indexer and server are now gathered and stored
transparently within a TEI-conformant XML header file, rather than being
supplied at runtime by various obscure control files, command line options, or
hard-coded declarations. A new Windows utility has been included to facilitate
creation and modification of this data: it allows the user to control the
behaviour of the indexer by selecting available options from a series of
dialogues, and saves the results of these decisions in an appropriate part of
the TEI/XCES header. The same utility can be used to check validity of the
supplied corpus files and to run the indexing utility, and provides an
interface for testing that the system index files have been correctly
generated. Its use is not however essential: other XML-aware software can be
used to define a corpus header which Xara will use, and the indexer utility can
be run independently.
The indexer needs to be provided
with the following information:
Much,
perhaps most, of this information is implicit in the XML structure for a richly
tagged corpus: one could imagine a corpus in which every index term was
explicitly tagged, with lemma, part of speech, representation etc. In practice
however, such richly tagged corpora remain imaginary, and software performs the
largely automatic task of adding value to a document marked up with only a
basic XML structure. The new indextools utility specifies how the task is to be
performed in a standardized way, using the existing structure of the TEI header
(see http://www.tei-c.org/Guidelines/HD.htm) to
hold the specification. Because our original design goal was to avoid any need
for extension or modification of the existing TEI DTD; we have used very
general constructions of the type<list
type="foo"/> rather than defining more `application focussed' tags such as <foo>. We have
however done this is a fairly consistent and easily validated form, so that
when TEI support for external namespaces is available, it will be easy to make
our generic tags more specific.
The Xara system is currently under beta-test and will be demonstrated at
the conference. We hope to make an initial general release available by the
summer of 2003, but participants wishing to experiment with it earlier are very
welcome to participate in the beta test. The first release will operate in any
modern Microsoft Windows environment; it is intended, however, to port both
indexer and server to Unix environments as soon as possible; it is also
intended to licence open source versions of at least those components of the
system at that time (open source licensing of the Microsoft-dependent
components is rather less likely, for obvious reasons)