1 <chapter id="examples">
2 <!-- $Id: examples.xml,v 1.17 2002-11-08 17:00:57 mike Exp $ -->
3 <title>Example Configurations</title>
6 <title>Overview</title>
9 <literal>zebraidx</literal> and <literal>zebrasrv</literal> are both
10 driven by a master configuration file, which may refer to other
11 subsidiary configuration files. By default, they try to use
12 <filename>zebra.cfg</filename> in the working directory as the
13 master file; but this can be changed using the <literal>-c</literal>
14 option to specify an alternative master configuration file.
17 The master configuration file tells Zebra:
22 Where to find subsidiary configuration files, including
23 <literal>default.idx</literal>
24 which specifies the default indexing rules.
30 What attribute sets to recognise in searches.
36 Policy details such as what record type to expect, what
37 low-level indexing algorithm to use, how to identify potential
38 duplicate records, etc.
45 Now let's see what goes in the <literal>zebra.cfg</literal> file
46 for some example configurations.
51 <title>Example 1: XML Indexing And Searching</title>
54 This example shows how Zebra can be used with absolutely minimal
55 configuration to index a body of
56 <ulink url="http://www.w3.org/XML/">XML</ulink>
57 documents, and search them using
58 <ulink url="http://www.w3.org/TR/xpath">XPath</ulink>
59 expressions to specify access points.
62 Go to the <literal>examples/zthes</literal> subdirectory
63 of the distribution archive.
64 There you will find a <literal>Makefile</literal> that will
65 populate the <literal>records</literal> subdirectory with a file of
66 <ulink url="http://zthes.z3950.org/">Zthes</ulink>
67 records representing a taxonomic hierarchy of dinosaurs. (The
68 records are generated from the family tree in the file
69 <literal>dino.tree</literal>.)
70 Type <literal>make records/dino.xml</literal>
71 to make the XML data file.
74 Now we need to create a Zebra database to hold and index the XML
75 records. We do this with the
76 Zebra indexer, <literal>zebraidx</literal>, which is
77 driven by the <literal>zebra.cfg</literal> configuration file.
78 For our purposes, we don't need any
79 special behaviour - we can use the defaults - so we start with a
80 minimal file that just tells <literal>zebraidx</literal> where to
81 find the default indexing rules, and how to parse the records:
83 profilePath: .:../../tab
88 That's all you need for a minimal Zebra configuration. Now you can
89 roll the XML records into the database and build the indexes:
91 zebraidx update records
95 Now start the server. Like the indexer, its behaviour is
97 <literal>zebra.cfg</literal> file; and like the indexer, it works
98 just fine with this minimal configuration.
102 By default, the server listens on IP port number 9999, although
103 this can easily be changed - see
104 <xref linkend="zebrasrv"/>.
107 Now you can use the Z39.50 client program of your choice to execute
108 XPath-based boolean queries and fetch the XML records that satisfy
111 $ yaz-client tcp:@:9999
113 Z> find @attr 1=/Zthes/termName Sauroposeidon
118 <termId>22</termId>
119 <termName>Sauroposeidon</termName>
120 <termType>PT</termType>
122 <relationType>BT</relationType>
123 <termId>21</termId>
124 <termName>Brachiosauridae</termName>
125 <termType>PT</termType>
128 <idzebra xmlns="http://www.indexdata.dk/zebra/">
129 <size>245</size>
130 <localnumber>23</localnumber>
131 <filename>records/dino.xml</filename>
137 Now wasn't that easy?
142 <sect1 id="example2">
143 <title>Example 2: Supporting Interoperable Searches</title>
146 The problem with the previous example is that you need to know the
147 structure of the documents in order to find them. For example,
148 when we wanted to find the record for the taxon
149 <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
150 we had to formulate a complex XPath
151 <literal>/Zthes/termName</literal>
152 which embodies the knowledge that taxon names are specified in a
153 <literal><termName></literal> element inside the top-level
154 <literal><Zthes></literal> element.
157 This is bad not just because it requires a lot of typing, but more
158 significantly because it ties searching semantics to the physical
159 structure of the searched records. You can't use the same search
160 specification to search two databases if their internal
161 representations are different. Consider an alternative taxonomy
162 database in which the records have taxon names specified
163 inside a <literal><name></literal> element nested within a
164 <literal><identification></literal> element
165 inside a top-level <literal><taxon></literal> element: then
166 you'd need to search for them using
167 <literal>1=/taxon/identification/name</literal>
170 How, then, can we build broadcasting Information Retrieval
171 applications that look for records in many different databases?
172 The Z39.50 protocol offers a powerful and general solution to this:
173 abstract ``access points''. In the Z39.50 model, an access point
174 is simply a point at which searches can be directed. Nothing is
175 said about implementation: in a given database, an access point
176 might be implemented as an index, a path into physical records, an
177 algorithm for interrogating relational tables or whatever works.
178 The key point is that the semantics of an access point are fixed
182 For convenience, access points are gathered into <firstterm>attribute
183 sets</firstterm>. For example, the BIB-1 attribute set is supposed to
184 contain bibliographic access points such as author, title, subject
185 and ISBN; the GEO attribute set contains access points pertaining
186 to geospatial information (bounding coordinates, stratum, latitude
187 resolution, etc.); the CIMI
188 attribute set contains access points to do with museum collections
189 (provenance, inscriptions, etc.)
192 In practice, the BIB-1 attribute set has tended to be a dumping
193 ground for all sorts of access points, so that, for example, it
194 includes some geospatial access points as well as strictly
195 bibliographic ones. Nevertheless, the key point is that this model
196 allows a layer of abstraction over the physical representation of
197 records in databases.
200 In the BIB-1 attribute set, a taxon name is probably best
201 interpreted as a title - that is, a phrase that identifies the item
202 in question. BIB-1 represents title searches by
204 <ulink url="ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt"
205 >The BIB-1 Attribute Set Semantics</ulink>)
206 So we need to configure our dinosaur database so that searches for
207 BIB-1 access point 4 look in the
208 <literal><termName></literal> element,
210 <literal><Zthes></literal> element.
213 This is a two-step process. First, we need to tell Zebra that we
214 want to support the BIB-1 attribute set. Then we need to tell it
215 which elements of its record pertain to access point 4.
218 We need to create an <link linkend="abs-file">Abstract Syntax
219 file</link> named after the document element of the records we're
220 working with, plus a <literal>.abs</literal> suffix - in this case,
221 <literal>Zthes.abs</literal> - as follows:
239 The simplest hello-world example could go like this:
244 <title>The art of motorcycle maintenance</title>
245 <subject scheme="Dewey">zen</subject>
250 f @attr 1=/book/title motorcycle
252 f @attr 1=/book/subject[@scheme=Dewey] zen
254 If you suddenly decide you want broader interop, you can add
255 an abs file (more or less like this):
260 elm (2,1) title title
261 elm (2,21) subject subject
265 How to include images:
269 <imagedata fileref="system.eps" format="eps">
272 <imagedata fileref="system.gif" format="gif">
275 <phrase>The Multi-Lingual Search System Architecture</phrase>
279 <emphasis role="strong">
280 The Multi-Lingual Search System Architecture.
283 Network connections across local area networks are
284 represented by straight lines, and those over the
285 internet by jagged lines.
289 Where the three <*object> thingies inside the top-level <mediaobject>
290 are decreasingly preferred version to include depending on what the
291 rendering engine can handle. I generated the EPS version of the image
292 by exporting a line-drawing done in TGIF, then converted that to the
293 GIF using a shell-script called "epstogif" which used an appallingly
294 baroque sequence of conversions, which I would prefer not to pollute
295 the Zebra build environment with:
299 # Yes, what follows is stupidly convoluted, but I can't find a
300 # more straightforward path from the EPS generated by tgif's
301 # "Print" command into a browser-friendly format.
303 file=`echo "$1" | sed 's/\.eps//'`
304 ps2pdf "$1" "$file".pdf
305 pdftopbm "$file".pdf "$file"
306 pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
307 rm -f "$file".pdf "$file"-000001.pbm
311 <!-- Keep this comment at the end of the file
316 sgml-minimize-attributes:nil
317 sgml-always-quote-attributes:t
320 sgml-parent-document: "zebra.xml"
321 sgml-local-catalogs: nil
322 sgml-namecase-general:t