From 20e4253a40eb337adedf5496b64baaa6010485d8 Mon Sep 17 00:00:00 2001 From: Marc Cromme Date: Wed, 18 Jan 2006 14:00:54 +0000 Subject: [PATCH] started re-modelling documentation to be less Z39.50- centric, GRS-centric and BIB1-centric added general chapter on architecture, describing different record models (TEXT, GRS, ALVIS XML/DOM), which is not finished at all. --- doc/Makefile.am | 80 ++++-- doc/administration.xml | 4 +- doc/architecture.xml | 730 ++++++++++++++++++++++++++++++++++++++++++++++++ doc/recordmodel.xml | 290 +++++++++---------- 4 files changed, 926 insertions(+), 178 deletions(-) create mode 100644 doc/architecture.xml diff --git a/doc/Makefile.am b/doc/Makefile.am index 2dc1c1e..a95728b 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -1,4 +1,4 @@ -## $Id: Makefile.am,v 1.29 2006-01-17 21:26:43 marc Exp $ +## $Id: Makefile.am,v 1.30 2006-01-18 14:00:54 marc Exp $ docdir=$(datadir)/doc/@PACKAGE@ SUPPORTFILES = \ @@ -9,32 +9,62 @@ SUPPORTFILES = \ xml.dcl XMLFILES = \ - zebra.xml.in \ - introduction.xml \ - installation.xml \ - quickstart.xml \ - examples.xml \ - administration.xml \ - zebraidx.xml \ - server.xml \ - recordmodel.xml \ - license.xml \ - indexdata.xml \ - zebraidx-options.xml \ - zebraidx-commands.xml \ - zebrasrv-options.xml \ - zebrasrv-synopsis.xml \ - zebrasrv-virtual.xml + zebra.xml.in \ + administration.xml \ + architecture.xml \ + examples.xml \ + indexdata.xml \ + installation.xml \ + introduction.xml \ + license.xml \ + marc_indexing.xml \ + quickstart.xml \ + recordmodel.xml \ + server.xml \ + zebra.xml \ + zebraidx-commands.xml \ + zebraidx-options.xml \ + zebraidx.xml \ + zebrasrv-options.xml \ + zebrasrv-synopsis.xml \ + zebrasrv-virtual.xml HTMLFILES = \ - administration.html apps.html configuration-file.html data-model.html \ - example1.html example2.html examples.html features.html file-ids.html \ - formats.html future.html generic-ids.html indexdata.html \ - installation.html installation.debian.html installation.win32.html \ - internal-representation.html introduction.html \ - license.html locating-records.html protocol-support.html quick-start.html \ - record-model.html register-location.html server.html shadow-registers.html \ - simple-indexing.html support.html zebra.html zebraidx.html + administration.html \ + apps.html \ + architecture.html \ + configuration-file.html \ + example1.html \ + example2.html \ + examples.html \ + features.html \ + file-ids.html \ + future.html \ + generic-ids.html \ + grs-exchange-formats.html \ + grs-internal-representation.html \ + grs-record-model.html \ + indexdata.html \ + installation.debian.html \ + installation.html \ + installation.win32.html \ + introduction.html \ + license.html \ + locating-records.html \ + maincomponents.html \ + protocol-support.html \ + quick-start.html \ + record-model.html \ + register-location.html \ + server.html \ + shadow-registers.html \ + simple-indexing.html \ + support.html \ + workflow.html \ + zebra.html \ + zebraidx.html + + PNGFILES=zebra.png EPSFILES=zebra.eps diff --git a/doc/administration.xml b/doc/administration.xml index b9e88fe..e5cdfcd 100644 --- a/doc/administration.xml +++ b/doc/administration.xml @@ -1,5 +1,5 @@ - + Administrating Zebra + Overview of Zebra Architecture + + + + Local Representation + + + As mentioned earlier, Zebra places few restrictions on the type of + data that you can index and manage. Generally, whatever the form of + the data, it is parsed by an input filter specific to that format, and + turned into an internal structure that Zebra knows how to handle. This + process takes place whenever the record is accessed - for indexing and + retrieval. + + + + The RecordType parameter in the zebra.cfg file, or + the -t option to the indexer tells Zebra how to + process input records. + Two basic types of processing are available - raw text and structured + data. Raw text is just that, and it is selected by providing the + argument text to Zebra. Structured records are + all handled internally using the basic mechanisms described in the + subsequent sections. + Zebra can read structured records in many different formats. + + + + + + Indexing and Retrieval Workflow + + + Records pass through three different states during processing in the + system. + + + + + + + + + When records are accessed by the system, they are represented + in their local, or native format. This might be SGML or HTML files, + News or Mail archives, MARC records. If the system doesn't already + know how to read the type of data you need to store, you can set up an + input filter by preparing conversion rules based on regular + expressions and possibly augmented by a flexible scripting language + (Tcl). + The input filter produces as output an internal representation, + a tree structure. + + + + + + + When records are processed by the system, they are represented + in a tree-structure, constructed by tagged data elements hanging off a + root node. The tagged elements may contain data or yet more tagged + elements in a recursive structure. The system performs various + actions on this tree structure (indexing, element selection, schema + mapping, etc.), + + + + + + + Before transmitting records to the client, they are first + converted from the internal structure to a form suitable for exchange + over the network - according to the Z39.50 standard. + + + + + + + + + + + Main Components + + The Zebra system is designed to support a wide range of data management + applications. The system can be configured to handle virtually any + kind of structured data. Each record in the system is associated with + a record schema which lends context to the data + elements of the record. + Any number of record schemas can coexist in the system. + Although it may be wise to use only a single schema within + one database, the system poses no such restrictions. + + + The Zebra indexer and information retrieval server consists of the + following main applications: the zebraidx + indexing maintenance utility, and the zebrasrv + information query and retireval server. Both are using some of the + same main components, which are presented here. + + + This virtual package installs all the necessary packages to start + working with IDZebra - including utility programs, development libraries, + documentation and modules. + idzebra1.4 + + + + Core Zebra Module Containing Common Functionality + + - loads external filter modules used for presenting + the recods in a search response. + - executes search requests in PQF/RPN, which are handed over from + the YAZ server frontend API + - calls resorting/reranking algorithms on the hit sets + - returns - possibly ranked - result sets, hit + numbers, and the like internal data to the YAZ server backend API. + + + This package contains all run-time libraries for IDZebra. + libidzebra1.4 + This package includes documentation for IDZebra in PDF and HTML. + idzebra1.4-doc + This package includes common essential IDZebra configuration files + idzebra1.4-common + + + + + + Zebra Indexer + + the core Zebra indexer which + - loads external filter modules used for indexing data records of + different type. + - creates, updates and drops databases and indexes + + + This package contains IDZebra utilities such as the zebraidx indexer + utility and the zebrasrv server. + idzebra1.4-utils + + + + + Zebra Searcher/Retriever + + the core Zebra searcher/retriever which + + + This package contains IDZebra utilities such as the zebraidx indexer + utility and the zebrasrv server, and their associated man pages. + idzebra1.4-utils + + + + + YAZ Server Frontend + + The YAZ server frontend is + a full fledged stateful Z39.50 server taking client + connections, and forwarding search and scan requests to the + Zebra core indexer. + + + In addition to Z39.50 requests, the YAZ server frontend acts + as HTTP server, honouring + SRW SOAP requests, and SRU REST requests. Moreover, it can + translate inco ming CQL queries to PQF/RPN queries, if + correctly configured. + + + YAZ is a toolkit that allows you to develop software using the + ANSI Z39.50/ISO23950 standard for information retrieval. + SRW/SRU + libyazthread.so + libyaz.so + libyaz + + + + + Record Models and Filter Modules + + all filter modules which do indexing and record display filtering: +This virtual package contains all base IDZebra filter modules. EMPTY ??? + libidzebra1.4-modules + + + + TEXT Record Model and Filter Module + + Plain ASCII text filter + + + + + + GRS Record Model and Filter Modules + + Chapter + + - grs.danbib GRS filters of various kind (*.abs files) +IDZebra filter grs.danbib (DBC DanBib records) + This package includes grs.danbib filter which parses DanBib records. + DanBib is the Danish Union Catalogue hosted by DBC + (Danish Bibliographic Centre). + libidzebra1.4-mod-grs-danbib + + + - grs.marc + - grs.marcxml + This package includes the grs.marc and grs.marcxml filters that allows + IDZebra to read MARC records based on ISO2709. + + libidzebra1.4-mod-grs-marc + + - grs.regx + - grs.tcl GRS TCL scriptable filter + This package includes the grs.regx and grs.tcl filters. + libidzebra1.4-mod-grs-regx + + + - grs.sgml + libidzebra1.4-mod-grs-sgml not packaged yet ?? + + - grs.xml + This package includes the grs.xml filter which uses Expat to + parse records in XML and turn them into IDZebra's internal grs node. + libidzebra1.4-mod-grs-xml + + + + + ALVIS Record Model and Filter Module + + - alvis Experimental Alvis XSLT filter + mod-alvis.so + libidzebra1.4-mod-alvis + + + + + SAFARI Record Model and Filter Module + + - safari + + + + + + + + Configuration Files + + - yazserver XML based config file + - core Zebra ascii based config files + - filter module config files in many flavours + - CQL to PQF ascii based config file + + + + + + + + + + diff --git a/doc/recordmodel.xml b/doc/recordmodel.xml index 1832d7f..fd52ee0 100644 --- a/doc/recordmodel.xml +++ b/doc/recordmodel.xml @@ -1,105 +1,20 @@ - - The Record Model + + GRS Record Model and Filter Modules - - The Zebra system is designed to support a wide range of data management - applications. The system can be configured to handle virtually any - kind of structured data. Each record in the system is associated with - a record schema which lends context to the data - elements of the record. - Any number of record schemas can coexist in the system. - Although it may be wise to use only a single schema within - one database, the system poses no such restrictions. - The record model described in this chapter applies to the fundamental, structured record type grs, introduced in - . - - - - - Records pass through three different states during processing in the - system. - - - - - - - - - When records are accessed by the system, they are represented - in their local, or native format. This might be SGML or HTML files, - News or Mail archives, MARC records. If the system doesn't already - know how to read the type of data you need to store, you can set up an - input filter by preparing conversion rules based on regular - expressions and possibly augmented by a flexible scripting language - (Tcl). - The input filter produces as output an internal representation, - a tree structure. - - - - - - - When records are processed by the system, they are represented - in a tree-structure, constructed by tagged data elements hanging off a - root node. The tagged elements may contain data or yet more tagged - elements in a recursive structure. The system performs various - actions on this tree structure (indexing, element selection, schema - mapping, etc.), - - - - - - - Before transmitting records to the client, they are first - converted from the internal structure to a form suitable for exchange - over the network - according to the Z39.50 standard. - - - - - + . - - Local Representation - - - As mentioned earlier, Zebra places few restrictions on the type of - data that you can index and manage. Generally, whatever the form of - the data, it is parsed by an input filter specific to that format, and - turned into an internal structure that Zebra knows how to handle. This - process takes place whenever the record is accessed - for indexing and - retrieval. - - - - The RecordType parameter in the zebra.cfg file, or - the -t option to the indexer tells Zebra how to - process input records. - Two basic types of processing are available - raw text and structured - data. Raw text is just that, and it is selected by providing the - argument text to Zebra. Structured records are - all handled internally using the basic mechanisms described in the - subsequent sections. - Zebra can read structured records in many different formats. - How this is done is governed by additional parameters after the - "grs" keyword, separated by "." characters. - + + GRS Record Filters - Four basic subtypes to the grs type are + Many basic subtypes of the grs type are currently available: @@ -109,38 +24,62 @@ grs.sgml - This is the canonical input format — - described below. It is a simple SGML-like syntax. + This is the canonical input format + described . It is using + simple SGML-like syntax. + + - grs.regx.filter + grs.marc - This enables a user-supplied input - filter. The mechanisms of these filters are described below. + This allows Zebra to read + records in the ISO2709 (MARC) encoding standard. + + + The loadable grs.marc filter module + is packaged in the GNU/Debian package + libidzebra1.4-mod-grs-marc + - grs.tcl.filter + grs.marcxml - Similar to grs.regx but using Tcl for rules. + This allows Zebra to read + records in the ISO2709??? (MARCXML) encoding standard. + + The loadable grs.marcxml filter module + is also contained in the GNU/Debian package + libidzebra1.4-mod-grs-marc + - grs.marc.abstract syntax + grs.danbib - This allows Zebra to read - records in the ISO2709 (MARC) encoding standard. In this case, the - last parameter abstract syntax names the - .abs file (see below) - which describes the specific MARC structure of the input record as - well as the indexing rules. + The grs.danbib filter parses DanBib + records, a danish MARC record variant called DANMARC. + DanBib is the Danish Union Catalogue hosted by the + Danish Bibliographic Centre (DBC). + + The loadable grs.danbib filter module + is packages in the GNU/Debian package + libidzebra1.4-mod-grs-danbib. @@ -148,18 +87,55 @@ grs.xml - This filter reads XML records. Only one record per file + This filter reads XML records and uses Expat to + parse them and convert them into IDZebra's internal + grs record model. + Only one record per file is supported. The filter is only available if Zebra/YAZ is compiled with EXPAT support. + + The loadable grs.xml filter module + is packagged in the GNU/Debian package + libidzebra1.4-mod-grs-xml + + + + + grs.regx + + + This enables a user-supplied Regular Expressions input + filter described in + . + + + The loadable grs.regx filter module + is packaged in the GNU/Debian package + libidzebra1.4-mod-grs-regx + + + + + grs.tcl + + + Similar to grs.regx but using Tcl for rules, described in + . + + + The loadable grs.tcl filter module + is also packaged in the GNU/Debian package + libidzebra1.4-mod-grs-regx + - - Canonical Input Format + + GRS Canonical Input Format Although input data can take any form, it is sometimes useful to @@ -239,7 +215,7 @@ makes up the total record. In the canonical input format, the root tag should contain the name of the schema that lends context to the elements of the record - (see ). + (see ). The following is a GILS record that contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory @@ -359,8 +335,8 @@ - - Input Filters + + GRS REGX And TCL Input Filters In order to handle general input formats, Zebra allows the @@ -606,8 +582,8 @@ - - Internal Representation + + GRS Internal Record Representation When records are manipulated by the system, they're represented in a @@ -730,12 +706,13 @@ - - Configuring Your Data Model + + GRS Record Model Configuration The following sections describe the configuration files that govern - the internal management of data records. The system searches for the files + the internal management of grs records. + The system searches for the files in the directories specified by the profilePath setting in the zebra.cfg file. @@ -1959,16 +1936,18 @@ target - This directive introduces a - mapping between each of the members of the value-set on the left to - the character on the right. The character on the right must occur in - the value set (the lowercase directive) of - the character set, but - it may be a paranthesis-enclosed multi-octet character. This directive - may be used to map diacritics to their base characters, or to map - HTML-style character-representations to their natural form, etc. The map directive - can also be used to ignore leading articles in searching and/or sorting, and to perform - other special transformations. See section . + This directive introduces a mapping between each of the + members of the value-set on the left to the character on the + right. The character on the right must occur in the value + set (the lowercase directive) of the + character set, but it may be a paranthesis-enclosed + multi-octet character. This directive may be used to map + diacritics to their base characters, or to map HTML-style + character-representations to their natural form, etc. The + map directive can also be used to ignore leading articles in + searching and/or sorting, and to perform other special + transformations. See section . @@ -1977,47 +1956,55 @@ Ignoring leading articles - In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, - you can also use the character map files to make Zebra ignore leading articles in sorting - records, or when doing complete field searching. + In addition to specifying sort orders, space (blank) handling, + and upper/lowercase folding, you can also use the character map + files to make Zebra ignore leading articles in sorting records, + or when doing complete field searching. - This is done using the map directive in the character map file. In a - nutshell, what you do is map certain sequences of characters, when they occur - in the beginning of a field, to a space. Assuming that the character "@" is - defined as a space character in your file, you can do: + This is done using the map directive in the + character map file. In a nutshell, what you do is map certain + sequences of characters, when they occur in the + beginning of a field, to a space. Assuming that the + character "@" is defined as a space character in your file, you + can do: map (^The\s) @ map (^the\s) @ - The effect of these directives is to map either 'the' or 'The', followed by a space - character, to a space. The hat ^ character denotes beginning-of-field only when - complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just + The effect of these directives is to map either 'the' or 'The', + followed by a space character, to a space. The hat ^ character + denotes beginning-of-field only when complete-subfield indexing + or sort indexing is taking place; otherwise, it is treated just as any other character. - Because the default.idx file can be used to associate different - character maps with different indexing types -- and you can create additional indexing - types, should the need arise -- it is possible to specify that leading articles should be - ignored either in sorting, in complete-field searching, or both. + Because the default.idx file can be used to + associate different character maps with different indexing types + -- and you can create additional indexing types, should the need + arise -- it is possible to specify that leading articles should + be ignored either in sorting, in complete-field searching, or + both. - If you ignore certain prefixes in sorting, then these will be eliminated from the index, - and sorting will take place as if they weren't there. However, if you set the system up - to ignore certain prefixes in searching, then these are deleted both - from the indexes and from query terms, when the client specifies complete-field - searching. This has the effect that a search for 'the science journal' and 'science - journal' would both produce the same results. + If you ignore certain prefixes in sorting, then these will be + eliminated from the index, and sorting will take place as if + they weren't there. However, if you set the system up to ignore + certain prefixes in searching, then these + are deleted both from the indexes and from query terms, when the + client specifies complete-field searching. This has the effect + that a search for 'the science journal' and 'science journal' + would both produce the same results. - - Exchange Formats + + GRS Exchange Formats - Converting records from the internal structure to en exchange format + Converting records from the internal structure to an exchange format is largely an automatic process. Currently, the following exchange formats are supported: @@ -2085,6 +2072,8 @@ + + - + --> -- 1.7.10.4