--- /dev/null
+ <chapter id="architecture">
+ <!-- $Id: architecture.xml,v 1.1 2006-01-18 14:00:54 marc Exp $ -->
+ <title>Overview of Zebra Architecture</title>
+
+
+ <sect1 id="local-representation">
+ <title>Local Representation</title>
+
+ <para>
+ As mentioned earlier, Zebra places few restrictions on the type of
+ data that you can index and manage. Generally, whatever the form of
+ the data, it is parsed by an input filter specific to that format, and
+ turned into an internal structure that Zebra knows how to handle. This
+ process takes place whenever the record is accessed - for indexing and
+ retrieval.
+ </para>
+
+ <para>
+ The RecordType parameter in the <literal>zebra.cfg</literal> file, or
+ the <literal>-t</literal> option to the indexer tells Zebra how to
+ process input records.
+ Two basic types of processing are available - raw text and structured
+ data. Raw text is just that, and it is selected by providing the
+ argument <emphasis>text</emphasis> to Zebra. Structured records are
+ all handled internally using the basic mechanisms described in the
+ subsequent sections.
+ Zebra can read structured records in many different formats.
+ <!--
+ How this is done is governed by additional parameters after the
+ "grs" keyword, separated by "." characters.
+ -->
+ </para>
+ </sect1>
+
+ <sect1 id="workflow">
+ <title>Indexing and Retrieval Workflow</title>
+
+ <para>
+ Records pass through three different states during processing in the
+ system.
+ </para>
+
+ <para>
+
+ <itemizedlist>
+ <listitem>
+
+ <para>
+ When records are accessed by the system, they are represented
+ in their local, or native format. This might be SGML or HTML files,
+ News or Mail archives, MARC records. If the system doesn't already
+ know how to read the type of data you need to store, you can set up an
+ input filter by preparing conversion rules based on regular
+ expressions and possibly augmented by a flexible scripting language
+ (Tcl).
+ The input filter produces as output an internal representation,
+ a tree structure.
+
+ </para>
+ </listitem>
+ <listitem>
+
+ <para>
+ When records are processed by the system, they are represented
+ in a tree-structure, constructed by tagged data elements hanging off a
+ root node. The tagged elements may contain data or yet more tagged
+ elements in a recursive structure. The system performs various
+ actions on this tree structure (indexing, element selection, schema
+ mapping, etc.),
+
+ </para>
+ </listitem>
+ <listitem>
+
+ <para>
+ Before transmitting records to the client, they are first
+ converted from the internal structure to a form suitable for exchange
+ over the network - according to the Z39.50 standard.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ </para>
+ </sect1>
+
+
+ <sect1 id="maincomponents">
+ <title>Main Components</title>
+ <para>
+ The Zebra system is designed to support a wide range of data management
+ applications. The system can be configured to handle virtually any
+ kind of structured data. Each record in the system is associated with
+ a <emphasis>record schema</emphasis> which lends context to the data
+ elements of the record.
+ Any number of record schemas can coexist in the system.
+ Although it may be wise to use only a single schema within
+ one database, the system poses no such restrictions.
+ </para>
+ <para>
+ The Zebra indexer and information retrieval server consists of the
+ following main applications: the <literal>zebraidx</literal>
+ indexing maintenance utility, and the <literal>zebrasrv</literal>
+ information query and retireval server. Both are using some of the
+ same main components, which are presented here.
+ </para>
+ <para>
+ This virtual package installs all the necessary packages to start
+ working with IDZebra - including utility programs, development libraries,
+ documentation and modules.
+ <literal>idzebra1.4</literal>
+ </para>
+
+ <sect2 id="componentcore">
+ <title>Core Zebra Module Containing Common Functionality</title>
+ <para>
+ - loads external filter modules used for presenting
+ the recods in a search response.
+ - executes search requests in PQF/RPN, which are handed over from
+ the YAZ server frontend API
+ - calls resorting/reranking algorithms on the hit sets
+ - returns - possibly ranked - result sets, hit
+ numbers, and the like internal data to the YAZ server backend API.
+ </para>
+ <para>
+ This package contains all run-time libraries for IDZebra.
+ <literal>libidzebra1.4</literal>
+ This package includes documentation for IDZebra in PDF and HTML.
+ <literal>idzebra1.4-doc</literal>
+ This package includes common essential IDZebra configuration files
+ <literal>idzebra1.4-common</literal>
+ </para>
+ </sect2>
+
+
+ <sect2 id="componentindexer">
+ <title>Zebra Indexer</title>
+ <para>
+ the core Zebra indexer which
+ - loads external filter modules used for indexing data records of
+ different type.
+ - creates, updates and drops databases and indexes
+ </para>
+ <para>
+ This package contains IDZebra utilities such as the zebraidx indexer
+ utility and the zebrasrv server.
+ <literal>idzebra1.4-utils</literal>
+ </para>
+ </sect2>
+
+ <sect2 id="componentsearcher">
+ <title>Zebra Searcher/Retriever</title>
+ <para>
+ the core Zebra searcher/retriever which
+ </para>
+ <para>
+ This package contains IDZebra utilities such as the zebraidx indexer
+ utility and the zebrasrv server, and their associated man pages.
+ <literal>idzebra1.4-utils</literal>
+ </para>
+ </sect2>
+
+ <sect2 id="componentyazserver">
+ <title>YAZ Server Frontend</title>
+ <para>
+ The YAZ server frontend is
+ a full fledged stateful Z39.50 server taking client
+ connections, and forwarding search and scan requests to the
+ Zebra core indexer.
+ </para>
+ <para>
+ In addition to Z39.50 requests, the YAZ server frontend acts
+ as HTTP server, honouring
+ SRW SOAP requests, and SRU REST requests. Moreover, it can
+ translate inco ming CQL queries to PQF/RPN queries, if
+ correctly configured.
+ </para>
+ <para>
+ YAZ is a toolkit that allows you to develop software using the
+ ANSI Z39.50/ISO23950 standard for information retrieval.
+ SRW/SRU
+ <literal>libyazthread.so</literal>
+ <literal>libyaz.so</literal>
+ <literal>libyaz</literal>
+ </para>
+ </sect2>
+
+ <sect2 id="componentmodules">
+ <title>Record Models and Filter Modules</title>
+ <para>
+ all filter modules which do indexing and record display filtering:
+This virtual package contains all base IDZebra filter modules. EMPTY ???
+ <literal>libidzebra1.4-modules</literal>
+ </para>
+
+ <sect3 id="componentmodulestext">
+ <title>TEXT Record Model and Filter Module</title>
+ <para>
+ Plain ASCII text filter
+ <!--
+ <literal>text module missing as deb file<literal>
+ -->
+ </para>
+ </sect3>
+
+ <sect3 id="componentmodulesgrs">
+ <title>GRS Record Model and Filter Modules</title>
+ <para>
+ Chapter <xref linkend="record-model"/>
+
+ - grs.danbib GRS filters of various kind (*.abs files)
+IDZebra filter grs.danbib (DBC DanBib records)
+ This package includes grs.danbib filter which parses DanBib records.
+ DanBib is the Danish Union Catalogue hosted by DBC
+ (Danish Bibliographic Centre).
+ <literal>libidzebra1.4-mod-grs-danbib</literal>
+
+
+ - grs.marc
+ - grs.marcxml
+ This package includes the grs.marc and grs.marcxml filters that allows
+ IDZebra to read MARC records based on ISO2709.
+
+ <literal>libidzebra1.4-mod-grs-marc</literal>
+
+ - grs.regx
+ - grs.tcl GRS TCL scriptable filter
+ This package includes the grs.regx and grs.tcl filters.
+ <literal>libidzebra1.4-mod-grs-regx</literal>
+
+
+ - grs.sgml
+ <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
+
+ - grs.xml
+ This package includes the grs.xml filter which uses Expat to
+ parse records in XML and turn them into IDZebra's internal grs node.
+ <literal>libidzebra1.4-mod-grs-xml</literal>
+ </para>
+ </sect3>
+
+ <sect3 id="componentmodulesalvis">
+ <title>ALVIS Record Model and Filter Module</title>
+ <para>
+ - alvis Experimental Alvis XSLT filter
+ <literal>mod-alvis.so</literal>
+ <literal>libidzebra1.4-mod-alvis</literal>
+ </para>
+ </sect3>
+
+ <sect3 id="componentmodulessafari">
+ <title>SAFARI Record Model and Filter Module</title>
+ <para>
+ - safari
+ <!--
+ <literal>safari module missing as deb file<literal>
+ -->
+ </para>
+ </sect3>
+
+ </sect2>
+
+ <sect2 id="componentconfig">
+ <title>Configuration Files</title>
+ <para>
+ - yazserver XML based config file
+ - core Zebra ascii based config files
+ - filter module config files in many flavours
+ - CQL to PQF ascii based config file
+ </para>
+ </sect2>
+
+ </sect1>
+
+ <!--
+
+
+ <sect1 id="cqltopqf">
+ <title>Server Side CQL To PQF Conversion</title>
+ <para>
+ The cql2pqf.txt yaz-client config file, which is also used in the
+ yaz-server CQL-to-PQF process, is used to to drive
+ org.z3950.zing.cql.CQLNode's toPQF() back-end and the YAZ CQL-to-PQF
+ converter. This specifies the interpretation of various CQL
+ indexes, relations, etc. in terms of Type-1 query attributes.
+
+ This configuration file generates queries using BIB-1 attributes.
+ See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
+ for the Maintenance Agency's work-in-progress mapping of Dublin Core
+ indexes to Attribute Architecture (util, XD and BIB-2)
+ attributes.
+
+ a) CQL set prefixes are specified using the correct CQL/SRW/U
+ prefixes for the required index sets, or user-invented prefixes for
+ special index sets. An index set in CQL is roughly speaking equivalent to a
+ namespace specifier in XML.
+
+ b) The default index set to be used if none explicitely mentioned
+
+ c) Index mapping definitions of the form
+
+ index.cql.all = 1=text
+
+ which means that the index "all" from the set "cql" is mapped on the
+ bib-1 RPN query "@attr 1=text" (where "text" is some existing index
+ in zebra, see indexing stylesheet)
+
+ d) Relation mapping from CQL relations to bib-1 RPN "@attr 2= " stuff
+
+ e) Relation modifier mapping from CQL relations to bib-1 RPN "@attr
+ 2= " stuff
+
+ f) Position attributes
+
+ g) structure attributes
+
+ h) truncation attributes
+
+ See
+ http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config
+ file details.
+
+
+ </para>
+ </sect1>
+
+
+ <sect1 id="ranking">
+ <title>Static and Dynamic Ranking</title>
+ <para>
+ Zebra uses internally inverted indexes to look up term occurencies
+ in documents. Multiple queries from different indexes can be
+ combined by the binary boolean operations AND, OR and/or NOT (which
+ is in fact a binary AND NOT operation). To ensure fast query execution
+ speed, all indexes have to be sorted in the same order.
+
+ The indexes are normally sorted according to document ID in
+ ascending order, and any query which does not invoke a special
+ re-ranking function will therefore retrieve the result set in document ID
+ order.
+
+ If one defines the
+
+ staticrank: 1
+
+ directive in the main core Zebra config file, the internal document
+ keys used for ordering are augmented by a preceeding integer, which
+ contains the static rank of a given document, and the index lists
+ are ordered
+ - first by ascending static rank
+ - then by ascending document ID.
+
+ This implies that the default rank "0" is the best rank at the
+ beginning of the list, and "max int" is the worst static rank.
+
+ The "alvis" and the experimental "xslt" filters are providing a
+ directive to fetch static rank information out of the indexed XML
+ records, thus making _all_ hit sets orderd after ascending static
+ rank, and for those doc's which have the same static rank, ordered
+ after ascending doc ID.
+ If one wants to do a little fiddeling with the static rank order,
+ one has to invoke additional re-ranking/re-ordering using dynamic
+ reranking or score functions. These functions return positive
+ interger scores, where _highest_ score is best, which means that the
+ hit sets will be sorted according to _decending_ scores (in contrary
+ to the index lists which are sorted according to _ascending_ rank
+ number and document ID)
+
+
+ Those are defined in the zebra C source files
+
+ "rank-1" : zebra/index/rank1.c
+ default TF/IDF like zebra dynamic ranking
+ "rank-static" : zebra/index/rankstatic.c
+ do-nothing dummy static ranking (this is just to prove
+ that the static rank can be used in dynamic ranking functions)
+ "zvrank" : zebra/index/zvrank.c
+ many different dynamic TF/IDF ranking functions
+
+ The are in the zebra config file enabled by a directive like:
+
+ rank: rank-static
+
+ Notice that the "rank-1" and "zvrank" do not use the static rank
+ information in the list keys, and will produce the same ordering
+ with our without static ranking enabled.
+
+ The dummy "rank-static" reranking/scoring function returns just
+ score = max int - staticrank
+ in order to preserve the ordering of hit sets with and without it's
+ call.
+
+ Obviously, one wants to make a new ranking function, which combines
+ static and dynamic ranking, which is left as an exercise for the
+ reader .. (Wray, this is your's ...)
+
+
+ </para>
+
+
+ <para>
+ yazserver frontend config file
+
+ db/yazserver.xml
+
+ Setup of listening ports, and virtual zebra servers.
+ Note path to server-side CQL-to-PQF config file, and to
+ SRW explain config section.
+
+ The <directory> path is relative to the directory where zebra.init is placed
+ and is started up. The other pathes are relative to <directory>,
+ which in this case is the same.
+
+ see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl
+
+ </para>
+
+ <para>
+c) Main "alvis" XSLT filter config file:
+ cat db/filter_alvis_conf.xml
+
+ <?xml version="1.0" encoding="UTF8"?>
+ <schemaInfo>
+ <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
+ <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+ stylesheet="db/alvis2index.xsl" />
+ <schema name="dc" stylesheet="db/alvis2dc.xsl" />
+ <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
+ <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
+ <schema name="help" stylesheet="db/alvis2help.xsl" />
+ <split level="1"/>
+ </schemaInfo>
+
+ the pathes are relative to the directory where zebra.init is placed
+ and is started up.
+
+ The split level decides where the SAX parser shall split the
+ collections of records into individual records, which then are
+ loaded into DOM, and have the indexing XSLT stylesheet applied.
+
+ The indexing stylesheet is found by it's identifier.
+
+ All the other stylesheets are for presentation after search.
+
+- in data/ a short sample of harvested carnivorous plants
+ ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
+
+- in root also one single data record - nice for testing the xslt
+ stylesheets,
+
+ xsltproc db/alvis2index.xsl carni*.xml
+
+ and so on.
+
+- in db/ a cql2pqf.txt yaz-client config file
+ which is also used in the yaz-server CQL-to-PQF process
+
+ see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
+
+- in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
+ as it constructs the new XML structure by pulling data out of the
+ respective elements/attributes of the old structure.
+
+ Notice the special zebra namespace, and the special elements in this
+ namespace which indicate to the zebra indexer what to do.
+
+ <z:record id="67ht7" rank="675" type="update">
+ indicates that a new record with given id and static rank has to be updated.
+
+ <z:index name="title" type="w">
+ encloses all the text/XML which shall be indexed in the index named
+ "title" and of index type "w" (see file default.idx in your zebra
+ installation)
+
+
+ </para>
+
+ <para>
+ Z39.50 searching:
+
+ search like this (using client-side CQL-to-PQF conversion):
+
+ yaz-client -q db/cql2pqf.txt localhost:9999
+ > format xml
+ > querytype cql2rpn
+ > f text=(plant and soil)
+ > s 1
+ > elements dc
+ > s 1
+ > elements index
+ > s 1
+ > elements alvis
+ > s 1
+ > elements snippet
+ > s 1
+
+
+ search like this (using server-side CQL-to-PQF conversion):
+ (the only difference is "querytype cql" instead of
+ "querytype cql2rpn" and the call without specifying a local
+ conversion file)
+
+ yaz-client localhost:9999
+ > format xml
+ > querytype cql
+ > f text=(plant and soil)
+ > s 1
+ > elements dc
+ > s 1
+ > elements index
+ > s 1
+ > elements alvis
+ > s 1
+ > elements snippet
+ > s 1
+
+ NEW: static relevance ranking - see examples in alvis2index.xsl
+
+ > f text = /relevant (plant and soil)
+ > elem dc
+ > s 1
+
+ > f title = /relevant a
+ > elem dc
+ > s 1
+
+
+
+SRW/U searching
+ Surf into http://localhost:9999
+
+ firefox http://localhost:9999
+
+ gives you an explain record. Unfortunately, the data found in the
+ CQL-to-PQF text file must be added by hand-craft into the explain
+ section of the yazserver.xml file. Too bad, but this is all extreme
+ new alpha stuff, and a lot of work has yet to be done ..
+
+ Searching via SRU: surf into the URL (lines broken here - concat on
+ URL line)
+
+ - see number of hits:
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &query=text=(plant%20and%20soil)
+
+
+ - fetch record 5-7 in DC format
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &query=text=(plant%20and%20soil)
+ &startRecord=5&maximumRecords=2&recordSchema=dc
+
+
+ - even search using PQF queries using the extended verb "x-pquery",
+ which is special to YAZ/Zebra
+
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &x-pquery=@attr%201=text%20@and%20plant%20soil
+
+ More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/
+278,280d299
+ Search via SRW:
+ read the fine manual at
+ http://www.loc.gov/z3950/agency/zing/srw/
+
+
+and so on. The list of available indexes is found in db/cql2pqf.txt
+
+
+7) How do you add to the index attributes of any other type than "w"?
+I mean, in the context of making CQL queries. Let's say I want a date
+attribute in there, so that one could do date > 20050101 in CQL.
+
+Currently for example 'date-modified' is of type 'w'.
+
+The 2-seconds-of-though solution:
+
+ in alvis2index.sl:
+
+ <z:index name="date-modified" type="d">
+ <xsl:value-of
+ select="acquisition/acquisitionData/modifiedDate"/>
+ </z:index>
+
+But here's the catch...doesn't the use of the 'd' type require
+structure type 'date' (@attr 4=5) in PQF? But then...how does that
+reflect in the CQL->RPN/PQF mapping - does it really work if I just
+change the type of an element in alvis2index.sl? I would think not...?
+
+
+
+
+ Kimmo
+
+
+Either do:
+
+ f @attr 4=5 @attr 1=date-modified 20050713
+
+or do
+
+
+Either do:
+
+ f @attr 4=5 @attr 1=date-modified 20050713
+
+or do
+
+querytype cql
+
+ f date-modified=20050713
+
+ f date-modified=20050713
+
+ Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att
+r 4=1 @attr 2=3 @attr "1=date-modified" 20050713
+
+
+
+ f date-modified eq 20050713
+
+Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5
+ @attr 2=3 @attr "1=date-modified" 20050713
+
+
+ </para>
+
+ <para>
+E) EXTENDED SERVICE LIFE UPDATES
+
+The extended services are not enabled by default in zebra - due to the
+fact that they modify the system.
+
+In order to allow anybody to update, use
+perm.anonymous: rw
+in zebra.cfg.
+
+Or, even better, allow only updates for a particular admin user. For
+user 'admin', you could use:
+perm.admin: rw
+passwd: passwordfile
+
+And in passwordfile, specify users and passwords ..
+admin:secret
+
+We can now start a yaz-client admin session and create a database:
+
+$ yaz-client localhost:9999 -u admin/secret
+Authentication set to Open (admin/secret)
+Connecting...OK.
+Sent initrequest.
+Connection accepted by v3 target.
+ID : 81
+Name : Zebra Information Server/GFS/YAZ
+Version: Zebra 1.4.0/1.63/2.1.9
+Options: search present delSet triggerResourceCtrl scan sort
+extendedServices namedResultSets
+Elapsed: 0.007046
+Z> adm-create
+Admin request
+Got extended services response
+Status: done
+Elapsed: 0.045009
+:
+Now Default was created.. We can now insert an XML file (esdd0006.grs
+from example/gils/records) and index it:
+
+Z> update insert 1 esdd0006.grs
+Got extended services response
+Status: done
+Elapsed: 0.438016
+
+The 3rd parameter.. 1 here .. is the opaque record id from Ext update.
+It a record ID that _we_ assign to the record in question. If we do not
+assign one the usual rules for match apply (recordId: from zebra.cfg).
+
+Actually, we should have a way to specify "no opaque record id" for
+yaz-client's update command.. We'll fix that.
+
+Elapsed: 0.438016
+Z> f utah
+Sent searchRequest.
+Received SearchResponse.
+Search was a success.
+Number of hits: 1, setno 1
+SearchResult-1: term=utah cnt=1
+records returned: 0
+Elapsed: 0.014179
+
+Let's delete the beast:
+Z> update delete 1
+No last record (update ignored)
+Z> update delete 1 esdd0006.grs
+Got extended services response
+Status: done
+Elapsed: 0.072441
+Z> f utah
+Sent searchRequest.
+Received SearchResponse.
+Search was a success.
+Number of hits: 0, setno 2
+SearchResult-1: term=utah cnt=0
+records returned: 0
+Elapsed: 0.013610
+
+If shadow register is enabled you must run the adm-commit command in
+order write your changes..
+
+ </para>
+
+
+
+ </sect1>
+-->
+
+ </chapter>
+
+ <!-- Keep this comment at the end of the file
+ Local variables:
+ mode: sgml
+ sgml-omittag:t
+ sgml-shorttag:t
+ sgml-minimize-attributes:nil
+ sgml-always-quote-attributes:t
+ sgml-indent-step:1
+ sgml-indent-data:t
+ sgml-parent-document: "zebra.xml"
+ sgml-local-catalogs: nil
+ sgml-namecase-general:t
+ End:
+ -->
<chapter id="record-model">
- <!-- $Id: recordmodel.xml,v 1.21 2005-01-04 19:10:29 quinn Exp $ -->
- <title>The Record Model</title>
+ <!-- $Id: recordmodel.xml,v 1.22 2006-01-18 14:00:54 marc Exp $ -->
+ <title>GRS Record Model and Filter Modules</title>
- <para>
- The Zebra system is designed to support a wide range of data management
- applications. The system can be configured to handle virtually any
- kind of structured data. Each record in the system is associated with
- a <emphasis>record schema</emphasis> which lends context to the data
- elements of the record.
- Any number of record schemas can coexist in the system.
- Although it may be wise to use only a single schema within
- one database, the system poses no such restrictions.
- </para>
<para>
The record model described in this chapter applies to the fundamental,
structured
record type <literal>grs</literal>, introduced in
- <xref linkend="record-types"/>.
- <!--
- FIXME - Need to describe the simple string-tag model, or at least
- refer to it here. -H
- -->
- </para>
-
- <para>
- Records pass through three different states during processing in the
- system.
- </para>
-
- <para>
-
- <itemizedlist>
- <listitem>
-
- <para>
- When records are accessed by the system, they are represented
- in their local, or native format. This might be SGML or HTML files,
- News or Mail archives, MARC records. If the system doesn't already
- know how to read the type of data you need to store, you can set up an
- input filter by preparing conversion rules based on regular
- expressions and possibly augmented by a flexible scripting language
- (Tcl).
- The input filter produces as output an internal representation,
- a tree structure.
-
- </para>
- </listitem>
- <listitem>
-
- <para>
- When records are processed by the system, they are represented
- in a tree-structure, constructed by tagged data elements hanging off a
- root node. The tagged elements may contain data or yet more tagged
- elements in a recursive structure. The system performs various
- actions on this tree structure (indexing, element selection, schema
- mapping, etc.),
-
- </para>
- </listitem>
- <listitem>
-
- <para>
- Before transmitting records to the client, they are first
- converted from the internal structure to a form suitable for exchange
- over the network - according to the Z39.50 standard.
- </para>
- </listitem>
-
- </itemizedlist>
-
+ <xref linkend="componentmodulesgrs"/>.
</para>
- <sect1 id="local-representation">
- <title>Local Representation</title>
-
- <para>
- As mentioned earlier, Zebra places few restrictions on the type of
- data that you can index and manage. Generally, whatever the form of
- the data, it is parsed by an input filter specific to that format, and
- turned into an internal structure that Zebra knows how to handle. This
- process takes place whenever the record is accessed - for indexing and
- retrieval.
- </para>
-
- <para>
- The RecordType parameter in the <literal>zebra.cfg</literal> file, or
- the <literal>-t</literal> option to the indexer tells Zebra how to
- process input records.
- Two basic types of processing are available - raw text and structured
- data. Raw text is just that, and it is selected by providing the
- argument <emphasis>text</emphasis> to Zebra. Structured records are
- all handled internally using the basic mechanisms described in the
- subsequent sections.
- Zebra can read structured records in many different formats.
- How this is done is governed by additional parameters after the
- "grs" keyword, separated by "." characters.
- </para>
+ <sect1 id="grs-record-filters">
+ <title>GRS Record Filters</title>
<para>
- Four basic subtypes to the <emphasis>grs</emphasis> type are
+ Many basic subtypes of the <emphasis>grs</emphasis> type are
currently available:
</para>
<term>grs.sgml</term>
<listitem>
<para>
- This is the canonical input format —
- described below. It is a simple SGML-like syntax.
+ This is the canonical input format
+ described <xref linkend="grs-canonical-format"/>. It is using
+ simple SGML-like syntax.
+ </para>
+ <!--
+ <para>
+ <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
</para>
+ -->
</listitem>
</varlistentry>
<varlistentry>
- <term>grs.regx.<emphasis>filter</emphasis></term>
+ <term>grs.marc<!--.<emphasis>abstract syntax</emphasis>--></term>
<listitem>
<para>
- This enables a user-supplied input
- filter. The mechanisms of these filters are described below.
+ This allows Zebra to read
+ records in the ISO2709 (MARC) encoding standard.
+ <!-- In this case, the
+ last parameter <emphasis>abstract syntax</emphasis> names the
+ <literal>.abs</literal> file (see below)
+ which describes the specific MARC structure of the input record as
+ well as the indexing rules. -->
</para>
+ <para>
+ The loadable <literal>grs.marc</literal> filter module
+ is packaged in the GNU/Debian package
+ <literal>libidzebra1.4-mod-grs-marc</literal>
+ </para>
</listitem>
</varlistentry>
<varlistentry>
- <term>grs.tcl.<emphasis>filter</emphasis></term>
+ <term>grs.marcxml<!--.<emphasis>abstract syntax</emphasis>--></term>
<listitem>
<para>
- Similar to grs.regx but using Tcl for rules.
+ This allows Zebra to read
+ records in the ISO2709??? (MARCXML) encoding standard.
</para>
+ <para>
+ The loadable <literal>grs.marcxml</literal> filter module
+ is also contained in the GNU/Debian package
+ <literal>libidzebra1.4-mod-grs-marc</literal>
+ </para>
</listitem>
</varlistentry>
<varlistentry>
- <term>grs.marc.<emphasis>abstract syntax</emphasis></term>
+ <term>grs.danbib</term>
<listitem>
<para>
- This allows Zebra to read
- records in the ISO2709 (MARC) encoding standard. In this case, the
- last parameter <emphasis>abstract syntax</emphasis> names the
- <literal>.abs</literal> file (see below)
- which describes the specific MARC structure of the input record as
- well as the indexing rules.
+ The <literal>grs.danbib</literal> filter parses DanBib
+ records, a danish MARC record variant called DANMARC.
+ DanBib is the Danish Union Catalogue hosted by the
+ Danish Bibliographic Centre (DBC).
+ </para>
+ <para>The loadable <literal>grs.danbib</literal> filter module
+ is packages in the GNU/Debian package
+ <literal>libidzebra1.4-mod-grs-danbib</literal>.
</para>
</listitem>
</varlistentry>
<term>grs.xml</term>
<listitem>
<para>
- This filter reads XML records. Only one record per file
+ This filter reads XML records and uses Expat to
+ parse them and convert them into IDZebra's internal
+ <literal>grs</literal> record model.
+ Only one record per file
is supported. The filter is only available if Zebra/YAZ
is compiled with EXPAT support.
</para>
+ <para>
+ The loadable <literal>grs.xml</literal> filter module
+ is packagged in the GNU/Debian package
+ <literal>libidzebra1.4-mod-grs-xml</literal>
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>grs.regx<!--.<emphasis>filter</emphasis>--></term>
+ <listitem>
+ <para>
+ This enables a user-supplied Regular Expressions input
+ filter described in
+ <xref linkend="grs-regx-tcl"/>.
+ </para>
+ <para>
+ The loadable <literal>grs.regx</literal> filter module
+ is packaged in the GNU/Debian package
+ <literal>libidzebra1.4-mod-grs-regx</literal>
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>grs.tcl<!--.<emphasis>filter</emphasis>--></term>
+ <listitem>
+ <para>
+ Similar to grs.regx but using Tcl for rules, described in
+ <xref linkend="grs-regx-tcl"/>.
+ </para>
+ <para>
+ The loadable <literal>grs.tcl</literal> filter module
+ is also packaged in the GNU/Debian package
+ <literal>libidzebra1.4-mod-grs-regx</literal>
+ </para>
</listitem>
</varlistentry>
</variablelist>
</para>
- <sect2>
- <title>Canonical Input Format</title>
+ <sect2 id="grs-canonical-format">
+ <title>GRS Canonical Input Format</title>
<para>
Although input data can take any form, it is sometimes useful to
makes up the total record. In the canonical input format, the root tag
should contain the name of the schema that lends context to the
elements of the record
- (see <xref linkend="internal-representation"/>).
+ (see <xref linkend="grs-internal-representation"/>).
The following is a GILS record that
contains only a single element (strictly speaking, that makes it an
illegal GILS record, since the GILS profile includes several mandatory
</sect2>
- <sect2>
- <title>Input Filters</title>
+ <sect2 id="grs-regx-tcl">
+ <title>GRS REGX And TCL Input Filters</title>
<para>
In order to handle general input formats, Zebra allows the
</sect1>
- <sect1 id="internal-representation">
- <title>Internal Representation</title>
+ <sect1 id="grs-internal-representation">
+ <title>GRS Internal Record Representation</title>
<para>
When records are manipulated by the system, they're represented in a
</sect1>
- <sect1 id="data-model">
- <title>Configuring Your Data Model</title>
+ <sect1 id="grs-record-model">
+ <title>GRS Record Model Configuration</title>
<para>
The following sections describe the configuration files that govern
- the internal management of data records. The system searches for the files
+ the internal management of <literal>grs</literal> records.
+ The system searches for the files
in the directories specified by the <emphasis>profilePath</emphasis>
setting in the <literal>zebra.cfg</literal> file.
</para>
<emphasis>target</emphasis></term>
<listitem>
<para>
- This directive introduces a
- mapping between each of the members of the value-set on the left to
- the character on the right. The character on the right must occur in
- the value set (the <literal>lowercase</literal> directive) of
- the character set, but
- it may be a paranthesis-enclosed multi-octet character. This directive
- may be used to map diacritics to their base characters, or to map
- HTML-style character-representations to their natural form, etc. The map directive
- can also be used to ignore leading articles in searching and/or sorting, and to perform
- other special transformations. See section <xref linkend="leading-articles"/>.
+ This directive introduces a mapping between each of the
+ members of the value-set on the left to the character on the
+ right. The character on the right must occur in the value
+ set (the <literal>lowercase</literal> directive) of the
+ character set, but it may be a paranthesis-enclosed
+ multi-octet character. This directive may be used to map
+ diacritics to their base characters, or to map HTML-style
+ character-representations to their natural form, etc. The
+ map directive can also be used to ignore leading articles in
+ searching and/or sorting, and to perform other special
+ transformations. See section <xref
+ linkend="leading-articles"/>.
</para>
</listitem></varlistentry>
</variablelist>
<sect3 id="leading-articles">
<title>Ignoring leading articles</title>
<para>
- In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding,
- you can also use the character map files to make Zebra ignore leading articles in sorting
- records, or when doing complete field searching.
+ In addition to specifying sort orders, space (blank) handling,
+ and upper/lowercase folding, you can also use the character map
+ files to make Zebra ignore leading articles in sorting records,
+ or when doing complete field searching.
</para>
<para>
- This is done using the <literal>map</literal> directive in the character map file. In a
- nutshell, what you do is map certain sequences of characters, when they occur <emphasis>
- in the beginning of a field</emphasis>, to a space. Assuming that the character "@" is
- defined as a space character in your file, you can do:
+ This is done using the <literal>map</literal> directive in the
+ character map file. In a nutshell, what you do is map certain
+ sequences of characters, when they occur <emphasis> in the
+ beginning of a field</emphasis>, to a space. Assuming that the
+ character "@" is defined as a space character in your file, you
+ can do:
<screen>
map (^The\s) @
map (^the\s) @
</screen>
- The effect of these directives is to map either 'the' or 'The', followed by a space
- character, to a space. The hat ^ character denotes beginning-of-field only when
- complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just
+ The effect of these directives is to map either 'the' or 'The',
+ followed by a space character, to a space. The hat ^ character
+ denotes beginning-of-field only when complete-subfield indexing
+ or sort indexing is taking place; otherwise, it is treated just
as any other character.
</para>
<para>
- Because the <literal>default.idx</literal> file can be used to associate different
- character maps with different indexing types -- and you can create additional indexing
- types, should the need arise -- it is possible to specify that leading articles should be
- ignored either in sorting, in complete-field searching, or both.
+ Because the <literal>default.idx</literal> file can be used to
+ associate different character maps with different indexing types
+ -- and you can create additional indexing types, should the need
+ arise -- it is possible to specify that leading articles should
+ be ignored either in sorting, in complete-field searching, or
+ both.
</para>
<para>
- If you ignore certain prefixes in sorting, then these will be eliminated from the index,
- and sorting will take place as if they weren't there. However, if you set the system up
- to ignore certain prefixes in <emphasis>searching</emphasis>, then these are deleted both
- from the indexes and from query terms, when the client specifies complete-field
- searching. This has the effect that a search for 'the science journal' and 'science
- journal' would both produce the same results.
+ If you ignore certain prefixes in sorting, then these will be
+ eliminated from the index, and sorting will take place as if
+ they weren't there. However, if you set the system up to ignore
+ certain prefixes in <emphasis>searching</emphasis>, then these
+ are deleted both from the indexes and from query terms, when the
+ client specifies complete-field searching. This has the effect
+ that a search for 'the science journal' and 'science journal'
+ would both produce the same results.
</para>
</sect3>
</sect2>
</sect1>
- <sect1 id="formats">
- <title>Exchange Formats</title>
+ <sect1 id="grs-exchange-formats">
+ <title>GRS Exchange Formats</title>
<para>
- Converting records from the internal structure to en exchange format
+ Converting records from the internal structure to an exchange format
is largely an automatic process. Currently, the following exchange
formats are supported:
</para>
</para>
</listitem>
+ <!-- FIXME - Is this used anywhere ? -H -->
+ <!--
<listitem>
<para>
SOIF. Support for this syntax is experimental, and is currently
abstract syntaxes can be mapped to the SOIF format, although nested
elements are represented by concatenation of the tag names at each
level.
- <!-- FIXME - Is this used anywhere ? -H -->
</para>
</listitem>
-
+ -->
</itemizedlist>
</para>
</sect1>