-<chapter id="administration">
- <title>Administrating &zebra;</title>
- <!-- ### It's a bit daft that this chapter (which describes half of
- the configuration-file formats) is separated from
- "recordmodel-grs.xml" (which describes the other half) by the
- instructions on running zebraidx and zebrasrv. Some careful
- re-ordering is required here.
- -->
+ <chapter id="administration">
+ <title>Administrating &zebra;</title>
+ <!-- ### It's a bit daft that this chapter (which describes half of
+ the configuration-file formats) is separated from
+ "recordmodel-grs.xml" (which describes the other half) by the
+ instructions on running zebraidx and zebrasrv. Some careful
+ re-ordering is required here.
+ -->
- <para>
- Unlike many simpler retrieval systems, &zebra; supports safe, incremental
- updates to an existing index.
- </para>
-
- <para>
- Normally, when &zebra; modifies the index it reads a number of records
- that you specify.
- Depending on your specifications and on the contents of each record
- one the following events take place for each record:
- <variablelist>
-
- <varlistentry>
- <term>Insert</term>
- <listitem>
- <para>
- The record is indexed as if it never occurred before.
- Either the &zebra; system doesn't know how to identify the record or
- &zebra; can identify the record but didn't find it to be already indexed.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>Modify</term>
- <listitem>
- <para>
- The record has already been indexed.
- In this case either the contents of the record or the location
- (file) of the record indicates that it has been indexed before.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>Delete</term>
- <listitem>
- <para>
- The record is deleted from the index. As in the
- update-case it must be able to identify the record.
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
- </para>
-
- <para>
- Please note that in both the modify- and delete- case the &zebra;
- indexer must be able to generate a unique key that identifies the record
- in question (more on this below).
- </para>
-
- <para>
- To administrate the &zebra; retrieval system, you run the
- <literal>zebraidx</literal> program.
- This program supports a number of options which are preceded by a dash,
- and a few commands (not preceded by dash).
-</para>
-
- <para>
- Both the &zebra; administrative tool and the &acro.z3950; server share a
- set of index files and a global configuration file.
- The name of the configuration file defaults to
- <literal>zebra.cfg</literal>.
- The configuration file includes specifications on how to index
- various kinds of records and where the other configuration files
- are located. <literal>zebrasrv</literal> and <literal>zebraidx</literal>
- <emphasis>must</emphasis> be run in the directory where the
- configuration file lives unless you indicate the location of the
- configuration file by option <literal>-c</literal>.
- </para>
-
- <sect1 id="record-types">
- <title>Record Types</title>
-
- <para>
- Indexing is a per-record process, in which either insert/modify/delete
- will occur. Before a record is indexed search keys are extracted from
- whatever might be the layout the original record (sgml,html,text, etc..).
- The &zebra; system currently supports two fundamental types of records:
- structured and simple text.
- To specify a particular extraction process, use either the
- command line option <literal>-t</literal> or specify a
- <literal>recordType</literal> setting in the configuration file.
- </para>
-
- </sect1>
-
- <sect1 id="zebra-cfg">
- <title>The &zebra; Configuration File</title>
-
- <para>
- The &zebra; configuration file, read by <literal>zebraidx</literal> and
- <literal>zebrasrv</literal> defaults to <literal>zebra.cfg</literal>
- unless specified by <literal>-c</literal> option.
- </para>
-
- <para>
- You can edit the configuration file with a normal text editor.
- parameter names and values are separated by colons in the file. Lines
- starting with a hash sign (<literal>#</literal>) are
- treated as comments.
- </para>
-
<para>
- If you manage different sets of records that share common
- characteristics, you can organize the configuration settings for each
- type into "groups".
- When <literal>zebraidx</literal> is run and you wish to address a
- given group you specify the group name with the <literal>-g</literal>
- option.
- In this case settings that have the group name as their prefix
- will be used by <literal>zebraidx</literal>.
- If no <literal>-g</literal> option is specified, the settings
- without prefix are used.
+ Unlike many simpler retrieval systems, &zebra; supports safe, incremental
+ updates to an existing index.
</para>
-
- <para>
- In the configuration file, the group name is placed before the option
- name itself, separated by a dot (.). For instance, to set the record type
- for group <literal>public</literal> to <literal>grs.sgml</literal>
- (the &acro.sgml;-like format for structured records) you would write:
- </para>
-
- <para>
- <screen>
- public.recordType: grs.sgml
- </screen>
- </para>
-
- <para>
- To set the default value of the record type to <literal>text</literal>
- write:
- </para>
-
- <para>
- <screen>
- recordType: text
- </screen>
- </para>
-
- <para>
- The available configuration settings are summarized below. They will be
- explained further in the following sections.
- </para>
-
- <!--
- FIXME - Didn't Adam make something to have multiple databases in multiple dirs...
- -->
-
+
<para>
+ Normally, when &zebra; modifies the index it reads a number of records
+ that you specify.
+ Depending on your specifications and on the contents of each record
+ one the following events take place for each record:
<variablelist>
-
- <varlistentry>
- <term>
- <emphasis>group</emphasis>
- .recordType[<emphasis>.name</emphasis>]:
- <replaceable>type</replaceable>
- </term>
- <listitem>
- <para>
- Specifies how records with the file extension
- <emphasis>name</emphasis> should be handled by the indexer.
- This option may also be specified as a command line option
- (<literal>-t</literal>). Note that if you do not specify a
- <emphasis>name</emphasis>, the setting applies to all files.
- In general, the record type specifier consists of the elements (each
- element separated by dot), <emphasis>fundamental-type</emphasis>,
- <emphasis>file-read-type</emphasis> and arguments. Currently, two
- fundamental types exist, <literal>text</literal> and
- <literal>grs</literal>.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><emphasis>group</emphasis>.recordId:
- <replaceable>record-id-spec</replaceable></term>
- <listitem>
- <para>
- Specifies how the records are to be identified when updated. See
- <xref linkend="locating-records"/>.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><emphasis>group</emphasis>.database:
- <replaceable>database</replaceable></term>
- <listitem>
- <para>
- Specifies the &acro.z3950; database name.
- <!-- FIXME - now we can have multiple databases in one server. -H -->
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><emphasis>group</emphasis>.storeKeys:
- <replaceable>boolean</replaceable></term>
- <listitem>
- <para>
- Specifies whether key information should be saved for a given
- group of records. If you plan to update/delete this type of
- records later this should be specified as 1; otherwise it
- should be 0 (default), to save register space.
- <!-- ### this is the first mention of "register" -->
- See <xref linkend="file-ids"/>.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><emphasis>group</emphasis>.storeData:
- <replaceable>boolean</replaceable></term>
- <listitem>
- <para>
- Specifies whether the records should be stored internally
- in the &zebra; system files.
- If you want to maintain the raw records yourself,
- this option should be false (0).
- If you want &zebra; to take care of the records for you, it
- should be true(1).
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <!-- ### probably a better place to define "register" -->
- <term>register: <replaceable>register-location</replaceable></term>
- <listitem>
- <para>
- Specifies the location of the various register files that &zebra; uses
- to represent your databases.
- See <xref linkend="register-location"/>.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>shadow: <replaceable>register-location</replaceable></term>
- <listitem>
- <para>
- Enables the <emphasis>safe update</emphasis> facility of &zebra;, and
- tells the system where to place the required, temporary files.
- See <xref linkend="shadow-registers"/>.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>lockDir: <replaceable>directory</replaceable></term>
- <listitem>
- <para>
- Directory in which various lock files are stored.
- </para>
- </listitem>
- </varlistentry>
+
<varlistentry>
- <term>keyTmpDir: <replaceable>directory</replaceable></term>
+ <term>Insert</term>
<listitem>
<para>
- Directory in which temporary files used during zebraidx's update
- phase are stored.
+ The record is indexed as if it never occurred before.
+ Either the &zebra; system doesn't know how to identify the record or
+ &zebra; can identify the record but didn't find it to be already indexed.
</para>
</listitem>
</varlistentry>
<varlistentry>
- <term>setTmpDir: <replaceable>directory</replaceable></term>
+ <term>Modify</term>
<listitem>
<para>
- Specifies the directory that the server uses for temporary result sets.
- If not specified <literal>/tmp</literal> will be used.
+ The record has already been indexed.
+ In this case either the contents of the record or the location
+ (file) of the record indicates that it has been indexed before.
</para>
</listitem>
</varlistentry>
<varlistentry>
- <term>profilePath: <replaceable>path</replaceable></term>
+ <term>Delete</term>
<listitem>
<para>
- Specifies a path of profile specification files.
- The path is composed of one or more directories separated by
- colon. Similar to <literal>PATH</literal> for UNIX systems.
+ The record is deleted from the index. As in the
+ update-case it must be able to identify the record.
</para>
</listitem>
</varlistentry>
+ </variablelist>
+ </para>
+
+ <para>
+ Please note that in both the modify- and delete- case the &zebra;
+ indexer must be able to generate a unique key that identifies the record
+ in question (more on this below).
+ </para>
+
+ <para>
+ To administrate the &zebra; retrieval system, you run the
+ <literal>zebraidx</literal> program.
+ This program supports a number of options which are preceded by a dash,
+ and a few commands (not preceded by dash).
+ </para>
+
+ <para>
+ Both the &zebra; administrative tool and the &acro.z3950; server share a
+ set of index files and a global configuration file.
+ The name of the configuration file defaults to
+ <literal>zebra.cfg</literal>.
+ The configuration file includes specifications on how to index
+ various kinds of records and where the other configuration files
+ are located. <literal>zebrasrv</literal> and <literal>zebraidx</literal>
+ <emphasis>must</emphasis> be run in the directory where the
+ configuration file lives unless you indicate the location of the
+ configuration file by option <literal>-c</literal>.
+ </para>
+
+ <sect1 id="record-types">
+ <title>Record Types</title>
+
+ <para>
+ Indexing is a per-record process, in which either insert/modify/delete
+ will occur. Before a record is indexed search keys are extracted from
+ whatever might be the layout the original record (sgml,html,text, etc..).
+ The &zebra; system currently supports two fundamental types of records:
+ structured and simple text.
+ To specify a particular extraction process, use either the
+ command line option <literal>-t</literal> or specify a
+ <literal>recordType</literal> setting in the configuration file.
+ </para>
+
+ </sect1>
+
+ <sect1 id="zebra-cfg">
+ <title>The &zebra; Configuration File</title>
+
+ <para>
+ The &zebra; configuration file, read by <literal>zebraidx</literal> and
+ <literal>zebrasrv</literal> defaults to <literal>zebra.cfg</literal>
+ unless specified by <literal>-c</literal> option.
+ </para>
+
+ <para>
+ You can edit the configuration file with a normal text editor.
+ parameter names and values are separated by colons in the file. Lines
+ starting with a hash sign (<literal>#</literal>) are
+ treated as comments.
+ </para>
+
+ <para>
+ If you manage different sets of records that share common
+ characteristics, you can organize the configuration settings for each
+ type into "groups".
+ When <literal>zebraidx</literal> is run and you wish to address a
+ given group you specify the group name with the <literal>-g</literal>
+ option.
+ In this case settings that have the group name as their prefix
+ will be used by <literal>zebraidx</literal>.
+ If no <literal>-g</literal> option is specified, the settings
+ without prefix are used.
+ </para>
+
+ <para>
+ In the configuration file, the group name is placed before the option
+ name itself, separated by a dot (.). For instance, to set the record type
+ for group <literal>public</literal> to <literal>grs.sgml</literal>
+ (the &acro.sgml;-like format for structured records) you would write:
+ </para>
+
+ <para>
+ <screen>
+ public.recordType: grs.sgml
+ </screen>
+ </para>
+
+ <para>
+ To set the default value of the record type to <literal>text</literal>
+ write:
+ </para>
+
+ <para>
+ <screen>
+ recordType: text
+ </screen>
+ </para>
+
+ <para>
+ The available configuration settings are summarized below. They will be
+ explained further in the following sections.
+ </para>
+
+ <!--
+ FIXME - Didn't Adam make something to have multiple databases in multiple dirs...
+ -->
+
+ <para>
+ <variablelist>
+
+ <varlistentry>
+ <term>
+ <emphasis>group</emphasis>
+ .recordType[<emphasis>.name</emphasis>]:
+ <replaceable>type</replaceable>
+ </term>
+ <listitem>
+ <para>
+ Specifies how records with the file extension
+ <emphasis>name</emphasis> should be handled by the indexer.
+ This option may also be specified as a command line option
+ (<literal>-t</literal>). Note that if you do not specify a
+ <emphasis>name</emphasis>, the setting applies to all files.
+ In general, the record type specifier consists of the elements (each
+ element separated by dot), <emphasis>fundamental-type</emphasis>,
+ <emphasis>file-read-type</emphasis> and arguments. Currently, two
+ fundamental types exist, <literal>text</literal> and
+ <literal>grs</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><emphasis>group</emphasis>.recordId:
+ <replaceable>record-id-spec</replaceable></term>
+ <listitem>
+ <para>
+ Specifies how the records are to be identified when updated. See
+ <xref linkend="locating-records"/>.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><emphasis>group</emphasis>.database:
+ <replaceable>database</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the &acro.z3950; database name.
+ <!-- FIXME - now we can have multiple databases in one server. -H -->
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><emphasis>group</emphasis>.storeKeys:
+ <replaceable>boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether key information should be saved for a given
+ group of records. If you plan to update/delete this type of
+ records later this should be specified as 1; otherwise it
+ should be 0 (default), to save register space.
+ <!-- ### this is the first mention of "register" -->
+ See <xref linkend="file-ids"/>.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><emphasis>group</emphasis>.storeData:
+ <replaceable>boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the records should be stored internally
+ in the &zebra; system files.
+ If you want to maintain the raw records yourself,
+ this option should be false (0).
+ If you want &zebra; to take care of the records for you, it
+ should be true(1).
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <!-- ### probably a better place to define "register" -->
+ <term>register: <replaceable>register-location</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the location of the various register files that &zebra; uses
+ to represent your databases.
+ See <xref linkend="register-location"/>.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>shadow: <replaceable>register-location</replaceable></term>
+ <listitem>
+ <para>
+ Enables the <emphasis>safe update</emphasis> facility of &zebra;, and
+ tells the system where to place the required, temporary files.
+ See <xref linkend="shadow-registers"/>.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>lockDir: <replaceable>directory</replaceable></term>
+ <listitem>
+ <para>
+ Directory in which various lock files are stored.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>keyTmpDir: <replaceable>directory</replaceable></term>
+ <listitem>
+ <para>
+ Directory in which temporary files used during zebraidx's update
+ phase are stored.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>setTmpDir: <replaceable>directory</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the directory that the server uses for temporary result sets.
+ If not specified <literal>/tmp</literal> will be used.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>profilePath: <replaceable>path</replaceable></term>
+ <listitem>
+ <para>
+ Specifies a path of profile specification files.
+ The path is composed of one or more directories separated by
+ colon. Similar to <literal>PATH</literal> for UNIX systems.
+ </para>
+ </listitem>
+ </varlistentry>
<varlistentry>
<term>modulePath: <replaceable>path</replaceable></term>
<term>sortmax: <replaceable>integer</replaceable></term>
<listitem>
<para>
- Specifies the maximum number of records that will be sorted
- in a result set. If the result set contains more than
- <replaceable>integer</replaceable> records, records after the
- limit will not be sorted. If omitted, the default value is
- 1,000.
+ Specifies the maximum number of records that will be sorted
+ in a result set. If the result set contains more than
+ <replaceable>integer</replaceable> records, records after the
+ limit will not be sorted. If omitted, the default value is
+ 1,000.
</para>
</listitem>
</varlistentry>
</listitem>
</varlistentry>
- <varlistentry>
- <term>attset: <replaceable>filename</replaceable></term>
- <listitem>
- <para>
+ <varlistentry>
+ <term>attset: <replaceable>filename</replaceable></term>
+ <listitem>
+ <para>
Specifies the filename(s) of attribute set files for use in
searching. In many configurations <filename>bib1.att</filename>
is used, but that is not required. If Classic Explain
attributes is to be used for searching,
<filename>explain.att</filename> must be given.
- The path to att-files in general can be given using
+ The path to att-files in general can be given using
<literal>profilePath</literal> setting.
See also <xref linkend="attset-files"/>.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>memMax: <replaceable>size</replaceable></term>
- <listitem>
- <para>
- Specifies <replaceable>size</replaceable> of internal memory
- to use for the zebraidx program.
- The amount is given in megabytes - default is 4 (4 MB).
- The more memory, the faster large updates happen, up to about
- half the free memory available on the computer.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>tempfiles: <replaceable>Yes/Auto/No</replaceable></term>
- <listitem>
- <para>
- Tells zebra if it should use temporary files when indexing. The
- default is Auto, in which case zebra uses temporary files only
- if it would need more that <replaceable>memMax</replaceable>
- megabytes of memory. This should be good for most uses.
- </para>
- </listitem>
- </varlistentry>
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>memMax: <replaceable>size</replaceable></term>
+ <listitem>
+ <para>
+ Specifies <replaceable>size</replaceable> of internal memory
+ to use for the zebraidx program.
+ The amount is given in megabytes - default is 4 (4 MB).
+ The more memory, the faster large updates happen, up to about
+ half the free memory available on the computer.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>tempfiles: <replaceable>Yes/Auto/No</replaceable></term>
+ <listitem>
+ <para>
+ Tells zebra if it should use temporary files when indexing. The
+ default is Auto, in which case zebra uses temporary files only
+ if it would need more that <replaceable>memMax</replaceable>
+ megabytes of memory. This should be good for most uses.
+ </para>
+ </listitem>
+ </varlistentry>
- <varlistentry>
- <term>root: <replaceable>dir</replaceable></term>
- <listitem>
- <para>
- Specifies a directory base for &zebra;. All relative paths
- given (in profilePath, register, shadow) are based on this
- directory. This setting is useful if your &zebra; server
- is running in a different directory from where
- <literal>zebra.cfg</literal> is located.
- </para>
- </listitem>
- </varlistentry>
+ <varlistentry>
+ <term>root: <replaceable>dir</replaceable></term>
+ <listitem>
+ <para>
+ Specifies a directory base for &zebra;. All relative paths
+ given (in profilePath, register, shadow) are based on this
+ directory. This setting is useful if your &zebra; server
+ is running in a different directory from where
+ <literal>zebra.cfg</literal> is located.
+ </para>
+ </listitem>
+ </varlistentry>
- <varlistentry>
- <term>passwd: <replaceable>file</replaceable></term>
- <listitem>
- <para>
- Specifies a file with description of user accounts for &zebra;.
- The format is similar to that known to Apache's htpasswd files
- and UNIX' passwd files. Non-empty lines not beginning with
- # are considered account lines. There is one account per-line.
- A line consists of fields separate by a single colon character.
- First field is username, second is password.
- </para>
- </listitem>
- </varlistentry>
+ <varlistentry>
+ <term>passwd: <replaceable>file</replaceable></term>
+ <listitem>
+ <para>
+ Specifies a file with description of user accounts for &zebra;.
+ The format is similar to that known to Apache's htpasswd files
+ and UNIX' passwd files. Non-empty lines not beginning with
+ # are considered account lines. There is one account per-line.
+ A line consists of fields separate by a single colon character.
+ First field is username, second is password.
+ </para>
+ </listitem>
+ </varlistentry>
- <varlistentry>
- <term>passwd.c: <replaceable>file</replaceable></term>
- <listitem>
- <para>
- Specifies a file with description of user accounts for &zebra;.
- File format is similar to that used by the passwd directive except
- that the password are encrypted. Use Apache's htpasswd or similar
- for maintenance.
- </para>
- </listitem>
- </varlistentry>
+ <varlistentry>
+ <term>passwd.c: <replaceable>file</replaceable></term>
+ <listitem>
+ <para>
+ Specifies a file with description of user accounts for &zebra;.
+ File format is similar to that used by the passwd directive except
+ that the password are encrypted. Use Apache's htpasswd or similar
+ for maintenance.
+ </para>
+ </listitem>
+ </varlistentry>
- <varlistentry>
- <term>perm.<replaceable>user</replaceable>:
- <replaceable>permstring</replaceable></term>
- <listitem>
- <para>
- Specifies permissions (privilege) for a user that are allowed
- to access &zebra; via the passwd system. There are two kinds
- of permissions currently: read (r) and write(w). By default
- users not listed in a permission directive are given the read
- privilege. To specify permissions for a user with no
- username, or &acro.z3950; anonymous style use
+ <varlistentry>
+ <term>perm.<replaceable>user</replaceable>:
+ <replaceable>permstring</replaceable></term>
+ <listitem>
+ <para>
+ Specifies permissions (privilege) for a user that are allowed
+ to access &zebra; via the passwd system. There are two kinds
+ of permissions currently: read (r) and write(w). By default
+ users not listed in a permission directive are given the read
+ privilege. To specify permissions for a user with no
+ username, or &acro.z3950; anonymous style use
<literal>anonymous</literal>. The permstring consists of
- a sequence of characters. Include character <literal>w</literal>
- for write/update access, <literal>r</literal> for read access and
- <literal>a</literal> to allow anonymous access through this account.
- </para>
- </listitem>
- </varlistentry>
+ a sequence of characters. Include character <literal>w</literal>
+ for write/update access, <literal>r</literal> for read access and
+ <literal>a</literal> to allow anonymous access through this account.
+ </para>
+ </listitem>
+ </varlistentry>
- <varlistentry>
+ <varlistentry>
<term>dbaccess: <replaceable>accessfile</replaceable></term>
<listitem>
- <para>
- Names a file which lists database subscriptions for individual users.
- The access file should consists of lines of the form
- <literal>username: dbnames</literal>, where dbnames is a list of
- database names, separated by '+'. No whitespace is allowed in the
- database list.
- </para>
+ <para>
+ Names a file which lists database subscriptions for individual users.
+ The access file should consists of lines of the form
+ <literal>username: dbnames</literal>, where dbnames is a list of
+ database names, separated by '+'. No whitespace is allowed in the
+ database list.
+ </para>
</listitem>
- </varlistentry>
+ </varlistentry>
- <varlistentry>
+ <varlistentry>
<term>encoding: <replaceable>charsetname</replaceable></term>
<listitem>
- <para>
- Tells &zebra; to interpret the terms in Z39.50 queries as
- having been encoded using the specified character
- encoding. The default is <literal>ISO-8859-1</literal>; one
- useful alternative is <literal>UTF-8</literal>.
- </para>
+ <para>
+ Tells &zebra; to interpret the terms in Z39.50 queries as
+ having been encoded using the specified character
+ encoding. The default is <literal>ISO-8859-1</literal>; one
+ useful alternative is <literal>UTF-8</literal>.
+ </para>
</listitem>
- </varlistentry>
+ </varlistentry>
- <varlistentry>
+ <varlistentry>
<term>storeKeys: <replaceable>value</replaceable></term>
<listitem>
- <para>
- Specifies whether &zebra; keeps a copy of indexed keys.
- Use a value of 1 to enable; 0 to disable. If storeKeys setting is
- omitted, it is enabled. Enabled storeKeys
- are required for updating and deleting records. Disable only
- storeKeys to save space and only plan to index data once.
- </para>
+ <para>
+ Specifies whether &zebra; keeps a copy of indexed keys.
+ Use a value of 1 to enable; 0 to disable. If storeKeys setting is
+ omitted, it is enabled. Enabled storeKeys
+ are required for updating and deleting records. Disable only
+ storeKeys to save space and only plan to index data once.
+ </para>
</listitem>
- </varlistentry>
+ </varlistentry>
- <varlistentry>
+ <varlistentry>
<term>storeData: <replaceable>value</replaceable></term>
<listitem>
- <para>
- Specifies whether &zebra; keeps a copy of indexed records.
- Use a value of 1 to enable; 0 to disable. If storeData setting is
- omitted, it is enabled. A storeData setting of 0 (disabled) makes
- Zebra fetch records from the original locaction in the file
- system using filename, file offset and file length. For the
- DOM and ALVIS filter, the storeData setting is ignored.
- </para>
+ <para>
+ Specifies whether &zebra; keeps a copy of indexed records.
+ Use a value of 1 to enable; 0 to disable. If storeData setting is
+ omitted, it is enabled. A storeData setting of 0 (disabled) makes
+ Zebra fetch records from the original locaction in the file
+ system using filename, file offset and file length. For the
+ DOM and ALVIS filter, the storeData setting is ignored.
+ </para>
</listitem>
- </varlistentry>
+ </varlistentry>
- </variablelist>
- </para>
-
- </sect1>
-
- <sect1 id="locating-records">
- <title>Locating Records</title>
-
- <para>
- The default behavior of the &zebra; system is to reference the
- records from their original location, i.e. where they were found when you
- run <literal>zebraidx</literal>.
- That is, when a client wishes to retrieve a record
- following a search operation, the files are accessed from the place
- where you originally put them - if you remove the files (without
- running <literal>zebraidx</literal> again, the server will return
- diagnostic number 14 (``System error in presenting records'') to
- the client.
- </para>
-
- <para>
- If your input files are not permanent - for example if you retrieve
- your records from an outside source, or if they were temporarily
- mounted on a CD-ROM drive,
- you may want &zebra; to make an internal copy of them. To do this,
- you specify 1 (true) in the <literal>storeData</literal> setting. When
- the &acro.z3950; server retrieves the records they will be read from the
- internal file structures of the system.
- </para>
-
- </sect1>
-
- <sect1 id="simple-indexing">
- <title>Indexing with no Record IDs (Simple Indexing)</title>
-
- <para>
- If you have a set of records that are not expected to change over time
- you may can build your database without record IDs.
- This indexing method uses less space than the other methods and
- is simple to use.
- </para>
-
- <para>
- To use this method, you simply omit the <literal>recordId</literal> entry
- for the group of files that you index. To add a set of records you use
- <literal>zebraidx</literal> with the <literal>update</literal> command. The
- <literal>update</literal> command will always add all of the records that it
- encounters to the index - whether they have already been indexed or
- not. If the set of indexed files change, you should delete all of the
- index files, and build a new index from scratch.
- </para>
-
- <para>
- Consider a system in which you have a group of text files called
- <literal>simple</literal>.
- That group of records should belong to a &acro.z3950; database called
- <literal>textbase</literal>.
- The following <literal>zebra.cfg</literal> file will suffice:
- </para>
- <para>
-
- <screen>
- profilePath: /usr/local/idzebra/tab
- attset: bib1.att
- simple.recordType: text
- simple.database: textbase
- </screen>
+ </variablelist>
+ </para>
- </para>
-
- <para>
- Since the existing records in an index can not be addressed by their
- IDs, it is impossible to delete or modify records when using this method.
- </para>
-
- </sect1>
-
- <sect1 id="file-ids">
- <title>Indexing with File Record IDs</title>
-
- <para>
- If you have a set of files that regularly change over time: Old files
- are deleted, new ones are added, or existing files are modified, you
- can benefit from using the <emphasis>file ID</emphasis>
- indexing methodology.
- Examples of this type of database might include an index of WWW
- resources, or a USENET news spool area.
- Briefly speaking, the file key methodology uses the directory paths
- of the individual records as a unique identifier for each record.
- To perform indexing of a directory with file keys, again, you specify
- the top-level directory after the <literal>update</literal> command.
- The command will recursively traverse the directories and compare
- each one with whatever have been indexed before in that same directory.
- If a file is new (not in the previous version of the directory) it
- is inserted into the registers; if a file was already indexed and
- it has been modified since the last update, the index is also
- modified; if a file has been removed since the last
- visit, it is deleted from the index.
- </para>
-
- <para>
- The resulting system is easy to administrate. To delete a record you
- simply have to delete the corresponding file (say, with the
- <literal>rm</literal> command). And to add records you create new
- files (or directories with files). For your changes to take effect
- in the register you must run <literal>zebraidx update</literal> with
- the same directory root again. This mode of operation requires more
- disk space than simpler indexing methods, but it makes it easier for
- you to keep the index in sync with a frequently changing set of data.
- If you combine this system with the <emphasis>safe update</emphasis>
- facility (see below), you never have to take your server off-line for
- maintenance or register updating purposes.
- </para>
-
- <para>
- To enable indexing with pathname IDs, you must specify
- <literal>file</literal> as the value of <literal>recordId</literal>
- in the configuration file. In addition, you should set
- <literal>storeKeys</literal> to <literal>1</literal>, since the &zebra;
- indexer must save additional information about the contents of each record
- in order to modify the indexes correctly at a later time.
- </para>
-
- <!--
- FIXME - There must be a simpler way to do this with Adams string tags -H
- -->
+ </sect1>
+
+ <sect1 id="locating-records">
+ <title>Locating Records</title>
- <para>
- For example, to update records of group <literal>esdd</literal>
- located below
- <literal>/data1/records/</literal> you should type:
- <screen>
- $ zebraidx -g esdd update /data1/records
- </screen>
- </para>
-
- <para>
- The corresponding configuration file includes:
- <screen>
- esdd.recordId: file
- esdd.recordType: grs.sgml
- esdd.storeKeys: 1
- </screen>
- </para>
-
- <note>
- <para>You cannot start out with a group of records with simple
- indexing (no record IDs as in the previous section) and then later
- enable file record Ids. &zebra; must know from the first time that you
- index the group that
- the files should be indexed with file record IDs.
- </para>
- </note>
-
- <para>
- You cannot explicitly delete records when using this method (using the
- <literal>delete</literal> command to <literal>zebraidx</literal>. Instead
- you have to delete the files from the file system (or move them to a
- different location)
- and then run <literal>zebraidx</literal> with the
- <literal>update</literal> command.
- </para>
- <!-- ### what happens if a file contains multiple records? -->
-</sect1>
-
- <sect1 id="generic-ids">
- <title>Indexing with General Record IDs</title>
-
- <para>
- When using this method you construct an (almost) arbitrary, internal
- record key based on the contents of the record itself and other system
- information. If you have a group of records that explicitly associates
- an ID with each record, this method is convenient. For example, the
- record format may contain a title or a ID-number - unique within the group.
- In either case you specify the &acro.z3950; attribute set and use-attribute
- location in which this information is stored, and the system looks at
- that field to determine the identity of the record.
- </para>
-
- <para>
- As before, the record ID is defined by the <literal>recordId</literal>
- setting in the configuration file. The value of the record ID specification
- consists of one or more tokens separated by whitespace. The resulting
- ID is represented in the index by concatenating the tokens and
- separating them by ASCII value (1).
- </para>
-
- <para>
- There are three kinds of tokens:
- <variablelist>
-
- <varlistentry>
- <term>Internal record info</term>
- <listitem>
- <para>
- The token refers to a key that is
- extracted from the record. The syntax of this token is
- <literal>(</literal> <emphasis>set</emphasis> <literal>,</literal>
- <emphasis>use</emphasis> <literal>)</literal>,
- where <emphasis>set</emphasis> is the
- attribute set name <emphasis>use</emphasis> is the
- name or value of the attribute.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>System variable</term>
- <listitem>
- <para>
- The system variables are preceded by
-
- <screen>
- $
- </screen>
- and immediately followed by the system variable name, which
- may one of
- <variablelist>
-
- <varlistentry>
- <term>group</term>
- <listitem>
- <para>
- Group name.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>database</term>
- <listitem>
- <para>
- Current database specified.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>type</term>
- <listitem>
- <para>
- Record type.
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>Constant string</term>
- <listitem>
- <para>
- A string used as part of the ID — surrounded
- by single- or double quotes.
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
- </para>
-
- <para>
- For instance, the sample GILS records that come with the &zebra;
- distribution contain a unique ID in the data tagged Control-Identifier.
- The data is mapped to the &acro.bib1; use attribute Identifier-standard
- (code 1007). To use this field as a record id, specify
- <literal>(bib1,Identifier-standard)</literal> as the value of the
- <literal>recordId</literal> in the configuration file.
- If you have other record types that uses the same field for a
- different purpose, you might add the record type
- (or group or database name) to the record id of the gils
- records as well, to prevent matches with other types of records.
- In this case the recordId might be set like this:
-
- <screen>
- gils.recordId: $type (bib1,Identifier-standard)
- </screen>
-
- </para>
-
- <para>
- (see <xref linkend="grs"/>
- for details of how the mapping between elements of your records and
- searchable attributes is established).
- </para>
-
- <para>
- As for the file record ID case described in the previous section,
- updating your system is simply a matter of running
- <literal>zebraidx</literal>
- with the <literal>update</literal> command. However, the update with general
- keys is considerably slower than with file record IDs, since all files
- visited must be (re)read to discover their IDs.
- </para>
-
- <para>
- As you might expect, when using the general record IDs
- method, you can only add or modify existing records with the
- <literal>update</literal> command.
- If you wish to delete records, you must use the,
- <literal>delete</literal> command, with a directory as a parameter.
- This will remove all records that match the files below that root
- directory.
- </para>
-
- </sect1>
-
- <sect1 id="register-location">
- <title>Register Location</title>
-
- <para>
- Normally, the index files that form dictionaries, inverted
- files, record info, etc., are stored in the directory where you run
- <literal>zebraidx</literal>. If you wish to store these, possibly large,
- files somewhere else, you must add the <literal>register</literal>
- entry to the <literal>zebra.cfg</literal> file.
- Furthermore, the &zebra; system allows its file
- structures to span multiple file systems, which is useful for
- managing very large databases.
- </para>
-
- <para>
- The value of the <literal>register</literal> setting is a sequence
- of tokens. Each token takes the form:
-
- <emphasis>dir</emphasis><literal>:</literal><emphasis>size</emphasis>
-
- The <emphasis>dir</emphasis> specifies a directory in which index files
- will be stored and the <emphasis>size</emphasis> specifies the maximum
- size of all files in that directory. The &zebra; indexer system fills
- each directory in the order specified and use the next specified
- directories as needed.
- The <emphasis>size</emphasis> is an integer followed by a qualifier
- code,
- <literal>b</literal> for bytes,
- <literal>k</literal> for kilobytes.
- <literal>M</literal> for megabytes,
- <literal>G</literal> for gigabytes.
- Specifying a negative value disables the checking (it still needs the unit,
- use <literal>-1b</literal>).
- </para>
-
- <para>
- For instance, if you have allocated three disks for your register, and
- the first disk is mounted
- on <literal>/d1</literal> and has 2GB of free space, the
- second, mounted on <literal>/d2</literal> has 3.6 GB, and the third,
- on which you have more space than you bother to worry about, mounted on
- <literal>/d3</literal> you could put this entry in your configuration file:
-
- <screen>
- register: /d1:2G /d2:3600M /d3:-1b
- </screen>
- </para>
-
- <para>
- Note that &zebra; does not verify that the amount of space specified is
- actually available on the directory (file system) specified - it is
- your responsibility to ensure that enough space is available, and that
- other applications do not attempt to use the free space. In a large
- production system, it is recommended that you allocate one or more
- file system exclusively to the &zebra; register files.
- </para>
-
- </sect1>
-
- <sect1 id="shadow-registers">
- <title>Safe Updating - Using Shadow Registers</title>
-
- <sect2 id="shadow-registers-description">
- <title>Description</title>
-
<para>
- The &zebra; server supports <emphasis>updating</emphasis> of the index
- structures. That is, you can add, modify, or remove records from
- databases managed by &zebra; without rebuilding the entire index.
- Since this process involves modifying structured files with various
- references between blocks of data in the files, the update process
- is inherently sensitive to system crashes, or to process interruptions:
- Anything but a successfully completed update process will leave the
- register files in an unknown state, and you will essentially have no
- recourse but to re-index everything, or to restore the register files
- from a backup medium.
- Further, while the update process is active, users cannot be
- allowed to access the system, as the contents of the register files
- may change unpredictably.
+ The default behavior of the &zebra; system is to reference the
+ records from their original location, i.e. where they were found when you
+ run <literal>zebraidx</literal>.
+ That is, when a client wishes to retrieve a record
+ following a search operation, the files are accessed from the place
+ where you originally put them - if you remove the files (without
+ running <literal>zebraidx</literal> again, the server will return
+ diagnostic number 14 (``System error in presenting records'') to
+ the client.
</para>
-
+
<para>
- You can solve these problems by enabling the shadow register system in
- &zebra;.
- During the updating procedure, <literal>zebraidx</literal> will temporarily
- write changes to the involved files in a set of "shadow
- files", without modifying the files that are accessed by the
- active server processes. If the update procedure is interrupted by a
- system crash or a signal, you simply repeat the procedure - the
- register files have not been changed or damaged, and the partially
- written shadow files are automatically deleted before the new updating
- procedure commences.
+ If your input files are not permanent - for example if you retrieve
+ your records from an outside source, or if they were temporarily
+ mounted on a CD-ROM drive,
+ you may want &zebra; to make an internal copy of them. To do this,
+ you specify 1 (true) in the <literal>storeData</literal> setting. When
+ the &acro.z3950; server retrieves the records they will be read from the
+ internal file structures of the system.
</para>
-
+
+ </sect1>
+
+ <sect1 id="simple-indexing">
+ <title>Indexing with no Record IDs (Simple Indexing)</title>
+
<para>
- At the end of the updating procedure (or in a separate operation, if
- you so desire), the system enters a "commit mode". First,
- any active server processes are forced to access those blocks that
- have been changed from the shadow files rather than from the main
- register files; the unmodified blocks are still accessed at their
- normal location (the shadow files are not a complete copy of the
- register files - they only contain those parts that have actually been
- modified). If the commit process is interrupted at any point during the
- commit process, the server processes will continue to access the
- shadow files until you can repeat the commit procedure and complete
- the writing of data to the main register files. You can perform
- multiple update operations to the registers before you commit the
- changes to the system files, or you can execute the commit operation
- at the end of each update operation. When the commit phase has
- completed successfully, any running server processes are instructed to
- switch their operations to the new, operational register, and the
- temporary shadow files are deleted.
+ If you have a set of records that are not expected to change over time
+ you may can build your database without record IDs.
+ This indexing method uses less space than the other methods and
+ is simple to use.
</para>
-
- </sect2>
-
- <sect2 id="shadow-registers-how-to-use">
- <title>How to Use Shadow Register Files</title>
-
+
<para>
- The first step is to allocate space on your system for the shadow
- files.
- You do this by adding a <literal>shadow</literal> entry to the
- <literal>zebra.cfg</literal> file.
- The syntax of the <literal>shadow</literal> entry is exactly the
- same as for the <literal>register</literal> entry
- (see <xref linkend="register-location"/>).
- The location of the shadow area should be
- <emphasis>different</emphasis> from the location of the main register
- area (if you have specified one - remember that if you provide no
- <literal>register</literal> setting, the default register area is the
- working directory of the server and indexing processes).
+ To use this method, you simply omit the <literal>recordId</literal> entry
+ for the group of files that you index. To add a set of records you use
+ <literal>zebraidx</literal> with the <literal>update</literal> command. The
+ <literal>update</literal> command will always add all of the records that it
+ encounters to the index - whether they have already been indexed or
+ not. If the set of indexed files change, you should delete all of the
+ index files, and build a new index from scratch.
</para>
-
+
<para>
- The following excerpt from a <literal>zebra.cfg</literal> file shows
- one example of a setup that configures both the main register
- location and the shadow file area.
- Note that two directories or partitions have been set aside
- for the shadow file area. You can specify any number of directories
- for each of the file areas, but remember that there should be no
- overlaps between the directories used for the main registers and the
- shadow files, respectively.
+ Consider a system in which you have a group of text files called
+ <literal>simple</literal>.
+ That group of records should belong to a &acro.z3950; database called
+ <literal>textbase</literal>.
+ The following <literal>zebra.cfg</literal> file will suffice:
</para>
<para>
-
+
<screen>
- register: /d1:500M
- shadow: /scratch1:100M /scratch2:200M
+ profilePath: /usr/local/idzebra/tab
+ attset: bib1.att
+ simple.recordType: text
+ simple.database: textbase
</screen>
-
+
</para>
-
+
<para>
- When shadow files are enabled, an extra command is available at the
- <literal>zebraidx</literal> command line.
- In order to make changes to the system take effect for the
- users, you'll have to submit a "commit" command after a
- (sequence of) update operation(s).
+ Since the existing records in an index can not be addressed by their
+ IDs, it is impossible to delete or modify records when using this method.
</para>
-
+
+ </sect1>
+
+ <sect1 id="file-ids">
+ <title>Indexing with File Record IDs</title>
+
<para>
-
- <screen>
- $ zebraidx update /d1/records
- $ zebraidx commit
- </screen>
-
+ If you have a set of files that regularly change over time: Old files
+ are deleted, new ones are added, or existing files are modified, you
+ can benefit from using the <emphasis>file ID</emphasis>
+ indexing methodology.
+ Examples of this type of database might include an index of WWW
+ resources, or a USENET news spool area.
+ Briefly speaking, the file key methodology uses the directory paths
+ of the individual records as a unique identifier for each record.
+ To perform indexing of a directory with file keys, again, you specify
+ the top-level directory after the <literal>update</literal> command.
+ The command will recursively traverse the directories and compare
+ each one with whatever have been indexed before in that same directory.
+ If a file is new (not in the previous version of the directory) it
+ is inserted into the registers; if a file was already indexed and
+ it has been modified since the last update, the index is also
+ modified; if a file has been removed since the last
+ visit, it is deleted from the index.
</para>
-
+
<para>
- Or you can execute multiple updates before committing the changes:
+ The resulting system is easy to administrate. To delete a record you
+ simply have to delete the corresponding file (say, with the
+ <literal>rm</literal> command). And to add records you create new
+ files (or directories with files). For your changes to take effect
+ in the register you must run <literal>zebraidx update</literal> with
+ the same directory root again. This mode of operation requires more
+ disk space than simpler indexing methods, but it makes it easier for
+ you to keep the index in sync with a frequently changing set of data.
+ If you combine this system with the <emphasis>safe update</emphasis>
+ facility (see below), you never have to take your server off-line for
+ maintenance or register updating purposes.
</para>
-
+
<para>
-
- <screen>
- $ zebraidx -g books update /d1/records /d2/more-records
- $ zebraidx -g fun update /d3/fun-records
- $ zebraidx commit
- </screen>
-
+ To enable indexing with pathname IDs, you must specify
+ <literal>file</literal> as the value of <literal>recordId</literal>
+ in the configuration file. In addition, you should set
+ <literal>storeKeys</literal> to <literal>1</literal>, since the &zebra;
+ indexer must save additional information about the contents of each record
+ in order to modify the indexes correctly at a later time.
</para>
-
+
+ <!--
+ FIXME - There must be a simpler way to do this with Adams string tags -H
+ -->
+
<para>
- If one of the update operations above had been interrupted, the commit
- operation on the last line would fail: <literal>zebraidx</literal>
- will not let you commit changes that would destroy the running register.
- You'll have to rerun all of the update operations since your last
- commit operation, before you can commit the new changes.
+ For example, to update records of group <literal>esdd</literal>
+ located below
+ <literal>/data1/records/</literal> you should type:
+ <screen>
+ $ zebraidx -g esdd update /data1/records
+ </screen>
</para>
-
+
<para>
- Similarly, if the commit operation fails, <literal>zebraidx</literal>
- will not let you start a new update operation before you have
- successfully repeated the commit operation.
- The server processes will keep accessing the shadow files rather
- than the (possibly damaged) blocks of the main register files
- until the commit operation has successfully completed.
+ The corresponding configuration file includes:
+ <screen>
+ esdd.recordId: file
+ esdd.recordType: grs.sgml
+ esdd.storeKeys: 1
+ </screen>
</para>
-
+
+ <note>
+ <para>You cannot start out with a group of records with simple
+ indexing (no record IDs as in the previous section) and then later
+ enable file record Ids. &zebra; must know from the first time that you
+ index the group that
+ the files should be indexed with file record IDs.
+ </para>
+ </note>
+
<para>
- You should be aware that update operations may take slightly longer
- when the shadow register system is enabled, since more file access
- operations are involved. Further, while the disk space required for
- the shadow register data is modest for a small update operation, you
- may prefer to disable the system if you are adding a very large number
- of records to an already very large database (we use the terms
- <emphasis>large</emphasis> and <emphasis>modest</emphasis>
- very loosely here, since every application will have a
- different perception of size).
- To update the system without the use of the the shadow files,
- simply run <literal>zebraidx</literal> with the <literal>-n</literal>
- option (note that you do not have to execute the
- <emphasis>commit</emphasis> command of <literal>zebraidx</literal>
- when you temporarily disable the use of the shadow registers in
- this fashion.
- Note also that, just as when the shadow registers are not enabled,
- server processes will be barred from accessing the main register
- while the update procedure takes place.
+ You cannot explicitly delete records when using this method (using the
+ <literal>delete</literal> command to <literal>zebraidx</literal>. Instead
+ you have to delete the files from the file system (or move them to a
+ different location)
+ and then run <literal>zebraidx</literal> with the
+ <literal>update</literal> command.
</para>
-
- </sect2>
-
- </sect1>
+ <!-- ### what happens if a file contains multiple records? -->
+ </sect1>
+ <sect1 id="generic-ids">
+ <title>Indexing with General Record IDs</title>
- <sect1 id="administration-ranking">
- <title>Relevance Ranking and Sorting of Result Sets</title>
-
- <sect2 id="administration-overview">
- <title>Overview</title>
<para>
- The default ordering of a result set is left up to the server,
- which inside &zebra; means sorting in ascending document ID order.
- This is not always the order humans want to browse the sometimes
- quite large hit sets. Ranking and sorting comes to the rescue.
+ When using this method you construct an (almost) arbitrary, internal
+ record key based on the contents of the record itself and other system
+ information. If you have a group of records that explicitly associates
+ an ID with each record, this method is convenient. For example, the
+ record format may contain a title or a ID-number - unique within the group.
+ In either case you specify the &acro.z3950; attribute set and use-attribute
+ location in which this information is stored, and the system looks at
+ that field to determine the identity of the record.
</para>
- <para>
- In cases where a good presentation ordering can be computed at
- indexing time, we can use a fixed <literal>static ranking</literal>
- scheme, which is provided for the <literal>alvis</literal>
- indexing filter. This defines a fixed ordering of hit lists,
- independently of the query issued.
+ <para>
+ As before, the record ID is defined by the <literal>recordId</literal>
+ setting in the configuration file. The value of the record ID specification
+ consists of one or more tokens separated by whitespace. The resulting
+ ID is represented in the index by concatenating the tokens and
+ separating them by ASCII value (1).
</para>
<para>
- There are cases, however, where relevance of hit set documents is
- highly dependent on the query processed.
- Simply put, <literal>dynamic relevance ranking</literal>
- sorts a set of retrieved records such that those most likely to be
- relevant to your request are retrieved first.
- Internally, &zebra; retrieves all documents that satisfy your
- query, and re-orders the hit list to arrange them based on
- a measurement of similarity between your query and the content of
- each record.
+ There are three kinds of tokens:
+ <variablelist>
+
+ <varlistentry>
+ <term>Internal record info</term>
+ <listitem>
+ <para>
+ The token refers to a key that is
+ extracted from the record. The syntax of this token is
+ <literal>(</literal> <emphasis>set</emphasis> <literal>,</literal>
+ <emphasis>use</emphasis> <literal>)</literal>,
+ where <emphasis>set</emphasis> is the
+ attribute set name <emphasis>use</emphasis> is the
+ name or value of the attribute.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>System variable</term>
+ <listitem>
+ <para>
+ The system variables are preceded by
+
+ <screen>
+ $
+ </screen>
+ and immediately followed by the system variable name, which
+ may one of
+ <variablelist>
+
+ <varlistentry>
+ <term>group</term>
+ <listitem>
+ <para>
+ Group name.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>database</term>
+ <listitem>
+ <para>
+ Current database specified.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>type</term>
+ <listitem>
+ <para>
+ Record type.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>Constant string</term>
+ <listitem>
+ <para>
+ A string used as part of the ID — surrounded
+ by single- or double quotes.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
</para>
<para>
- Finally, there are situations where hit sets of documents should be
- <literal>sorted</literal> during query time according to the
- lexicographical ordering of certain sort indexes created at
- indexing time.
- </para>
- </sect2>
+ For instance, the sample GILS records that come with the &zebra;
+ distribution contain a unique ID in the data tagged Control-Identifier.
+ The data is mapped to the &acro.bib1; use attribute Identifier-standard
+ (code 1007). To use this field as a record id, specify
+ <literal>(bib1,Identifier-standard)</literal> as the value of the
+ <literal>recordId</literal> in the configuration file.
+ If you have other record types that uses the same field for a
+ different purpose, you might add the record type
+ (or group or database name) to the record id of the gils
+ records as well, to prevent matches with other types of records.
+ In this case the recordId might be set like this:
+ <screen>
+ gils.recordId: $type (bib1,Identifier-standard)
+ </screen>
- <sect2 id="administration-ranking-static">
- <title>Static Ranking</title>
-
- <para>
- &zebra; uses internally inverted indexes to look up term frequencies
- in documents. Multiple queries from different indexes can be
- combined by the binary boolean operations <literal>AND</literal>,
- <literal>OR</literal> and/or <literal>NOT</literal> (which
- is in fact a binary <literal>AND NOT</literal> operation).
- To ensure fast query execution
- speed, all indexes have to be sorted in the same order.
</para>
+
<para>
- The indexes are normally sorted according to document
- <literal>ID</literal> in
- ascending order, and any query which does not invoke a special
- re-ranking function will therefore retrieve the result set in
- document
- <literal>ID</literal>
- order.
+ (see <xref linkend="grs"/>
+ for details of how the mapping between elements of your records and
+ searchable attributes is established).
</para>
+
<para>
- If one defines the
- <screen>
- staticrank: 1
- </screen>
- directive in the main core &zebra; configuration file, the internal document
- keys used for ordering are augmented by a preceding integer, which
- contains the static rank of a given document, and the index lists
- are ordered
- first by ascending static rank,
- then by ascending document <literal>ID</literal>.
- Zero
- is the ``best'' rank, as it occurs at the
- beginning of the list; higher numbers represent worse scores.
+ As for the file record ID case described in the previous section,
+ updating your system is simply a matter of running
+ <literal>zebraidx</literal>
+ with the <literal>update</literal> command. However, the update with general
+ keys is considerably slower than with file record IDs, since all files
+ visited must be (re)read to discover their IDs.
</para>
+
<para>
- The experimental <literal>alvis</literal> filter provides a
- directive to fetch static rank information out of the indexed &acro.xml;
- records, thus making <emphasis>all</emphasis> hit sets ordered
- after <emphasis>ascending</emphasis> static
- rank, and for those doc's which have the same static rank, ordered
- after <emphasis>ascending</emphasis> doc <literal>ID</literal>.
- See <xref linkend="record-model-alvisxslt"/> for the gory details.
+ As you might expect, when using the general record IDs
+ method, you can only add or modify existing records with the
+ <literal>update</literal> command.
+ If you wish to delete records, you must use the,
+ <literal>delete</literal> command, with a directory as a parameter.
+ This will remove all records that match the files below that root
+ directory.
</para>
- </sect2>
+ </sect1>
+
+ <sect1 id="register-location">
+ <title>Register Location</title>
- <sect2 id="administration-ranking-dynamic">
- <title>Dynamic Ranking</title>
<para>
- In order to fiddle with the static rank order, it is necessary to
- invoke additional re-ranking/re-ordering using dynamic
- ranking or score functions. These functions return positive
- integer scores, where <emphasis>highest</emphasis> score is
- ``best'';
- hit sets are sorted according to <emphasis>descending</emphasis>
- scores (in contrary
- to the index lists which are sorted according to
- ascending rank number and document ID).
+ Normally, the index files that form dictionaries, inverted
+ files, record info, etc., are stored in the directory where you run
+ <literal>zebraidx</literal>. If you wish to store these, possibly large,
+ files somewhere else, you must add the <literal>register</literal>
+ entry to the <literal>zebra.cfg</literal> file.
+ Furthermore, the &zebra; system allows its file
+ structures to span multiple file systems, which is useful for
+ managing very large databases.
</para>
+
<para>
- Dynamic ranking is enabled by a directive like one of the
- following in the zebra configuration file (use only one of these a time!):
- <screen>
- rank: rank-1 # default TDF-IDF like
- rank: rank-static # dummy do-nothing
- </screen>
+ The value of the <literal>register</literal> setting is a sequence
+ of tokens. Each token takes the form:
+
+ <emphasis>dir</emphasis><literal>:</literal><emphasis>size</emphasis>
+
+ The <emphasis>dir</emphasis> specifies a directory in which index files
+ will be stored and the <emphasis>size</emphasis> specifies the maximum
+ size of all files in that directory. The &zebra; indexer system fills
+ each directory in the order specified and use the next specified
+ directories as needed.
+ The <emphasis>size</emphasis> is an integer followed by a qualifier
+ code,
+ <literal>b</literal> for bytes,
+ <literal>k</literal> for kilobytes.
+ <literal>M</literal> for megabytes,
+ <literal>G</literal> for gigabytes.
+ Specifying a negative value disables the checking (it still needs the unit,
+ use <literal>-1b</literal>).
</para>
-
+
<para>
- Dynamic ranking is done at query time rather than
- indexing time (this is why we
- call it ``dynamic ranking'' in the first place ...)
- It is invoked by adding
- the &acro.bib1; relation attribute with
- value ``relevance'' to the &acro.pqf; query (that is,
- <literal>@attr 2=102</literal>, see also
- <ulink url="&url.z39.50;bib1.html">
- The &acro.bib1; Attribute Set Semantics</ulink>, also in
- <ulink url="&url.z39.50.attset.bib1;">HTML</ulink>).
- To find all articles with the word <literal>Eoraptor</literal> in
- the title, and present them relevance ranked, issue the &acro.pqf; query:
+ For instance, if you have allocated three disks for your register, and
+ the first disk is mounted
+ on <literal>/d1</literal> and has 2GB of free space, the
+ second, mounted on <literal>/d2</literal> has 3.6 GB, and the third,
+ on which you have more space than you bother to worry about, mounted on
+ <literal>/d3</literal> you could put this entry in your configuration file:
+
<screen>
- @attr 2=102 @attr 1=4 Eoraptor
+ register: /d1:2G /d2:3600M /d3:-1b
</screen>
</para>
- <sect3 id="administration-ranking-dynamic-rank1">
- <title>Dynamically ranking using &acro.pqf; queries with the 'rank-1'
- algorithm</title>
-
<para>
- The default <literal>rank-1</literal> ranking module implements a
- TF/IDF (Term Frequecy over Inverse Document Frequency) like
- algorithm. In contrast to the usual definition of TF/IDF
- algorithms, which only considers searching in one full-text
- index, this one works on multiple indexes at the same time.
- More precisely,
- &zebra; does boolean queries and searches in specific addressed
- indexes (there are inverted indexes pointing from terms in the
- dictionary to documents and term positions inside documents).
- It works like this:
- <variablelist>
- <varlistentry>
- <term>Query Components</term>
- <listitem>
- <para>
- First, the boolean query is dismantled into its principal components,
- i.e. atomic queries where one term is looked up in one index.
- For example, the query
- <screen>
- @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
- </screen>
- is a boolean AND between the atomic parts
- <screen>
- @attr 2=102 @attr 1=1010 Utah
- </screen>
- and
- <screen>
- @attr 2=102 @attr 1=1018 Springer
- </screen>
- which gets processed each for itself.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Atomic hit lists</term>
- <listitem>
- <para>
- Second, for each atomic query, the hit list of documents is
- computed.
- </para>
- <para>
- In this example, two hit lists for each index
- <literal>@attr 1=1010</literal> and
- <literal>@attr 1=1018</literal> are computed.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Atomic scores</term>
- <listitem>
- <para>
- Third, each document in the hit list is assigned a score (_if_ ranking
- is enabled and requested in the query) using a TF/IDF scheme.
- </para>
- <para>
- In this example, both atomic parts of the query assign the magic
- <literal>@attr 2=102</literal> relevance attribute, and are
- to be used in the relevance ranking functions.
- </para>
- <para>
- It is possible to apply dynamic ranking on only parts of the
- &acro.pqf; query:
- <screen>
- @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
- </screen>
- searches for all documents which have the term 'Utah' on the
- body of text, and which have the term 'Springer' in the publisher
- field, and sort them in the order of the relevance ranking made on
- the body-of-text index only.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Hit list merging</term>
- <listitem>
- <para>
- Fourth, the atomic hit lists are merged according to the boolean
- conditions to a final hit list of documents to be returned.
- </para>
- <para>
- This step is always performed, independently of the fact that
- dynamic ranking is enabled or not.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Document score computation</term>
- <listitem>
- <para>
- Fifth, the total score of a document is computed as a linear
- combination of the atomic scores of the atomic hit lists
- </para>
- <para>
- Ranking weights may be used to pass a value to a ranking
- algorithm, using the non-standard &acro.bib1; attribute type 9.
- This allows one branch of a query to use one value while
- another branch uses a different one. For example, we can search
- for <literal>utah</literal> in the
- <literal>@attr 1=4</literal> index with weight 30, as
- well as in the <literal>@attr 1=1010</literal> index with weight 20:
- <screen>
- @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
- </screen>
- </para>
- <para>
- The default weight is
- sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score
- is 1000 and the bottom score is 0, encoded in integers.
- </para>
- <warning>
- <para>
- The ranking-weight feature is experimental. It may change in future
- releases of zebra.
- </para>
- </warning>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Re-sorting of hit list</term>
- <listitem>
- <para>
- Finally, the final hit list is re-ordered according to scores.
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
-
-
-<!--
-Still need to describe the exact TF/IDF formula. Here's the info, need -->
-<!--to extract it in human readable form .. MC
-
-static int calc (void *set_handle, zint sysno, zint staticrank,
- int *stop_flag)
-{
- int i, lo, divisor, score = 0;
- struct rank_set_info *si = (struct rank_set_info *) set_handle;
-
- if (!si->no_rank_entries)
- return -1; /* ranking not enabled for any terms */
-
- for (i = 0; i < si->no_entries; i++)
- {
- yaz_log(log_level, "calc: i=%d rank_flag=%d lo=%d",
- i, si->entries[i].rank_flag, si->entries[i].local_occur);
- if (si->entries[i].rank_flag && (lo = si->entries[i].local_occur))
- score += (8+log2_int (lo)) * si->entries[i].global_inv *
- si->entries[i].rank_weight;
- }
- divisor = si->no_rank_entries * (8+log2_int (si->last_pos/si->no_entries));
- score = score / divisor;
- yaz_log(log_level, "calc sysno=" ZINT_FORMAT " score=%d", sysno, score);
- if (score > 1000)
- score = 1000;
- /* reset the counts for the next term */
- for (i = 0; i < si->no_entries; i++)
- si->entries[i].local_occur = 0;
- return score;
-}
-
-
-where lo = si->entries[i].local_occur is the local documents term-within-index frequency, si->entries[i].global_inv represents the IDF part (computed in static void *begin()), and
-si->entries[i].rank_weight is the weight assigner per index (default 34, or set in the @attr 9=xyz magic)
-
-Finally, the IDF part is computed as:
-
-static void *begin (struct zebra_register *reg,
- void *class_handle, RSET rset, NMEM nmem,
- TERMID *terms, int numterms)
-{
- struct rank_set_info *si =
- (struct rank_set_info *) nmem_malloc (nmem,sizeof(*si));
- int i;
-
- yaz_log(log_level, "rank-1 begin");
- si->no_entries = numterms;
- si->no_rank_entries = 0;
- si->nmem=nmem;
- si->entries = (struct rank_term_info *)
- nmem_malloc (si->nmem, sizeof(*si->entries)*numterms);
- for (i = 0; i < numterms; i++)
- {
- zint g = rset_count(terms[i]->rset);
- yaz_log(log_level, "i=%d flags=%s '%s'", i,
- terms[i]->flags, terms[i]->name );
- if (!strncmp (terms[i]->flags, "rank,", 5))
- {
- const char *cp = strstr(terms[i]->flags+4, ",w=");
- si->entries[i].rank_flag = 1;
- if (cp)
- si->entries[i].rank_weight = atoi (cp+3);
- else
- si->entries[i].rank_weight = 34; /* sqrroot of 1000 */
- yaz_log(log_level, " i=%d weight=%d g="ZINT_FORMAT, i,
- si->entries[i].rank_weight, g);
- (si->no_rank_entries)++;
- }
- else
- si->entries[i].rank_flag = 0;
- si->entries[i].local_occur = 0; /* FIXME */
- si->entries[i].global_occur = g;
- si->entries[i].global_inv = 32 - log2_int (g);
- yaz_log(log_level, " global_inv = %d g = " ZINT_FORMAT,
- (int) (32-log2_int (g)), g);
- si->entries[i].term = terms[i];
- si->entries[i].term_index=i;
- terms[i]->rankpriv = &(si->entries[i]);
- }
- return si;
-}
-
-
-where g = rset_count(terms[i]->rset) is the count of all documents in this specific index hit list, and the IDF part then is
-
- si->entries[i].global_inv = 32 - log2_int (g);
- -->
-
+ Note that &zebra; does not verify that the amount of space specified is
+ actually available on the directory (file system) specified - it is
+ your responsibility to ensure that enough space is available, and that
+ other applications do not attempt to use the free space. In a large
+ production system, it is recommended that you allocate one or more
+ file system exclusively to the &zebra; register files.
</para>
+ </sect1>
+
+ <sect1 id="shadow-registers">
+ <title>Safe Updating - Using Shadow Registers</title>
+
+ <sect2 id="shadow-registers-description">
+ <title>Description</title>
+
+ <para>
+ The &zebra; server supports <emphasis>updating</emphasis> of the index
+ structures. That is, you can add, modify, or remove records from
+ databases managed by &zebra; without rebuilding the entire index.
+ Since this process involves modifying structured files with various
+ references between blocks of data in the files, the update process
+ is inherently sensitive to system crashes, or to process interruptions:
+ Anything but a successfully completed update process will leave the
+ register files in an unknown state, and you will essentially have no
+ recourse but to re-index everything, or to restore the register files
+ from a backup medium.
+ Further, while the update process is active, users cannot be
+ allowed to access the system, as the contents of the register files
+ may change unpredictably.
+ </para>
+
+ <para>
+ You can solve these problems by enabling the shadow register system in
+ &zebra;.
+ During the updating procedure, <literal>zebraidx</literal> will temporarily
+ write changes to the involved files in a set of "shadow
+ files", without modifying the files that are accessed by the
+ active server processes. If the update procedure is interrupted by a
+ system crash or a signal, you simply repeat the procedure - the
+ register files have not been changed or damaged, and the partially
+ written shadow files are automatically deleted before the new updating
+ procedure commences.
+ </para>
+
+ <para>
+ At the end of the updating procedure (or in a separate operation, if
+ you so desire), the system enters a "commit mode". First,
+ any active server processes are forced to access those blocks that
+ have been changed from the shadow files rather than from the main
+ register files; the unmodified blocks are still accessed at their
+ normal location (the shadow files are not a complete copy of the
+ register files - they only contain those parts that have actually been
+ modified). If the commit process is interrupted at any point during the
+ commit process, the server processes will continue to access the
+ shadow files until you can repeat the commit procedure and complete
+ the writing of data to the main register files. You can perform
+ multiple update operations to the registers before you commit the
+ changes to the system files, or you can execute the commit operation
+ at the end of each update operation. When the commit phase has
+ completed successfully, any running server processes are instructed to
+ switch their operations to the new, operational register, and the
+ temporary shadow files are deleted.
+ </para>
+
+ </sect2>
+
+ <sect2 id="shadow-registers-how-to-use">
+ <title>How to Use Shadow Register Files</title>
+
+ <para>
+ The first step is to allocate space on your system for the shadow
+ files.
+ You do this by adding a <literal>shadow</literal> entry to the
+ <literal>zebra.cfg</literal> file.
+ The syntax of the <literal>shadow</literal> entry is exactly the
+ same as for the <literal>register</literal> entry
+ (see <xref linkend="register-location"/>).
+ The location of the shadow area should be
+ <emphasis>different</emphasis> from the location of the main register
+ area (if you have specified one - remember that if you provide no
+ <literal>register</literal> setting, the default register area is the
+ working directory of the server and indexing processes).
+ </para>
+
+ <para>
+ The following excerpt from a <literal>zebra.cfg</literal> file shows
+ one example of a setup that configures both the main register
+ location and the shadow file area.
+ Note that two directories or partitions have been set aside
+ for the shadow file area. You can specify any number of directories
+ for each of the file areas, but remember that there should be no
+ overlaps between the directories used for the main registers and the
+ shadow files, respectively.
+ </para>
+ <para>
+
+ <screen>
+ register: /d1:500M
+ shadow: /scratch1:100M /scratch2:200M
+ </screen>
+
+ </para>
+
+ <para>
+ When shadow files are enabled, an extra command is available at the
+ <literal>zebraidx</literal> command line.
+ In order to make changes to the system take effect for the
+ users, you'll have to submit a "commit" command after a
+ (sequence of) update operation(s).
+ </para>
+
+ <para>
+
+ <screen>
+ $ zebraidx update /d1/records
+ $ zebraidx commit
+ </screen>
+
+ </para>
+
+ <para>
+ Or you can execute multiple updates before committing the changes:
+ </para>
+
+ <para>
+
+ <screen>
+ $ zebraidx -g books update /d1/records /d2/more-records
+ $ zebraidx -g fun update /d3/fun-records
+ $ zebraidx commit
+ </screen>
+
+ </para>
+
+ <para>
+ If one of the update operations above had been interrupted, the commit
+ operation on the last line would fail: <literal>zebraidx</literal>
+ will not let you commit changes that would destroy the running register.
+ You'll have to rerun all of the update operations since your last
+ commit operation, before you can commit the new changes.
+ </para>
+
+ <para>
+ Similarly, if the commit operation fails, <literal>zebraidx</literal>
+ will not let you start a new update operation before you have
+ successfully repeated the commit operation.
+ The server processes will keep accessing the shadow files rather
+ than the (possibly damaged) blocks of the main register files
+ until the commit operation has successfully completed.
+ </para>
+
+ <para>
+ You should be aware that update operations may take slightly longer
+ when the shadow register system is enabled, since more file access
+ operations are involved. Further, while the disk space required for
+ the shadow register data is modest for a small update operation, you
+ may prefer to disable the system if you are adding a very large number
+ of records to an already very large database (we use the terms
+ <emphasis>large</emphasis> and <emphasis>modest</emphasis>
+ very loosely here, since every application will have a
+ different perception of size).
+ To update the system without the use of the the shadow files,
+ simply run <literal>zebraidx</literal> with the <literal>-n</literal>
+ option (note that you do not have to execute the
+ <emphasis>commit</emphasis> command of <literal>zebraidx</literal>
+ when you temporarily disable the use of the shadow registers in
+ this fashion.
+ Note also that, just as when the shadow registers are not enabled,
+ server processes will be barred from accessing the main register
+ while the update procedure takes place.
+ </para>
+
+ </sect2>
+
+ </sect1>
+
+
+ <sect1 id="administration-ranking">
+ <title>Relevance Ranking and Sorting of Result Sets</title>
+
+ <sect2 id="administration-overview">
+ <title>Overview</title>
+ <para>
+ The default ordering of a result set is left up to the server,
+ which inside &zebra; means sorting in ascending document ID order.
+ This is not always the order humans want to browse the sometimes
+ quite large hit sets. Ranking and sorting comes to the rescue.
+ </para>
+
+ <para>
+ In cases where a good presentation ordering can be computed at
+ indexing time, we can use a fixed <literal>static ranking</literal>
+ scheme, which is provided for the <literal>alvis</literal>
+ indexing filter. This defines a fixed ordering of hit lists,
+ independently of the query issued.
+ </para>
+
+ <para>
+ There are cases, however, where relevance of hit set documents is
+ highly dependent on the query processed.
+ Simply put, <literal>dynamic relevance ranking</literal>
+ sorts a set of retrieved records such that those most likely to be
+ relevant to your request are retrieved first.
+ Internally, &zebra; retrieves all documents that satisfy your
+ query, and re-orders the hit list to arrange them based on
+ a measurement of similarity between your query and the content of
+ each record.
+ </para>
+
+ <para>
+ Finally, there are situations where hit sets of documents should be
+ <literal>sorted</literal> during query time according to the
+ lexicographical ordering of certain sort indexes created at
+ indexing time.
+ </para>
+ </sect2>
+
+
+ <sect2 id="administration-ranking-static">
+ <title>Static Ranking</title>
+
+ <para>
+ &zebra; uses internally inverted indexes to look up term frequencies
+ in documents. Multiple queries from different indexes can be
+ combined by the binary boolean operations <literal>AND</literal>,
+ <literal>OR</literal> and/or <literal>NOT</literal> (which
+ is in fact a binary <literal>AND NOT</literal> operation).
+ To ensure fast query execution
+ speed, all indexes have to be sorted in the same order.
+ </para>
+ <para>
+ The indexes are normally sorted according to document
+ <literal>ID</literal> in
+ ascending order, and any query which does not invoke a special
+ re-ranking function will therefore retrieve the result set in
+ document
+ <literal>ID</literal>
+ order.
+ </para>
+ <para>
+ If one defines the
+ <screen>
+ staticrank: 1
+ </screen>
+ directive in the main core &zebra; configuration file, the internal document
+ keys used for ordering are augmented by a preceding integer, which
+ contains the static rank of a given document, and the index lists
+ are ordered
+ first by ascending static rank,
+ then by ascending document <literal>ID</literal>.
+ Zero
+ is the ``best'' rank, as it occurs at the
+ beginning of the list; higher numbers represent worse scores.
+ </para>
+ <para>
+ The experimental <literal>alvis</literal> filter provides a
+ directive to fetch static rank information out of the indexed &acro.xml;
+ records, thus making <emphasis>all</emphasis> hit sets ordered
+ after <emphasis>ascending</emphasis> static
+ rank, and for those doc's which have the same static rank, ordered
+ after <emphasis>ascending</emphasis> doc <literal>ID</literal>.
+ See <xref linkend="record-model-alvisxslt"/> for the gory details.
+ </para>
+ </sect2>
+
+
+ <sect2 id="administration-ranking-dynamic">
+ <title>Dynamic Ranking</title>
+ <para>
+ In order to fiddle with the static rank order, it is necessary to
+ invoke additional re-ranking/re-ordering using dynamic
+ ranking or score functions. These functions return positive
+ integer scores, where <emphasis>highest</emphasis> score is
+ ``best'';
+ hit sets are sorted according to <emphasis>descending</emphasis>
+ scores (in contrary
+ to the index lists which are sorted according to
+ ascending rank number and document ID).
+ </para>
+ <para>
+ Dynamic ranking is enabled by a directive like one of the
+ following in the zebra configuration file (use only one of these a time!):
+ <screen>
+ rank: rank-1 # default TDF-IDF like
+ rank: rank-static # dummy do-nothing
+ </screen>
+ </para>
<para>
- The <literal>rank-1</literal> algorithm
- does not use the static rank
- information in the list keys, and will produce the same ordering
- with or without static ranking enabled.
+ Dynamic ranking is done at query time rather than
+ indexing time (this is why we
+ call it ``dynamic ranking'' in the first place ...)
+ It is invoked by adding
+ the &acro.bib1; relation attribute with
+ value ``relevance'' to the &acro.pqf; query (that is,
+ <literal>@attr 2=102</literal>, see also
+ <ulink url="&url.z39.50;bib1.html">
+ The &acro.bib1; Attribute Set Semantics</ulink>, also in
+ <ulink url="&url.z39.50.attset.bib1;">HTML</ulink>).
+ To find all articles with the word <literal>Eoraptor</literal> in
+ the title, and present them relevance ranked, issue the &acro.pqf; query:
+ <screen>
+ @attr 2=102 @attr 1=4 Eoraptor
+ </screen>
</para>
-
- <!--
<sect3 id="administration-ranking-dynamic-rank1">
- <title>Dynamically ranking &acro.pqf; queries with the 'rank-static'
+ <title>Dynamically ranking using &acro.pqf; queries with the 'rank-1'
algorithm</title>
- <para>
- The dummy <literal>rank-static</literal> reranking/scoring
- function returns just
- <literal>score = max int - staticrank</literal>
- in order to preserve the static ordering of hit sets that would
- have been produced had it not been invoked.
- Obviously, to combine static and dynamic ranking usefully,
- it is necessary
- to make a new ranking
- function; this is left
- as an exercise for the reader.
- </para>
- </sect3>
- -->
-
- <warning>
+
+ <para>
+ The default <literal>rank-1</literal> ranking module implements a
+ TF/IDF (Term Frequecy over Inverse Document Frequency) like
+ algorithm. In contrast to the usual definition of TF/IDF
+ algorithms, which only considers searching in one full-text
+ index, this one works on multiple indexes at the same time.
+ More precisely,
+ &zebra; does boolean queries and searches in specific addressed
+ indexes (there are inverted indexes pointing from terms in the
+ dictionary to documents and term positions inside documents).
+ It works like this:
+ <variablelist>
+ <varlistentry>
+ <term>Query Components</term>
+ <listitem>
+ <para>
+ First, the boolean query is dismantled into its principal components,
+ i.e. atomic queries where one term is looked up in one index.
+ For example, the query
+ <screen>
+ @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
+ </screen>
+ is a boolean AND between the atomic parts
+ <screen>
+ @attr 2=102 @attr 1=1010 Utah
+ </screen>
+ and
+ <screen>
+ @attr 2=102 @attr 1=1018 Springer
+ </screen>
+ which gets processed each for itself.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Atomic hit lists</term>
+ <listitem>
+ <para>
+ Second, for each atomic query, the hit list of documents is
+ computed.
+ </para>
+ <para>
+ In this example, two hit lists for each index
+ <literal>@attr 1=1010</literal> and
+ <literal>@attr 1=1018</literal> are computed.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Atomic scores</term>
+ <listitem>
+ <para>
+ Third, each document in the hit list is assigned a score (_if_ ranking
+ is enabled and requested in the query) using a TF/IDF scheme.
+ </para>
+ <para>
+ In this example, both atomic parts of the query assign the magic
+ <literal>@attr 2=102</literal> relevance attribute, and are
+ to be used in the relevance ranking functions.
+ </para>
+ <para>
+ It is possible to apply dynamic ranking on only parts of the
+ &acro.pqf; query:
+ <screen>
+ @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
+ </screen>
+ searches for all documents which have the term 'Utah' on the
+ body of text, and which have the term 'Springer' in the publisher
+ field, and sort them in the order of the relevance ranking made on
+ the body-of-text index only.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Hit list merging</term>
+ <listitem>
+ <para>
+ Fourth, the atomic hit lists are merged according to the boolean
+ conditions to a final hit list of documents to be returned.
+ </para>
+ <para>
+ This step is always performed, independently of the fact that
+ dynamic ranking is enabled or not.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Document score computation</term>
+ <listitem>
+ <para>
+ Fifth, the total score of a document is computed as a linear
+ combination of the atomic scores of the atomic hit lists
+ </para>
+ <para>
+ Ranking weights may be used to pass a value to a ranking
+ algorithm, using the non-standard &acro.bib1; attribute type 9.
+ This allows one branch of a query to use one value while
+ another branch uses a different one. For example, we can search
+ for <literal>utah</literal> in the
+ <literal>@attr 1=4</literal> index with weight 30, as
+ well as in the <literal>@attr 1=1010</literal> index with weight 20:
+ <screen>
+ @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
+ </screen>
+ </para>
+ <para>
+ The default weight is
+ sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score
+ is 1000 and the bottom score is 0, encoded in integers.
+ </para>
+ <warning>
+ <para>
+ The ranking-weight feature is experimental. It may change in future
+ releases of zebra.
+ </para>
+ </warning>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Re-sorting of hit list</term>
+ <listitem>
+ <para>
+ Finally, the final hit list is re-ordered according to scores.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ </para>
+
+
<para>
- <literal>Dynamic ranking</literal> is not compatible
- with <literal>estimated hit sizes</literal>, as all documents in
- a hit set must be accessed to compute the correct placing in a
- ranking sorted list. Therefore the use attribute setting
- <literal>@attr 2=102</literal> clashes with
- <literal>@attr 9=integer</literal>.
+ The <literal>rank-1</literal> algorithm
+ does not use the static rank
+ information in the list keys, and will produce the same ordering
+ with or without static ranking enabled.
</para>
- </warning>
- <!--
- we might want to add ranking like this:
- UNPUBLISHED:
- Simple BM25 Extension to Multiple Weighted Fields
- Stephen Robertson, Hugo Zaragoza and Michael Taylor
- Microsoft Research
- ser@microsoft.com
- hugoz@microsoft.com
- mitaylor2microsoft.com
- -->
+
+ <!--
+ <sect3 id="administration-ranking-dynamic-rank1">
+ <title>Dynamically ranking &acro.pqf; queries with the 'rank-static'
+ algorithm</title>
+ <para>
+ The dummy <literal>rank-static</literal> reranking/scoring
+ function returns just
+ <literal>score = max int - staticrank</literal>
+ in order to preserve the static ordering of hit sets that would
+ have been produced had it not been invoked.
+ Obviously, to combine static and dynamic ranking usefully,
+ it is necessary
+ to make a new ranking
+ function; this is left
+ as an exercise for the reader.
+ </para>
+ </sect3>
+ -->
+
+ <warning>
+ <para>
+ <literal>Dynamic ranking</literal> is not compatible
+ with <literal>estimated hit sizes</literal>, as all documents in
+ a hit set must be accessed to compute the correct placing in a
+ ranking sorted list. Therefore the use attribute setting
+ <literal>@attr 2=102</literal> clashes with
+ <literal>@attr 9=integer</literal>.
+ </para>
+ </warning>
+
+ <!--
+ we might want to add ranking like this:
+ UNPUBLISHED:
+ Simple BM25 Extension to Multiple Weighted Fields
+ Stephen Robertson, Hugo Zaragoza and Michael Taylor
+ Microsoft Research
+ ser@microsoft.com
+ hugoz@microsoft.com
+ mitaylor2microsoft.com
+ -->
</sect3>
<screen>
relationModifier.relevant = 2=102
</screen>
- invokes dynamic ranking each time a &acro.cql; query of the form
+ invokes dynamic ranking each time a &acro.cql; query of the form
<screen>
Z> querytype cql
Z> f alvis.text =/relevant house
<screen>
index.alvis.text = 1=text 2=102
</screen>
- which then invokes dynamic ranking each time a &acro.cql; query of the form
+ which then invokes dynamic ranking each time a &acro.cql; query of the form
<screen>
Z> querytype cql
Z> f alvis.text = house
</screen>
is issued.
</para>
-
+
</sect3>
- </sect2>
+ </sect2>
- <sect2 id="administration-ranking-sorting">
- <title>Sorting</title>
- <para>
+ <sect2 id="administration-ranking-sorting">
+ <title>Sorting</title>
+ <para>
&zebra; sorts efficiently using special sorting indexes
(type=<literal>s</literal>; so each sortable index must be known
at indexing time, specified in the configuration of record
indexing. For example, to enable sorting according to the &acro.bib1;
<literal>Date/time-added-to-db</literal> field, one could add the line
<screen>
- xelm /*/@created Date/time-added-to-db:s
+ xelm /*/@created Date/time-added-to-db:s
</screen>
to any <literal>.abs</literal> record-indexing configuration file.
Similarly, one could add an indexing element of the form
- <screen><![CDATA[
+ <screen><![CDATA[
<z:index name="date-modified" type="s">
- <xsl:value-of select="some/xpath"/>
- </z:index>
+ <xsl:value-of select="some/xpath"/>
+ </z:index>
]]></screen>
to any <literal>alvis</literal>-filter indexing stylesheet.
- </para>
- <para>
- Indexing can be specified at searching time using a query term
- carrying the non-standard
- &acro.bib1; attribute-type <literal>7</literal>. This removes the
- need to send a &acro.z3950; <literal>Sort Request</literal>
- separately, and can dramatically improve latency when the client
- and server are on separate networks.
- The sorting part of the query is separate from the rest of the
- query - the actual search specification - and must be combined
- with it using OR.
- </para>
- <para>
- A sorting subquery needs two attributes: an index (such as a
- &acro.bib1; type-1 attribute) specifying which index to sort on, and a
- type-7 attribute whose value is be <literal>1</literal> for
- ascending sorting, or <literal>2</literal> for descending. The
- term associated with the sorting attribute is the priority of
- the sort key, where <literal>0</literal> specifies the primary
- sort key, <literal>1</literal> the secondary sort key, and so
- on.
- </para>
+ </para>
+ <para>
+ Indexing can be specified at searching time using a query term
+ carrying the non-standard
+ &acro.bib1; attribute-type <literal>7</literal>. This removes the
+ need to send a &acro.z3950; <literal>Sort Request</literal>
+ separately, and can dramatically improve latency when the client
+ and server are on separate networks.
+ The sorting part of the query is separate from the rest of the
+ query - the actual search specification - and must be combined
+ with it using OR.
+ </para>
+ <para>
+ A sorting subquery needs two attributes: an index (such as a
+ &acro.bib1; type-1 attribute) specifying which index to sort on, and a
+ type-7 attribute whose value is be <literal>1</literal> for
+ ascending sorting, or <literal>2</literal> for descending. The
+ term associated with the sorting attribute is the priority of
+ the sort key, where <literal>0</literal> specifies the primary
+ sort key, <literal>1</literal> the secondary sort key, and so
+ on.
+ </para>
<para>For example, a search for water, sort by title (ascending),
- is expressed by the &acro.pqf; query
+ is expressed by the &acro.pqf; query
<screen>
- @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
+ @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
</screen>
- whereas a search for water, sort by title ascending,
+ whereas a search for water, sort by title ascending,
then date descending would be
<screen>
- @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
+ @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
</screen>
</para>
<para>
Notice the fundamental differences between <literal>dynamic
- ranking</literal> and <literal>sorting</literal>: there can be
+ ranking</literal> and <literal>sorting</literal>: there can be
only one ranking function defined and configured; but multiple
sorting indexes can be specified dynamically at search
time. Ranking does not need to use specific indexes, so
dynamic ranking can be enabled and disabled without
re-indexing; whereas, sorting indexes need to be
defined before indexing.
- </para>
+ </para>
+
+ </sect2>
- </sect2>
+ </sect1>
- </sect1>
+ <sect1 id="administration-extended-services">
+ <title>Extended Services: Remote Insert, Update and Delete</title>
- <sect1 id="administration-extended-services">
- <title>Extended Services: Remote Insert, Update and Delete</title>
-
<note>
<para>
Extended services are only supported when accessing the &zebra;
not support extended services.
</para>
</note>
-
- <para>
+
+ <para>
The extended services are not enabled by default in zebra - due to the
fact that they modify the system. &zebra; can be configured
to allow anybody to
perm.admin: rw
passwd: passwordfile
</screen>
- And in the password file
+ And in the password file
<filename>passwordfile</filename>, you have to specify users and
- encrypted passwords as colon separated strings.
- Use a tool like <filename>htpasswd</filename>
- to maintain the encrypted passwords.
- <screen>
+ encrypted passwords as colon separated strings.
+ Use a tool like <filename>htpasswd</filename>
+ to maintain the encrypted passwords.
+ <screen>
admin:secret
</screen>
- It is essential to configure &zebra; to store records internally,
+ It is essential to configure &zebra; to store records internally,
and to support
modifications and deletion of records:
<screen>
The general record type should be set to any record filter which
is able to parse &acro.xml; records, you may use any of the two
declarations (but not both simultaneously!)
- <screen>
+ <screen>
recordType: dom.filter_dom_conf.xml
# recordType: grs.xml
</screen>
Notice the difference to the specific instructions
- <screen>
+ <screen>
recordType.xml: dom.filter_dom_conf.xml
# recordType.xml: grs.xml
- </screen>
+ </screen>
which only work when indexing XML files from the filesystem using
the <literal>*.xml</literal> naming convention.
</para>
<screen>
shadow: directoryname: size (e.g. 1000M)
</screen>
- See <xref linkend="zebra-cfg"/> for additional information on
- these configuration options.
+ See <xref linkend="zebra-cfg"/> for additional information on
+ these configuration options.
</para>
<note>
<para>
limitations of the <ulink url="&url.z39.50;">&acro.z3950;</ulink>
protocol. Therefore, indexing filters can not be chosen on a
per-record basis. One and only one general &acro.xml; indexing filter
- must be defined.
+ must be defined.
<!-- but because it is represented as an OID, we would need some
form of proprietary mapping scheme between record type strings and
OIDs. -->
<!--
However, as a minimum, it would be extremely useful to enable
people to use &acro.marc21;, assuming grs.marcxml.marc21 as a record
- type.
+ type.
-->
</para>
</note>
servers to accept special binary <emphasis>extended services</emphasis>
protocol packages, which may be used to insert, update and delete
records into servers. These carry control and update
- information to the servers, which are encoded in seven package fields:
+ information to the servers, which are encoded in seven package fields:
</para>
<table id="administration-extended-services-z3950-table" frame="top">
<title>Extended services &acro.z3950; Package Fields</title>
- <tgroup cols="3">
- <thead>
+ <tgroup cols="3">
+ <thead>
<row>
- <entry>Parameter</entry>
- <entry>Value</entry>
- <entry>Notes</entry>
- </row>
+ <entry>Parameter</entry>
+ <entry>Value</entry>
+ <entry>Notes</entry>
+ </row>
</thead>
- <tbody>
- <row>
- <entry><literal>type</literal></entry>
- <entry><literal>'update'</literal></entry>
- <entry>Must be set to trigger extended services</entry>
- </row>
- <row>
- <entry><literal>action</literal></entry>
- <entry><literal>string</literal></entry>
+ <tbody>
+ <row>
+ <entry><literal>type</literal></entry>
+ <entry><literal>'update'</literal></entry>
+ <entry>Must be set to trigger extended services</entry>
+ </row>
+ <row>
+ <entry><literal>action</literal></entry>
+ <entry><literal>string</literal></entry>
<entry>
- Extended service action type with
+ Extended service action type with
one of four possible values: <literal>recordInsert</literal>,
<literal>recordReplace</literal>,
<literal>recordDelete</literal>,
and <literal>specialUpdate</literal>
</entry>
- </row>
- <row>
- <entry><literal>record</literal></entry>
- <entry><literal>&acro.xml; string</literal></entry>
- <entry>An &acro.xml; formatted string containing the record</entry>
- </row>
+ </row>
+ <row>
+ <entry><literal>record</literal></entry>
+ <entry><literal>&acro.xml; string</literal></entry>
+ <entry>An &acro.xml; formatted string containing the record</entry>
+ </row>
<row>
<entry><literal>syntax</literal></entry>
<entry><literal>'xml'</literal></entry>
The default filter (record type) as given by recordType in
zebra.cfg is used to parse the record.</entry>
</row>
- <row>
- <entry><literal>recordIdOpaque</literal></entry>
- <entry><literal>string</literal></entry>
- <entry>
+ <row>
+ <entry><literal>recordIdOpaque</literal></entry>
+ <entry><literal>string</literal></entry>
+ <entry>
Optional client-supplied, opaque record
identifier used under insert operations.
</entry>
- </row>
- <row>
- <entry><literal>recordIdNumber </literal></entry>
- <entry><literal>positive number</literal></entry>
- <entry>&zebra;'s internal system number,
- not allowed for <literal>recordInsert</literal> or
+ </row>
+ <row>
+ <entry><literal>recordIdNumber </literal></entry>
+ <entry><literal>positive number</literal></entry>
+ <entry>&zebra;'s internal system number,
+ not allowed for <literal>recordInsert</literal> or
<literal>specialUpdate</literal> actions which result in fresh
record inserts.
</entry>
- </row>
- <row>
- <entry><literal>databaseName</literal></entry>
- <entry><literal>database identifier</literal></entry>
+ </row>
+ <row>
+ <entry><literal>databaseName</literal></entry>
+ <entry><literal>database identifier</literal></entry>
<entry>
- The name of the database to which the extended services should be
+ The name of the database to which the extended services should be
applied.
</entry>
- </row>
+ </row>
</tbody>
- </tgroup>
- </table>
+ </tgroup>
+ </table>
- <para>
- The <literal>action</literal> parameter can be any of
- <literal>recordInsert</literal> (will fail if the record already exists),
- <literal>recordReplace</literal> (will fail if the record does not exist),
- <literal>recordDelete</literal> (will fail if the record does not
- exist), and
- <literal>specialUpdate</literal> (will insert or update the record
- as needed, record deletion is not possible).
- </para>
+ <para>
+ The <literal>action</literal> parameter can be any of
+ <literal>recordInsert</literal> (will fail if the record already exists),
+ <literal>recordReplace</literal> (will fail if the record does not exist),
+ <literal>recordDelete</literal> (will fail if the record does not
+ exist), and
+ <literal>specialUpdate</literal> (will insert or update the record
+ as needed, record deletion is not possible).
+ </para>
<para>
During all actions, the
usual rules for internal record ID generation apply, unless an
optional <literal>recordIdNumber</literal> &zebra; internal ID or a
- <literal>recordIdOpaque</literal> string identifier is assigned.
+ <literal>recordIdOpaque</literal> string identifier is assigned.
The default ID generation is
configured using the <literal>recordId:</literal> from
- <filename>zebra.cfg</filename>.
- See <xref linkend="zebra-cfg"/>.
+ <filename>zebra.cfg</filename>.
+ See <xref linkend="zebra-cfg"/>.
</para>
- <para>
- Setting of the <literal>recordIdNumber</literal> parameter,
- which must be an existing &zebra; internal system ID number, is not
- allowed during any <literal>recordInsert</literal> or
+ <para>
+ Setting of the <literal>recordIdNumber</literal> parameter,
+ which must be an existing &zebra; internal system ID number, is not
+ allowed during any <literal>recordInsert</literal> or
<literal>specialUpdate</literal> action resulting in fresh record
- inserts.
+ inserts.
</para>
<para>
When retrieving existing
- records indexed with &acro.grs1; indexing filters, the &zebra; internal
+ records indexed with &acro.grs1; indexing filters, the &zebra; internal
ID number is returned in the field
- <literal>/*/id:idzebra/localnumber</literal> in the namespace
- <literal>xmlns:id="http://www.indexdata.dk/zebra/"</literal>,
- where it can be picked up for later record updates or deletes.
+ <literal>/*/id:idzebra/localnumber</literal> in the namespace
+ <literal>xmlns:id="http://www.indexdata.dk/zebra/"</literal>,
+ where it can be picked up for later record updates or deletes.
</para>
-
+
<para>
A new element set for retrieval of internal record
data has been added, which can be used to access minimal records
See <xref linkend="special-retrieval"/>.
</para>
- <para>
+ <para>
The <literal>recordIdOpaque</literal> string parameter
is an client-supplied, opaque record
- identifier, which may be used under
+ identifier, which may be used under
insert, update and delete operations. The
client software is responsible for assigning these to
records. This identifier will
replace zebra's own automagic identifier generation with a unique
- mapping from <literal>recordIdOpaque</literal> to the
+ mapping from <literal>recordIdOpaque</literal> to the
&zebra; internal <literal>recordIdNumber</literal>.
<emphasis>The opaque <literal>recordIdOpaque</literal> string
- identifiers
+ identifiers
are not visible in retrieval records, nor are
searchable, so the value of this parameter is
questionable. It serves mostly as a convenient mapping from
application domain string identifiers to &zebra; internal ID's.
- </emphasis>
+ </emphasis>
</para>
</sect2>
-
- <sect2 id="administration-extended-services-yaz-client">
- <title>Extended services from yaz-client</title>
- <para>
- We can now start a yaz-client admin session and create a database:
- <screen>
- <![CDATA[
- $ yaz-client localhost:9999 -u admin/secret
- Z> adm-create
- ]]>
- </screen>
- Now the <literal>Default</literal> database was created,
- we can insert an &acro.xml; file (esdd0006.grs
- from example/gils/records) and index it:
- <screen>
- <![CDATA[
- Z> update insert id1234 esdd0006.grs
- ]]>
- </screen>
- The 3rd parameter - <literal>id1234</literal> here -
- is the <literal>recordIdOpaque</literal> package field.
- </para>
- <para>
- Actually, we should have a way to specify "no opaque record id" for
- yaz-client's update command.. We'll fix that.
- </para>
- <para>
- The newly inserted record can be searched as usual:
- <screen>
- <![CDATA[
- Z> f utah
- Sent searchRequest.
- Received SearchResponse.
- Search was a success.
- Number of hits: 1, setno 1
- SearchResult-1: term=utah cnt=1
- records returned: 0
- Elapsed: 0.014179
- ]]>
- </screen>
- </para>
- <para>
- Let's delete the beast, using the same
+ <sect2 id="administration-extended-services-yaz-client">
+ <title>Extended services from yaz-client</title>
+
+ <para>
+ We can now start a yaz-client admin session and create a database:
+ <screen>
+ <![CDATA[
+ $ yaz-client localhost:9999 -u admin/secret
+ Z> adm-create
+ ]]>
+ </screen>
+ Now the <literal>Default</literal> database was created,
+ we can insert an &acro.xml; file (esdd0006.grs
+ from example/gils/records) and index it:
+ <screen>
+ <![CDATA[
+ Z> update insert id1234 esdd0006.grs
+ ]]>
+ </screen>
+ The 3rd parameter - <literal>id1234</literal> here -
+ is the <literal>recordIdOpaque</literal> package field.
+ </para>
+ <para>
+ Actually, we should have a way to specify "no opaque record id" for
+ yaz-client's update command.. We'll fix that.
+ </para>
+ <para>
+ The newly inserted record can be searched as usual:
+ <screen>
+ <![CDATA[
+ Z> f utah
+ Sent searchRequest.
+ Received SearchResponse.
+ Search was a success.
+ Number of hits: 1, setno 1
+ SearchResult-1: term=utah cnt=1
+ records returned: 0
+ Elapsed: 0.014179
+ ]]>
+ </screen>
+ </para>
+ <para>
+ Let's delete the beast, using the same
<literal>recordIdOpaque</literal> string parameter:
- <screen>
- <![CDATA[
- Z> update delete id1234
- No last record (update ignored)
- Z> update delete 1 esdd0006.grs
- Got extended services response
- Status: done
- Elapsed: 0.072441
- Z> f utah
- Sent searchRequest.
- Received SearchResponse.
- Search was a success.
- Number of hits: 0, setno 2
- SearchResult-1: term=utah cnt=0
- records returned: 0
- Elapsed: 0.013610
- ]]>
+ <screen>
+ <![CDATA[
+ Z> update delete id1234
+ No last record (update ignored)
+ Z> update delete 1 esdd0006.grs
+ Got extended services response
+ Status: done
+ Elapsed: 0.072441
+ Z> f utah
+ Sent searchRequest.
+ Received SearchResponse.
+ Search was a success.
+ Number of hits: 0, setno 2
+ SearchResult-1: term=utah cnt=0
+ records returned: 0
+ Elapsed: 0.013610
+ ]]>
</screen>
</para>
<para>
- If shadow register is enabled in your
- <filename>zebra.cfg</filename>,
- you must run the adm-commit command
- <screen>
- <![CDATA[
- Z> adm-commit
- ]]>
- </screen>
+ If shadow register is enabled in your
+ <filename>zebra.cfg</filename>,
+ you must run the adm-commit command
+ <screen>
+ <![CDATA[
+ Z> adm-commit
+ ]]>
+ </screen>
after each update session in order write your changes from the
shadow to the life register space.
- </para>
- </sect2>
+ </para>
+ </sect2>
-
- <sect2 id="administration-extended-services-yaz-php">
- <title>Extended services from yaz-php</title>
- <para>
- Extended services are also available from the &yaz; &acro.php; client layer. An
- example of an &yaz;-&acro.php; extended service transaction is given here:
- <screen>
- <![CDATA[
- $record = '<record><title>A fine specimen of a record</title></record>';
-
- $options = array('action' => 'recordInsert',
- 'syntax' => 'xml',
- 'record' => $record,
- 'databaseName' => 'mydatabase'
- );
-
- yaz_es($yaz, 'update', $options);
- yaz_es($yaz, 'commit', array());
- yaz_wait();
-
- if ($error = yaz_error($yaz))
- echo "$error";
- ]]>
- </screen>
+ <sect2 id="administration-extended-services-yaz-php">
+ <title>Extended services from yaz-php</title>
+
+ <para>
+ Extended services are also available from the &yaz; &acro.php; client layer. An
+ example of an &yaz;-&acro.php; extended service transaction is given here:
+ <screen>
+ <![CDATA[
+ $record = '<record><title>A fine specimen of a record</title></record>';
+
+ $options = array('action' => 'recordInsert',
+ 'syntax' => 'xml',
+ 'record' => $record,
+ 'databaseName' => 'mydatabase'
+ );
+
+ yaz_es($yaz, 'update', $options);
+ yaz_es($yaz, 'commit', array());
+ yaz_wait();
+
+ if ($error = yaz_error($yaz))
+ echo "$error";
+ ]]>
+ </screen>
</para>
- </sect2>
+ </sect2>
<sect2 id="administration-extended-services-debugging">
<title>Extended services debugging guide</title>
<itemizedlist>
<listitem>
<para>
- Make sure you have a nice record on your filesystem, which you can
+ Make sure you have a nice record on your filesystem, which you can
index from the filesystem by use of the zebraidx command.
Do it exactly as you planned, using one of the GRS-1 filters,
- or the DOMXML filter.
+ or the DOMXML filter.
When this works, proceed.
</para>
</listitem>
<listitem>
<para>
- Check that your server setup is OK before you even coded one single
+ Check that your server setup is OK before you even coded one single
line PHP using ES.
- Take the same record form the file system, and send as ES via
+ Take the same record form the file system, and send as ES via
<literal>yaz-client</literal> like described in
<xref linkend="administration-extended-services-yaz-client"/>,
and
remember the <literal>-a</literal> option which tells you what
goes over the wire! Notice also the section on permissions:
- try
+ try
<screen>
perm.anonymous: rw
</screen>
- in <literal>zebra.cfg</literal> to make sure you do not run into
- permission problems (but never expose such an insecure setup on the
+ in <literal>zebra.cfg</literal> to make sure you do not run into
+ permission problems (but never expose such an insecure setup on the
internet!!!). Then, make sure to set the general
<literal>recordType</literal> instruction, pointing correctly
to the GRS-1 filters,
</listitem>
<listitem>
<para>
- If you insist on using the <literal>sysno</literal> in the
- <literal>recordIdNumber</literal> setting,
- please make sure you do only updates and deletes. Zebra's internal
+ If you insist on using the <literal>sysno</literal> in the
+ <literal>recordIdNumber</literal> setting,
+ please make sure you do only updates and deletes. Zebra's internal
system number is not allowed for
- <literal>recordInsert</literal> or
- <literal>specialUpdate</literal> actions
+ <literal>recordInsert</literal> or
+ <literal>specialUpdate</literal> actions
which result in fresh record inserts.
</para>
</listitem>
<listitem>
<para>
- If <literal>shadow register</literal> is enabled in your
- <literal>zebra.cfg</literal>, you must remember running the
+ If <literal>shadow register</literal> is enabled in your
+ <literal>zebra.cfg</literal>, you must remember running the
<screen>
Z> adm-commit
</screen>
</sect2>
- </sect1>
+ </sect1>
-</chapter>
+ </chapter>
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
<section id="architecture-representation">
<title>Local Representation</title>
-
+
<para>
As mentioned earlier, &zebra; places few restrictions on the type of
data that you can index and manage. Generally, whatever the form of
indexing maintenance utility, and the <command>zebrasrv</command>
information query and retrieval server. Both are using some of the
same main components, which are presented here.
- </para>
- <para>
+ </para>
+ <para>
The virtual Debian package <literal>idzebra-2.0</literal>
installs all the necessary packages to start
working with &zebra; - including utility programs, development libraries,
- documentation and modules.
- </para>
-
+ documentation and modules.
+ </para>
+
<section id="componentcore">
<title>Core &zebra; Libraries Containing Common Functionality</title>
<para>
The core &zebra; module is the meat of the <command>zebraidx</command>
indexing maintenance utility, and the <command>zebrasrv</command>
information query and retrieval server binaries. Shortly, the core
- libraries are responsible for
+ libraries are responsible for
<variablelist>
<varlistentry>
<term>Dynamic Loading</term>
<listitem>
<para>of external filter modules, in case the application is
not compiled statically. These filter modules define indexing,
- search and retrieval capabilities of the various input formats.
+ search and retrieval capabilities of the various input formats.
</para>
</listitem>
</varlistentry>
construction of hit lists according to boolean combinations
of simpler searches. Fast performance is achieved by careful
use of index structures, and by evaluation specific index hit
- lists in correct order.
+ lists in correct order.
</para>
</listitem>
</varlistentry>
components call resorting/re-ranking algorithms on the hit
sets. These might also be pre-sorted not only using the
assigned document ID's, but also using assigned static rank
- information.
+ information.
</para>
</listitem>
</varlistentry>
</varlistentry>
</variablelist>
</para>
- <para>
- The Debian package <literal>libidzebra-2.0</literal>
- contains all run-time libraries for &zebra;, the
- documentation in PDF and HTML is found in
+ <para>
+ The Debian package <literal>libidzebra-2.0</literal>
+ contains all run-time libraries for &zebra;, the
+ documentation in PDF and HTML is found in
<literal>idzebra-2.0-doc</literal>, and
<literal>idzebra-2.0-common</literal>
includes common essential &zebra; configuration files.
</para>
</section>
-
+
<section id="componentindexer">
<title>&zebra; Indexer</title>
<para>
The <command>zebraidx</command>
- indexing maintenance utility
+ indexing maintenance utility
loads external filter modules used for indexing data records of
different type, and creates, updates and drops databases and
indexes according to the rules defined in the filter modules.
- </para>
- <para>
+ </para>
+ <para>
The Debian package <literal>idzebra-2.0-utils</literal> contains
the <command>zebraidx</command> utility.
</para>
<para>
This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
glues together the core libraries and the filter modules to one
- great Information Retrieval server application.
- </para>
- <para>
+ great Information Retrieval server application.
+ </para>
+ <para>
The Debian package <literal>idzebra-2.0-utils</literal> contains
the <command>zebrasrv</command> utility.
</para>
<section id="componentyazserver">
<title>&yaz; Server Frontend</title>
<para>
- The &yaz; server frontend is
+ The &yaz; server frontend is
a full fledged stateful &acro.z3950; server taking client
- connections, and forwarding search and scan requests to the
+ connections, and forwarding search and scan requests to the
&zebra; core indexer.
</para>
<para>
In addition to &acro.z3950; requests, the &yaz; server frontend acts
as HTTP server, honoring
- <ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
- requests, and
+ <ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
+ requests, and
&acro.sru; &acro.rest;
requests. Moreover, it can
- translate incoming
+ translate incoming
<ulink url="&url.cql;">&acro.cql;</ulink>
queries to
<ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
queries, if
- correctly configured.
+ correctly configured.
</para>
<para>
<ulink url="&url.yaz;">&yaz;</ulink>
- is an Open Source
+ is an Open Source
toolkit that allows you to develop software using the
&acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
- It is packaged in the Debian packages
+ It is packaged in the Debian packages
<literal>yaz</literal> and <literal>libyaz</literal>.
</para>
</section>
-
+
<section id="componentmodules">
<title>Record Models and Filter Modules</title>
<para>
- The hard work of knowing <emphasis>what</emphasis> to index,
+ The hard work of knowing <emphasis>what</emphasis> to index,
<emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
part of the records to send in a search/retrieve response is
- implemented in
+ implemented in
various filter modules. It is their responsibility to define the
exact indexing and record display filtering rules.
</para>
<para>
The virtual Debian package
<literal>libidzebra-2.0-modules</literal> installs all base filter
- modules.
+ modules.
</para>
<section id="componentmodulesdom">
<title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
<para>
The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
- internal data model, and can thus parse, index, and display
+ internal data model, and can thus parse, index, and display
any &acro.xml; document.
</para>
<para>
A parser for binary &acro.marc; records based on the ISO2709 library
standard is provided, it transforms these to the internal
- &acro.marcxml; &acro.dom; representation.
+ &acro.marcxml; &acro.dom; representation.
</para>
<para>
The internal &acro.dom; &acro.xml; representation can be fed into four
different pipelines, consisting of arbitrarily many successive
- &acro.xslt; transformations; these are for
+ &acro.xslt; transformations; these are for
<itemizedlist>
<listitem><para>input parsing and initial
transformations,</para></listitem>
static ranks.
</para>
<para>
- Details on the experimental &acro.dom; &acro.xml; filter are found in
+ Details on the experimental &acro.dom; &acro.xml; filter are found in
<xref linkend="record-model-domxml"/>.
</para>
<para>
<note>
<para>
The functionality of this record model has been improved and
- replaced by the &acro.dom; &acro.xml; record model. See
+ replaced by the &acro.dom; &acro.xml; record model. See
<xref linkend="componentmodulesdom"/>.
</para>
</note>
<para>
The Alvis filter for &acro.xml; files is an &acro.xslt; based input
- filter.
+ filter.
It indexes element and attribute content of any thinkable &acro.xml; format
using full &acro.xpath; support, a feature which the standard &zebra;
&acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
according to availability of memory.
</para>
<para>
- The Alvis filter
+ The Alvis filter
uses &acro.xslt; display stylesheets, which let
the &zebra; DB administrator associate multiple, different views on
the same &acro.xml; document type. These views are chosen on-the-fly in
Finally, the Alvis filter allows for static ranking at index
time, and to to sort hit lists according to predefined
static ranks. This imposes no overhead at all, both
- search and indexing perform still
+ search and indexing perform still
<emphasis>O(1)</emphasis> irrespectively of document
collection size. This feature resembles Google's pre-ranking using
their PageRank algorithm.
</para>
<para>
- Details on the experimental Alvis &acro.xslt; filter are found in
+ Details on the experimental Alvis &acro.xslt; filter are found in
<xref linkend="record-model-alvisxslt"/>.
</para>
<para>
<note>
<para>
The functionality of this record model has been improved and
- replaced by the &acro.dom; &acro.xml; record model. See
+ replaced by the &acro.dom; &acro.xml; record model. See
<xref linkend="componentmodulesdom"/>.
</para>
</note>
<para>
- The &acro.grs1; filter modules described in
+ The &acro.grs1; filter modules described in
<xref linkend="grs"/>
are all based on the &acro.z3950; specifications, and it is absolutely
mandatory to have the reference pages on &acro.bib1; attribute sets on
to the <filename>*.abs</filename> configuration file suffix.
</para>
<para>
- The <emphasis>grs.marc</emphasis> and
+ The <emphasis>grs.marc</emphasis> and
<emphasis>grs.marcxml</emphasis> filters are suited to parse and
- index binary and &acro.xml; versions of traditional library &acro.marc; records
+ index binary and &acro.xml; versions of traditional library &acro.marc; records
based on the ISO2709 standard. The Debian package for both
- filters is
+ filters is
<literal>libidzebra-2.0-mod-grs-marc</literal>.
</para>
<para>
&acro.grs1; TCL scriptable filters for extensive user configuration come
- in two flavors: a regular expression filter
+ in two flavors: a regular expression filter
<emphasis>grs.regx</emphasis> using TCL regular expressions, and
- a general scriptable TCL filter called
- <emphasis>grs.tcl</emphasis>
- are both included in the
+ a general scriptable TCL filter called
+ <emphasis>grs.tcl</emphasis>
+ are both included in the
<literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
</para>
<para>
A general purpose &acro.sgml; filter is called
<emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
- but planned to be in the
+ but planned to be in the
<literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
</para>
<para>
- The Debian package
- <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
+ The Debian package
+ <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
<emphasis>grs.xml</emphasis> filter which uses <ulink
- url="&url.expat;">Expat</ulink> to
+ url="&url.expat;">Expat</ulink> to
parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
the next session.
</para>
</section>
-
+
<section id="componentmodulestext">
<title>TEXT Record Model and Filter Module</title>
<para>
<itemizedlist>
<listitem>
-
+
<para>
When records are accessed by the system, they are represented
in their local, or native format. This might be &acro.sgml; or HTML files,
</entry>
<entry>
Get facet of a result set. The facet result is returned
- as if it was a normal record, while in reality is a
- recap of most "important" terms in a result set for the fields
+ as if it was a normal record, while in reality is a
+ recap of most "important" terms in a result set for the fields
given.
The facet facility first appeared in Zebra 2.0.20.
</entry>
</screen>
</para>
<para>
- The special
- <literal>zebra::data</literal> element set name is
- defined for any record syntax, but will always fetch
+ The special
+ <literal>zebra::data</literal> element set name is
+ defined for any record syntax, but will always fetch
the raw record data in exactly the original form. No record syntax
- specific transformations will be applied to the raw record data.
+ specific transformations will be applied to the raw record data.
</para>
<para>
- Also, &zebra; internal metadata about the record can be accessed:
+ Also, &zebra; internal metadata about the record can be accessed:
<screen>
Z> f @attr 1=title my
Z> format xml
Z> elements zebra::meta::sysno
Z> s 1+1
- </screen>
+ </screen>
displays in <literal>&acro.xml;</literal> record syntax only internal
- record system number, whereas
+ record system number, whereas
<screen>
Z> f @attr 1=title my
Z> format xml
Z> elements zebra::meta
Z> s 1+1
- </screen>
+ </screen>
displays all available metadata on the record. These include system
number, database name, indexed filename, filter used for indexing,
score and static ranking information and finally bytesize of record.
indexed how and in which indexes. Using the indexing stylesheet of
the Alvis filter, one can at least see which portion of the record
went into which index, but a similar aid does not exist for all
- other indexing filters.
+ other indexing filters.
</para>
<para>
The special
<literal>zebra::index</literal> element set names are provided to
access information on per record indexed fields. For example, the
- queries
+ queries
<screen>
Z> f @attr 1=title my
Z> format sutrs
</screen>
will display all indexed tokens from all indexed fields of the
first record, and it will display in <literal>&acro.sutrs;</literal>
- record syntax, whereas
+ record syntax, whereas
<screen>
Z> f @attr 1=title my
Z> format xml
Z> s 1+1
Z> elements zebra::index::title:p
Z> s 1+1
- </screen>
+ </screen>
displays in <literal>&acro.xml;</literal> record syntax only the content
of the zebra string index <literal>title</literal>, or
even only the type <literal>p</literal> phrase indexed part of it.
</note>
</section>
- </chapter>
+ </chapter>
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<chapter id="examples">
- <title>Example Configurations</title>
-
- <sect1 id="examples-overview">
- <title>Overview</title>
-
- <para>
- <command>zebraidx</command> and
- <command>zebrasrv</command> are both
- driven by a master configuration file, which may refer to other
- subsidiary configuration files. By default, they try to use
- <filename>zebra.cfg</filename> in the working directory as the
- master file; but this can be changed using the <literal>-c</literal>
- option to specify an alternative master configuration file.
- </para>
- <para>
- The master configuration file tells &zebra;:
- <itemizedlist>
-
- <listitem>
- <para>
- Where to find subsidiary configuration files, including both
- those that are named explicitly and a few ``magic'' files such
- as <literal>default.idx</literal>,
- which specifies the default indexing rules.
- </para>
- </listitem>
+ <chapter id="examples">
+ <title>Example Configurations</title>
- <listitem>
- <para>
- What record schemas to support. (Subsidiary files specify how
- to index the contents of records in those schemas, and what
- format to use when presenting records in those schemas to client
- software.)
- </para>
- </listitem>
+ <sect1 id="examples-overview">
+ <title>Overview</title>
- <listitem>
- <para>
- What attribute sets to recognise in searches. (Subsidiary files
- specify how to interpret the attributes in terms
- of the indexes that are created on the records.)
- </para>
- </listitem>
+ <para>
+ <command>zebraidx</command> and
+ <command>zebrasrv</command> are both
+ driven by a master configuration file, which may refer to other
+ subsidiary configuration files. By default, they try to use
+ <filename>zebra.cfg</filename> in the working directory as the
+ master file; but this can be changed using the <literal>-c</literal>
+ option to specify an alternative master configuration file.
+ </para>
+ <para>
+ The master configuration file tells &zebra;:
+ <itemizedlist>
- <listitem>
- <para>
- Policy details such as what type of input format to expect when
- adding new records, what low-level indexing algorithm to use,
- how to identify potential duplicate records, etc.
- </para>
- </listitem>
-
- </itemizedlist>
- </para>
- <para>
- Now let's see what goes in the <literal>zebra.cfg</literal> file
- for some example configurations.
- </para>
- </sect1>
-
- <sect1 id="example1">
- <title>Example 1: &acro.xml; Indexing And Searching</title>
-
- <para>
- This example shows how &zebra; can be used with absolutely minimal
- configuration to index a body of
- <ulink url="&url.xml;">&acro.xml;</ulink>
- documents, and search them using
- <ulink url="&url.xpath;">XPath</ulink>
- expressions to specify access points.
- </para>
- <para>
- Go to the <literal>examples/zthes</literal> subdirectory
- of the distribution archive.
- There you will find a <literal>Makefile</literal> that will
- populate the <literal>records</literal> subdirectory with a file of
- <ulink url="http://zthes.z3950.org/">Zthes</ulink>
- records representing a taxonomic hierarchy of dinosaurs. (The
- records are generated from the family tree in the file
- <literal>dino.tree</literal>.)
- Type <literal>make records/dino.xml</literal>
- to make the &acro.xml; data file.
- (Or you could just type <literal>make dino</literal> to build the &acro.xml;
- data file, create the database and populate it with the taxonomic
- records all in one shot - but then you wouldn't learn anything,
- would you? :-)
- </para>
- <para>
- Now we need to create a &zebra; database to hold and index the &acro.xml;
- records. We do this with the
- &zebra; indexer, <command>zebraidx</command>, which is
- driven by the <literal>zebra.cfg</literal> configuration file.
- For our purposes, we don't need any
- special behaviour - we can use the defaults - so we can start with a
- minimal file that just tells <command>zebraidx</command> where to
- find the default indexing rules, and how to parse the records:
- <screen>
- profilePath: .:../../tab
- recordType: grs.sgml
- </screen>
- </para>
- <para>
- That's all you need for a minimal &zebra; configuration. Now you can
- roll the &acro.xml; records into the database and build the indexes:
- <screen>
- zebraidx update records
- </screen>
- </para>
- <para>
- Now start the server. Like the indexer, its behaviour is
- controlled by the
- <literal>zebra.cfg</literal> file; and like the indexer, it works
- just fine with this minimal configuration.
- <screen>
- zebrasrv
- </screen>
- By default, the server listens on IP port number 9999, although
- this can easily be changed - see
- <xref linkend="zebrasrv"/>.
- </para>
- <para>
- Now you can use the &acro.z3950; client program of your choice to execute
- XPath-based boolean queries and fetch the &acro.xml; records that satisfy
- them:
- <screen>
- $ yaz-client @:9999
- Connecting...Ok.
- Z> find @attr 1=/Zthes/termName Sauroposeidon
- Number of hits: 1
- Z> format xml
- Z> show 1
- <Zthes>
+ <listitem>
+ <para>
+ Where to find subsidiary configuration files, including both
+ those that are named explicitly and a few ``magic'' files such
+ as <literal>default.idx</literal>,
+ which specifies the default indexing rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ What record schemas to support. (Subsidiary files specify how
+ to index the contents of records in those schemas, and what
+ format to use when presenting records in those schemas to client
+ software.)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ What attribute sets to recognise in searches. (Subsidiary files
+ specify how to interpret the attributes in terms
+ of the indexes that are created on the records.)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Policy details such as what type of input format to expect when
+ adding new records, what low-level indexing algorithm to use,
+ how to identify potential duplicate records, etc.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </para>
+ <para>
+ Now let's see what goes in the <literal>zebra.cfg</literal> file
+ for some example configurations.
+ </para>
+ </sect1>
+
+ <sect1 id="example1">
+ <title>Example 1: &acro.xml; Indexing And Searching</title>
+
+ <para>
+ This example shows how &zebra; can be used with absolutely minimal
+ configuration to index a body of
+ <ulink url="&url.xml;">&acro.xml;</ulink>
+ documents, and search them using
+ <ulink url="&url.xpath;">XPath</ulink>
+ expressions to specify access points.
+ </para>
+ <para>
+ Go to the <literal>examples/zthes</literal> subdirectory
+ of the distribution archive.
+ There you will find a <literal>Makefile</literal> that will
+ populate the <literal>records</literal> subdirectory with a file of
+ <ulink url="http://zthes.z3950.org/">Zthes</ulink>
+ records representing a taxonomic hierarchy of dinosaurs. (The
+ records are generated from the family tree in the file
+ <literal>dino.tree</literal>.)
+ Type <literal>make records/dino.xml</literal>
+ to make the &acro.xml; data file.
+ (Or you could just type <literal>make dino</literal> to build the &acro.xml;
+ data file, create the database and populate it with the taxonomic
+ records all in one shot - but then you wouldn't learn anything,
+ would you? :-)
+ </para>
+ <para>
+ Now we need to create a &zebra; database to hold and index the &acro.xml;
+ records. We do this with the
+ &zebra; indexer, <command>zebraidx</command>, which is
+ driven by the <literal>zebra.cfg</literal> configuration file.
+ For our purposes, we don't need any
+ special behaviour - we can use the defaults - so we can start with a
+ minimal file that just tells <command>zebraidx</command> where to
+ find the default indexing rules, and how to parse the records:
+ <screen>
+ profilePath: .:../../tab
+ recordType: grs.sgml
+ </screen>
+ </para>
+ <para>
+ That's all you need for a minimal &zebra; configuration. Now you can
+ roll the &acro.xml; records into the database and build the indexes:
+ <screen>
+ zebraidx update records
+ </screen>
+ </para>
+ <para>
+ Now start the server. Like the indexer, its behaviour is
+ controlled by the
+ <literal>zebra.cfg</literal> file; and like the indexer, it works
+ just fine with this minimal configuration.
+ <screen>
+ zebrasrv
+ </screen>
+ By default, the server listens on IP port number 9999, although
+ this can easily be changed - see
+ <xref linkend="zebrasrv"/>.
+ </para>
+ <para>
+ Now you can use the &acro.z3950; client program of your choice to execute
+ XPath-based boolean queries and fetch the &acro.xml; records that satisfy
+ them:
+ <screen>
+ $ yaz-client @:9999
+ Connecting...Ok.
+ Z> find @attr 1=/Zthes/termName Sauroposeidon
+ Number of hits: 1
+ Z> format xml
+ Z> show 1
+ <Zthes>
<termId>22</termId>
<termName>Sauroposeidon</termName>
<termType>PT</termType>
<termNote>The tallest known dinosaur (18m)</termNote>
<relation>
- <relationType>BT</relationType>
- <termId>21</termId>
- <termName>Brachiosauridae</termName>
- <termType>PT</termType>
+ <relationType>BT</relationType>
+ <termId>21</termId>
+ <termName>Brachiosauridae</termName>
+ <termType>PT</termType>
</relation>
- <idzebra xmlns="http://www.indexdata.dk/zebra/">
- <size>300</size>
- <localnumber>23</localnumber>
- <filename>records/dino.xml</filename>
- </idzebra>
- </Zthes>
- </screen>
- </para>
- <para>
- Now wasn't that nice and easy?
- </para>
- </sect1>
-
-
- <sect1 id="example2">
- <title>Example 2: Supporting Interoperable Searches</title>
-
- <para>
- The problem with the previous example is that you need to know the
- structure of the documents in order to find them. For example,
- when we wanted to find the record for the taxon
- <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
- we had to formulate a complex XPath
- <literal>/Zthes/termName</literal>
- which embodies the knowledge that taxon names are specified in a
- <literal><termName></literal> element inside the top-level
- <literal><Zthes></literal> element.
- </para>
- <para>
- This is bad not just because it requires a lot of typing, but more
- significantly because it ties searching semantics to the physical
- structure of the searched records. You can't use the same search
- specification to search two databases if their internal
- representations are different. Consider a different taxonomy
- database in which the records have taxon names specified
- inside a <literal><name></literal> element nested within a
- <literal><identification></literal> element
- inside a top-level <literal><taxon></literal> element: then
- you'd need to search for them using
- <literal>1=/taxon/identification/name</literal>
- </para>
- <para>
- How, then, can we build broadcasting Information Retrieval
- applications that look for records in many different databases?
- The &acro.z3950; protocol offers a powerful and general solution to this:
- abstract ``access points''. In the &acro.z3950; model, an access point
- is simply a point at which searches can be directed. Nothing is
- said about implementation: in a given database, an access point
- might be implemented as an index, a path into physical records, an
- algorithm for interrogating relational tables or whatever works.
- The only important thing is that the semantics of an access
- point is fixed and well defined.
- </para>
- <para>
- For convenience, access points are gathered into <firstterm>attribute
- sets</firstterm>. For example, the &acro.bib1; attribute set is supposed to
- contain bibliographic access points such as author, title, subject
- and ISBN; the GEO attribute set contains access points pertaining
- to geospatial information (bounding coordinates, stratum, latitude
- resolution, etc.); the CIMI
- attribute set contains access points to do with museum collections
- (provenance, inscriptions, etc.)
- </para>
- <para>
- In practice, the &acro.bib1; attribute set has tended to be a dumping
- ground for all sorts of access points, so that, for example, it
- includes some geospatial access points as well as strictly
- bibliographic ones. Nevertheless, this model
- allows a layer of abstraction over the physical representation of
- records in databases.
- </para>
- <para>
- In the &acro.bib1; attribute set, a taxon name is probably best
- interpreted as a title - that is, a phrase that identifies the item
- in question. &acro.bib1; represents title searches by
- access point 4. (See
- <ulink url="&url.z39.50.bib1.semantics;">The &acro.bib1; Attribute
- Set Semantics</ulink>)
- So we need to configure our dinosaur database so that searches for
- &acro.bib1; access point 4 look in the
- <literal><termName></literal> element,
- inside the top-level
- <literal><Zthes></literal> element.
- </para>
- <para>
- This is a two-step process. First, we need to tell &zebra; that we
- want to support the &acro.bib1; attribute set. Then we need to tell it
- which elements of its record pertain to access point 4.
+ <idzebra xmlns="http://www.indexdata.dk/zebra/">
+ <size>300</size>
+ <localnumber>23</localnumber>
+ <filename>records/dino.xml</filename>
+ </idzebra>
+ </Zthes>
+ </screen>
+ </para>
+ <para>
+ Now wasn't that nice and easy?
+ </para>
+ </sect1>
+
+
+ <sect1 id="example2">
+ <title>Example 2: Supporting Interoperable Searches</title>
+
+ <para>
+ The problem with the previous example is that you need to know the
+ structure of the documents in order to find them. For example,
+ when we wanted to find the record for the taxon
+ <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
+ we had to formulate a complex XPath
+ <literal>/Zthes/termName</literal>
+ which embodies the knowledge that taxon names are specified in a
+ <literal><termName></literal> element inside the top-level
+ <literal><Zthes></literal> element.
+ </para>
+ <para>
+ This is bad not just because it requires a lot of typing, but more
+ significantly because it ties searching semantics to the physical
+ structure of the searched records. You can't use the same search
+ specification to search two databases if their internal
+ representations are different. Consider a different taxonomy
+ database in which the records have taxon names specified
+ inside a <literal><name></literal> element nested within a
+ <literal><identification></literal> element
+ inside a top-level <literal><taxon></literal> element: then
+ you'd need to search for them using
+ <literal>1=/taxon/identification/name</literal>
+ </para>
+ <para>
+ How, then, can we build broadcasting Information Retrieval
+ applications that look for records in many different databases?
+ The &acro.z3950; protocol offers a powerful and general solution to this:
+ abstract ``access points''. In the &acro.z3950; model, an access point
+ is simply a point at which searches can be directed. Nothing is
+ said about implementation: in a given database, an access point
+ might be implemented as an index, a path into physical records, an
+ algorithm for interrogating relational tables or whatever works.
+ The only important thing is that the semantics of an access
+ point is fixed and well defined.
+ </para>
+ <para>
+ For convenience, access points are gathered into <firstterm>attribute
+ sets</firstterm>. For example, the &acro.bib1; attribute set is supposed to
+ contain bibliographic access points such as author, title, subject
+ and ISBN; the GEO attribute set contains access points pertaining
+ to geospatial information (bounding coordinates, stratum, latitude
+ resolution, etc.); the CIMI
+ attribute set contains access points to do with museum collections
+ (provenance, inscriptions, etc.)
+ </para>
+ <para>
+ In practice, the &acro.bib1; attribute set has tended to be a dumping
+ ground for all sorts of access points, so that, for example, it
+ includes some geospatial access points as well as strictly
+ bibliographic ones. Nevertheless, this model
+ allows a layer of abstraction over the physical representation of
+ records in databases.
+ </para>
+ <para>
+ In the &acro.bib1; attribute set, a taxon name is probably best
+ interpreted as a title - that is, a phrase that identifies the item
+ in question. &acro.bib1; represents title searches by
+ access point 4. (See
+ <ulink url="&url.z39.50.bib1.semantics;">The &acro.bib1; Attribute
+ Set Semantics</ulink>)
+ So we need to configure our dinosaur database so that searches for
+ &acro.bib1; access point 4 look in the
+ <literal><termName></literal> element,
+ inside the top-level
+ <literal><Zthes></literal> element.
</para>
<para>
- We need to create an <link linkend="abs-file">Abstract Syntax
- file</link> named after the document element of the records we're
+ This is a two-step process. First, we need to tell &zebra; that we
+ want to support the &acro.bib1; attribute set. Then we need to tell it
+ which elements of its record pertain to access point 4.
+ </para>
+ <para>
+ We need to create an <link linkend="abs-file">Abstract Syntax
+ file</link> named after the document element of the records we're
working with, plus a <literal>.abs</literal> suffix - in this case,
<literal>Zthes.abs</literal> - as follows:
</para>
<area id="termId" coords="7"/>
<area id="termName" coords="8"/>
</areaspec>
- <programlisting>
-attset zthes.att
-attset bib1.att
-xpath enable
-systag sysno none
-
-xelm /Zthes/termId termId:w
-xelm /Zthes/termName termName:w,title:w
-xelm /Zthes/termQualifier termQualifier:w
-xelm /Zthes/termType termType:w
-xelm /Zthes/termLanguage termLanguage:w
-xelm /Zthes/termNote termNote:w
-xelm /Zthes/termCreatedDate termCreatedDate:w
-xelm /Zthes/termCreatedBy termCreatedBy:w
-xelm /Zthes/termModifiedDate termModifiedDate:w
-xelm /Zthes/termModifiedBy termModifiedBy:w
- </programlisting>
+ <programlisting>
+ attset zthes.att
+ attset bib1.att
+ xpath enable
+ systag sysno none
+
+ xelm /Zthes/termId termId:w
+ xelm /Zthes/termName termName:w,title:w
+ xelm /Zthes/termQualifier termQualifier:w
+ xelm /Zthes/termType termType:w
+ xelm /Zthes/termLanguage termLanguage:w
+ xelm /Zthes/termNote termNote:w
+ xelm /Zthes/termCreatedDate termCreatedDate:w
+ xelm /Zthes/termCreatedBy termCreatedBy:w
+ xelm /Zthes/termModifiedDate termModifiedDate:w
+ xelm /Zthes/termModifiedBy termModifiedBy:w
+ </programlisting>
<calloutlist>
<callout arearefs="attset.zthes">
<para>
After re-indexing, we can search the database using &acro.bib1;
attribute, title, as follows:
<screen>
-Z> form xml
-Z> f @attr 1=4 Eoraptor
-Sent searchRequest.
-Received SearchResponse.
-Search was a success.
-Number of hits: 1, setno 1
-SearchResult-1: Eoraptor(1)
-records returned: 0
-Elapsed: 0.106896
-Z> s
-Sent presentRequest (1+1).
-Records: 1
-[Default]Record type: &acro.xml;
-<Zthes>
- <termId>2</termId>
- <termName>Eoraptor</termName>
- <termType>PT</termType>
- <termNote>The most basal known dinosaur</termNote>
- ...
+ Z> form xml
+ Z> f @attr 1=4 Eoraptor
+ Sent searchRequest.
+ Received SearchResponse.
+ Search was a success.
+ Number of hits: 1, setno 1
+ SearchResult-1: Eoraptor(1)
+ records returned: 0
+ Elapsed: 0.106896
+ Z> s
+ Sent presentRequest (1+1).
+ Records: 1
+ [Default]Record type: &acro.xml;
+ <Zthes>
+ <termId>2</termId>
+ <termName>Eoraptor</termName>
+ <termType>PT</termType>
+ <termNote>The most basal known dinosaur</termNote>
+ ...
</screen>
</para>
- </sect1>
-</chapter>
-
-
-<!--
- The simplest hello-world example could go like this:
-
- Index the document
-
- <book>
- <title>The art of motorcycle maintenance</title>
- <subject scheme="Dewey">zen</subject>
+ </sect1>
+ </chapter>
+
+
+ <!--
+ The simplest hello-world example could go like this:
+
+ Index the document
+
+ <book>
+ <title>The art of motorcycle maintenance</title>
+ <subject scheme="Dewey">zen</subject>
</book>
-
- And search it like
-
- f @attr 1=/book/title motorcycle
-
- f @attr 1=/book/subject[@scheme=Dewey] zen
-
- If you suddenly decide you want broader interop, you can add
- an abs file (more or less like this):
-
- attset bib1.att
- tagset tagsetg.tag
-
- elm (2,1) title title
- elm (2,21) subject subject
--->
-
-<!--
-How to include images:
-
- <mediaobject>
- <imageobject>
- <imagedata fileref="system.eps" format="eps">
+
+ And search it like
+
+ f @attr 1=/book/title motorcycle
+
+ f @attr 1=/book/subject[@scheme=Dewey] zen
+
+ If you suddenly decide you want broader interop, you can add
+ an abs file (more or less like this):
+
+ attset bib1.att
+ tagset tagsetg.tag
+
+ elm (2,1) title title
+ elm (2,21) subject subject
+ -->
+
+ <!--
+ How to include images:
+
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="system.eps" format="eps">
</imageobject>
- <imageobject>
- <imagedata fileref="system.gif" format="gif">
+ <imageobject>
+ <imagedata fileref="system.gif" format="gif">
</imageobject>
- <textobject>
- <phrase>The Multi-Lingual Search System Architecture</phrase>
+ <textobject>
+ <phrase>The Multi-Lingual Search System Architecture</phrase>
</textobject>
- <caption>
- <para>
- <emphasis role="strong">
- The Multi-Lingual Search System Architecture.
+ <caption>
+ <para>
+ <emphasis role="strong">
+ The Multi-Lingual Search System Architecture.
</emphasis>
- <para>
- Network connections across local area networks are
- represented by straight lines, and those over the
- internet by jagged lines.
+ <para>
+ Network connections across local area networks are
+ represented by straight lines, and those over the
+ internet by jagged lines.
</caption>
</mediaobject>
-Where the three <*object> thingies inside the top-level <mediaobject>
-are decreasingly preferred version to include depending on what the
-rendering engine can handle. I generated the EPS version of the image
-by exporting a line-drawing done in TGIF, then converted that to the
-GIF using a shell-script called "epstogif" which used an appallingly
-baroque sequence of conversions, which I would prefer not to pollute
-the &zebra; build environment with:
+ Where the three <*object> thingies inside the top-level <mediaobject>
+ are decreasingly preferred version to include depending on what the
+ rendering engine can handle. I generated the EPS version of the image
+ by exporting a line-drawing done in TGIF, then converted that to the
+ GIF using a shell-script called "epstogif" which used an appallingly
+ baroque sequence of conversions, which I would prefer not to pollute
+ the &zebra; build environment with:
- #!/bin/sh
+ #!/bin/sh
- # Yes, what follows is stupidly convoluted, but I can't find a
- # more straightforward path from the EPS generated by tgif's
- # "Print" command into a browser-friendly format.
+ # Yes, what follows is stupidly convoluted, but I can't find a
+ # more straightforward path from the EPS generated by tgif's
+ # "Print" command into a browser-friendly format.
- file=`echo "$1" | sed 's/\.eps//'`
- ps2pdf "$1" "$file".pdf
- pdftopbm "$file".pdf "$file"
- pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
- rm -f "$file".pdf "$file"-000001.pbm
+ file=`echo "$1" | sed 's/\.eps//'`
+ ps2pdf "$1" "$file".pdf
+ pdftopbm "$file".pdf "$file"
+ pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
+ rm -f "$file".pdf "$file"-000001.pbm
--->
+ -->
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
</simpara>
</abstract>
</bookinfo>
-
+
&chap-introduction;
&chap-installation;
<!-- &chap-quickstart; -->
&chap-recordmodel-alvisxslt;
&chap-recordmodel-grs;
&chap-field-structure;
-
+
<reference id="reference">
<title>Reference</title>
<partintro id="reference-introduction">
</partintro>
&manref;
</reference>
-
+
&app-license;
&gpl2;
&app-indexdata;
-
+
</book>
<!-- Keep this comment at the end of the file
Local variables:
-<appendix id="indexdata">
+ <appendix id="indexdata">
<title>About Index Data and the &zebra; Server</title>
-
+
<para>
Index Data is a consulting and software-development enterprise that
specializes in library and information management systems. Our
</para>
<para>
The <emphasis>Random House College Dictionary</emphasis>, 1975 edition
- offers this definition of the
+ offers this definition of the
word "Zebra":
</para>
-
+
<para>
Zebra, n., any of several horselike, African mammals of the genus Equus,
having a characteristic pattern of black or dark-brown stripes on
a whitish background.
</para>
-</appendix>
+ </appendix>
<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
<chapter id="installation">
<title>Installation</title>
<para>
- &zebra; is written in &acro.ansi; C and was implemented with portability in mind.
- We primarily use <ulink url="&url.gcc;">GCC</ulink> on UNIX and
+ &zebra; is written in &acro.ansi; C and was implemented with portability in mind.
+ We primarily use <ulink url="&url.gcc;">GCC</ulink> on UNIX and
<ulink url="&url.vstudio;">Microsoft Visual C++</ulink> on Windows.
</para>
(sparc)</ulink>,
<ulink url="&url.windows2000;">Windows 2000</ulink>.
</para>
-
+
<para>
&zebra; can be configured to use the following utilities (most of
which are optional):
(required)</term>
<listitem>
<para>
- &zebra; uses &yaz; to support <ulink url="&url.z39.50;">&acro.z3950;</ulink> /
+ &zebra; uses &yaz; to support <ulink url="&url.z39.50;">&acro.z3950;</ulink> /
<ulink url="&url.sru;">&acro.sru;</ulink>.
Zebra also uses a lot of other utilities (not related to networking),
such as memory management and XML support.
<para>
For the <link linkend="record-model-domxml">DOM XML</link>
/ <link linkend="record-model-alvisxslt">ALVIS</link>
- record filters, &yaz; must be compiled with
+ record filters, &yaz; must be compiled with
<ulink url="&url.libxml2;">Libxml2</ulink>
and
<ulink url="&url.libxslt;">Libxslt</ulink>
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term><ulink url="&url.tcl;">Tcl</ulink> (optional)</term>
<listitem>
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term>
<ulink url="&url.autoconf;">Autoconf</ulink>,
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term><ulink url="&url.docbook;">Docbook</ulink>
and friends (optional)</term>
<section id="installation-unix"><title>UNIX</title>
<para>
On Unix, GCC works fine, but any native
- C compiler should be possible to use as long as it is
+ C compiler should be possible to use as long as it is
&acro.ansi; C compliant.
</para>
-
+
<para>
Unpack the distribution archive. The <literal>configure</literal>
shell script attempts to guess correct values for various
It uses those values to create a <literal>Makefile</literal> in each
directory of &zebra;.
</para>
-
+
<para>
To run the configure script type:
-
+
<screen>
./configure
</screen>
-
+
</para>
-
+
<para>
The configure script attempts to use C compiler specified by
the <literal>CC</literal> environment variable.
The <literal>CFLAGS</literal> environment variable holds
options to be passed to the C compiler. If you're using a
Bourne-shell compatible shell you may pass something like this:
-
+
<screen>
CC=/opt/ccs/bin/cc CFLAGS=-O ./configure
</screen>
./configure --help
</screen>
</para>
-
+
<para>
Once the build environment is configured, build the software by
typing:
make
</screen>
</para>
-
+
<para>
If the build is successful, two executables are created in the
sub-directory <literal>index</literal>:
<variablelist>
-
+
<varlistentry>
<term><literal>zebrasrv</literal></term>
<listitem>
</listitem>
</varlistentry>
<varlistentry>
- <term><literal>zebraidx</literal></term>
+ <term><literal>zebraidx</literal></term>
<listitem>
<para>
The administrative indexing tool.
</varlistentry>
<varlistentry>
- <term><literal>index/*.so</literal></term>
+ <term><literal>index/*.so</literal></term>
<listitem>
<para>
The <literal>.so</literal>-files are &zebra; record filter modules.
- There are modules for reading
+ There are modules for reading
&acro.marc; (<filename>mod-grs-marc.so</filename>),
- &acro.xml; (<filename>mod-grs-xml.so</filename>) , etc.
+ &acro.xml; (<filename>mod-grs-xml.so</filename>) , etc.
</para>
</listitem>
</varlistentry>
<screen>
make install
</screen>
- By default this will install the &zebra; executables in
+ By default this will install the &zebra; executables in
<filename>/usr/local/bin</filename>,
- and the standard configuration files in
+ and the standard configuration files in
<filename>/usr/local/share/idzebra-2.0</filename>. If
shared modules are built, these are installed in
<filename>/usr/local/lib/idzebra-2.0/modules</filename>.
<section id="installation-debian"><title>GNU/Debian</title>
<section id="installation-debian-linux"><title>GNU/Debian Linux on
- i686 Platform</title>
+ i686 Platform</title>
<para>
Index Data provides pre-compiled GNU/Debian i686 Linux packages
at our Debian package archive, both for
- the Sarge and the Etch release.
+ the Sarge and the Etch release.
</para>
-
+
<para>
To install these packages, you need to add two lines to your
<filename>/etc/apt/sources.list</filename> configuration file,
deb http://ftp.indexdata.dk/debian sarge main
deb-src http://ftp.indexdata.dk/debian sarge main
</screen>
- or the Etch sources from
+ or the Etch sources from
<screen>
deb http://ftp.indexdata.dk/debian etch main
deb-src http://ftp.indexdata.dk/debian etch main
<screen>
apt-get update
</screen>
- as <literal>root</literal>, the
+ as <literal>root</literal>, the
<ulink url="&url.idzebra;">&zebra;</ulink> indexer is
easily installed issuing
<screen>
</screen>
</para>
</section>
-
+
<section id="installation-debia-nother">
<title>Ubuntu/Debian and GNU/Debian on other platforms</title>
<para>
These <ulink url="&url.idzebra;">&zebra;</ulink>
packages are specifically compiled for
- GNU/Debian Linux systems. Installation on other
+ GNU/Debian Linux systems. Installation on other
GNU/Debian systems is possible by
- re-compilation the Debian way: you need to add only the
- <literal>deb-src</literal> sources lines to the
+ re-compilation the Debian way: you need to add only the
+ <literal>deb-src</literal> sources lines to the
<filename>/etc/apt/sources.list</filename> configuration file,
that is either the Sarge sources
<screen>
apt-get update
apt-get build-dep idzebra-2.0
</screen>
- as <literal>root</literal>, the
+ as <literal>root</literal>, the
<ulink url="&url.idzebra;">&zebra;</ulink> indexer is
recompiled and installed issuing
<screen>
<section id="installation-win32"><title>WIN32</title>
<para>The easiest way to install &zebra; on Windows is by downloading
- an installer from
+ an installer from
<ulink url="&url.idzebra.download.win32;">here</ulink>.
The installer comes with source too - in case you wish to
compile &zebra; with different Compiler options.
</para>
-
+
<para>
&zebra; is shipped with "makefiles" for the NMAKE tool that comes
with <ulink url="&url.vstudio;">Microsoft Visual C++</ulink>.
<filename>WIN</filename> where the file <filename>makefile</filename>
is located. Customize the installation by editing the
<filename>makefile</filename> file (for example by using notepad).
-
+
The following summarizes the most important settings in that file:
-
+
<variablelist>
<varlistentry><term><literal>DEBUG</literal></term>
<listitem><para>
(code generation is multi-threaded DLL).
</para></listitem>
</varlistentry>
-
+
<varlistentry>
<term><literal>YAZDIR</literal></term>
<listitem><para>
Directory of &yaz; source. &zebra;'s makefile expects to find
- <filename>yaz.lib</filename>, <filename>yaz.dll</filename>
+ <filename>yaz.lib</filename>, <filename>yaz.dll</filename>
in <replaceable>yazdir</replaceable><literal>/lib</literal> and
<replaceable>yazdir</replaceable><literal>/bin</literal> respectively.
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term><literal>HAVE_EXPAT</literal>,
<literal>EXPAT_DIR</literal></term>
<listitem><para>
If <literal>HAVE_EXPAT</literal> is set to 1, &zebra; is compiled
with <ulink url="&url.expat;">Expat</ulink> support.
- In this configuration, set
+ In this configuration, set
<literal>ZEBRA_DIR</literal> to the Expat source directory.
Windows version of Expat can be downloaded from
<ulink url="&url.expat;">SourceForge</ulink>.
</para></listitem>
</varlistentry>
-
+
<varlistentry>
<term><literal>HAVE_ICONV</literal>,
<literal>ICONV_DIR</literal></term>
- <listitem><para>
+ <listitem><para>
If <literal>HAVE_ICONV</literal> is set to 1, &zebra; is compiled
- with iconv support. In this configuration, set
+ with iconv support. In this configuration, set
<literal>ICONV_DIR</literal> to the iconv source directory.
Iconv binaries can be downloaded from
<ulink url="&url.libxml2.download.win32;">this site</ulink>.
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term><literal>BZIP2INCLUDE</literal>,
<literal>BZIP2LIB</literal>,
<ulink url="&url.bzip2;">BZIP2</ulink> record compression support.
</para></listitem>
</varlistentry>
-
+
</variablelist>
</para>
<warning>
</note>
<para>
If you wish to recompile &zebra; - for example if you modify
- settings in the <filename>makefile</filename> you can delete
+ settings in the <filename>makefile</filename> you can delete
object files, etc by running.
<screen>
nmake clean
</para>
<para>
The following files are generated upon successful compilation:
-
+
<variablelist>
<varlistentry><term><filename>bin/zebraidx.exe</filename></term>
<listitem><para>
The &zebra; indexer.
</para></listitem></varlistentry>
-
+
<varlistentry><term><filename>bin/zebrasrv.exe</filename></term>
<listitem><para>
The &zebra; server.
</para></listitem></varlistentry>
-
+
</variablelist>
-
+
</para>
</section>
<title>Upgrading from &zebra; version 1.3.x</title>
<para>
&zebra;'s installation directories have changed a bit. In addition,
- the new loadable modules must be defined in the
+ the new loadable modules must be defined in the
master <filename>zebra.cfg</filename> configuration file. The old
version 1.3.x configuration options
<screen>
# profilePath - where to look for config files
profilePath: some/local/path:/usr/share/idzebra/tab
</screen>
- must be changed to
+ must be changed to
<screen>
# profilePath - where to look for config files
profilePath: some/local/path:/usr/share/idzebra-2.0/tab
</note>
<para>
The attribute set definition files may no longer contain
- redirection to other fields.
+ redirection to other fields.
For example the following snippet of
- a custom <filename>custom/bib1.att</filename>
+ a custom <filename>custom/bib1.att</filename>
&acro.bib1; attribute set definition file is no
longer supported:
<screen>
att 1016 Any 1016,4,1005,62
</screen>
- and should be changed to
+ and should be changed to
<screen>
att 1016 Any
</screen>
</screen>
</para>
<para>
- It is also possible to map the numerical attribute value
+ It is also possible to map the numerical attribute value
<literal>@attr 1=1016</literal> onto another already existing huge
index, in this example, one could for example use the mapping
<screen>
<screen>
attset: idxpath.att
</screen>
- </para>
+ </para>
</section>
-
+
</chapter>
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<chapter id="introduction">
+ <chapter id="introduction">
<title>Introduction</title>
-
- <section id="overview">
- <title>Overview</title>
-
- <para>
- &zebra; is a free, fast, friendly information management system. It can
- index records in &acro.xml;/&acro.sgml;, &acro.marc;, e-mail archives and many other
- formats, and quickly find them using a combination of boolean
- searching and relevance ranking. Search-and-retrieve applications can
- be written using &acro.api;s in a wide variety of languages, communicating
- with the &zebra; server using industry-standard information-retrieval
- protocols or web services.
- </para>
- <para>
- &zebra; is licensed Open Source, and can be
- deployed by anyone for any purpose without license fees. The C source
- code is open to anybody to read and change under the GPL license.
- </para>
- <para>
- &zebra; is a networked component which acts as a
- reliable &acro.z3950; server
- for both record/document search, presentation, insert, update and
- delete operations. In addition, it understands the &acro.sru; family of
- webservices, which exist in &acro.rest; &acro.get;/&acro.post; and truly
- &acro.soap; flavors.
- </para>
- <para>
- &zebra; is available as MS Windows 2003 Server (32 bit) self-extracting
- package as well as GNU/Debian Linux (32 bit and 64 bit) precompiled
- packages. It has been deployed successfully on other Unix systems,
- including Sun Sparc, HP Unix, and many variants of Linux and BSD
- based systems.
- </para>
- <para>
- <ulink url="http://www.indexdata.com/zebra/">http://www.indexdata.com/zebra/</ulink>
- <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/">http://ftp.indexdata.dk/pub/zebra/win32/</ulink>
- <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/">http://ftp.indexdata.dk/pub/zebra/debian/</ulink>
- </para>
- <para>
- <ulink url="http://indexdata.dk/zebra/">&zebra;</ulink>
- is a high-performance, general-purpose structured text
- indexing and retrieval engine. It reads records in a
- variety of input formats (e.g. email, &acro.xml;, &acro.marc;) and provides access
- to them through a powerful combination of boolean search
- expressions and relevance-ranked free-text queries.
- </para>
-
- <para>
- &zebra; supports large databases (tens of millions of records,
- tens of gigabytes of data). It allows safe, incremental
- database updates on live systems. Because &zebra; supports
- the industry-standard information retrieval protocol, &acro.z3950;,
- you can search &zebra; databases using an enormous variety of
- programs and toolkits, both commercial and free, which understand
- this protocol. Application libraries are available to allow
- bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
- Basic, Python, &acro.php; and more - see the
- <ulink url="&url.zoom;">&acro.zoom; web site</ulink>
- for more information on some of these client toolkits.
- </para>
-
- <para>
- This document is an introduction to the &zebra; system. It explains
- how to compile the software, how to prepare your first database,
- and how to configure the server to give you the
- functionality that you need.
- </para>
- </section>
-
- <section id="features">
- <title>&zebra; Features Overview</title>
-
-
- <!--
- <row>
- <entry></entry>
- <entry></entry>
- <entry></entry>
- <entry><xref linkend=""/></entry>
- </row>
- <row>
- <entry></entry>
- <entry></entry>
- <entry></entry>
- <entry><xref linkend=""/></entry>
- </row>
- <row>
- <entry></entry>
- <entry></entry>
- <entry></entry>
- <entry><xref linkend=""/></entry>
- </row>
- -->
+ <section id="overview">
+ <title>Overview</title>
+ <para>
+ &zebra; is a free, fast, friendly information management system. It can
+ index records in &acro.xml;/&acro.sgml;, &acro.marc;, e-mail archives and many other
+ formats, and quickly find them using a combination of boolean
+ searching and relevance ranking. Search-and-retrieve applications can
+ be written using &acro.api;s in a wide variety of languages, communicating
+ with the &zebra; server using industry-standard information-retrieval
+ protocols or web services.
+ </para>
+ <para>
+ &zebra; is licensed Open Source, and can be
+ deployed by anyone for any purpose without license fees. The C source
+ code is open to anybody to read and change under the GPL license.
+ </para>
+ <para>
+ &zebra; is a networked component which acts as a
+ reliable &acro.z3950; server
+ for both record/document search, presentation, insert, update and
+ delete operations. In addition, it understands the &acro.sru; family of
+ webservices, which exist in &acro.rest; &acro.get;/&acro.post; and truly
+ &acro.soap; flavors.
+ </para>
+ <para>
+ &zebra; is available as MS Windows 2003 Server (32 bit) self-extracting
+ package as well as GNU/Debian Linux (32 bit and 64 bit) precompiled
+ packages. It has been deployed successfully on other Unix systems,
+ including Sun Sparc, HP Unix, and many variants of Linux and BSD
+ based systems.
+ </para>
+ <para>
+ <ulink url="http://www.indexdata.com/zebra/">http://www.indexdata.com/zebra/</ulink>
+ <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/">http://ftp.indexdata.dk/pub/zebra/win32/</ulink>
+ <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/">http://ftp.indexdata.dk/pub/zebra/debian/</ulink>
+ </para>
+
+ <para>
+ <ulink url="http://indexdata.dk/zebra/">&zebra;</ulink>
+ is a high-performance, general-purpose structured text
+ indexing and retrieval engine. It reads records in a
+ variety of input formats (e.g. email, &acro.xml;, &acro.marc;) and provides access
+ to them through a powerful combination of boolean search
+ expressions and relevance-ranked free-text queries.
+ </para>
+ <para>
+ &zebra; supports large databases (tens of millions of records,
+ tens of gigabytes of data). It allows safe, incremental
+ database updates on live systems. Because &zebra; supports
+ the industry-standard information retrieval protocol, &acro.z3950;,
+ you can search &zebra; databases using an enormous variety of
+ programs and toolkits, both commercial and free, which understand
+ this protocol. Application libraries are available to allow
+ bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
+ Basic, Python, &acro.php; and more - see the
+ <ulink url="&url.zoom;">&acro.zoom; web site</ulink>
+ for more information on some of these client toolkits.
+ </para>
+
+ <para>
+ This document is an introduction to the &zebra; system. It explains
+ how to compile the software, how to prepare your first database,
+ and how to configure the server to give you the
+ functionality that you need.
+ </para>
+ </section>
+
+ <section id="features">
+ <title>&zebra; Features Overview</title>
<section id="features-document">
<title>&zebra; Document Model</title>
- <table id="table-features-document" frame="top">
- <title>&zebra; document model</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Complex semi-structured Documents</entry>
- <entry>&acro.xml; and &acro.grs1; Documents</entry>
- <entry>Both &acro.xml; and &acro.grs1; documents exhibit a &acro.dom; like internal
- representation allowing for complex indexing and display rules</entry>
- <entry><xref linkend="record-model-alvisxslt"/> and
- <xref linkend="grs"/></entry>
- </row>
- <row>
- <entry>Input document formats</entry>
- <entry>&acro.xml;, &acro.sgml;, Text, ISO2709 (&acro.marc;)</entry>
- <entry>
- A system of input filters driven by
- regular expressions allows most ASCII-based
- data formats to be easily processed.
- &acro.sgml;, &acro.xml;, ISO2709 (&acro.marc;), and raw text are also
- supported.</entry>
- <entry><xref linkend="componentmodules"/></entry>
- </row>
- <row>
- <entry>Document storage</entry>
- <entry>Index-only, Key storage, Document storage</entry>
- <entry>Data can be, and usually is, imported
- into &zebra;'s own storage, but &zebra; can also refer to
- external files, building and maintaining indexes of "live"
- collections.</entry>
- <entry></entry>
- </row>
-
- </tbody>
- </tgroup>
- </table>
+ <table id="table-features-document" frame="top">
+ <title>&zebra; document model</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Complex semi-structured Documents</entry>
+ <entry>&acro.xml; and &acro.grs1; Documents</entry>
+ <entry>Both &acro.xml; and &acro.grs1; documents exhibit a &acro.dom; like internal
+ representation allowing for complex indexing and display rules</entry>
+ <entry><xref linkend="record-model-alvisxslt"/> and
+ <xref linkend="grs"/></entry>
+ </row>
+ <row>
+ <entry>Input document formats</entry>
+ <entry>&acro.xml;, &acro.sgml;, Text, ISO2709 (&acro.marc;)</entry>
+ <entry>
+ A system of input filters driven by
+ regular expressions allows most ASCII-based
+ data formats to be easily processed.
+ &acro.sgml;, &acro.xml;, ISO2709 (&acro.marc;), and raw text are also
+ supported.</entry>
+ <entry><xref linkend="componentmodules"/></entry>
+ </row>
+ <row>
+ <entry>Document storage</entry>
+ <entry>Index-only, Key storage, Document storage</entry>
+ <entry>Data can be, and usually is, imported
+ into &zebra;'s own storage, but &zebra; can also refer to
+ external files, building and maintaining indexes of "live"
+ collections.</entry>
+ <entry></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
</section>
<section id="features-search">
<title>&zebra; Search Features</title>
- <table id="table-features-search" frame="top">
+ <table id="table-features-search" frame="top">
<title>&zebra; search functionality</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Query languages</entry>
- <entry>&acro.cql; and &acro.rpn;/&acro.pqf;</entry>
- <entry>The type-1 Reverse Polish Notation (&acro.rpn;)
- and its textual representation Prefix Query Format (&acro.pqf;) are
- supported. The Common Query Language (&acro.cql;) can be configured as
- a mapping from &acro.cql; to &acro.rpn;/&acro.pqf;</entry>
- <entry><xref linkend="querymodel-query-languages-pqf"/> and
- <xref linkend="querymodel-cql-to-pqf"/></entry>
- </row>
- <row>
- <entry>Complex boolean query tree</entry>
- <entry>&acro.cql; and &acro.rpn;/&acro.pqf;</entry>
- <entry>Both &acro.cql; and &acro.rpn;/&acro.pqf; allow atomic query parts (&acro.apt;) to
- be combined into complex boolean query trees</entry>
- <entry><xref linkend="querymodel-rpn-tree"/></entry>
- </row>
- <row>
- <entry>Field search</entry>
- <entry>user defined</entry>
- <entry>Atomic query parts (&acro.apt;) are either general, or
- directed at user-specified document fields
- </entry>
- <entry><xref linkend="querymodel-atomic-queries"/>,
- <xref linkend="querymodel-use-string"/>,
- <xref linkend="querymodel-bib1-use"/>, and
- <xref linkend="querymodel-idxpath-use"/></entry>
- </row>
- <row>
- <entry>Data normalization</entry>
- <entry>user defined</entry>
- <entry>Data normalization, text tokenization and character
- mappings can be applied during indexing and searching</entry>
- <entry><xref linkend="fields-and-charsets"/></entry>
- </row>
- <row>
- <entry>Predefined field types</entry>
- <entry>user defined</entry>
- <entry>Data fields can be indexed as phrase, as into word
- tokenized text, as numeric values, URLs, dates, and raw binary
- data.</entry>
- <entry><xref linkend="character-map-files"/> and
- <xref linkend="querymodel-pqf-apt-mapping-structuretype"/>
- </entry>
- </row>
- <row>
- <entry>Regular expression matching</entry>
- <entry>available</entry>
- <entry>Full regular expression matching and "approximate
- matching" (e.g. spelling mistake corrections) are handled.</entry>
- <entry><xref linkend="querymodel-regular"/></entry>
- </row>
- <row>
- <entry>Term truncation</entry>
- <entry>left, right, left-and-right</entry>
- <entry>The truncation attribute specifies whether variations of
- one or more characters are allowed between search term and hit
- terms, or not. Using non-default truncation attributes will
- broaden the document hit set of a search query.</entry>
- <entry><xref linkend="querymodel-bib1-truncation"/></entry>
- </row>
- <row>
- <entry>Fuzzy searches</entry>
- <entry>Spelling correction</entry>
- <entry>In addition, fuzzy searches are implemented, where one
- spelling mistake in search terms is matched</entry>
- <entry><xref linkend="querymodel-bib1-truncation"/></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Query languages</entry>
+ <entry>&acro.cql; and &acro.rpn;/&acro.pqf;</entry>
+ <entry>The type-1 Reverse Polish Notation (&acro.rpn;)
+ and its textual representation Prefix Query Format (&acro.pqf;) are
+ supported. The Common Query Language (&acro.cql;) can be configured as
+ a mapping from &acro.cql; to &acro.rpn;/&acro.pqf;</entry>
+ <entry><xref linkend="querymodel-query-languages-pqf"/> and
+ <xref linkend="querymodel-cql-to-pqf"/></entry>
+ </row>
+ <row>
+ <entry>Complex boolean query tree</entry>
+ <entry>&acro.cql; and &acro.rpn;/&acro.pqf;</entry>
+ <entry>Both &acro.cql; and &acro.rpn;/&acro.pqf; allow atomic query parts (&acro.apt;) to
+ be combined into complex boolean query trees</entry>
+ <entry><xref linkend="querymodel-rpn-tree"/></entry>
+ </row>
+ <row>
+ <entry>Field search</entry>
+ <entry>user defined</entry>
+ <entry>Atomic query parts (&acro.apt;) are either general, or
+ directed at user-specified document fields
+ </entry>
+ <entry><xref linkend="querymodel-atomic-queries"/>,
+ <xref linkend="querymodel-use-string"/>,
+ <xref linkend="querymodel-bib1-use"/>, and
+ <xref linkend="querymodel-idxpath-use"/></entry>
+ </row>
+ <row>
+ <entry>Data normalization</entry>
+ <entry>user defined</entry>
+ <entry>Data normalization, text tokenization and character
+ mappings can be applied during indexing and searching</entry>
+ <entry><xref linkend="fields-and-charsets"/></entry>
+ </row>
+ <row>
+ <entry>Predefined field types</entry>
+ <entry>user defined</entry>
+ <entry>Data fields can be indexed as phrase, as into word
+ tokenized text, as numeric values, URLs, dates, and raw binary
+ data.</entry>
+ <entry><xref linkend="character-map-files"/> and
+ <xref linkend="querymodel-pqf-apt-mapping-structuretype"/>
+ </entry>
+ </row>
+ <row>
+ <entry>Regular expression matching</entry>
+ <entry>available</entry>
+ <entry>Full regular expression matching and "approximate
+ matching" (e.g. spelling mistake corrections) are handled.</entry>
+ <entry><xref linkend="querymodel-regular"/></entry>
+ </row>
+ <row>
+ <entry>Term truncation</entry>
+ <entry>left, right, left-and-right</entry>
+ <entry>The truncation attribute specifies whether variations of
+ one or more characters are allowed between search term and hit
+ terms, or not. Using non-default truncation attributes will
+ broaden the document hit set of a search query.</entry>
+ <entry><xref linkend="querymodel-bib1-truncation"/></entry>
+ </row>
+ <row>
+ <entry>Fuzzy searches</entry>
+ <entry>Spelling correction</entry>
+ <entry>In addition, fuzzy searches are implemented, where one
+ spelling mistake in search terms is matched</entry>
+ <entry><xref linkend="querymodel-bib1-truncation"/></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
<section id="features-scan">
<title>&zebra; Index Scanning</title>
- <table id="table-features-scan" frame="top">
- <title>&zebra; index scanning</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Scan</entry>
- <entry>term suggestions</entry>
- <entry><literal>Scan</literal> on a given named index returns all the
- indexed terms in lexicographical order near the given start
- term. This can be used to create drop-down menus and search
- suggestions.</entry>
- <entry><xref linkend="querymodel-operation-type-scan"/> and
- <xref linkend="querymodel-atomic-queries"/>
- </entry>
- </row>
- <row>
- <entry>Facetted browsing</entry>
+ <table id="table-features-scan" frame="top">
+ <title>&zebra; index scanning</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Scan</entry>
+ <entry>term suggestions</entry>
+ <entry><literal>Scan</literal> on a given named index returns all the
+ indexed terms in lexicographical order near the given start
+ term. This can be used to create drop-down menus and search
+ suggestions.</entry>
+ <entry><xref linkend="querymodel-operation-type-scan"/> and
+ <xref linkend="querymodel-atomic-queries"/>
+ </entry>
+ </row>
+ <row>
+ <entry>Facetted browsing</entry>
<entry>available</entry>
<entry>Zebra 2.1 and allows retrieval of facets for
a result set.
- </entry>
- <entry><xref linkend="querymodel-zebra-attr-scan"/></entry>
- </row>
- <row>
- <entry>Drill-down or refine-search</entry>
- <entry>partially</entry>
- <entry>scanning in result sets can be used to implement
- drill-down in search clients</entry>
- <entry><xref linkend="querymodel-zebra-attr-scan"/></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ </entry>
+ <entry><xref linkend="querymodel-zebra-attr-scan"/></entry>
+ </row>
+ <row>
+ <entry>Drill-down or refine-search</entry>
+ <entry>partially</entry>
+ <entry>scanning in result sets can be used to implement
+ drill-down in search clients</entry>
+ <entry><xref linkend="querymodel-zebra-attr-scan"/></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
<section id="features-presentation">
<title>&zebra; Document Presentation</title>
- <table id="table-features-presentation" frame="top">
- <title>&zebra; document presentation</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Hit count</entry>
- <entry>yes</entry>
- <entry>Search results include at any time the total hit count of a given
- query, either exact computed, or approximative, in case that the
- hit count exceeds a possible pre-defined hit set truncation
- level.</entry>
- <entry>
- <xref linkend="querymodel-zebra-local-attr-limit"/> and
- <xref linkend="zebra-cfg"/>
- </entry>
- </row>
- <row>
- <entry>Paged result sets</entry>
- <entry>yes</entry>
- <entry>Paging of search requests and present/display request
- can return any successive number of records from any start
- position in the hit set, i.e. it is trivial to provide search
- results in successive pages of any size.</entry>
- <entry></entry>
- </row>
- <row>
- <entry>&acro.xml; document transformations</entry>
- <entry>&acro.xslt; based</entry>
- <entry> Record presentation can be performed in many
- pre-defined &acro.xml; data
- formats, where the original &acro.xml; records are on-the-fly transformed
- through any preconfigured &acro.xslt; transformation. It is therefore
- trivial to present records in short/full &acro.xml; views, transforming to
- RSS, Dublin Core, or other &acro.xml; based data formats, or transform
- records to XHTML snippets ready for inserting in XHTML pages.</entry>
- <entry>
- <xref linkend="record-model-alvisxslt-elementset"/></entry>
- </row>
- <row>
- <entry>Binary record transformations</entry>
- <entry>&acro.marc;, &acro.usmarc;, &acro.marc21; and &acro.marcxml;</entry>
- <entry>post-filter record transformations</entry>
- <entry></entry>
- </row>
- <row>
- <entry>Record Syntaxes</entry>
- <entry></entry>
- <entry> Multiple record syntaxes
- for data retrieval: &acro.grs1;, &acro.sutrs;,
- &acro.xml;, ISO2709 (&acro.marc;), etc. Records can be mapped between
- record syntaxes and schemas on the fly.</entry>
- <entry></entry>
- </row>
- <row>
- <entry>&zebra; internal metadata</entry>
- <entry>yes</entry>
- <entry> &zebra; internal document metadata can be fetched in
- &acro.sutrs; and &acro.xml; record syntaxes. Those are useful in client
- applications.</entry>
- <entry><xref linkend="special-retrieval"/></entry>
- </row>
- <row>
- <entry>&zebra; internal raw record data</entry>
- <entry>yes</entry>
- <entry> &zebra; internal raw, binary record data can be fetched in
- &acro.sutrs; and &acro.xml; record syntaxes, leveraging %zebra; to a
- binary storage system</entry>
- <entry><xref linkend="special-retrieval"/></entry>
- </row>
- <row>
- <entry>&zebra; internal record field data</entry>
- <entry>yes</entry>
- <entry> &zebra; internal record field data can be fetched in
- &acro.sutrs; and &acro.xml; record syntaxes. This makes very fast minimal
- record data displays possible.</entry>
- <entry><xref linkend="special-retrieval"/></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <table id="table-features-presentation" frame="top">
+ <title>&zebra; document presentation</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Hit count</entry>
+ <entry>yes</entry>
+ <entry>Search results include at any time the total hit count of a given
+ query, either exact computed, or approximative, in case that the
+ hit count exceeds a possible pre-defined hit set truncation
+ level.</entry>
+ <entry>
+ <xref linkend="querymodel-zebra-local-attr-limit"/> and
+ <xref linkend="zebra-cfg"/>
+ </entry>
+ </row>
+ <row>
+ <entry>Paged result sets</entry>
+ <entry>yes</entry>
+ <entry>Paging of search requests and present/display request
+ can return any successive number of records from any start
+ position in the hit set, i.e. it is trivial to provide search
+ results in successive pages of any size.</entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>&acro.xml; document transformations</entry>
+ <entry>&acro.xslt; based</entry>
+ <entry> Record presentation can be performed in many
+ pre-defined &acro.xml; data
+ formats, where the original &acro.xml; records are on-the-fly transformed
+ through any preconfigured &acro.xslt; transformation. It is therefore
+ trivial to present records in short/full &acro.xml; views, transforming to
+ RSS, Dublin Core, or other &acro.xml; based data formats, or transform
+ records to XHTML snippets ready for inserting in XHTML pages.</entry>
+ <entry>
+ <xref linkend="record-model-alvisxslt-elementset"/></entry>
+ </row>
+ <row>
+ <entry>Binary record transformations</entry>
+ <entry>&acro.marc;, &acro.usmarc;, &acro.marc21; and &acro.marcxml;</entry>
+ <entry>post-filter record transformations</entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>Record Syntaxes</entry>
+ <entry></entry>
+ <entry> Multiple record syntaxes
+ for data retrieval: &acro.grs1;, &acro.sutrs;,
+ &acro.xml;, ISO2709 (&acro.marc;), etc. Records can be mapped between
+ record syntaxes and schemas on the fly.</entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>&zebra; internal metadata</entry>
+ <entry>yes</entry>
+ <entry> &zebra; internal document metadata can be fetched in
+ &acro.sutrs; and &acro.xml; record syntaxes. Those are useful in client
+ applications.</entry>
+ <entry><xref linkend="special-retrieval"/></entry>
+ </row>
+ <row>
+ <entry>&zebra; internal raw record data</entry>
+ <entry>yes</entry>
+ <entry> &zebra; internal raw, binary record data can be fetched in
+ &acro.sutrs; and &acro.xml; record syntaxes, leveraging %zebra; to a
+ binary storage system</entry>
+ <entry><xref linkend="special-retrieval"/></entry>
+ </row>
+ <row>
+ <entry>&zebra; internal record field data</entry>
+ <entry>yes</entry>
+ <entry> &zebra; internal record field data can be fetched in
+ &acro.sutrs; and &acro.xml; record syntaxes. This makes very fast minimal
+ record data displays possible.</entry>
+ <entry><xref linkend="special-retrieval"/></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
<section id="features-sort-rank">
<title>&zebra; Sorting and Ranking</title>
- <table id="table-features-sort-rank" frame="top">
- <title>&zebra; sorting and ranking</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Sort</entry>
- <entry>numeric, lexicographic</entry>
- <entry>Sorting on the basis of alpha-numeric and numeric data
- is supported. Alphanumeric sorts can be configured for
- different data encodings and locales for European languages.</entry>
- <entry><xref linkend="administration-ranking-sorting"/> and
- <xref linkend="querymodel-zebra-attr-sorting"/></entry>
- </row>
- <row>
- <entry>Combined sorting</entry>
- <entry>yes</entry>
- <entry>Sorting on the basis of combined sorts  e.g. combinations of
- ascending/descending sorts of lexicographical/numeric/date field data
- is supported</entry>
- <entry><xref linkend="administration-ranking-sorting"/></entry>
- </row>
- <row>
- <entry>Relevance ranking</entry>
- <entry>TF-IDF like</entry>
- <entry>Relevance-ranking of free-text queries is supported
- using a TF-IDF like algorithm.</entry>
- <entry><xref linkend="administration-ranking-dynamic"/></entry>
- </row>
- <row>
- <entry>Static pre-ranking</entry>
- <entry>yes</entry>
- <entry>Enables pre-index time ranking of documents where hit
- lists are ordered first by ascending static rank, then by
- ascending document ID.</entry>
- <entry><xref linkend="administration-ranking-static"/></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <table id="table-features-sort-rank" frame="top">
+ <title>&zebra; sorting and ranking</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Sort</entry>
+ <entry>numeric, lexicographic</entry>
+ <entry>Sorting on the basis of alpha-numeric and numeric data
+ is supported. Alphanumeric sorts can be configured for
+ different data encodings and locales for European languages.</entry>
+ <entry><xref linkend="administration-ranking-sorting"/> and
+ <xref linkend="querymodel-zebra-attr-sorting"/></entry>
+ </row>
+ <row>
+ <entry>Combined sorting</entry>
+ <entry>yes</entry>
+ <entry>Sorting on the basis of combined sorts  e.g. combinations of
+ ascending/descending sorts of lexicographical/numeric/date field data
+ is supported</entry>
+ <entry><xref linkend="administration-ranking-sorting"/></entry>
+ </row>
+ <row>
+ <entry>Relevance ranking</entry>
+ <entry>TF-IDF like</entry>
+ <entry>Relevance-ranking of free-text queries is supported
+ using a TF-IDF like algorithm.</entry>
+ <entry><xref linkend="administration-ranking-dynamic"/></entry>
+ </row>
+ <row>
+ <entry>Static pre-ranking</entry>
+ <entry>yes</entry>
+ <entry>Enables pre-index time ranking of documents where hit
+ lists are ordered first by ascending static rank, then by
+ ascending document ID.</entry>
+ <entry><xref linkend="administration-ranking-static"/></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
<title>&zebra; Live Updates</title>
- <table id="table-features-updates" frame="top">
- <title>&zebra; live updates</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Incremental and batch updates</entry>
- <entry></entry>
- <entry>It is possible to schedule record inserts/updates/deletes in any
- quantity, from single individual handled records to batch updates
- in strikes of any size, as well as total re-indexing of all records
- from file system. </entry>
- <entry><xref linkend="zebraidx"/></entry>
- </row>
- <row>
- <entry>Remote updates</entry>
- <entry>&acro.z3950; extended services</entry>
- <entry>Updates can be performed from remote locations using the
- &acro.z3950; extended services. Access to extended services can be
- login-password protected.</entry>
- <entry><xref linkend="administration-extended-services"/> and
- <xref linkend="zebra-cfg"/></entry>
- </row>
- <row>
- <entry>Live updates</entry>
- <entry>transaction based</entry>
- <entry> Data updates are transaction based and can be performed
- on running &zebra; systems. Full searchability is preserved
- during life data update due to use of shadow disk areas for
- update operations. Multiple update transactions at the same
- time are lined up, to be performed one after each other. Data
- integrity is preserved.</entry>
- <entry><xref linkend="shadow-registers"/></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <table id="table-features-updates" frame="top">
+ <title>&zebra; live updates</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Incremental and batch updates</entry>
+ <entry></entry>
+ <entry>It is possible to schedule record inserts/updates/deletes in any
+ quantity, from single individual handled records to batch updates
+ in strikes of any size, as well as total re-indexing of all records
+ from file system. </entry>
+ <entry><xref linkend="zebraidx"/></entry>
+ </row>
+ <row>
+ <entry>Remote updates</entry>
+ <entry>&acro.z3950; extended services</entry>
+ <entry>Updates can be performed from remote locations using the
+ &acro.z3950; extended services. Access to extended services can be
+ login-password protected.</entry>
+ <entry><xref linkend="administration-extended-services"/> and
+ <xref linkend="zebra-cfg"/></entry>
+ </row>
+ <row>
+ <entry>Live updates</entry>
+ <entry>transaction based</entry>
+ <entry> Data updates are transaction based and can be performed
+ on running &zebra; systems. Full searchability is preserved
+ during life data update due to use of shadow disk areas for
+ update operations. Multiple update transactions at the same
+ time are lined up, to be performed one after each other. Data
+ integrity is preserved.</entry>
+ <entry><xref linkend="shadow-registers"/></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
- <section id="features-protocol">
- <title>&zebra; Networked Protocols</title>
-
- <table id="table-features-protocol" frame="top">
- <title>&zebra; networked protocols</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Fundamental operations</entry>
- <entry>&acro.z3950;/&acro.sru; <literal>explain</literal>,
- <literal>search</literal>, <literal>scan</literal>, and
- <literal>update</literal></entry>
- <entry></entry>
- <entry><xref linkend="querymodel-operation-types"/></entry>
- </row>
- <row>
- <entry>&acro.z3950; protocol support</entry>
- <entry>yes</entry>
- <entry> Protocol facilities supported are:
- <literal>init</literal>, <literal>search</literal>,
- <literal>present</literal> (retrieval),
- Segmentation (support for very large records),
- <literal>delete</literal>, <literal>scan</literal>
- (index browsing), <literal>sort</literal>,
- <literal>close</literal> and support for the <literal>update</literal>
- Extended Service to add or replace an existing &acro.xml;
- record. Piggy-backed presents are honored in the search
- request. Named result sets are supported.</entry>
- <entry><xref linkend="protocol-support"/></entry>
- </row>
- <row>
- <entry>Web Service support</entry>
- <entry>&acro.sru;</entry>
- <entry> The protocol operations <literal>explain</literal>,
- <literal>searchRetrieve</literal> and <literal>scan</literal>
- are supported. <ulink url="&url.cql;">&acro.cql;</ulink> to internal
- query model &acro.rpn;
- conversion is supported. Extended RPN queries
- for search/retrieve and scan are supported.</entry>
- <entry><xref linkend="zebrasrv-sru-support"/></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <section id="features-protocol">
+ <title>&zebra; Networked Protocols</title>
+
+ <table id="table-features-protocol" frame="top">
+ <title>&zebra; networked protocols</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Fundamental operations</entry>
+ <entry>&acro.z3950;/&acro.sru; <literal>explain</literal>,
+ <literal>search</literal>, <literal>scan</literal>, and
+ <literal>update</literal></entry>
+ <entry></entry>
+ <entry><xref linkend="querymodel-operation-types"/></entry>
+ </row>
+ <row>
+ <entry>&acro.z3950; protocol support</entry>
+ <entry>yes</entry>
+ <entry> Protocol facilities supported are:
+ <literal>init</literal>, <literal>search</literal>,
+ <literal>present</literal> (retrieval),
+ Segmentation (support for very large records),
+ <literal>delete</literal>, <literal>scan</literal>
+ (index browsing), <literal>sort</literal>,
+ <literal>close</literal> and support for the <literal>update</literal>
+ Extended Service to add or replace an existing &acro.xml;
+ record. Piggy-backed presents are honored in the search
+ request. Named result sets are supported.</entry>
+ <entry><xref linkend="protocol-support"/></entry>
+ </row>
+ <row>
+ <entry>Web Service support</entry>
+ <entry>&acro.sru;</entry>
+ <entry> The protocol operations <literal>explain</literal>,
+ <literal>searchRetrieve</literal> and <literal>scan</literal>
+ are supported. <ulink url="&url.cql;">&acro.cql;</ulink> to internal
+ query model &acro.rpn;
+ conversion is supported. Extended RPN queries
+ for search/retrieve and scan are supported.</entry>
+ <entry><xref linkend="zebrasrv-sru-support"/></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
<section id="features-scalability">
<title>&zebra; Data Size and Scalability</title>
- <table id="table-features-scalability" frame="top">
- <title>&zebra; data size and scalability</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>No of records</entry>
- <entry>40-60 million</entry>
- <entry></entry>
- <entry></entry>
- </row>
- <row>
- <entry>Data size</entry>
- <entry>100 GB of record data</entry>
- <entry>&zebra; based applications have successfully indexed up
- to 100 GB of record data</entry>
- <entry></entry>
- </row>
- <row>
- <entry>Scale out</entry>
- <entry>multiple discs</entry>
- <entry></entry>
- <entry></entry>
- </row>
- <row>
- <entry>Performance</entry>
- <entry><literal>O(n * log N)</literal></entry>
- <entry> &zebra; query speed and performance is affected roughly by
- <literal>O(log N)</literal>,
- where <literal>N</literal> is the total database size, and by
- <literal>O(n)</literal>, where <literal>n</literal> is the
- specific query hit set size.</entry>
- <entry></entry>
- </row>
- <row>
- <entry>Average search times</entry>
- <entry></entry>
- <entry> Even on very large size databases hit rates of 20 queries per
- seconds with average query answering time of 1 second are possible,
- provided that the boolean queries are constructed sufficiently
- precise to result in hit sets of the order of 1000 to 5.000
- documents.</entry>
- <entry></entry>
- </row>
- <row>
- <entry>Large databases</entry>
- <entry>64 bit file pointers</entry>
- <entry>64 file pointers assure that register files can extend
- the 2 GB limit. Logical files can be
- automatically partitioned over multiple disks, thus allowing for
- large databases.</entry>
- <entry></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <table id="table-features-scalability" frame="top">
+ <title>&zebra; data size and scalability</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>No of records</entry>
+ <entry>40-60 million</entry>
+ <entry></entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>Data size</entry>
+ <entry>100 GB of record data</entry>
+ <entry>&zebra; based applications have successfully indexed up
+ to 100 GB of record data</entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>Scale out</entry>
+ <entry>multiple discs</entry>
+ <entry></entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>Performance</entry>
+ <entry><literal>O(n * log N)</literal></entry>
+ <entry> &zebra; query speed and performance is affected roughly by
+ <literal>O(log N)</literal>,
+ where <literal>N</literal> is the total database size, and by
+ <literal>O(n)</literal>, where <literal>n</literal> is the
+ specific query hit set size.</entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>Average search times</entry>
+ <entry></entry>
+ <entry> Even on very large size databases hit rates of 20 queries per
+ seconds with average query answering time of 1 second are possible,
+ provided that the boolean queries are constructed sufficiently
+ precise to result in hit sets of the order of 1000 to 5.000
+ documents.</entry>
+ <entry></entry>
+ </row>
+ <row>
+ <entry>Large databases</entry>
+ <entry>64 bit file pointers</entry>
+ <entry>64 file pointers assure that register files can extend
+ the 2 GB limit. Logical files can be
+ automatically partitioned over multiple disks, thus allowing for
+ large databases.</entry>
+ <entry></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
<section id="features-platforms">
<title>&zebra; Supported Platforms</title>
- <table id="table-features-platforms" frame="top">
- <title>&zebra; supported platforms</title>
- <tgroup cols="4">
- <colspec colwidth="1*" colname="feature"/>
- <colspec colwidth="1*" colname="availability"/>
- <colspec colwidth="3*" colname="notes"/>
- <colspec colwidth="2*" colname="references"/>
- <thead>
- <row>
- <entry>Feature</entry>
- <entry>Availability</entry>
- <entry>Notes</entry>
- <entry>Reference</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Linux</entry>
- <entry></entry>
- <entry>GNU Linux (32 and 64bit), journaling Reiser or (better)
- JFS file system
- on disks. NFS file systems are not supported.
- GNU/Debian Linux packages are available</entry>
- <entry><xref linkend="installation-debian"/></entry>
- </row>
- <row>
- <entry>Unix</entry>
- <entry>tar-ball</entry>
- <entry>&zebra; is written in portable C, so it runs on most
- Unix-like systems.
- Usual tar-ball install possible on many major Unix systems</entry>
- <entry><xref linkend="installation-unix"/></entry>
- </row>
- <row>
- <entry>Windows</entry>
- <entry>NT/2000/2003/XP</entry>
- <entry>&zebra; runs as well on Windows (NT/2000/2003/XP).
- Windows installer packages available</entry>
- <entry><xref linkend="installation-win32"/></entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <table id="table-features-platforms" frame="top">
+ <title>&zebra; supported platforms</title>
+ <tgroup cols="4">
+ <colspec colwidth="1*" colname="feature"/>
+ <colspec colwidth="1*" colname="availability"/>
+ <colspec colwidth="3*" colname="notes"/>
+ <colspec colwidth="2*" colname="references"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Availability</entry>
+ <entry>Notes</entry>
+ <entry>Reference</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Linux</entry>
+ <entry></entry>
+ <entry>GNU Linux (32 and 64bit), journaling Reiser or (better)
+ JFS file system
+ on disks. NFS file systems are not supported.
+ GNU/Debian Linux packages are available</entry>
+ <entry><xref linkend="installation-debian"/></entry>
+ </row>
+ <row>
+ <entry>Unix</entry>
+ <entry>tar-ball</entry>
+ <entry>&zebra; is written in portable C, so it runs on most
+ Unix-like systems.
+ Usual tar-ball install possible on many major Unix systems</entry>
+ <entry><xref linkend="installation-unix"/></entry>
+ </row>
+ <row>
+ <entry>Windows</entry>
+ <entry>NT/2000/2003/XP</entry>
+ <entry>&zebra; runs as well on Windows (NT/2000/2003/XP).
+ Windows installer packages available</entry>
+ <entry><xref linkend="installation-win32"/></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
</section>
-
- </section>
-
+
+ </section>
+
<section id="introduction-apps">
- <title>References and &zebra; based Applications</title>
- <para>
- &zebra; has been deployed in numerous applications, in both the
- academic and commercial worlds, in application domains as diverse
- as bibliographic catalogues, Geo-spatial information, structured
- vocabulary browsing, government information locators, civic
- information systems, environmental observations, museum information
- and web indexes.
- </para>
- <para>
- Notable applications include the following:
- </para>
-
-
- <section id="koha-ils">
- <title>Koha free open-source ILS</title>
+ <title>References and &zebra; based Applications</title>
+ <para>
+ &zebra; has been deployed in numerous applications, in both the
+ academic and commercial worlds, in application domains as diverse
+ as bibliographic catalogues, Geo-spatial information, structured
+ vocabulary browsing, government information locators, civic
+ information systems, environmental observations, museum information
+ and web indexes.
+ </para>
<para>
+ Notable applications include the following:
+ </para>
+
+
+ <section id="koha-ils">
+ <title>Koha free open-source ILS</title>
+ <para>
<ulink url="http://www.koha.org/">Koha</ulink> is a full-featured
- open-source ILS, initially developed in
+ open-source ILS, initially developed in
New Zealand by Katipo Communications Ltd, and first deployed in
January of 2000 for Horowhenua Library Trust. It is currently
maintained by a team of software providers and library technology
- staff from around the globe.
+ staff from around the globe.
</para>
<para>
- <ulink url="http://liblime.com/">LibLime</ulink>,
+ <ulink url="http://liblime.com/">LibLime</ulink>,
a company that is marketing and supporting Koha, adds in
the new release of Koha 3.0 the &zebra;
database server to drive its bibliographic database.
in the Koha 2.x series. After extensive evaluations of the best
of the Open Source textual database engines - including MySQL
full-text searching, PostgreSQL, Lucene and Plucene - the team
- selected &zebra;.
+ selected &zebra;.
</para>
<para>
"&zebra; completely eliminates scalability limitations, because it
Ferraro, LibLime's Technology President and Koha's Project
Release Manager. "Our performance tests showed search results in
under a second for databases with over 5 million records on a
- modest i386 900Mhz test server."
+ modest i386 900Mhz test server."
</para>
<para>
"&zebra; also includes support for true boolean search expressions
database updates, which allow on-the-fly record
management. Finally, since &zebra; has at its heart the &acro.z3950;
protocol, it greatly improves Koha's support for that critical
- library standard."
+ library standard."
</para>
- <para>
+ <para>
Although the bibliographic database will be moved to &zebra;, Koha
3.0 will continue to use a relational SQL-based database design
for the 'factual' database. "Relational database managers have
their strengths, in spite of their inability to handle large
numbers of bibliographic records efficiently," summed up Ferraro,
"We're taking the best from both worlds in our redesigned Koha
- 3.0.
- </para>
- <para>
+ 3.0.
+ </para>
+ <para>
See also LibLime's newsletter article
- <ulink url="http://www.liblime.com/newsletter/2006/01/features/koha-earns-its-stripes/">
- Koha Earns its Stripes</ulink>.
- </para>
+ <ulink url="http://www.liblime.com/newsletter/2006/01/features/koha-earns-its-stripes/">
+ Koha Earns its Stripes</ulink>.
+ </para>
</section>
- <section id="kete-dom">
- <title>Kete Open Source Digital Library and Archiving software</title>
- <para>
+ <section id="kete-dom">
+ <title>Kete Open Source Digital Library and Archiving software</title>
+ <para>
<ulink url="http://kete.net.nz/">Kete</ulink> is a digital object
- management repository, initially developed in
+ management repository, initially developed in
New Zealand. Initial development has
been a partnership between the Horowhenua Library Trust and
Katipo Communications Ltd. funded as part of the Community
Partnership Fund in 2006.
Kete is purpose built
software to enable communities to build their own digital
- libraries, archives and repositories.
+ libraries, archives and repositories.
</para>
<para>
It is based on Ruby-on-Rails and MySQL, and integrates the &zebra; server
application.
See
how Kete <ulink
- url="http://kete.net.nz/documentation/topics/show/139-managing-zebra">manages
- Zebra.</ulink>
- </para>
- <para>
+ url="http://kete.net.nz/documentation/topics/show/139-managing-zebra">manages
+ Zebra.</ulink>
+ </para>
+ <para>
Why does Kete wants to use Zebra?? Speed, Scalability and easy
- integration with Koha. Read their
- <ulink
- url="http://kete.net.nz/blog/topics/show/44-who-what-why-when-answering-some-of-the-niggly-development-questions">detailed
- reasoning here.</ulink>
+ integration with Koha. Read their
+ <ulink
+ url="http://kete.net.nz/blog/topics/show/44-who-what-why-when-answering-some-of-the-niggly-development-questions">detailed
+ reasoning here.</ulink>
</para>
</section>
- <section id="reindex-ils">
- <title>ReIndex.Net web based ILS</title>
+ <section id="reindex-ils">
+ <title>ReIndex.Net web based ILS</title>
<para>
<ulink url="http://www.reindex.net/index.php?lang=en">Reindex.net</ulink>
is a netbased library service offering all
services. Reindex.net is a comprehensive and powerful WEB system
based on standards such as &acro.xml; and &acro.z3950;.
updates. Reindex supports &acro.marc21;, dan&acro.marc; eller Dublin Core with
- UTF8-encoding.
+ UTF8-encoding.
</para>
<para>
Reindex.net runs on GNU/Debian Linux with &zebra; and Simpleserver
- from Index
+ from Index
Data for bibliographic data. The relational database system
Sybase 9 &acro.xml; is used for
- administrative data.
+ administrative data.
Internally &acro.marcxml; is used for bibliographical records. Update
- utilizes &acro.z3950; extended services.
+ utilizes &acro.z3950; extended services.
</para>
</section>
<title>DADS - the DTV Article Database
Service</title>
<para>
- DADS is a huge database of more than ten million records, totalling
- over ten gigabytes of data. The records are metadata about academic
- journal articles, primarily scientific; about 10% of these
- metadata records link to the full text of the articles they
- describe, a body of about a terabyte of information (although the
- full text is not indexed.)
- </para>
- <para>
- It allows students and researchers at DTU (Danmarks Tekniske
- Universitet, the Technical College of Denmark) to find and order
- articles from multiple databases in a single query. The database
- contains literature on all engineering subjects. It's available
- on-line through a web gateway, though currently only to registered
- users.
- </para>
- <para>
- More information can be found at
- <ulink url="http://www.dtic.dtu.dk/"/> and
- <ulink url="http://dads.dtv.dk"/>
- </para>
- </section>
+ DADS is a huge database of more than ten million records, totalling
+ over ten gigabytes of data. The records are metadata about academic
+ journal articles, primarily scientific; about 10% of these
+ metadata records link to the full text of the articles they
+ describe, a body of about a terabyte of information (although the
+ full text is not indexed.)
+ </para>
+ <para>
+ It allows students and researchers at DTU (Danmarks Tekniske
+ Universitet, the Technical College of Denmark) to find and order
+ articles from multiple databases in a single query. The database
+ contains literature on all engineering subjects. It's available
+ on-line through a web gateway, though currently only to registered
+ users.
+ </para>
+ <para>
+ More information can be found at
+ <ulink url="http://www.dtic.dtu.dk/"/> and
+ <ulink url="http://dads.dtv.dk"/>
+ </para>
+ </section>
- <section id="uls">
- <title>ULS (Union List of Serials)</title>
- <para>
- The M25 Systems Team
- has created a union catalogue for the periodicals of the
- twenty-one constituent libraries of the University of London and
- the University of Westminster
- (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
- They have achieved this using an
- unusual architecture, which they describe as a
- ``non-distributed virtual union catalogue''.
- </para>
- <para>
- The member libraries send in data files representing their
- periodicals, including both brief bibliographic data and summary
- holdings. Then 21 individual &acro.z3950; targets are created, each
- using &zebra;, and all mounted on the single hardware server.
- The live service provides a web gateway allowing &acro.z3950; searching
- of all of the targets or a selection of them. &zebra;'s small
- footprint allows a relatively modest system to comfortably host
- the 21 servers.
- </para>
- <para>
- More information can be found at
- <ulink url="http://www.m25lib.ac.uk/ULS/"/>
- </para>
- </section>
+ <section id="uls">
+ <title>ULS (Union List of Serials)</title>
+ <para>
+ The M25 Systems Team
+ has created a union catalogue for the periodicals of the
+ twenty-one constituent libraries of the University of London and
+ the University of Westminster
+ (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
+ They have achieved this using an
+ unusual architecture, which they describe as a
+ ``non-distributed virtual union catalogue''.
+ </para>
+ <para>
+ The member libraries send in data files representing their
+ periodicals, including both brief bibliographic data and summary
+ holdings. Then 21 individual &acro.z3950; targets are created, each
+ using &zebra;, and all mounted on the single hardware server.
+ The live service provides a web gateway allowing &acro.z3950; searching
+ of all of the targets or a selection of them. &zebra;'s small
+ footprint allows a relatively modest system to comfortably host
+ the 21 servers.
+ </para>
+ <para>
+ More information can be found at
+ <ulink url="http://www.m25lib.ac.uk/ULS/"/>
+ </para>
+ </section>
- <section id="various-web-indexes">
- <title>Various web indexes</title>
- <para>
- &zebra; has been used by a variety of institutions to construct
- indexes of large web sites, typically in the region of tens of
- millions of pages. In this role, it functions somewhat similarly
- to the engine of Google or AltaVista, but for a selected intranet
- or a subset of the whole Web.
- </para>
- <para>
- For example, Liverpool University's web-search facility (see on
- the home page at
- <ulink url="http://www.liv.ac.uk/"/>
- and many sub-pages) works by relevance-searching a &zebra; database
- which is populated by the Harvest-NG web-crawling software.
- </para>
- <para>
- For more information on Liverpool university's intranet search
- architecture, contact John Gilbertson
- <email>jgilbert@liverpool.ac.uk</email>
- </para>
- <para>
- Kang-Jin Lee
- has recently modified the Harvest web indexer to use &zebra; as
- its native repository engine. His comments on the switch over
- from the old engine are revealing:
- <blockquote>
- <para>
- The first results after some testing with &zebra; are very
- promising. The tests were done with around 220,000 SOIF files,
- which occupies 1.6GB of disk space.
- </para>
- <para>
- Building the index from scratch takes around one hour with &zebra;
- where [old-engine] needs around five hours. While [old-engine]
- blocks search requests when updating its index, &zebra; can still
- answer search requests.
- [...]
- &zebra; supports incremental indexing which will speed up indexing
- even further.
- </para>
- <para>
- While the search time of [old-engine] varies from some seconds
- to some minutes depending how expensive the query is, &zebra;
- usually takes around one to three seconds, even for expensive
- queries.
- [...]
- &zebra; can search more than 100 times faster than [old-engine]
- and can process multiple search requests simultaneously
- </para>
- <para>
- I am very happy to see such nice software available under GPL.
- </para>
- </blockquote>
- </para>
+ <section id="various-web-indexes">
+ <title>Various web indexes</title>
+ <para>
+ &zebra; has been used by a variety of institutions to construct
+ indexes of large web sites, typically in the region of tens of
+ millions of pages. In this role, it functions somewhat similarly
+ to the engine of Google or AltaVista, but for a selected intranet
+ or a subset of the whole Web.
+ </para>
+ <para>
+ For example, Liverpool University's web-search facility (see on
+ the home page at
+ <ulink url="http://www.liv.ac.uk/"/>
+ and many sub-pages) works by relevance-searching a &zebra; database
+ which is populated by the Harvest-NG web-crawling software.
+ </para>
+ <para>
+ For more information on Liverpool university's intranet search
+ architecture, contact John Gilbertson
+ <email>jgilbert@liverpool.ac.uk</email>
+ </para>
+ <para>
+ Kang-Jin Lee
+ has recently modified the Harvest web indexer to use &zebra; as
+ its native repository engine. His comments on the switch over
+ from the old engine are revealing:
+ <blockquote>
+ <para>
+ The first results after some testing with &zebra; are very
+ promising. The tests were done with around 220,000 SOIF files,
+ which occupies 1.6GB of disk space.
+ </para>
+ <para>
+ Building the index from scratch takes around one hour with &zebra;
+ where [old-engine] needs around five hours. While [old-engine]
+ blocks search requests when updating its index, &zebra; can still
+ answer search requests.
+ [...]
+ &zebra; supports incremental indexing which will speed up indexing
+ even further.
+ </para>
+ <para>
+ While the search time of [old-engine] varies from some seconds
+ to some minutes depending how expensive the query is, &zebra;
+ usually takes around one to three seconds, even for expensive
+ queries.
+ [...]
+ &zebra; can search more than 100 times faster than [old-engine]
+ and can process multiple search requests simultaneously
+ </para>
+ <para>
+ I am very happy to see such nice software available under GPL.
+ </para>
+ </blockquote>
+ </para>
+ </section>
</section>
- </section>
-
-
+
<section id="introduction-support">
<title>Support</title>
<para>
releases, bug fixes, etc.) and general discussion. You are welcome
to seek support there. Join by filling the form on the list home page.
</para>
- </section>
-</chapter>
+ </section>
+ </chapter>
<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<appendix id="license">
- <title>License</title>
+ <appendix id="license">
+ <title>License</title>
<para>
- &zebra; Server,
- Copyright © ©right-year; Index Data ApS.
- </para>
+ &zebra; Server,
+ Copyright © ©right-year; Index Data ApS.
+ </para>
<para>
&zebra; is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2, or (at your option) any later
version.
- </para>
+ </para>
<para>
&zebra; is distributed in the hope that it will be useful, but WITHOUT ANY
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
</para>
-
+
<para>
You should have received a copy of the GNU General Public License
along with &zebra;; see the file LICENSE.zebra. If not, write to the
- Free Software Foundation,
+ Free Software Foundation,
51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- </para>
-
+ </para>
+
</appendix>
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
<chapter id="querymodel">
<title>Query Model</title>
-
+
<section id="querymodel-overview">
- <title>Query Model Overview</title>
-
+ <title>Query Model Overview</title>
+
<section id="querymodel-query-languages">
<title>Query Languages</title>
-
+
<para>
&zebra; is born as a networking Information Retrieval engine adhering
- to the international standards
+ to the international standards
<ulink url="&url.z39.50;">&acro.z3950;</ulink> and
<ulink url="&url.sru;">&acro.sru;</ulink>,
- and implement the
+ and implement the
type-1 Reverse Polish Notation (&acro.rpn;) query
model defined there.
Unfortunately, this model has only defined a binary
encoded representation, which is used as transport packaging in
the &acro.z3950; protocol layer. This representation is not human
- readable, nor defines any convenient way to specify queries.
+ readable, nor defines any convenient way to specify queries.
</para>
<para>
Since the type-1 (&acro.rpn;)
representation, every client application needs to provide some
form of mapping from a local query notation or representation to it.
</para>
-
-
+
+
<section id="querymodel-query-languages-pqf">
<title>Prefix Query Format (&acro.pqf;)</title>
<para>
- Index Data has defined a textual representation in the
+ Index Data has defined a textual representation in the
<ulink url="&url.yaz.pqf;">Prefix Query Format</ulink>, short
- <emphasis>&acro.pqf;</emphasis>, which maps
- one-to-one to binary encoded
+ <emphasis>&acro.pqf;</emphasis>, which maps
+ one-to-one to binary encoded
<emphasis>type-1 &acro.rpn;</emphasis> queries.
&acro.pqf; has been adopted by other
parties developing &acro.z3950; software, and is often referred to as
- <emphasis>Prefix Query Notation</emphasis>, or in short
- &acro.pqn;. See
+ <emphasis>Prefix Query Notation</emphasis>, or in short
+ &acro.pqn;. See
<xref linkend="querymodel-rpn"/> for further explanations and
- descriptions of &zebra;'s capabilities.
+ descriptions of &zebra;'s capabilities.
</para>
- </section>
-
+ </section>
+
<section id="querymodel-query-languages-cql">
<title>Common Query Language (&acro.cql;)</title>
<para>
The query model of the type-1 &acro.rpn;,
- expressed in &acro.pqf;/&acro.pqn; is natively supported.
+ expressed in &acro.pqf;/&acro.pqn; is natively supported.
On the other hand, the default &acro.sru;
web services <emphasis>Common Query Language</emphasis>
<ulink url="&url.cql;">&acro.cql;</ulink> is not natively supported.
&zebra; can be configured to understand and map &acro.cql; to &acro.pqf;. See
<xref linkend="querymodel-cql-to-pqf"/>.
</para>
- </section>
-
+ </section>
+
</section>
<section id="querymodel-operation-types">
<para>
&zebra; supports all of the three different
&acro.z3950;/&acro.sru; operations defined in the
- standards: explain, search,
+ standards: explain, search,
and scan. A short description of the
- functionality and purpose of each is quite in order here.
+ functionality and purpose of each is quite in order here.
</para>
<section id="querymodel-operation-type-explain">
well known to any client, but the specific
<emphasis>semantics</emphasis> - taking into account a
particular servers functionalities and abilities - must be
- discovered from case to case. Enters the
- explain operation, which provides the means for learning which
+ discovered from case to case. Enters the
+ explain operation, which provides the means for learning which
<emphasis>fields</emphasis> (also called
<emphasis>indexes</emphasis> or <emphasis>access points</emphasis>)
are provided, which default parameter the server uses, which
retrieve document formats are defined, and which specific parts
- of the general query model are supported.
+ of the general query model are supported.
</para>
<para>
The &acro.z3950; embeds the explain operation
- by performing a
- search in the magic
+ by performing a
+ search in the magic
<literal>IR-Explain-1</literal> database;
- see <xref linkend="querymodel-exp1"/>.
+ see <xref linkend="querymodel-exp1"/>.
</para>
<para>
In &acro.sru;, explain is an entirely separate
- operation, which returns an ZeeRex &acro.xml; record according to the
+ operation, which returns an ZeeRex &acro.xml; record according to the
structure defined by the protocol.
</para>
<para>
In both cases, the information gathered through
explain operations can be used to
auto-configure a client user interface to the servers
- capabilities.
+ capabilities.
</para>
</section>
<section id="querymodel-operation-type-search">
<title>Search Operation</title>
<para>
- Search and retrieve interactions are the raison d'être.
+ Search and retrieve interactions are the raison d'être.
They are used to query the remote database and
return search result documents. Search queries span from
simple free text searches to nested complex boolean queries,
<title>Scan Operation</title>
<para>
The scan operation is a helper functionality,
- which operates on one index or access point a time.
+ which operates on one index or access point a time.
</para>
<para>
It provides
the indexes, and in addition the scan
operation returns the number of documents indexed by each term.
A search client can use this information to propose proper
- spelling of search terms, to auto-fill search boxes, or to
+ spelling of search terms, to auto-fill search boxes, or to
display controlled vocabularies.
</para>
</section>
</section>
- </section>
+ </section>
<section id="querymodel-rpn">
<title>&acro.rpn; queries and semantics</title>
is documented in the &yaz; manual, and shall not be
repeated here. This textual &acro.pqf; representation
is not transmitted to &zebra; during search, but it is in the
- client mapped to the equivalent &acro.z3950; binary
- query parse tree.
+ client mapped to the equivalent &acro.z3950; binary
+ query parse tree.
</para>
-
+
<section id="querymodel-rpn-tree">
<title>&acro.rpn; tree structure</title>
<para>
The &acro.rpn; parse tree - or the equivalent textual representation in &acro.pqf; -
- may start with one specification of the
+ may start with one specification of the
<emphasis>attribute set</emphasis> used. Following is a query
- tree, which
+ tree, which
consists of <emphasis>atomic query parts (&acro.apt;)</emphasis> or
<emphasis>named result sets</emphasis>, eventually
- paired by <emphasis>boolean binary operators</emphasis>, and
- finally <emphasis>recursively combined </emphasis> into
- complex query trees.
+ paired by <emphasis>boolean binary operators</emphasis>, and
+ finally <emphasis>recursively combined </emphasis> into
+ complex query trees.
</para>
-
+
<section id="querymodel-attribute-sets">
<title>Attribute sets</title>
<para>
definitions, others can easily be defined and added to the
configuration.
</para>
-
+
<table id="querymodel-attribute-sets-table" frame="top">
<title>Attribute sets predefined in &zebra;</title>
<tgroup cols="4">
<entry>Notes</entry>
</row>
</thead>
-
+
<tbody>
<row>
<entry>Explain</entry>
<entry><literal>bib-1</literal></entry>
<entry>Standard &acro.pqf; query language attribute set which defines the
semantics of &acro.z3950; searching. In addition, all of the
- non-use attributes (types 2-14) define the hard-wired
+ non-use attributes (types 2-14) define the hard-wired
&zebra; internal query
processing.</entry>
<entry>default</entry>
</tbody>
</tgroup>
</table>
-
+
<para>
The use attributes (type 1) mappings the
predefined attribute sets are found in the
- attribute set configuration files <filename>tab/*.att</filename>.
+ attribute set configuration files <filename>tab/*.att</filename>.
</para>
-
+
<note>
<para>
- The &zebra; internal query processing is modeled after
+ The &zebra; internal query processing is modeled after
the &acro.bib1; attribute set, and the non-use
attributes type 2-6 are hard-wired in. It is therefore essential
- to be familiar with <xref linkend="querymodel-bib1-nonuse"/>.
+ to be familiar with <xref linkend="querymodel-bib1-nonuse"/>.
</para>
</note>
-
+
</section>
-
+
<section id="querymodel-boolean-operators">
<title>Boolean operators</title>
<para>
using the standard boolean operators into new query trees.
Thus, boolean operators are always internal nodes in the query tree.
</para>
-
+
<table id="querymodel-boolean-operators-table" frame="top">
<title>Boolean operators</title>
<tgroup cols="3">
</row>
<row><entry><literal>@prox</literal></entry>
<entry>binary PROXIMITY operator</entry>
- <entry>Set intersection of two atomic queries hit sets. In
- addition, the intersection set is purged for all
- documents which do not satisfy the requested query
- term proximity. Usually a proper subset of the AND
+ <entry>Set intersection of two atomic queries hit sets. In
+ addition, the intersection set is purged for all
+ documents which do not satisfy the requested query
+ term proximity. Usually a proper subset of the AND
operation.</entry>
</row>
</tbody>
</tgroup>
</table>
-
+
<para>
- For example, we can combine the terms
- <emphasis>information</emphasis> and <emphasis>retrieval</emphasis>
+ For example, we can combine the terms
+ <emphasis>information</emphasis> and <emphasis>retrieval</emphasis>
into different searches in the default index of the default
attribute set as follows.
Querying for the union of all documents containing the
terms <emphasis>information</emphasis> OR
- <emphasis>retrieval</emphasis>:
+ <emphasis>retrieval</emphasis>:
<screen>
Z> find @or information retrieval
</screen>
<para>
Querying for the intersection of all documents containing the
terms <emphasis>information</emphasis> AND
- <emphasis>retrieval</emphasis>:
+ <emphasis>retrieval</emphasis>:
The hit set is a subset of the corresponding
OR query.
<screen>
terms <emphasis>information</emphasis> AND
<emphasis>retrieval</emphasis>, taking proximity into account:
The hit set is a subset of the corresponding
- AND query
+ AND query
(see the <ulink url="&url.yaz.pqf;">&acro.pqf; grammar</ulink> for
details on the proximity operator):
<screen>
Querying for the intersection of all documents containing the
terms <emphasis>information</emphasis> AND
<emphasis>retrieval</emphasis>, in the same order and near each
- other as described in the term list.
+ other as described in the term list.
The hit set is a subset of the corresponding
PROXIMITY query.
<screen>
</screen>
</para>
</section>
-
-
+
+
<section id="querymodel-atomic-queries">
<title>Atomic queries (&acro.apt;)</title>
<para>
Atomic queries are the query parts which work on one access point
only. These consist of <emphasis>an attribute list</emphasis>
followed by a <emphasis>single term</emphasis> or a
- <emphasis>quoted term list</emphasis>, and are often called
+ <emphasis>quoted term list</emphasis>, and are often called
<emphasis>Attributes-Plus-Terms (&acro.apt;)</emphasis> queries.
</para>
<para>
- Atomic (&acro.apt;) queries are always leaf nodes in the &acro.pqf; query tree.
+ Atomic (&acro.apt;) queries are always leaf nodes in the &acro.pqf; query tree.
UN-supplied non-use attributes types 2-12 are either inherited from
higher nodes in the query tree, or are set to &zebra;'s default values.
- See <xref linkend="querymodel-bib1"/> for details.
+ See <xref linkend="querymodel-bib1"/> for details.
</para>
-
+
<table id="querymodel-atomic-queries-table" frame="top">
<title>Atomic queries (&acro.apt;)</title>
<tgroup cols="3">
<entry>Type</entry>
<entry>Notes</entry>
</row>
- </thead>
+ </thead>
<tbody>
<row>
<entry><emphasis>attribute list</emphasis></entry>
</row>
<row>
<entry><emphasis>term</emphasis></entry>
- <entry>single <emphasis>term</emphasis>
+ <entry>single <emphasis>term</emphasis>
or <emphasis>quoted term list</emphasis> </entry>
<entry>Here the search terms or list of search terms is added
to the query</entry>
</para>
<para>
- The <emphasis>scan</emphasis> operation is only supported with
+ The <emphasis>scan</emphasis> operation is only supported with
atomic &acro.apt; queries, as it is bound to one access point at a
time. Boolean query trees are not allowed during
<emphasis>scan</emphasis>.
- </para>
-
+ </para>
+
<para>
For example, we might want to scan the title index, starting with
- the term
+ the term
<emphasis>debussy</emphasis>, and displaying this and the
following terms in lexicographic order:
<screen>
</screen>
</para>
</section>
-
-
+
+
<section id="querymodel-resultset">
<title>Named Result Sets</title>
<para>
result sets are leaf nodes in the &acro.pqf; query tree, exactly as
atomic &acro.apt; queries are.
</para>
- <para>
+ <para>
After the execution of a search, the result set is available at
the server, such that the client can use it for subsequent
searches or retrieval requests. The Z30.50 standard actually
send a diagnostic to the effect that the requested
result set does not exist any more.
</para>
-
+
<para>
Defining a named result set and re-using it in the next query,
using <application>yaz-client</application>. Notice that the client, not
the server, assigns the string '1' to the
- named result set.
+ named result set.
<screen>
Z> f @attr 1=4 mozart
...
Number of hits: 14, setno 2
</screen>
</para>
-
+
<note>
<para>
Named result sets are only supported by the &acro.z3950; protocol.
</para>
</note>
</section>
-
+
<section id="querymodel-use-string">
<title>&zebra;'s special access point of type 'string'</title>
<para>
- The numeric <emphasis>use (type 1)</emphasis> attribute is usually
+ The numeric <emphasis>use (type 1)</emphasis> attribute is usually
referred to from a given
- attribute set. In addition, &zebra; let you use
+ attribute set. In addition, &zebra; let you use
<emphasis>any internal index
- name defined in your configuration</emphasis>
+ name defined in your configuration</emphasis>
as use attribute value. This is a great feature for
debugging, and when you do
not need the complexity of defined use attribute values. It is
- the preferred way of accessing &zebra; indexes directly.
+ the preferred way of accessing &zebra; indexes directly.
</para>
<para>
Finding all documents which have the term list "information
<para>
It is possible to search
in any silly string index - if it's defined in your
- indexing rules and can be parsed by the &acro.pqf; parser.
+ indexing rules and can be parsed by the &acro.pqf; parser.
This is definitely not the recommended use of
this facility, as it might confuse your users with some very
unexpected results.
</screen>
</para>
<para>
- See also <xref linkend="querymodel-pqf-apt-mapping"/> for details, and
+ See also <xref linkend="querymodel-pqf-apt-mapping"/> for details, and
<xref linkend="zebrasrv-sru"/>
for the &acro.sru; &acro.pqf; query extension using string names as a fast
debugging facility.
</para>
</section>
-
+
<section id="querymodel-use-xpath">
- <title>&zebra;'s special access point of type 'XPath'
+ <title>&zebra;'s special access point of type 'XPath'
for &acro.grs1; filters</title>
<para>
As we have seen above, it is possible (albeit seldom a great
- idea) to emulate
+ idea) to emulate
<ulink url="http://www.w3.org/TR/xpath">XPath 1.0</ulink> based
search by defining <emphasis>use (type 1)</emphasis>
- <emphasis>string</emphasis> attributes which in appearance
+ <emphasis>string</emphasis> attributes which in appearance
<emphasis>resemble XPath queries</emphasis>. There are two
problems with this approach: first, the XPath-look-alike has to
be defined at indexing time, no new undefined
XPath queries can entered at search time, and second, it might
confuse users very much that an XPath-alike index name in fact
gets populated from a possible entirely different &acro.xml; element
- than it pretends to access.
+ than it pretends to access.
</para>
<para>
When using the &acro.grs1; Record Model
(see <xref linkend="grs"/>), we have the
- possibility to embed <emphasis>life</emphasis>
+ possibility to embed <emphasis>life</emphasis>
XPath expressions
in the &acro.pqf; queries, which are here called
<emphasis>use (type 1)</emphasis> <emphasis>xpath</emphasis>
- attributes. You must enable the
- <literal>xpath enable</literal> directive in your
- <literal>.abs</literal> configuration files.
+ attributes. You must enable the
+ <literal>xpath enable</literal> directive in your
+ <literal>.abs</literal> configuration files.
</para>
<note>
<para>
- Only a <emphasis>very</emphasis> restricted subset of the
- <ulink url="http://www.w3.org/TR/xpath">XPath 1.0</ulink>
+ Only a <emphasis>very</emphasis> restricted subset of the
+ <ulink url="http://www.w3.org/TR/xpath">XPath 1.0</ulink>
standard is supported as the &acro.grs1; record model is simpler than
- a full &acro.xml; &acro.dom; structure. See the following examples for
+ a full &acro.xml; &acro.dom; structure. See the following examples for
possibilities.
</para>
</note>
<para>
- Finding all documents which have the term "content"
+ Finding all documents which have the term "content"
inside a text node found in a specific &acro.xml; &acro.dom;
- <emphasis>subtree</emphasis>, whose starting element is
- addressed by XPath.
+ <emphasis>subtree</emphasis>, whose starting element is
+ addressed by XPath.
<screen>
- Z> find @attr 1=/root content
+ Z> find @attr 1=/root content
Z> find @attr 1=/root/first content
</screen>
<emphasis>Notice that the
</emphasis>
It follows that the above searches are interpreted as:
<screen>
- Z> find @attr 1=/root//text() content
+ Z> find @attr 1=/root//text() content
Z> find @attr 1=/root/first//text() content
</screen>
</para>
-
+
<para>
Searching inside attribute strings is possible:
<screen>
- Z> find @attr 1=/link/@creator morten
+ Z> find @attr 1=/link/@creator morten
</screen>
- </para>
+ </para>
- <para>
+ <para>
Filter the addressing XPath by a predicate working on exact
string values in
attributes (in the &acro.xml; sense) can be done: return all those docs which
<screen>
Z> find @attr 1=/record/title[@lang='en'] english
Z> find @attr 1=/link[@creator='sisse'] sibelius
- Z> find @attr 1=/link[@creator='sisse']/description[@xml:lang='da'] sibelius
+ Z> find @attr 1=/link[@creator='sisse']/description[@xml:lang='da'] sibelius
</screen>
</para>
-
- <para>
- Combining numeric indexes, boolean expressions,
+
+ <para>
+ Combining numeric indexes, boolean expressions,
and xpath based searches is possible:
<screen>
Z> find @attr 1=/record/title @and foo bar
Z> find @and @attr 1=/record/title foo @attr 1=4 bar
- </screen>
+ </screen>
</para>
<para>
Escaping &acro.pqf; keywords and other non-parseable XPath constructs
</warning>
</section>
</section>
-
+
<section id="querymodel-exp1">
<title>Explain Attribute Set</title>
<para>
- The &acro.z3950; standard defines the
+ The &acro.z3950; standard defines the
<ulink url="&url.z39.50.explain;">Explain</ulink> attribute set
- Exp-1, which is used to discover information
+ Exp-1, which is used to discover information
about a server's search semantics and functional capabilities
- &zebra; exposes a "classic"
+ &zebra; exposes a "classic"
Explain database by base name <literal>IR-Explain-1</literal>, which
- is populated with system internal information.
+ is populated with system internal information.
</para>
- <para>
- The attribute-set <literal>exp-1</literal> consists of a single
- use attribute (type 1).
+ <para>
+ The attribute-set <literal>exp-1</literal> consists of a single
+ use attribute (type 1).
</para>
<para>
In addition, the non-Use
- &acro.bib1; attributes, that is, the types
+ &acro.bib1; attributes, that is, the types
<emphasis>Relation</emphasis>, <emphasis>Position</emphasis>,
- <emphasis>Structure</emphasis>, <emphasis>Truncation</emphasis>,
- and <emphasis>Completeness</emphasis> are imported from
+ <emphasis>Structure</emphasis>, <emphasis>Truncation</emphasis>,
+ and <emphasis>Completeness</emphasis> are imported from
the &acro.bib1; attribute set, and may be used
- within any explain query.
+ within any explain query.
</para>
-
+
<section id="querymodel-exp1-use">
- <title>Use Attributes (type = 1)</title>
+ <title>Use Attributes (type = 1)</title>
<para>
The following Explain search attributes are supported:
- <literal>ExplainCategory</literal> (@attr 1=1),
- <literal>DatabaseName</literal> (@attr 1=3),
- <literal>DateAdded</literal> (@attr 1=9),
+ <literal>ExplainCategory</literal> (@attr 1=1),
+ <literal>DatabaseName</literal> (@attr 1=3),
+ <literal>DateAdded</literal> (@attr 1=9),
<literal>DateChanged</literal>(@attr 1=10).
</para>
<para>
<literal>DatabaseInfo</literal>, <literal>AttributeDetails</literal>.
</para>
<para>
- See <filename>tab/explain.att</filename> and the
+ See <filename>tab/explain.att</filename> and the
<ulink url="&url.z39.50;">&acro.z3950;</ulink> standard
for more information.
</para>
</section>
-
+
<section id="querymodel-examples">
<title>Explain searches with yaz-client</title>
<para>
via ASN.1. Practically no &acro.z3950; clients supports this. Fortunately
they don't have to - &zebra; allows retrieval of this information
in other formats:
- &acro.sutrs;, &acro.xml;,
+ &acro.sutrs;, &acro.xml;,
&acro.grs1; and <literal>ASN.1</literal> Explain.
</para>
-
+
<para>
List supported categories to find out which explain commands are
- supported:
+ supported:
<screen>
Z> base IR-Explain-1
Z> find @attr exp1 1=1 categorylist
Z> show 1+2
</screen>
</para>
-
+
<para>
Get target info, that is, investigate which databases exist at
this server endpoint:
Z> show 1+1
</screen>
</para>
-
+
<para>
List all supported databases, the number of hits
is the number of databases found, which most commonly are the
Z> show 1+2
</screen>
</para>
-
+
<para>
Get database info record for database <literal>Default</literal>.
<screen>
Z> find @attrset exp1 @and @attr 1=1 databaseinfo @attr 1=3 Default
</screen>
</para>
-
+
<para>
Get attribute details record for database
<literal>Default</literal>.
</screen>
</para>
</section>
-
+
</section>
-
+
<section id="querymodel-bib1">
<title>&acro.bib1; Attribute Set</title>
<para>
Most of the information contained in this section is an excerpt of
the ATTRIBUTE SET &acro.bib1; (&acro.z3950;-1995) SEMANTICS
found at <ulink url="&url.z39.50.attset.bib1.1995;">. The &acro.bib1;
- Attribute Set Semantics</ulink> from 1995, also in an updated
+ Attribute Set Semantics</ulink> from 1995, also in an updated
<ulink url="&url.z39.50.attset.bib1;">&acro.bib1;
- Attribute Set</ulink>
+ Attribute Set</ulink>
version from 2003. Index Data is not the copyright holder of this
information, except for the configuration details, the listing of
- &zebra;'s capabilities, and the example queries.
+ &zebra;'s capabilities, and the example queries.
</para>
-
-
- <section id="querymodel-bib1-use">
+
+
+ <section id="querymodel-bib1-use">
<title>Use Attributes (type 1)</title>
- <para>
- A use attribute specifies an access point for any atomic query.
- These access points are highly dependent on the attribute set used
- in the query, and are user configurable using the following
- default configuration files:
- <filename>tab/bib1.att</filename>,
- <filename>tab/dan1.att</filename>,
- <filename>tab/explain.att</filename>, and
- <filename>tab/gils.att</filename>.
+ <para>
+ A use attribute specifies an access point for any atomic query.
+ These access points are highly dependent on the attribute set used
+ in the query, and are user configurable using the following
+ default configuration files:
+ <filename>tab/bib1.att</filename>,
+ <filename>tab/dan1.att</filename>,
+ <filename>tab/explain.att</filename>, and
+ <filename>tab/gils.att</filename>.
</para>
- <para>
+ <para>
For example, some few &acro.bib1; use
attributes from the <filename>tab/bib1.att</filename> are:
<screen>
att 1036 Author-Title-Subject
</screen>
</para>
- <para>
- New attribute sets can be added by adding new
- <filename>tab/*.att</filename> configuration files, which need to
- be sourced in the main configuration <filename>zebra.cfg</filename>.
+ <para>
+ New attribute sets can be added by adding new
+ <filename>tab/*.att</filename> configuration files, which need to
+ be sourced in the main configuration <filename>zebra.cfg</filename>.
+ </para>
+ <para>
+ In addition, &zebra; allows the access of
+ <emphasis>internal index names</emphasis> and <emphasis>dynamic
+ XPath</emphasis> as use attributes; see
+ <xref linkend="querymodel-use-string"/> and
+ <xref linkend="querymodel-use-xpath"/>.
</para>
- <para>
- In addition, &zebra; allows the access of
- <emphasis>internal index names</emphasis> and <emphasis>dynamic
- XPath</emphasis> as use attributes; see
- <xref linkend="querymodel-use-string"/> and
- <xref linkend="querymodel-use-xpath"/>.
- </para>
- <para>
- Phrase search for <emphasis>information retrieval</emphasis> in
- the title-register, scanning the same register afterwards:
- <screen>
- Z> find @attr 1=4 "information retrieval"
- Z> scan @attr 1=4 information
- </screen>
- </para>
+ <para>
+ Phrase search for <emphasis>information retrieval</emphasis> in
+ the title-register, scanning the same register afterwards:
+ <screen>
+ Z> find @attr 1=4 "information retrieval"
+ Z> scan @attr 1=4 information
+ </screen>
+ </para>
</section>
</section>
<section id="querymodel-bib1-nonuse">
- <title>&zebra; general Bib1 Non-Use Attributes (type 2-6)</title>
+ <title>&zebra; general Bib1 Non-Use Attributes (type 2-6)</title>
<section id="querymodel-bib1-relation">
<title>Relation Attributes (type 2)</title>
-
+
<para>
Relation attributes describe the relationship of the access
- point (left side
+ point (left side
of the relation) to the search term as qualified by the attributes (right
side of the relation), e.g., Date-publication <= 1975.
- </para>
+ </para>
<table id="querymodel-bib1-relation-table" frame="top">
<title>Relation Attributes (type 2)</title>
AlwaysMatches searches are only supported if alwaysmatches indexing
has been enabled. See <xref linkend="default-idx-file"/>
</para>
- </note>
-
+ </note>
+
<para>
The relation attributes 1-5 are supported and work exactly as
expected.
- All ordering operations are based on a lexicographical ordering,
- <emphasis>except</emphasis> when the
+ All ordering operations are based on a lexicographical ordering,
+ <emphasis>except</emphasis> when the
structure attribute numeric (109) is used. In
- this case, ordering is numerical. See
+ this case, ordering is numerical. See
<xref linkend="querymodel-bib1-structure"/>.
<screen>
Z> find @attr 1=Title @attr 2=1 music
</para>
<para>
- The relation attribute
+ The relation attribute
<emphasis>Relevance (102)</emphasis> is supported, see
<xref linkend="administration-ranking"/> for full information.
</para>
-
+
<para>
Ranked search for <emphasis>information retrieval</emphasis> in
the title-register:
</para>
<para>
- The relation attribute
+ The relation attribute
<emphasis>AlwaysMatches (103)</emphasis> is in the default
configuration
- supported in conjecture with structure attribute
+ supported in conjecture with structure attribute
<emphasis>Phrase (1)</emphasis> (which may be omitted by
- default).
+ default).
It can be configured to work with other structure attributes,
- see the configuration file
- <filename>tab/default.idx</filename> and
- <xref linkend="querymodel-pqf-apt-mapping"/>.
+ see the configuration file
+ <filename>tab/default.idx</filename> and
+ <xref linkend="querymodel-pqf-apt-mapping"/>.
</para>
<para>
<emphasis>AlwaysMatches (103)</emphasis> is a
<section id="querymodel-bib1-position">
<title>Position Attributes (type 3)</title>
-
+
<para>
The position attribute specifies the location of the search term
within the field or subfield in which it appears.
<section id="querymodel-bib1-structure">
<title>Structure Attributes (type 4)</title>
-
+
<para>
The structure attribute specifies the type of search
term. This causes the search to be mapped on
different &zebra; internal indexes, which must have been defined
- at index time.
+ at index time.
</para>
- <para>
- The possible values of the
+ <para>
+ The possible values of the
<literal>structure attribute (type 4)</literal> can be defined
using the configuration file <filename>tab/default.idx</filename>.
The default configuration is summarized in this table.
</tbody>
</tgroup>
</table>
- <para>
- The structure attribute values
- <literal>Word list (6)</literal>
- is supported, and maps to the boolean <literal>AND</literal>
- combination of words supplied. The word list is useful when
- Google-like bag-of-word queries need to be translated from a GUI
- query language to &acro.pqf;. For example, the following queries
- are equivalent:
- <screen>
- Z> find @attr 1=Title @attr 4=6 "mozart amadeus"
- Z> find @attr 1=Title @and mozart amadeus
- </screen>
- </para>
+ <para>
+ The structure attribute values
+ <literal>Word list (6)</literal>
+ is supported, and maps to the boolean <literal>AND</literal>
+ combination of words supplied. The word list is useful when
+ Google-like bag-of-word queries need to be translated from a GUI
+ query language to &acro.pqf;. For example, the following queries
+ are equivalent:
+ <screen>
+ Z> find @attr 1=Title @attr 4=6 "mozart amadeus"
+ Z> find @attr 1=Title @and mozart amadeus
+ </screen>
+ </para>
- <para>
- The structure attribute value
- <literal>Free-form-text (105)</literal> and
- <literal>Document-text (106)</literal>
- are supported, and map both to the boolean <literal>OR</literal>
- combination of words supplied. The following queries
- are equivalent:
- <screen>
- Z> find @attr 1=Body-of-text @attr 4=105 "bach salieri teleman"
- Z> find @attr 1=Body-of-text @attr 4=106 "bach salieri teleman"
- Z> find @attr 1=Body-of-text @or bach @or salieri teleman
- </screen>
- This <literal>OR</literal> list of terms is very useful in
- combination with relevance ranking:
- <screen>
- Z> find @attr 1=Body-of-text @attr 2=102 @attr 4=105 "bach salieri teleman"
- </screen>
- </para>
+ <para>
+ The structure attribute value
+ <literal>Free-form-text (105)</literal> and
+ <literal>Document-text (106)</literal>
+ are supported, and map both to the boolean <literal>OR</literal>
+ combination of words supplied. The following queries
+ are equivalent:
+ <screen>
+ Z> find @attr 1=Body-of-text @attr 4=105 "bach salieri teleman"
+ Z> find @attr 1=Body-of-text @attr 4=106 "bach salieri teleman"
+ Z> find @attr 1=Body-of-text @or bach @or salieri teleman
+ </screen>
+ This <literal>OR</literal> list of terms is very useful in
+ combination with relevance ranking:
+ <screen>
+ Z> find @attr 1=Body-of-text @attr 2=102 @attr 4=105 "bach salieri teleman"
+ </screen>
+ </para>
- <para>
- The structure attribute value
- <literal>Local number (107)</literal>
- is supported, and maps always to the &zebra; internal document ID,
- irrespectively which use attribute is specified. The following queries
- have exactly the same unique record in the hit set:
- <screen>
- Z> find @attr 4=107 10
- Z> find @attr 1=4 @attr 4=107 10
- Z> find @attr 1=1010 @attr 4=107 10
- </screen>
- </para>
+ <para>
+ The structure attribute value
+ <literal>Local number (107)</literal>
+ is supported, and maps always to the &zebra; internal document ID,
+ irrespectively which use attribute is specified. The following queries
+ have exactly the same unique record in the hit set:
+ <screen>
+ Z> find @attr 4=107 10
+ Z> find @attr 1=4 @attr 4=107 10
+ Z> find @attr 1=1010 @attr 4=107 10
+ </screen>
+ </para>
- <para>
- In
- the GILS schema (<literal>gils.abs</literal>), the
- west-bounding-coordinate is indexed as type <literal>n</literal>,
- and is therefore searched by specifying
- <emphasis>structure</emphasis>=<emphasis>Numeric String</emphasis>.
- To match all those records with west-bounding-coordinate greater
- than -114 we use the following query:
- <screen>
- Z> find @attr 4=109 @attr 2=5 @attr gils 1=2038 -114
- </screen>
+ <para>
+ In
+ the GILS schema (<literal>gils.abs</literal>), the
+ west-bounding-coordinate is indexed as type <literal>n</literal>,
+ and is therefore searched by specifying
+ <emphasis>structure</emphasis>=<emphasis>Numeric String</emphasis>.
+ To match all those records with west-bounding-coordinate greater
+ than -114 we use the following query:
+ <screen>
+ Z> find @attr 4=109 @attr 2=5 @attr gils 1=2038 -114
+ </screen>
</para>
<note>
<para>
The exact mapping between &acro.pqf; queries and &zebra; internal indexes
- and index types is explained in
+ and index types is explained in
<xref linkend="querymodel-pqf-apt-mapping"/>.
</para>
</note>
</section>
-
+
<section id="querymodel-bib1-truncation">
<title>Truncation Attributes (type = 5)</title>
...
Number of hits: 95, setno 8
</screen>
- </para>
+ </para>
<para>
- The truncation attribute value
+ The truncation attribute value
<literal>Process # in search term (101)</literal> is a
poor-man's regular expression search. It maps
each <literal>#</literal> to <literal>.*</literal>, and
Number of hits: 89, setno 10
</screen>
</para>
-
+
<para>
- The truncation attribute value
- <literal>Regexp-1 (102)</literal> is a normal regular search,
+ The truncation attribute value
+ <literal>Regexp-1 (102)</literal> is a normal regular search,
see <xref linkend="querymodel-regular"/> for details.
<screen>
Z> find @attr 1=Body-of-text @attr 5=102 schnit+ke
</para>
<para>
- The truncation attribute value
+ The truncation attribute value
<literal>Regexp-2 (103) </literal> is a &zebra; specific extension
which allows <emphasis>fuzzy</emphasis> matches. One single
error in spelling of search terms is allowed, i.e., a document
is hit if it includes a term which can be mapped to the used
search term by one character substitution, addition, deletion or
- change of position.
+ change of position.
<screen>
Z> find @attr 1=Body-of-text @attr 5=100 schnittke
...
Number of hits: 103, setno 15
...
</screen>
- </para>
+ </para>
</section>
-
+
<section id="querymodel-bib1-completeness">
- <title>Completeness Attributes (type = 6)</title>
+ <title>Completeness Attributes (type = 6)</title>
<para>
The <literal>Completeness Attributes (type = 6)</literal>
- is used to specify that a given search term or term list is either
- part of the terms of a given index/field
+ is used to specify that a given search term or term list is either
+ part of the terms of a given index/field
(<literal>Incomplete subfield (1)</literal>), or is
what literally is found in the entire field's index
(<literal>Complete field (3)</literal>).
- </para>
+ </para>
<table id="querymodel-bib1-completeness-table" frame="top">
<title>Completeness Attributes (type = 6)</title>
The <literal>Completeness Attributes (type = 6)</literal>
is only partially and conditionally
supported in the sense that it is ignored if the hit index is
- not of structure <literal>type="w"</literal> or
+ not of structure <literal>type="w"</literal> or
<literal>type="p"</literal>.
- </para>
+ </para>
<para>
<literal>Incomplete subfield (1)</literal> is the default, and
- makes &zebra; use
- register <literal>type="w"</literal>, whereas
+ makes &zebra; use
+ register <literal>type="w"</literal>, whereas
<literal>Complete field (3)</literal> triggers
search and scan in index <literal>type="p"</literal>.
</para>
<note>
<para>
The exact mapping between &acro.pqf; queries and &zebra; internal indexes
- and index types is explained in
+ and index types is explained in
<xref linkend="querymodel-pqf-apt-mapping"/>.
</para>
</note>
</section>
- </section>
+ </section>
<section id="querymodel-zebra">
<title>Extended &zebra; &acro.rpn; Features</title>
and <emphasis>non-portable</emphasis>: most functional extensions
are modeled over the <literal>bib-1</literal> attribute set,
defining type 7 and higher values.
- There are also the special
+ There are also the special
<literal>string</literal> type index names for the
- <literal>idxpath</literal> attribute set.
+ <literal>idxpath</literal> attribute set.
</para>
-
+
<section id="querymodel-zebra-attr-allrecords">
<title>&zebra; specific retrieval of all records</title>
<para>
&zebra; defines a hardwired <literal>string</literal> index name
called <literal>_ALLRECORDS</literal>. It matches any record
- contained in the database, if used in conjunction with
- the relation attribute
+ contained in the database, if used in conjunction with
+ the relation attribute
<literal>AlwaysMatches (103)</literal>.
- </para>
+ </para>
<para>
The <literal>_ALLRECORDS</literal> index name is used for total database
export. The search term is ignored, it may be empty.
<title>&zebra; specific Search Extensions to all Attribute Sets</title>
<para>
&zebra; extends the &acro.bib1; attribute types, and these extensions are
- recognized regardless of attribute
+ recognized regardless of attribute
set used in a <literal>search</literal> operation query.
</para>
-
+
<table id="querymodel-zebra-attr-search-table" frame="top">
<title>&zebra; Search Attribute Extensions</title>
<tgroup cols="4">
<thead>
<row>
- <entry>Name</entry>
+ <entry>Name</entry>
<entry>Value</entry>
<entry>Operation</entry>
<entry>&zebra; version</entry>
</row>
</tbody>
</tgroup>
- </table>
-
+ </table>
+
<section id="querymodel-zebra-attr-sorting">
<title>&zebra; Extension Embedded Sort Attribute (type 7)</title>
<para>
The embedded sort is a way to specify sort within a query - thus
removing the need to send a Sort Request separately. It is both
faster and does not require clients to deal with the Sort
- Facility.
+ Facility.
</para>
-
+
<para>
- All ordering operations are based on a lexicographical ordering,
- <emphasis>except</emphasis> when the
+ All ordering operations are based on a lexicographical ordering,
+ <emphasis>except</emphasis> when the
<literal>structure attribute numeric (109)</literal> is used. In
- this case, ordering is numerical. See
+ this case, ordering is numerical. See
<xref linkend="querymodel-bib1-structure"/>.
</para>
-
+
<para>
The possible values after attribute <literal>type 7</literal> are
- <literal>1</literal> ascending and
- <literal>2</literal> descending.
+ <literal>1</literal> ascending and
+ <literal>2</literal> descending.
The attributes+term (&acro.apt;) node is separate from the
- rest and must be <literal>@or</literal>'ed.
+ rest and must be <literal>@or</literal>'ed.
The term associated with &acro.apt; is the sorting level in integers,
- where <literal>0</literal> means primary sort,
+ where <literal>0</literal> means primary sort,
<literal>1</literal> means secondary sort, and so forth.
See also <xref linkend="administration-ranking"/>.
</para>
<para>
- For example, searching for water, sort by title (ascending)
+ For example, searching for water, sort by title (ascending)
<screen>
Z> find @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
</screen>
</para>
</section>
- <!--
+ <!--
&zebra; Extension Term Set Attribute
From the manual text, I can not see what is the point with this feature.
I think it makes more sense when there are multiple terms in a query, or
something...
-
+
We decided 2006-06-03 to disable this feature, as it is covered by
scan within a resultset. Better use ressources to upgrade this
feature for good performance.
-->
- <!--
+ <!--
<section id="querymodel-zebra-attr-estimation">
- <title>&zebra; Extension Term Set Attribute (type 8)</title>
+ <title>&zebra; Extension Term Set Attribute (type 8)</title>
<para>
- The Term Set feature is a facility that allows a search to store
- hitting terms in a "pseudo" resultset; thus a search (as usual) +
- a scan-like facility. Requires a client that can do named result
- sets since the search generates two result sets. The value for
- attribute 8 is the name of a result set (string). The terms in
- the named term set are returned as &acro.sutrs; records.
- </para>
+ The Term Set feature is a facility that allows a search to store
+ hitting terms in a "pseudo" resultset; thus a search (as usual) +
+ a scan-like facility. Requires a client that can do named result
+ sets since the search generates two result sets. The value for
+ attribute 8 is the name of a result set (string). The terms in
+ the named term set are returned as &acro.sutrs; records.
+ </para>
<para>
- For example, searching for u in title, right truncated, and
- storing the result in term set named 'aset'
- <screen>
- Z> find @attr 5=1 @attr 1=4 @attr 8=aset u
- </screen>
- </para>
+ For example, searching for u in title, right truncated, and
+ storing the result in term set named 'aset'
+ <screen>
+ Z> find @attr 5=1 @attr 1=4 @attr 8=aset u
+ </screen>
+ </para>
<warning>
- The model has one serious flaw: we don't know the size of term
- set. Experimental. Do not use in production code.
- </warning>
- </section>
+ The model has one serious flaw: we don't know the size of term
+ set. Experimental. Do not use in production code.
+ </warning>
+ </section>
-->
<title>&zebra; Extension Rank Weight Attribute (type 9)</title>
<para>
Rank weight is a way to pass a value to a ranking algorithm - so
- that one &acro.apt; has one value - while another as a different one.
+ that one &acro.apt; has one value - while another as a different one.
See also <xref linkend="administration-ranking"/>.
</para>
<para>
For example, searching for utah in title with weight 30 as well
- as any with weight 20:
- <screen>
+ as any with weight 20:
+ <screen>
Z> find @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
</screen>
</para>
<section id="querymodel-zebra-attr-termref">
<title>&zebra; Extension Term Reference Attribute (type 10)</title>
<para>
- &zebra; supports the searchResult-1 facility.
+ &zebra; supports the searchResult-1 facility.
If the Term Reference Attribute (type 10) is
given, that specifies a subqueryId value returned as part of the
search result. It is a way for a client to name an &acro.apt; part of a
- query.
+ query.
</para>
<warning>
<para>
Experimental. Do not use in production code.
- </para>
+ </para>
</warning>
-
+
</section>
-
-
-
+
+
+
<section id="querymodel-zebra-local-attr-limit">
<title>Local Approximative Limit Attribute (type 11)</title>
<para>
</para>
<para>
For example, we might be interested in exact hit count for a, but
- for b we allow hit count estimates for 1000 and higher.
+ for b we allow hit count estimates for 1000 and higher.
<screen>
Z> find @and a @attr 11=1000 b
</screen>
The estimated hit count facility makes searches faster, as one
only needs to process large hit lists partially.
It is mostly used in huge databases, where you you want trade
- exactness of hit counts against speed of execution.
+ exactness of hit counts against speed of execution.
</para>
</note>
<warning>
Do not use approximative hit count limits
in conjunction with relevance ranking, as re-sorting of the
result set only works when the entire result set has
- been processed.
+ been processed.
</para>
</warning>
</section>
<para>
By default &zebra; computes precise hit counts for a query as
a whole. Setting attribute 12 makes it perform approximative
- hit counts instead. It has the same semantics as
+ hit counts instead. It has the same semantics as
<literal>estimatehits</literal> for the <xref linkend="zebra-cfg"/>.
</para>
<para>
Do not use approximative hit count limits
in conjunction with relevance ranking, as re-sorting of the
result set only works when the entire result set has
- been processed.
+ been processed.
</para>
</warning>
</section>
<title>&zebra; specific Scan Extensions to all Attribute Sets</title>
<para>
&zebra; extends the Bib1 attribute types, and these extensions are
- recognized regardless of attribute
+ recognized regardless of attribute
set used in a scan operation query.
</para>
<table id="querymodel-zebra-attr-scan-table" frame="top">
</row>
</tbody>
</tgroup>
- </table>
-
+ </table>
+
<section id="querymodel-zebra-attr-narrow">
<title>&zebra; Extension Result Set Narrow (type 8)</title>
<para>
If attribute Result Set Narrow (type 8)
is given for scan, the value is the name of a
- result set. Each hit count in scan is
- <literal>@and</literal>'ed with the result set given.
+ result set. Each hit count in scan is
+ <literal>@and</literal>'ed with the result set given.
</para>
<para>
- Consider for example
+ Consider for example
the case of scanning all title fields around the
scanterm <emphasis>mozart</emphasis>, then refining the scan by
issuing a filtering query for <emphasis>amadeus</emphasis> to
- restrict the scan to the result set of the query:
+ restrict the scan to the result set of the query:
<screen>
- Z> scan @attr 1=4 mozart
- ...
- * mozart (43)
- mozartforskningen (1)
- mozartiana (1)
- mozarts (16)
- ...
- Z> f @attr 1=4 amadeus
- ...
- Number of hits: 15, setno 2
- ...
- Z> scan @attr 1=4 @attr 8=2 mozart
- ...
- * mozart (14)
- mozartforskningen (0)
- mozartiana (0)
- mozarts (1)
- ...
+ Z> scan @attr 1=4 mozart
+ ...
+ * mozart (43)
+ mozartforskningen (1)
+ mozartiana (1)
+ mozarts (16)
+ ...
+ Z> f @attr 1=4 amadeus
+ ...
+ Number of hits: 15, setno 2
+ ...
+ Z> scan @attr 1=4 @attr 8=2 mozart
+ ...
+ * mozart (14)
+ mozartforskningen (0)
+ mozartiana (0)
+ mozarts (1)
+ ...
</screen>
</para>
-
+
<para>
&zebra; 2.0.2 and later is able to skip 0 hit counts. This, however,
is known not to scale if the number of terms to skip is high.
<para>
The &zebra; Extension Approximative Limit (type 12) is a way to
enable approximate hit counts for scan hit counts, in the same
- way as for search hit counts.
+ way as for search hit counts.
</para>
</section>
</section>
-
+
<section id="querymodel-idxpath">
<title>&zebra; special &acro.idxpath; Attribute Set for &acro.grs1; indexing</title>
<para>
- The attribute-set <literal>idxpath</literal> consists of a single
- Use (type 1) attribute. All non-use attributes behave as normal.
+ The attribute-set <literal>idxpath</literal> consists of a single
+ Use (type 1) attribute. All non-use attributes behave as normal.
</para>
<para>
This feature is enabled when defining the
</warning>
<section id="querymodel-idxpath-use">
- <title>&acro.idxpath; Use Attributes (type = 1)</title>
+ <title>&acro.idxpath; Use Attributes (type = 1)</title>
<para>
This attribute set allows one to search &acro.grs1; filter indexed
- records by &acro.xpath; like structured index names.
+ records by &acro.xpath; like structured index names.
</para>
<warning>
index names, which might clash with your own index names.
</para>
</warning>
-
+
<table id="querymodel-idxpath-use-table" frame="top">
<title>&zebra; specific &acro.idxpath; Use Attributes (type 1)</title>
<tgroup cols="4">
See <filename>tab/idxpath.att</filename> for more information.
</para>
<para>
- Search for all documents starting with root element
+ Search for all documents starting with root element
<literal>/root</literal> (either using the numeric or the string
use attributes):
<screen>
- Z> find @attrset idxpath @attr 1=1 @attr 4=3 root/
- Z> find @attr idxpath 1=1 @attr 4=3 root/
- Z> find @attr 1=_XPATH_BEGIN @attr 4=3 root/
+ Z> find @attrset idxpath @attr 1=1 @attr 4=3 root/
+ Z> find @attr idxpath 1=1 @attr 4=3 root/
+ Z> find @attr 1=_XPATH_BEGIN @attr 4=3 root/
</screen>
</para>
<para>
- Search for all documents where specific nested &acro.xpath;
+ Search for all documents where specific nested &acro.xpath;
<literal>/c1/c2/../cn</literal> exists. Notice the very
counter-intuitive <emphasis>reverse</emphasis> notation!
<screen>
- Z> find @attrset idxpath @attr 1=1 @attr 4=3 cn/cn-1/../c1/
- Z> find @attr 1=_XPATH_BEGIN @attr 4=3 cn/cn-1/../c1/
+ Z> find @attrset idxpath @attr 1=1 @attr 4=3 cn/cn-1/../c1/
+ Z> find @attr 1=_XPATH_BEGIN @attr 4=3 cn/cn-1/../c1/
</screen>
</para>
<para>
</screen>
</para>
<para>
- Search for CDATA string <emphasis>anothertext</emphasis> in any
- attribute:
- <screen>
+ Search for CDATA string <emphasis>anothertext</emphasis> in any
+ attribute:
+ <screen>
Z> find @attrset idxpath @attr 1=1015 anothertext
Z> find @attr 1=_XPATH_ATTR_CDATA anothertext
</screen>
</para>
<para>
- Search for all documents with have an &acro.xml; element node
- including an &acro.xml; attribute named <emphasis>creator</emphasis>
- <screen>
- Z> find @attrset idxpath @attr 1=3 @attr 4=3 creator
- Z> find @attr 1=_XPATH_ATTR_NAME @attr 4=3 creator
+ Search for all documents with have an &acro.xml; element node
+ including an &acro.xml; attribute named <emphasis>creator</emphasis>
+ <screen>
+ Z> find @attrset idxpath @attr 1=3 @attr 4=3 creator
+ Z> find @attr 1=_XPATH_ATTR_NAME @attr 4=3 creator
</screen>
</para>
<para>
<para>
Scanning is supported on all <literal>idxpath</literal>
indexes, both specified as numeric use attributes, or as string
- index names.
+ index names.
<screen>
Z> scan @attrset idxpath @attr 1=1016 text
Z> scan @attr 1=_XPATH_ATTR_CDATA anothertext
<section id="querymodel-pqf-apt-mapping">
- <title>Mapping from &acro.pqf; atomic &acro.apt; queries to &zebra; internal
+ <title>Mapping from &acro.pqf; atomic &acro.apt; queries to &zebra; internal
register indexes</title>
<para>
The rules for &acro.pqf; &acro.apt; mapping are rather tricky to grasp in the
internal register or string index to use, according to the use
attribute or access point specified in the query. Thereafter we
deal with the rules for determining the correct structure type of
- the named register.
+ the named register.
</para>
- <section id="querymodel-pqf-apt-mapping-accesspoint">
- <title>Mapping of &acro.pqf; &acro.apt; access points</title>
- <para>
+ <section id="querymodel-pqf-apt-mapping-accesspoint">
+ <title>Mapping of &acro.pqf; &acro.apt; access points</title>
+ <para>
&zebra; understands four fundamental different types of access
- points, of which only the
+ points, of which only the
<emphasis>numeric use attribute</emphasis> type access points
are defined by the <ulink url="&url.z39.50;">&acro.z3950;</ulink>
standard.
All other access point types are &zebra; specific, and non-portable.
- </para>
+ </para>
<table id="querymodel-zebra-mapping-accesspoint-types" frame="top">
<title>Access point name mapping</title>
<entry>Grammar</entry>
<entry>Notes</entry>
</row>
- </thead>
- <tbody>
- <row>
- <entry>Use attribute</entry>
- <entry>numeric</entry>
- <entry>[1-9][1-9]*</entry>
- <entry>directly mapped to string index name</entry>
- </row>
- <row>
- <entry>String index name</entry>
- <entry>string</entry>
- <entry>[a-zA-Z](\-?[a-zA-Z0-9])*</entry>
- <entry>normalized name is used as internal string index name</entry>
- </row>
- <row>
- <entry>&zebra; internal index name</entry>
- <entry>zebra</entry>
- <entry>_[a-zA-Z](_?[a-zA-Z0-9])*</entry>
- <entry>hardwired internal string index name</entry>
- </row>
- <row>
- <entry>&acro.xpath; special index</entry>
- <entry>XPath</entry>
- <entry>/.*</entry>
- <entry>special xpath search for &acro.grs1; indexed records</entry>
- </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Use attribute</entry>
+ <entry>numeric</entry>
+ <entry>[1-9][1-9]*</entry>
+ <entry>directly mapped to string index name</entry>
+ </row>
+ <row>
+ <entry>String index name</entry>
+ <entry>string</entry>
+ <entry>[a-zA-Z](\-?[a-zA-Z0-9])*</entry>
+ <entry>normalized name is used as internal string index name</entry>
+ </row>
+ <row>
+ <entry>&zebra; internal index name</entry>
+ <entry>zebra</entry>
+ <entry>_[a-zA-Z](_?[a-zA-Z0-9])*</entry>
+ <entry>hardwired internal string index name</entry>
+ </row>
+ <row>
+ <entry>&acro.xpath; special index</entry>
+ <entry>XPath</entry>
+ <entry>/.*</entry>
+ <entry>special xpath search for &acro.grs1; indexed records</entry>
+ </row>
</tbody>
</tgroup>
</table>
-
+
<para>
- <literal>Attribute set names</literal> and
+ <literal>Attribute set names</literal> and
<literal>string index names</literal> are normalizes
according to the following rules: all <emphasis>single</emphasis>
hyphens <literal>'-'</literal> are stripped, and all upper case
letters are folded to lower case.
</para>
-
+
<para>
- <emphasis>Numeric use attributes</emphasis> are mapped
+ <emphasis>Numeric use attributes</emphasis> are mapped
to the &zebra; internal
string index according to the attribute set definition in use.
The default attribute set is &acro.bib1;, and may be
omitted in the &acro.pqf; query.
</para>
-
+
<para>
According to normalization and numeric
use attribute mapping, it follows that the following
&acro.pqf; queries are considered equivalent (assuming the default
configuration has not been altered):
<screen>
- Z> find @attr 1=Body-of-text serenade
- Z> find @attr 1=bodyoftext serenade
- Z> find @attr 1=BodyOfText serenade
- Z> find @attr 1=bO-d-Y-of-tE-x-t serenade
- Z> find @attr 1=1010 serenade
- Z> find @attrset bib1 @attr 1=1010 serenade
- Z> find @attrset bib1 @attr 1=1010 serenade
- Z> find @attrset Bib1 @attr 1=1010 serenade
- Z> find @attrset b-I-b-1 @attr 1=1010 serenade
- </screen>
- </para>
+ Z> find @attr 1=Body-of-text serenade
+ Z> find @attr 1=bodyoftext serenade
+ Z> find @attr 1=BodyOfText serenade
+ Z> find @attr 1=bO-d-Y-of-tE-x-t serenade
+ Z> find @attr 1=1010 serenade
+ Z> find @attrset bib1 @attr 1=1010 serenade
+ Z> find @attrset bib1 @attr 1=1010 serenade
+ Z> find @attrset Bib1 @attr 1=1010 serenade
+ Z> find @attrset b-I-b-1 @attr 1=1010 serenade
+ </screen>
+ </para>
- <para>
+ <para>
The <emphasis>numerical</emphasis>
- <literal>use attributes (type 1)</literal>
+ <literal>use attributes (type 1)</literal>
are interpreted according to the
attribute sets which have been loaded in the
<literal>zebra.cfg</literal> file, and are matched against specific
fields as specified in the <literal>.abs</literal> file which
describes the profile of the records which have been loaded.
- If no use attribute is provided, a default of
+ If no use attribute is provided, a default of
&acro.bib1; Use Any (1016) is assumed.
The predefined use attribute sets
can be reconfigured by tweaking the configuration files
- <filename>tab/*.att</filename>, and
+ <filename>tab/*.att</filename>, and
new attribute sets can be defined by adding similar files in the
- configuration path <literal>profilePath</literal> of the server.
- </para>
+ configuration path <literal>profilePath</literal> of the server.
+ </para>
<para>
String indexes can be accessed directly,
<para>
&zebra; internal indexes can be accessed directly,
according to the same rules as the user defined
- string indexes. The only difference is that
+ string indexes. The only difference is that
&zebra; internal index names are hardwired,
all uppercase and
- must start with the character <literal>'_'</literal>.
+ must start with the character <literal>'_'</literal>.
</para>
<para>
available using the &acro.grs1; filter for indexing.
These access point names must start with the character
<literal>'/'</literal>, they are <emphasis>not
- normalized</emphasis>, but passed unaltered to the &zebra; internal
+ normalized</emphasis>, but passed unaltered to the &zebra; internal
&acro.xpath; engine. See <xref linkend="querymodel-use-xpath"/>.
</para>
</section>
- <section id="querymodel-pqf-apt-mapping-structuretype">
- <title>Mapping of &acro.pqf; &acro.apt; structure and completeness to
+ <section id="querymodel-pqf-apt-mapping-structuretype">
+ <title>Mapping of &acro.pqf; &acro.apt; structure and completeness to
register type</title>
- <para>
+ <para>
Internally &zebra; has in its default configuration several
- different types of registers or indexes, whose tokenization and
- character normalization rules differ. This reflects the fact that
+ different types of registers or indexes, whose tokenization and
+ character normalization rules differ. This reflects the fact that
searching fundamental different tokens like dates, numbers,
- bitfields and string based text needs different rule sets.
+ bitfields and string based text needs different rule sets.
</para>
<table id="querymodel-zebra-mapping-structure-types" frame="top">
<tbody>
<row>
<entry>
- phrase (@attr 4=1), word (@attr 4=2),
+ phrase (@attr 4=1), word (@attr 4=2),
word-list (@attr 4=6),
free-form-text (@attr 4=105), or document-text (@attr 4=106)
</entry>
</row>
<row>
<entry>
- phrase (@attr 4=1), word (@attr 4=2),
+ phrase (@attr 4=1), word (@attr 4=2),
word-list (@attr 4=6),
free-form-text (@attr 4=105), or document-text (@attr 4=106)
</entry>
<entry>overruled</entry>
<entry>overruled</entry>
<entry>special</entry>
- <entry>Internal record ID register, used whenever
+ <entry>Internal record ID register, used whenever
Relation Always Matches (@attr 2=103) is specified</entry>
</row>
</tbody>
</tgroup>
</table>
-
+
<!-- see in util/zebramap.c -->
-
- <para>
- If a <emphasis>Structure</emphasis> attribute of
- <emphasis>Phrase</emphasis> is used in conjunction with a
- <emphasis>Completeness</emphasis> attribute of
- <emphasis>Complete (Sub)field</emphasis>, the term is matched
- against the contents of the phrase (long word) register, if one
- exists for the given <emphasis>Use</emphasis> attribute.
- A phrase register is created for those fields in the
- &acro.grs1; <filename>*.abs</filename> file that contains a
- <literal>p</literal>-specifier.
+
+ <para>
+ If a <emphasis>Structure</emphasis> attribute of
+ <emphasis>Phrase</emphasis> is used in conjunction with a
+ <emphasis>Completeness</emphasis> attribute of
+ <emphasis>Complete (Sub)field</emphasis>, the term is matched
+ against the contents of the phrase (long word) register, if one
+ exists for the given <emphasis>Use</emphasis> attribute.
+ A phrase register is created for those fields in the
+ &acro.grs1; <filename>*.abs</filename> file that contains a
+ <literal>p</literal>-specifier.
<screen>
- Z> scan @attr 1=Title @attr 4=1 @attr 6=3 beethoven
+ Z> scan @attr 1=Title @attr 4=1 @attr 6=3 beethoven
...
bayreuther festspiele (1)
* beethoven bibliography database (1)
benny carter (1)
...
- Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography"
+ Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography"
...
Number of hits: 0, setno 5
...
- Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography database"
+ Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography database"
...
Number of hits: 1, setno 6
- </screen>
- </para>
+ </screen>
+ </para>
- <para>
- If <emphasis>Structure</emphasis>=<emphasis>Phrase</emphasis> is
- used in conjunction with <emphasis>Incomplete Field</emphasis> - the
- default value for <emphasis>Completeness</emphasis>, the
- search is directed against the normal word registers, but if the term
- contains multiple words, the term will only match if all of the words
- are found immediately adjacent, and in the given order.
- The word search is performed on those fields that are indexed as
- type <literal>w</literal> in the &acro.grs1; <filename>*.abs</filename> file.
+ <para>
+ If <emphasis>Structure</emphasis>=<emphasis>Phrase</emphasis> is
+ used in conjunction with <emphasis>Incomplete Field</emphasis> - the
+ default value for <emphasis>Completeness</emphasis>, the
+ search is directed against the normal word registers, but if the term
+ contains multiple words, the term will only match if all of the words
+ are found immediately adjacent, and in the given order.
+ The word search is performed on those fields that are indexed as
+ type <literal>w</literal> in the &acro.grs1; <filename>*.abs</filename> file.
<screen>
- Z> scan @attr 1=Title @attr 4=1 @attr 6=1 beethoven
+ Z> scan @attr 1=Title @attr 4=1 @attr 6=1 beethoven
...
- beefheart (1)
+ beefheart (1)
* beethoven (18)
- beethovens (7)
+ beethovens (7)
...
- Z> find @attr 1=Title @attr 4=1 @attr 6=1 beethoven
+ Z> find @attr 1=Title @attr 4=1 @attr 6=1 beethoven
...
Number of hits: 18, setno 1
...
...
Number of hits: 2, setno 2
...
- </screen>
- </para>
+ </screen>
+ </para>
- <para>
- If the <emphasis>Structure</emphasis> attribute is
- <emphasis>Word List</emphasis>,
- <emphasis>Free-form Text</emphasis>, or
- <emphasis>Document Text</emphasis>, the term is treated as a
- natural-language, relevance-ranked query.
- This search type uses the word register, i.e. those fields
- that are indexed as type <literal>w</literal> in the
- &acro.grs1; <filename>*.abs</filename> file.
- </para>
+ <para>
+ If the <emphasis>Structure</emphasis> attribute is
+ <emphasis>Word List</emphasis>,
+ <emphasis>Free-form Text</emphasis>, or
+ <emphasis>Document Text</emphasis>, the term is treated as a
+ natural-language, relevance-ranked query.
+ This search type uses the word register, i.e. those fields
+ that are indexed as type <literal>w</literal> in the
+ &acro.grs1; <filename>*.abs</filename> file.
+ </para>
- <para>
- If the <emphasis>Structure</emphasis> attribute is
- <emphasis>Numeric String</emphasis> the term is treated as an integer.
- The search is performed on those fields that are indexed
- as type <literal>n</literal> in the &acro.grs1;
+ <para>
+ If the <emphasis>Structure</emphasis> attribute is
+ <emphasis>Numeric String</emphasis> the term is treated as an integer.
+ The search is performed on those fields that are indexed
+ as type <literal>n</literal> in the &acro.grs1;
<filename>*.abs</filename> file.
- </para>
+ </para>
- <para>
- If the <emphasis>Structure</emphasis> attribute is
- <emphasis>URX</emphasis> the term is treated as a URX (URL) entity.
- The search is performed on those fields that are indexed as type
- <literal>u</literal> in the <filename>*.abs</filename> file.
- </para>
+ <para>
+ If the <emphasis>Structure</emphasis> attribute is
+ <emphasis>URX</emphasis> the term is treated as a URX (URL) entity.
+ The search is performed on those fields that are indexed as type
+ <literal>u</literal> in the <filename>*.abs</filename> file.
+ </para>
- <para>
- If the <emphasis>Structure</emphasis> attribute is
- <emphasis>Local Number</emphasis> the term is treated as
- native &zebra; Record Identifier.
- </para>
+ <para>
+ If the <emphasis>Structure</emphasis> attribute is
+ <emphasis>Local Number</emphasis> the term is treated as
+ native &zebra; Record Identifier.
+ </para>
- <para>
- If the <emphasis>Relation</emphasis> attribute is
- <emphasis>Equals</emphasis> (default), the term is matched
- in a normal fashion (modulo truncation and processing of
- individual words, if required).
- If <emphasis>Relation</emphasis> is <emphasis>Less Than</emphasis>,
- <emphasis>Less Than or Equal</emphasis>,
- <emphasis>Greater than</emphasis>, or <emphasis>Greater than or
- Equal</emphasis>, the term is assumed to be numerical, and a
- standard regular expression is constructed to match the given
- expression.
- If <emphasis>Relation</emphasis> is <emphasis>Relevance</emphasis>,
- the standard natural-language query processor is invoked.
- </para>
+ <para>
+ If the <emphasis>Relation</emphasis> attribute is
+ <emphasis>Equals</emphasis> (default), the term is matched
+ in a normal fashion (modulo truncation and processing of
+ individual words, if required).
+ If <emphasis>Relation</emphasis> is <emphasis>Less Than</emphasis>,
+ <emphasis>Less Than or Equal</emphasis>,
+ <emphasis>Greater than</emphasis>, or <emphasis>Greater than or
+ Equal</emphasis>, the term is assumed to be numerical, and a
+ standard regular expression is constructed to match the given
+ expression.
+ If <emphasis>Relation</emphasis> is <emphasis>Relevance</emphasis>,
+ the standard natural-language query processor is invoked.
+ </para>
- <para>
- For the <emphasis>Truncation</emphasis> attribute,
- <emphasis>No Truncation</emphasis> is the default.
- <emphasis>Left Truncation</emphasis> is not supported.
- <emphasis>Process # in search term</emphasis> is supported, as is
- <emphasis>Regxp-1</emphasis>.
- <emphasis>Regxp-2</emphasis> enables the fault-tolerant (fuzzy)
- search. As a default, a single error (deletion, insertion,
- replacement) is accepted when terms are matched against the register
- contents.
- </para>
+ <para>
+ For the <emphasis>Truncation</emphasis> attribute,
+ <emphasis>No Truncation</emphasis> is the default.
+ <emphasis>Left Truncation</emphasis> is not supported.
+ <emphasis>Process # in search term</emphasis> is supported, as is
+ <emphasis>Regxp-1</emphasis>.
+ <emphasis>Regxp-2</emphasis> enables the fault-tolerant (fuzzy)
+ search. As a default, a single error (deletion, insertion,
+ replacement) is accepted when terms are matched against the register
+ contents.
+ </para>
- </section>
+ </section>
</section>
<section id="querymodel-regular">
<title>&zebra; Regular Expressions in Truncation Attribute (type = 5)</title>
-
+
<para>
Each term in a query is interpreted as a regular expression if
the truncation value is either <emphasis>Regxp-1 (@attr 5=102)</emphasis>
</row>
</tbody>
</tgroup>
- </table>
+ </table>
<para>
The above operands can be combined with the following operators:
</para>
-
+
<table id="querymodel-regular-operators-table" frame="top">
<title>Regular Expression Operators</title>
<tgroup cols="2">
<tbody>
<row>
<entry><literal>x*</literal></entry>
- <entry>Matches <literal>x</literal> zero or more times.
+ <entry>Matches <literal>x</literal> zero or more times.
Priority: high.</entry>
</row>
<row>
<entry><literal>x+</literal></entry>
- <entry>Matches <literal>x</literal> one or more times.
+ <entry>Matches <literal>x</literal> one or more times.
Priority: high.</entry>
</row>
<row>
<entry><literal>x?</literal></entry>
- <entry> Matches <literal>x</literal> zero or once.
+ <entry> Matches <literal>x</literal> zero or once.
Priority: high.</entry>
</row>
<row>
<entry>The order of evaluation may be changed by using parentheses.</entry>
</row>
</tbody>
- </tgroup>
- </table>
-
+ </tgroup>
+ </table>
+
<para>
If the first character of the <literal>Regxp-2</literal> query
is a plus character (<literal>+</literal>) it marks the
beginning of a section with non-standard specifiers.
The next plus character marks the end of the section.
Currently &zebra; only supports one specifier, the error tolerance,
- which consists one digit.
+ which consists one digit.
<!-- TODO Nice thing, but what does
that error tolerance digit *mean*? Maybe an example would be nice? -->
</para>
</para>
</section>
-
+
<!--
<para>
- The RecordType parameter in the <literal>zebra.cfg</literal> file, or
- the <literal>-t</literal> option to the indexer tells &zebra; how to
- process input records.
- Two basic types of processing are available - raw text and structured
- data. Raw text is just that, and it is selected by providing the
- argument <literal>text</literal> to &zebra;. Structured records are
- all handled internally using the basic mechanisms described in the
- subsequent sections.
- &zebra; can read structured records in many different formats.
- </para>
+ The RecordType parameter in the <literal>zebra.cfg</literal> file, or
+ the <literal>-t</literal> option to the indexer tells &zebra; how to
+ process input records.
+ Two basic types of processing are available - raw text and structured
+ data. Raw text is just that, and it is selected by providing the
+ argument <literal>text</literal> to &zebra;. Structured records are
+ all handled internally using the basic mechanisms described in the
+ subsequent sections.
+ &zebra; can read structured records in many different formats.
+ </para>
-->
</section>
<para>
Using the
<literal><cql2rpn>l2rpn.txt</cql2rpn></literal>
- &yaz; Frontend Virtual
+ &yaz; Frontend Virtual
Hosts option, one can configure
the &yaz; Frontend &acro.cql;-to-&acro.pqf;
- converter, specifying the interpretation of various
+ converter, specifying the interpretation of various
<ulink url="&url.cql;">&acro.cql;</ulink>
indexes, relations, etc. in terms of Type-1 query attributes.
- <!-- The yaz-client config file -->
+ <!-- The yaz-client config file -->
</para>
<para>
For example, using server-side &acro.cql;-to-&acro.pqf; conversion, one might
query a zebra server like this:
<screen>
- <![CDATA[
+ <![CDATA[
yaz-client localhost:9999
Z> querytype cql
Z> find text=(plant and soil)
]]>
</screen>
- and - if properly configured - even static relevance ranking can
- be performed using &acro.cql; query syntax:
+ and - if properly configured - even static relevance ranking can
+ be performed using &acro.cql; query syntax:
<screen>
- <![CDATA[
+ <![CDATA[
Z> find text = /relevant (plant and soil)
]]>
- </screen>
+ </screen>
</para>
<para>
- By the way, the same configuration can be used to
+ By the way, the same configuration can be used to
search using client-side &acro.cql;-to-&acro.pqf; conversion:
- (the only difference is <literal>querytype cql2rpn</literal>
- instead of
+ (the only difference is <literal>querytype cql2rpn</literal>
+ instead of
<literal>querytype cql</literal>, and the call specifying a local
conversion file)
<screen>
- <![CDATA[
+ <![CDATA[
yaz-client -q local/cql2pqf.txt localhost:9999
Z> querytype cql2rpn
Z> find text=(plant and soil)
]]>
- </screen>
+ </screen>
</para>
<para>
Exhaustive information can be found in the
Section <ulink url="&url.yaz.cql2pqf;">&acro.cql; to &acro.rpn; conversion</ulink>
in the &yaz; manual.
- </para>
- <!--
- <para>
- See
+ </para>
+ <!--
+ <para>
+ See
<ulink url="http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html"/>
for the Maintenance Agency's work-in-progress mapping of Dublin Core
- indexes to Attribute Architecture (util, XD and BIB-2)
+ indexes to Attribute Architecture (util, XD and BIB-2)
attributes.
</para>
-->
- </section>
+ </section>
-</chapter>
+ </chapter>
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<chapter id="quick-start">
- <title>Quick Start </title>
+ <chapter id="quick-start">
+ <title>Quick Start </title>
- <para>
- <!-- ### ulink to GILS profile: what's the URL? -->
- In this section, we will test the system by indexing a small set of
- sample GILS records that are included with the &zebra; distribution,
- running a &zebra; server against the newly created database, and
- searching the indexes with a client that connects to that server.
- </para>
- <para>
- Go to the <literal>examples/gils</literal> subdirectory of the
- distribution archive. The 48 test records are located in the sub
- directory <literal>records</literal>. To index these, type:
- <screen>
- zebraidx update records
- </screen>
- </para>
-
- <para>
- In this command, the word <literal>update</literal> is followed
- by the name of a directory: <literal>zebraidx</literal> updates all
- files in the hierarchy rooted at that directory.
- </para>
-
- <para>
- If your indexing command was successful, you are now ready to
- fire up a server. To start a server on port 2100, type:
-
- <screen>
- zebrasrv @:2100
- </screen>
-
- </para>
+ <para>
+ <!-- ### ulink to GILS profile: what's the URL? -->
+ In this section, we will test the system by indexing a small set of
+ sample GILS records that are included with the &zebra; distribution,
+ running a &zebra; server against the newly created database, and
+ searching the indexes with a client that connects to that server.
+ </para>
+ <para>
+ Go to the <literal>examples/gils</literal> subdirectory of the
+ distribution archive. The 48 test records are located in the sub
+ directory <literal>records</literal>. To index these, type:
+ <screen>
+ zebraidx update records
+ </screen>
+ </para>
+
+ <para>
+ In this command, the word <literal>update</literal> is followed
+ by the name of a directory: <literal>zebraidx</literal> updates all
+ files in the hierarchy rooted at that directory.
+ </para>
+
+ <para>
+ If your indexing command was successful, you are now ready to
+ fire up a server. To start a server on port 2100, type:
+
+ <screen>
+ zebrasrv @:2100
+ </screen>
- <para>
- The &zebra; index that you have just created has a single database
- named <literal>Default</literal>.
- The database contains records structured according to
- the GILS profile, and the server will
- return records in &acro.usmarc;, &acro.grs1;, or &acro.sutrs; format depending
- on what the client asks for.
- </para>
-
- <para>
- To test the server, you can use any &acro.z3950; client.
- For instance, you can use the demo command-line client that comes
- with &yaz;:
- </para>
- <para>
- <screen>
- yaz-client localhost:2100
- </screen>
- </para>
-
- <para>
- When the client has connected, you can type:
- </para>
-
- <para>
- <screen>
- Z> find surficial
- Z> show 1
- </screen>
- </para>
-
- <para>
- The default retrieval syntax for the client is &acro.usmarc;, and the
- default element set is <literal>F</literal> (``full record''). To
- try other formats and element sets for the same record, try:
- </para>
- <para>
- <screen>
- Z>format sutrs
- Z>show 1
- Z>format grs-1
- Z>show 1
- Z>format xml
- Z>show 1
- Z>elements B
- Z>show 1
- </screen>
- </para>
-
- <note>
- <para>You may notice that more fields are returned when your
- client requests &acro.sutrs;, &acro.grs1; or &acro.xml; records.
- This is normal - not all of the GILS data elements have mappings in
- the &acro.usmarc; record format.
</para>
- </note>
- <para>
- If you've made it this far, you know that your installation is
- working, but there's a certain amount of voodoo going on - for
- example, the mysterious incantations in the
- <literal>zebra.cfg</literal> file. In order to help us understand
- these fully, the next chapter will work through a series of
- increasingly complex example configurations.
- </para>
-
-</chapter>
+
+ <para>
+ The &zebra; index that you have just created has a single database
+ named <literal>Default</literal>.
+ The database contains records structured according to
+ the GILS profile, and the server will
+ return records in &acro.usmarc;, &acro.grs1;, or &acro.sutrs; format depending
+ on what the client asks for.
+ </para>
+
+ <para>
+ To test the server, you can use any &acro.z3950; client.
+ For instance, you can use the demo command-line client that comes
+ with &yaz;:
+ </para>
+ <para>
+ <screen>
+ yaz-client localhost:2100
+ </screen>
+ </para>
+
+ <para>
+ When the client has connected, you can type:
+ </para>
+
+ <para>
+ <screen>
+ Z> find surficial
+ Z> show 1
+ </screen>
+ </para>
+
+ <para>
+ The default retrieval syntax for the client is &acro.usmarc;, and the
+ default element set is <literal>F</literal> (``full record''). To
+ try other formats and element sets for the same record, try:
+ </para>
+ <para>
+ <screen>
+ Z>format sutrs
+ Z>show 1
+ Z>format grs-1
+ Z>show 1
+ Z>format xml
+ Z>show 1
+ Z>elements B
+ Z>show 1
+ </screen>
+ </para>
+
+ <note>
+ <para>You may notice that more fields are returned when your
+ client requests &acro.sutrs;, &acro.grs1; or &acro.xml; records.
+ This is normal - not all of the GILS data elements have mappings in
+ the &acro.usmarc; record format.
+ </para>
+ </note>
+ <para>
+ If you've made it this far, you know that your installation is
+ working, but there's a certain amount of voodoo going on - for
+ example, the mysterious incantations in the
+ <literal>zebra.cfg</literal> file. In order to help us understand
+ these fully, the next chapter will work through a series of
+ increasingly complex example configurations.
+ </para>
+
+ </chapter>
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<chapter id="record-model-alvisxslt">
+ <chapter id="record-model-alvisxslt">
<title>ALVIS &acro.xml; Record Model and Filter Module</title>
- <warning>
- <para>
- The functionality of this record model has been improved and
- replaced by the DOM &acro.xml; record model, see
- <xref linkend="record-model-domxml"/>. The Alvis &acro.xml; record
- model is considered obsolete, and will eventually be removed
- from future releases of the &zebra; software.
- </para>
- </warning>
+ <warning>
+ <para>
+ The functionality of this record model has been improved and
+ replaced by the DOM &acro.xml; record model, see
+ <xref linkend="record-model-domxml"/>. The Alvis &acro.xml; record
+ model is considered obsolete, and will eventually be removed
+ from future releases of the &zebra; software.
+ </para>
+ </warning>
<para>
The record model described in this chapter applies to the fundamental,
<xref linkend="componentmodulesalvis"/>.
</para>
- <para> This filter has been developed under the
+ <para> This filter has been developed under the
<ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
the European Community under the "Information Society Technologies"
Program (2002-2006).
</para>
-
-
+
+
<section id="record-model-alvisxslt-filter">
<title>ALVIS Record Filter</title>
<para>
The experimental, loadable Alvis &acro.xml;/&acro.xslt; filter module
- <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
+ <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
<literal>libidzebra1.4-mod-alvis</literal>.
It is invoked by the <filename>zebra.cfg</filename> configuration statement
<screen>
recordtype.xml: alvis.db/filter_alvis_conf.xml
</screen>
- In this example on all data files with suffix
+ In this example on all data files with suffix
<filename>*.xml</filename>, where the
Alvis &acro.xslt; filter configuration file is found in the
path <filename>db/filter_alvis_conf.xml</filename>.
valid &acro.xml;. It might look like this (This example is
used for indexing and display of &acro.oai; harvested records):
<screen>
- <?xml version="1.0" encoding="UTF-8"?>
- <schemaInfo>
- <schema name="identity" stylesheet="xsl/identity.xsl" />
- <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
- stylesheet="xsl/oai2index.xsl" />
- <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
- <!-- use split level 2 when indexing whole OAI Record lists -->
- <split level="2"/>
- </schemaInfo>
- </screen>
+ <?xml version="1.0" encoding="UTF-8"?>
+ <schemaInfo>
+ <schema name="identity" stylesheet="xsl/identity.xsl" />
+ <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+ stylesheet="xsl/oai2index.xsl" />
+ <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
+ <!-- use split level 2 when indexing whole OAI Record lists -->
+ <split level="2"/>
+ </schemaInfo>
+ </screen>
</para>
<para>
All named stylesheets defined inside
- <literal>schema</literal> element tags
+ <literal>schema</literal> element tags
are for presentation after search, including
the indexing stylesheet (which is a great debugging help). The
names defined in the <literal>name</literal> attributes must be
- unique, these are the literal <literal>schema</literal> or
- <literal>element set</literal> names used in
- <ulink url="http://www.loc.gov/standards/sru/srw/">&acro.srw;</ulink>,
- <ulink url="&url.sru;">&acro.sru;</ulink> and
+ unique, these are the literal <literal>schema</literal> or
+ <literal>element set</literal> names used in
+ <ulink url="http://www.loc.gov/standards/sru/srw/">&acro.srw;</ulink>,
+ <ulink url="&url.sru;">&acro.sru;</ulink> and
&acro.z3950; protocol queries.
The paths in the <literal>stylesheet</literal> attributes
are relative to zebras working directory, or absolute to file
</para>
<para>
There must be exactly one indexing &acro.xslt; stylesheet, which is
- defined by the magic attribute
+ defined by the magic attribute
<literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
</para>
<section id="record-model-alvisxslt-internal">
- <title>ALVIS Internal Record Representation</title>
+ <title>ALVIS Internal Record Representation</title>
<para>When indexing, an &acro.xml; Reader is invoked to split the input
- files into suitable record &acro.xml; pieces. Each record piece is then
- transformed to an &acro.xml; &acro.dom; structure, which is essentially the
- record model. Only &acro.xslt; transformations can be applied during
- index, search and retrieval. Consequently, output formats are
- restricted to whatever &acro.xslt; can deliver from the record &acro.xml;
- structure, be it other &acro.xml; formats, HTML, or plain text. In case
- you have <literal>libxslt1</literal> running with E&acro.xslt; support,
- you can use this functionality inside the Alvis
- filter configuration &acro.xslt; stylesheets.
+ files into suitable record &acro.xml; pieces. Each record piece is then
+ transformed to an &acro.xml; &acro.dom; structure, which is essentially the
+ record model. Only &acro.xslt; transformations can be applied during
+ index, search and retrieval. Consequently, output formats are
+ restricted to whatever &acro.xslt; can deliver from the record &acro.xml;
+ structure, be it other &acro.xml; formats, HTML, or plain text. In case
+ you have <literal>libxslt1</literal> running with E&acro.xslt; support,
+ you can use this functionality inside the Alvis
+ filter configuration &acro.xslt; stylesheets.
</para>
</section>
<section id="record-model-alvisxslt-canonical">
- <title>ALVIS Canonical Indexing Format</title>
+ <title>ALVIS Canonical Indexing Format</title>
<para>The output of the indexing &acro.xslt; stylesheets must contain
- certain elements in the magic
+ certain elements in the magic
<literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
- namespace. The output of the &acro.xslt; indexing transformation is then
- parsed using &acro.dom; methods, and the contained instructions are
- performed on the <emphasis>magic elements and their
- subtrees</emphasis>.
+ namespace. The output of the &acro.xslt; indexing transformation is then
+ parsed using &acro.dom; methods, and the contained instructions are
+ performed on the <emphasis>magic elements and their
+ subtrees</emphasis>.
</para>
<para>
- For example, the output of the command
- <screen>
+ For example, the output of the command
+ <screen>
xsltproc xsl/oai2index.xsl one-record.xml
- </screen>
+ </screen>
might look like this:
<screen>
<?xml version="1.0" encoding="UTF-8"?>
- <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
- z:id="oai:JTRS:CP-3290---Volume-I"
- z:rank="47896">
- <z:index name="oai_identifier" type="0">
- oai:JTRS:CP-3290---Volume-I</z:index>
- <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
- <z:index name="oai_setspec" type="0">jtrs</z:index>
- <z:index name="dc_all" type="w">
- <z:index name="dc_title" type="w">Proceedings of the 4th
- International Conference and Exhibition:
- World Congress on Superconductivity - Volume I</z:index>
- <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
- Burnham, Editors</z:index>
- </z:index>
- </z:record>
+ <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ z:id="oai:JTRS:CP-3290---Volume-I"
+ z:rank="47896">
+ <z:index name="oai_identifier" type="0">
+ oai:JTRS:CP-3290---Volume-I</z:index>
+ <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
+ <z:index name="oai_setspec" type="0">jtrs</z:index>
+ <z:index name="dc_all" type="w">
+ <z:index name="dc_title" type="w">Proceedings of the 4th
+ International Conference and Exhibition:
+ World Congress on Superconductivity - Volume I</z:index>
+ <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
+ Burnham, Editors</z:index>
+ </z:index>
+ </z:record>
</screen>
</para>
- <para>This means the following: From the original &acro.xml; file
+ <para>This means the following: From the original &acro.xml; file
<literal>one-record.xml</literal> (or from the &acro.xml; record &acro.dom; of the
same form coming from a split input file), the indexing
stylesheet produces an indexing &acro.xml; record, which is defined by
the <literal>record</literal> element in the magic namespace
<literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
- &zebra; uses the content of
+ &zebra; uses the content of
<literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
- record ID, and - in case static ranking is set - the content of
+ record ID, and - in case static ranking is set - the content of
<literal>z:rank="47896"</literal> as static rank. Following the
discussion in <xref linkend="administration-ranking"/>
we see that this records is internally ordered
<literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
<!-- The type of action performed during indexing is defined by
<literal>z:type="update"></literal>, with recognized values
- <literal>insert</literal>, <literal>update</literal>, and
- <literal>delete</literal>. -->
+ <literal>insert</literal>, <literal>update</literal>, and
+ <literal>delete</literal>. -->
</para>
<para>In this example, the following literal indexes are constructed:
<screen>
- oai_identifier
- oai_datestamp
- oai_setspec
- dc_all
- dc_title
- dc_creator
+ oai_identifier
+ oai_datestamp
+ oai_setspec
+ dc_all
+ dc_title
+ dc_creator
</screen>
- where the indexing type is defined in the
- <literal>type</literal> attribute
+ where the indexing type is defined in the
+ <literal>type</literal> attribute
(any value from the standard configuration
- file <filename>default.idx</filename> will do). Finally, any
+ file <filename>default.idx</filename> will do). Finally, any
<literal>text()</literal> node content recursively contained
inside the <literal>index</literal> will be filtered through the
appropriate char map for character normalization, and will be
<literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
byte for byte without any form of character normalization,
inserted into the index named <literal>oai:identifier</literal>,
- the text
+ the text
<literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
will be inserted using the <literal>w</literal> character
normalization defined in <filename>default.idx</filename> into
the index <literal>dc:creator</literal> (that is, after character
- normalization the index will keep the individual words
- <literal>kumar</literal>, <literal>krishen</literal>,
+ normalization the index will keep the individual words
+ <literal>kumar</literal>, <literal>krishen</literal>,
<literal>and</literal>, <literal>calvin</literal>,
<literal>burnham</literal>, and <literal>editors</literal>), and
finally both the texts
<literal>Proceedings of the 4th International Conference and Exhibition:
- World Congress on Superconductivity - Volume I</literal>
+ World Congress on Superconductivity - Volume I</literal>
and
- <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
+ <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
will be inserted into the index <literal>dc:all</literal> using
- the same character normalization map <literal>w</literal>.
+ the same character normalization map <literal>w</literal>.
</para>
<para>
Finally, this example configuration can be queried using &acro.pqf;
- queries, either transported by &acro.z3950;, (here using a yaz-client)
+ queries, either transported by &acro.z3950;, (here using a yaz-client)
<screen>
<![CDATA[
Z> open localhost:9999
<title>ALVIS Record Model Configuration</title>
- <section id="record-model-alvisxslt-index">
- <title>ALVIS Indexing Configuration</title>
+ <section id="record-model-alvisxslt-index">
+ <title>ALVIS Indexing Configuration</title>
<para>
As mentioned above, there can be only one indexing
stylesheet, and configuration of the indexing process is a synonym
of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the
- magic elements discussed in
- <xref linkend="record-model-alvisxslt-internal"/>.
+ magic elements discussed in
+ <xref linkend="record-model-alvisxslt-internal"/>.
Obviously, there are million of different ways to accomplish this
task, and some comments and code snippets are in order to lead
our Padawan's on the right track to the good side of the force.
<emphasis>push</emphasis> type might be the only possible way to
sort out deeply recursive input &acro.xml; formats.
</para>
- <para>
+ <para>
A <emphasis>pull</emphasis> stylesheet example used to index
&acro.oai; harvested records could use some of the following template
definitions:
<screen>
<![CDATA[
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
- xmlns:z="http://indexdata.dk/zebra/xslt/1"
- xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/"
- xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/"
- xmlns:dc="http://purl.org/dc/elements/1.1/"
- version="1.0">
-
- <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
-
- <!-- disable all default text node output -->
- <xsl:template match="text()"/>
-
- <!-- match on oai xml record root -->
- <xsl:template match="/">
- <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}">
- <!-- you might want to use z:rank="{some &acro.xslt; function here}" -->
- <xsl:apply-templates/>
- </z:record>
- </xsl:template>
-
- <!-- &acro.oai; indexing templates -->
- <xsl:template match="oai:record/oai:header/oai:identifier">
- <z:index name="oai_identifier" type="0">
- <xsl:value-of select="."/>
- </z:index>
- </xsl:template>
-
- <!-- etc, etc -->
-
- <!-- DC specific indexing templates -->
- <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
- <z:index name="dc_title" type="w">
- <xsl:value-of select="."/>
- </z:index>
- </xsl:template>
-
- <!-- etc, etc -->
-
- </xsl:stylesheet>
+ xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/"
+ xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/"
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ version="1.0">
+
+ <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+ <!-- disable all default text node output -->
+ <xsl:template match="text()"/>
+
+ <!-- match on oai xml record root -->
+ <xsl:template match="/">
+ <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}">
+ <!-- you might want to use z:rank="{some &acro.xslt; function here}" -->
+ <xsl:apply-templates/>
+ </z:record>
+ </xsl:template>
+
+ <!-- &acro.oai; indexing templates -->
+ <xsl:template match="oai:record/oai:header/oai:identifier">
+ <z:index name="oai_identifier" type="0">
+ <xsl:value-of select="."/>
+ </z:index>
+ </xsl:template>
+
+ <!-- etc, etc -->
+
+ <!-- DC specific indexing templates -->
+ <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
+ <z:index name="dc_title" type="w">
+ <xsl:value-of select="."/>
+ </z:index>
+ </xsl:template>
+
+ <!-- etc, etc -->
+
+ </xsl:stylesheet>
]]>
</screen>
</para>
Notice also,
that the names and types of the indexes can be defined in the
indexing &acro.xslt; stylesheet <emphasis>dynamically according to
- content in the original &acro.xml; records</emphasis>, which has
+ content in the original &acro.xml; records</emphasis>, which has
opportunities for great power and wizardry as well as grande
- disaster.
+ disaster.
</para>
<para>
The following excerpt of a <emphasis>push</emphasis> stylesheet
- <emphasis>might</emphasis>
+ <emphasis>might</emphasis>
be a good idea according to your strict control of the &acro.xml;
input format (due to rigorous checking against well-defined and
tight RelaxNG or &acro.xml; Schema's, for example):
<screen>
<![CDATA[
- <xsl:template name="element-name-indexes">
- <z:index name="{name()}" type="w">
- <xsl:value-of select="'1'"/>
- </z:index>
- </xsl:template>
+ <xsl:template name="element-name-indexes">
+ <z:index name="{name()}" type="w">
+ <xsl:value-of select="'1'"/>
+ </z:index>
+ </xsl:template>
]]>
</screen>
- This template creates indexes which have the name of the working
+ This template creates indexes which have the name of the working
node of any input &acro.xml; file, and assigns a '1' to the index.
- The example query
- <literal>find @attr 1=xyz 1</literal>
+ The example query
+ <literal>find @attr 1=xyz 1</literal>
finds all files which contain at least one
<literal>xyz</literal> &acro.xml; element. In case you can not control
which element names the input files contain, you might ask for
</para>
<para>
One variation over the theme <emphasis>dynamically created
- indexes</emphasis> will definitely be unwise:
+ indexes</emphasis> will definitely be unwise:
<screen>
- <![CDATA[
+ <![CDATA[
<!-- match on oai xml record root -->
- <xsl:template match="/">
- <z:record>
-
- <!-- create dynamic index name from input content -->
- <xsl:variable name="dynamic_content">
- <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
- </xsl:variable>
-
- <!-- create zillions of indexes with unknown names -->
- <z:index name="{$dynamic_content}" type="w">
- <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
- </z:index>
- </z:record>
-
- </xsl:template>
+ <xsl:template match="/">
+ <z:record>
+
+ <!-- create dynamic index name from input content -->
+ <xsl:variable name="dynamic_content">
+ <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
+ </xsl:variable>
+
+ <!-- create zillions of indexes with unknown names -->
+ <z:index name="{$dynamic_content}" type="w">
+ <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
+ </z:index>
+ </z:record>
+
+ </xsl:template>
]]>
</screen>
Don't be tempted to cross
to suffering and pain, and universal
disintegration of your project schedule.
</para>
- </section>
+ </section>
- <section id="record-model-alvisxslt-elementset">
- <title>ALVIS Exchange Formats</title>
- <para>
+ <section id="record-model-alvisxslt-elementset">
+ <title>ALVIS Exchange Formats</title>
+ <para>
An exchange format can be anything which can be the outcome of an
&acro.xslt; transformation, as far as the stylesheet is registered in
the main Alvis &acro.xslt; filter configuration file, see
<xref linkend="record-model-alvisxslt-filter"/>.
In principle anything that can be expressed in &acro.xml;, HTML, and
- TEXT can be the output of a <literal>schema</literal> or
- <literal>element set</literal> directive during search, as long as
- the information comes from the
+ TEXT can be the output of a <literal>schema</literal> or
+ <literal>element set</literal> directive during search, as long as
+ the information comes from the
<emphasis>original input record &acro.xml; &acro.dom; tree</emphasis>
(and not the transformed and <emphasis>indexed</emphasis> &acro.xml;!!).
</para>
indexer can be accessed during record retrieval. The following
example is a summary of the possibilities:
<screen>
- <![CDATA[
+ <![CDATA[
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
- xmlns:z="http://indexdata.dk/zebra/xslt/1"
- version="1.0">
-
- <!-- register internal zebra parameters -->
- <xsl:param name="id" select="''"/>
- <xsl:param name="filename" select="''"/>
- <xsl:param name="score" select="''"/>
- <xsl:param name="schema" select="''"/>
-
- <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
-
- <!-- use then for display of internal information -->
- <xsl:template match="/">
- <z:zebra>
- <id><xsl:value-of select="$id"/></id>
- <filename><xsl:value-of select="$filename"/></filename>
- <score><xsl:value-of select="$score"/></score>
- <schema><xsl:value-of select="$schema"/></schema>
- </z:zebra>
- </xsl:template>
-
- </xsl:stylesheet>
+ xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ version="1.0">
+
+ <!-- register internal zebra parameters -->
+ <xsl:param name="id" select="''"/>
+ <xsl:param name="filename" select="''"/>
+ <xsl:param name="score" select="''"/>
+ <xsl:param name="schema" select="''"/>
+
+ <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+ <!-- use then for display of internal information -->
+ <xsl:template match="/">
+ <z:zebra>
+ <id><xsl:value-of select="$id"/></id>
+ <filename><xsl:value-of select="$filename"/></filename>
+ <score><xsl:value-of select="$score"/></score>
+ <schema><xsl:value-of select="$schema"/></schema>
+ </z:zebra>
+ </xsl:template>
+
+ </xsl:stylesheet>
]]>
</screen>
</para>
- </section>
+ </section>
- <section id="record-model-alvisxslt-example">
- <title>ALVIS Filter &acro.oai; Indexing Example</title>
- <para>
+ <section id="record-model-alvisxslt-example">
+ <title>ALVIS Filter &acro.oai; Indexing Example</title>
+ <para>
The source code tarball contains a working Alvis filter example in
the directory <filename>examples/alvis-oai/</filename>, which
- should get you started.
+ should get you started.
</para>
<para>
More example data can be harvested from any &acro.oai; compliant server,
- see details at the &acro.oai;
+ see details at the &acro.oai;
<ulink url="http://www.openarchives.org/">
http://www.openarchives.org/</ulink> web site, and the community
- links at
+ links at
<ulink url="http://www.openarchives.org/community/index.html">
http://www.openarchives.org/community/index.html</ulink>.
There is a tutorial
</section>
-
+
</chapter>
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<chapter id="record-model-domxml">
+ <chapter id="record-model-domxml">
<title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
-
+
<para>
The record model described in this chapter applies to the fundamental,
structured &acro.xml;
releases of the &zebra; Information Server.
</para>
-
-
+
+
<section id="record-model-domxml-filter">
<title>&acro.dom; Record Filter Architecture</title>
- <para>
- The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
- internal data model, and can therefore parse, index, and display
- any &acro.xml; document type. It is well suited to work on
- standardized &acro.xml;-based formats such as Dublin Core, MODS, METS,
- MARCXML, OAI-PMH, RSS, and performs equally well on any other
- non-standard &acro.xml; format.
- </para>
- <para>
- A parser for binary &acro.marc; records based on the ISO2709 library
- standard is provided, it transforms these to the internal
- &acro.marcxml; &acro.dom; representation. Other binary document parsers
- are planned to follow.
- </para>
+ <para>
+ The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
+ internal data model, and can therefore parse, index, and display
+ any &acro.xml; document type. It is well suited to work on
+ standardized &acro.xml;-based formats such as Dublin Core, MODS, METS,
+ MARCXML, OAI-PMH, RSS, and performs equally well on any other
+ non-standard &acro.xml; format.
+ </para>
+ <para>
+ A parser for binary &acro.marc; records based on the ISO2709 library
+ standard is provided, it transforms these to the internal
+ &acro.marcxml; &acro.dom; representation. Other binary document parsers
+ are planned to follow.
+ </para>
- <para>
- The &acro.dom; filter architecture consists of four
- different pipelines, each being a chain of arbitrarily many successive
- &acro.xslt; transformations of the internal &acro.dom; &acro.xml;
- representations of documents.
- </para>
+ <para>
+ The &acro.dom; filter architecture consists of four
+ different pipelines, each being a chain of arbitrarily many successive
+ &acro.xslt; transformations of the internal &acro.dom; &acro.xml;
+ representations of documents.
+ </para>
- <figure id="record-model-domxml-architecture-fig">
- <title>&acro.dom; &acro.xml; filter architecture</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
- </imageobject>
- <imageobject>
- <imagedata fileref="domfilter.png" format="PNG"/>
- </imageobject>
- <textobject>
- <!-- Fall back if none of the images can be used -->
- <phrase>
- [Here there should be a diagram showing the &acro.dom; &acro.xml;
- filter architecture, but is seems that your
- tool chain has not been able to include the diagram in this
- document.]
- </phrase>
- </textobject>
- </mediaobject>
- </figure>
-
-
- <table id="record-model-domxml-architecture-table" frame="top">
- <title>&acro.dom; &acro.xml; filter pipelines overview</title>
- <tgroup cols="5">
- <thead>
- <row>
- <entry>Name</entry>
- <entry>When</entry>
- <entry>Description</entry>
- <entry>Input</entry>
- <entry>Output</entry>
- </row>
- </thead>
-
- <tbody>
- <row>
- <entry><literal>input</literal></entry>
- <entry>first</entry>
- <entry>input parsing and initial
- transformations to common &acro.xml; format</entry>
- <entry>Input raw &acro.xml; record buffers, &acro.xml; streams and
- binary &acro.marc; buffers</entry>
- <entry>Common &acro.xml; &acro.dom;</entry>
- </row>
- <row>
- <entry><literal>extract</literal></entry>
- <entry>second</entry>
- <entry>indexing term extraction
- transformations</entry>
- <entry>Common &acro.xml; &acro.dom;</entry>
- <entry>Indexing &acro.xml; &acro.dom;</entry>
- </row>
- <row>
- <entry><literal>store</literal></entry>
- <entry>second</entry>
- <entry> transformations before internal document
- storage</entry>
- <entry>Common &acro.xml; &acro.dom;</entry>
- <entry>Storage &acro.xml; &acro.dom;</entry>
- </row>
- <row>
- <entry><literal>retrieve</literal></entry>
- <entry>third</entry>
- <entry>multiple document retrieve transformations from
- storage to different output
- formats are possible</entry>
- <entry>Storage &acro.xml; &acro.dom;</entry>
- <entry>Output &acro.xml; syntax in requested formats</entry>
- </row>
- </tbody>
- </tgroup>
- </table>
+ <figure id="record-model-domxml-architecture-fig">
+ <title>&acro.dom; &acro.xml; filter architecture</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
+ </imageobject>
+ <imageobject>
+ <imagedata fileref="domfilter.png" format="PNG"/>
+ </imageobject>
+ <textobject>
+ <!-- Fall back if none of the images can be used -->
+ <phrase>
+ [Here there should be a diagram showing the &acro.dom; &acro.xml;
+ filter architecture, but is seems that your
+ tool chain has not been able to include the diagram in this
+ document.]
+ </phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+
+
+ <table id="record-model-domxml-architecture-table" frame="top">
+ <title>&acro.dom; &acro.xml; filter pipelines overview</title>
+ <tgroup cols="5">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>When</entry>
+ <entry>Description</entry>
+ <entry>Input</entry>
+ <entry>Output</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>input</literal></entry>
+ <entry>first</entry>
+ <entry>input parsing and initial
+ transformations to common &acro.xml; format</entry>
+ <entry>Input raw &acro.xml; record buffers, &acro.xml; streams and
+ binary &acro.marc; buffers</entry>
+ <entry>Common &acro.xml; &acro.dom;</entry>
+ </row>
+ <row>
+ <entry><literal>extract</literal></entry>
+ <entry>second</entry>
+ <entry>indexing term extraction
+ transformations</entry>
+ <entry>Common &acro.xml; &acro.dom;</entry>
+ <entry>Indexing &acro.xml; &acro.dom;</entry>
+ </row>
+ <row>
+ <entry><literal>store</literal></entry>
+ <entry>second</entry>
+ <entry> transformations before internal document
+ storage</entry>
+ <entry>Common &acro.xml; &acro.dom;</entry>
+ <entry>Storage &acro.xml; &acro.dom;</entry>
+ </row>
+ <row>
+ <entry><literal>retrieve</literal></entry>
+ <entry>third</entry>
+ <entry>multiple document retrieve transformations from
+ storage to different output
+ formats are possible</entry>
+ <entry>Storage &acro.xml; &acro.dom;</entry>
+ <entry>Output &acro.xml; syntax in requested formats</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
- <para>
- The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on
- your platform, even &acro.exslt;), it brings thus full &acro.xpath;
- support to the indexing, storage and display rules of not only
- &acro.xml; documents, but also binary &acro.marc; records.
- </para>
- </section>
+ <para>
+ The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on
+ your platform, even &acro.exslt;), it brings thus full &acro.xpath;
+ support to the indexing, storage and display rules of not only
+ &acro.xml; documents, but also binary &acro.marc; records.
+ </para>
+ </section>
- <section id="record-model-domxml-pipeline">
- <title>&acro.dom; &acro.xml; filter pipeline configuration</title>
+ <section id="record-model-domxml-pipeline">
+ <title>&acro.dom; &acro.xml; filter pipeline configuration</title>
<para>
The experimental, loadable &acro.dom; &acro.xml;/&acro.xslt; filter module
- <literal>mod-dom.so</literal>
+ <literal>mod-dom.so</literal>
is invoked by the <filename>zebra.cfg</filename> configuration statement
<screen>
recordtype.xml: dom.db/filter_dom_conf.xml
</screen>
- In this example the &acro.dom; &acro.xml; filter is configured to work
- on all data files with suffix
+ In this example the &acro.dom; &acro.xml; filter is configured to work
+ on all data files with suffix
<filename>*.xml</filename>, where the configuration file is found in the
path <filename>db/filter_dom_conf.xml</filename>.
</para>
<para>The &acro.dom; &acro.xslt; filter configuration file must be
valid &acro.xml;. It might look like this:
- <screen>
- <![CDATA[
- <?xml version="1.0" encoding="UTF8"?>
- <dom xmlns="http://indexdata.com/zebra-2.0">
- <input>
- <xmlreader level="1"/>
- <!-- <marc inputcharset="marc-8"/> -->
- </input>
- <extract>
- <xslt stylesheet="common2index.xsl"/>
- </extract>
- <store>
- <xslt stylesheet="common2store.xsl"/>
- </store>
- <retrieve name="dc">
- <xslt stylesheet="store2dc.xsl"/>
- </retrieve>
- <retrieve name="mods">
- <xslt stylesheet="store2mods.xsl"/>
- </retrieve>
+ <screen>
+ <![CDATA[
+ <?xml version="1.0" encoding="UTF8"?>
+ <dom xmlns="http://indexdata.com/zebra-2.0">
+ <input>
+ <xmlreader level="1"/>
+ <!-- <marc inputcharset="marc-8"/> -->
+ </input>
+ <extract>
+ <xslt stylesheet="common2index.xsl"/>
+ </extract>
+ <store>
+ <xslt stylesheet="common2store.xsl"/>
+ </store>
+ <retrieve name="dc">
+ <xslt stylesheet="store2dc.xsl"/>
+ </retrieve>
+ <retrieve name="mods">
+ <xslt stylesheet="store2mods.xsl"/>
+ </retrieve>
</dom>
- ]]>
+ ]]>
</screen>
</para>
<para>
- The root &acro.xml; element <literal><dom></literal> and all other &acro.dom;
- &acro.xml; filter elements are residing in the namespace
- <literal>xmlns="http://indexdata.com/zebra-2.0"</literal>.
+ The root &acro.xml; element <literal><dom></literal> and all other &acro.dom;
+ &acro.xml; filter elements are residing in the namespace
+ <literal>xmlns="http://indexdata.com/zebra-2.0"</literal>.
</para>
<para>
All pipeline definition elements - i.e. the
- <literal><input></literal>,
- <literal><extract></literal>,
- <literal><store></literal>, and
- <literal><retrieve></literal> elements - are optional.
- Missing pipeline definitions are just interpreted
- do-nothing identity pipelines.
+ <literal><input></literal>,
+ <literal><extract></literal>,
+ <literal><store></literal>, and
+ <literal><retrieve></literal> elements - are optional.
+ Missing pipeline definitions are just interpreted
+ do-nothing identity pipelines.
</para>
<para>
- All pipeline definition elements may contain zero or more
+ All pipeline definition elements may contain zero or more
<literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
&acro.xslt; transformation instructions, which are performed
sequentially from top to bottom.
<section id="record-model-domxml-pipeline-input">
- <title>Input pipeline</title>
- <para>
- The <literal><input></literal> pipeline definition element
- may contain either one &acro.xml; Reader definition
- <literal><![CDATA[<xmlreader level="1"/>]]></literal>, used to split
- an &acro.xml; collection input stream into individual &acro.xml; &acro.dom;
- documents at the prescribed element level,
- or one &acro.marc; binary
- parsing instruction
- <literal><![CDATA[<marc inputcharset="marc-8"/>]]></literal>, which defines
- a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values
- of the <literal>inputcharset</literal> attribute depend on your
- local <productname>iconv</productname> set-up.
- </para>
- <para>
- Both input parsers deliver individual &acro.dom; &acro.xml; documents to the
- following chain of zero or more
- <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
- &acro.xslt; transformations. At the end of this pipeline, the documents
- are in the common format, used to feed both the
- <literal><extract></literal> and
+ <title>Input pipeline</title>
+ <para>
+ The <literal><input></literal> pipeline definition element
+ may contain either one &acro.xml; Reader definition
+ <literal><![CDATA[<xmlreader level="1"/>]]></literal>, used to split
+ an &acro.xml; collection input stream into individual &acro.xml; &acro.dom;
+ documents at the prescribed element level,
+ or one &acro.marc; binary
+ parsing instruction
+ <literal><![CDATA[<marc inputcharset="marc-8"/>]]></literal>, which defines
+ a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values
+ of the <literal>inputcharset</literal> attribute depend on your
+ local <productname>iconv</productname> set-up.
+ </para>
+ <para>
+ Both input parsers deliver individual &acro.dom; &acro.xml; documents to the
+ following chain of zero or more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &acro.xslt; transformations. At the end of this pipeline, the documents
+ are in the common format, used to feed both the
+ <literal><extract></literal> and
<literal><store></literal> pipelines.
- </para>
+ </para>
</section>
<section id="record-model-domxml-pipeline-extract">
- <title>Extract pipeline</title>
- <para>
- The <literal><extract></literal> pipeline takes documents
- from any common &acro.dom; &acro.xml; format to the &zebra; specific
- indexing &acro.dom; &acro.xml; format.
- It may consist of zero ore more
- <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
- &acro.xslt; transformations, and the outcome is handled to the
- &zebra; core to drive the process of building the inverted
- indexes. See
- <xref linkend="record-model-domxml-canonical-index"/> for
- details.
- </para>
+ <title>Extract pipeline</title>
+ <para>
+ The <literal><extract></literal> pipeline takes documents
+ from any common &acro.dom; &acro.xml; format to the &zebra; specific
+ indexing &acro.dom; &acro.xml; format.
+ It may consist of zero ore more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &acro.xslt; transformations, and the outcome is handled to the
+ &zebra; core to drive the process of building the inverted
+ indexes. See
+ <xref linkend="record-model-domxml-canonical-index"/> for
+ details.
+ </para>
</section>
<section id="record-model-domxml-pipeline-store">
- <title>Store pipeline</title>
- The <literal><store></literal> pipeline takes documents
- from any common &acro.dom; &acro.xml; format to the &zebra; specific
- storage &acro.dom; &acro.xml; format.
- It may consist of zero ore more
- <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
- &acro.xslt; transformations, and the outcome is handled to the
- &zebra; core for deposition into the internal storage system.
- </section>
+ <title>Store pipeline</title>
+ The <literal><store></literal> pipeline takes documents
+ from any common &acro.dom; &acro.xml; format to the &zebra; specific
+ storage &acro.dom; &acro.xml; format.
+ It may consist of zero ore more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &acro.xslt; transformations, and the outcome is handled to the
+ &zebra; core for deposition into the internal storage system.
+ </section>
<section id="record-model-domxml-pipeline-retrieve">
- <title>Retrieve pipeline</title>
+ <title>Retrieve pipeline</title>
<para>
- Finally, there may be one or more
- <literal><retrieve></literal> pipeline definitions, each
- of them again consisting of zero or more
- <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
- &acro.xslt; transformations. These are used for document
- presentation after search, and take the internal storage &acro.dom;
- &acro.xml; to the requested output formats during record present
- requests.
+ Finally, there may be one or more
+ <literal><retrieve></literal> pipeline definitions, each
+ of them again consisting of zero or more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &acro.xslt; transformations. These are used for document
+ presentation after search, and take the internal storage &acro.dom;
+ &acro.xml; to the requested output formats during record present
+ requests.
</para>
<para>
- The possible multiple
+ The possible multiple
<literal><retrieve></literal> pipeline definitions
are distinguished by their unique <literal>name</literal>
- attributes, these are the literal <literal>schema</literal> or
- <literal>element set</literal> names used in
- <ulink url="http://www.loc.gov/standards/sru/srw/">&acro.srw;</ulink>,
- <ulink url="&url.sru;">&acro.sru;</ulink> and
- &acro.z3950; protocol queries.
- </para>
+ attributes, these are the literal <literal>schema</literal> or
+ <literal>element set</literal> names used in
+ <ulink url="http://www.loc.gov/standards/sru/srw/">&acro.srw;</ulink>,
+ <ulink url="&url.sru;">&acro.sru;</ulink> and
+ &acro.z3950; protocol queries.
+ </para>
</section>
namespace <literal>xmlns:z="http://indexdata.com/zebra-2.0"</literal>.
</para>
- <section id="record-model-domxml-canonical-index-pi">
- <title>Processing-instruction governed indexing format</title>
-
- <para>The output of the processing instruction driven
+ <section id="record-model-domxml-canonical-index-pi">
+ <title>Processing-instruction governed indexing format</title>
+
+ <para>The output of the processing instruction driven
indexing &acro.xslt; stylesheets must contain
- processing instructions named
- <literal>zebra-2.0</literal>.
+ processing instructions named
+ <literal>zebra-2.0</literal>.
The output of the &acro.xslt; indexing transformation is then
parsed using &acro.dom; methods, and the contained instructions are
performed on the <emphasis>elements and their
- subtrees directly following the processing instructions</emphasis>.
- </para>
- <para>
- For example, the output of the command
- <screen>
+ subtrees directly following the processing instructions</emphasis>.
+ </para>
+ <para>
+ For example, the output of the command
+ <screen>
xsltproc dom-index-pi.xsl marc-one.xml
- </screen>
- might look like this:
- <screen>
- <![CDATA[
- <?xml version="1.0" encoding="UTF-8"?>
- <?zebra-2.0 record id=11224466 rank=42?>
- <record>
- <?zebra-2.0 index control:0?>
- <control>11224466</control>
- <?zebra-2.0 index any:w title:w title:p title:s?>
- <title>How to program a computer</title>
+ </screen>
+ might look like this:
+ <screen>
+ <![CDATA[
+ <?xml version="1.0" encoding="UTF-8"?>
+ <?zebra-2.0 record id=11224466 rank=42?>
+ <record>
+ <?zebra-2.0 index control:0?>
+ <control>11224466</control>
+ <?zebra-2.0 index any:w title:w title:p title:s?>
+ <title>How to program a computer</title>
</record>
- ]]>
- </screen>
- </para>
- </section>
+ ]]>
+ </screen>
+ </para>
+ </section>
- <section id="record-model-domxml-canonical-index-element">
- <title>Magic element governed indexing format</title>
-
- <para>The output of the indexing &acro.xslt; stylesheets must contain
- certain elements in the magic
- <literal>xmlns:z="http://indexdata.com/zebra-2.0"</literal>
- namespace. The output of the &acro.xslt; indexing transformation is then
- parsed using &acro.dom; methods, and the contained instructions are
- performed on the <emphasis>magic elements and their
- subtrees</emphasis>.
- </para>
- <para>
- For example, the output of the command
- <screen>
- xsltproc dom-index-element.xsl marc-one.xml
- </screen>
- might look like this:
- <screen>
- <![CDATA[
- <?xml version="1.0" encoding="UTF-8"?>
- <z:record xmlns:z="http://indexdata.com/zebra-2.0"
- z:id="11224466" z:rank="42">
- <z:index name="control:0">11224466</z:index>
- <z:index name="any:w title:w title:p title:s">
- How to program a computer</z:index>
+ <section id="record-model-domxml-canonical-index-element">
+ <title>Magic element governed indexing format</title>
+
+ <para>The output of the indexing &acro.xslt; stylesheets must contain
+ certain elements in the magic
+ <literal>xmlns:z="http://indexdata.com/zebra-2.0"</literal>
+ namespace. The output of the &acro.xslt; indexing transformation is then
+ parsed using &acro.dom; methods, and the contained instructions are
+ performed on the <emphasis>magic elements and their
+ subtrees</emphasis>.
+ </para>
+ <para>
+ For example, the output of the command
+ <screen>
+ xsltproc dom-index-element.xsl marc-one.xml
+ </screen>
+ might look like this:
+ <screen>
+ <![CDATA[
+ <?xml version="1.0" encoding="UTF-8"?>
+ <z:record xmlns:z="http://indexdata.com/zebra-2.0"
+ z:id="11224466" z:rank="42">
+ <z:index name="control:0">11224466</z:index>
+ <z:index name="any:w title:w title:p title:s">
+ How to program a computer</z:index>
</z:record>
- ]]>
- </screen>
- </para>
- </section>
+ ]]>
+ </screen>
+ </para>
+ </section>
- <section id="record-model-domxml-canonical-index-semantics">
- <title>Semantics of the indexing formats</title>
+ <section id="record-model-domxml-canonical-index-semantics">
+ <title>Semantics of the indexing formats</title>
- <para>
- Both indexing formats are defined with equal semantics and
- behavior in mind:
- <itemizedlist>
+ <para>
+ Both indexing formats are defined with equal semantics and
+ behavior in mind:
+ <itemizedlist>
<listitem>
- <para>&zebra; specific instructions are either
+ <para>&zebra; specific instructions are either
processing instructions named
<literal>zebra-2.0</literal> or
elements contained in the namespace
<literal>xmlns:z="http://indexdata.com/zebra-2.0"</literal>.
- </para>
+ </para>
</listitem>
<listitem>
- <para>There must be exactly one <literal>record</literal>
- instruction, which sets the scope for the following,
- possibly nested <literal>index</literal> instructions.
- </para>
+ <para>There must be exactly one <literal>record</literal>
+ instruction, which sets the scope for the following,
+ possibly nested <literal>index</literal> instructions.
+ </para>
</listitem>
<listitem>
- <para>
- The unique <literal>record</literal> instruction
- may have additional attributes <literal>id</literal>,
- <literal>rank</literal> and <literal>type</literal>.
- Attribute <literal>id</literal> is the value of the opaque ID
- and may be any string not containing the whitespace character
- <literal>' '</literal>.
- The <literal>rank</literal> attribute value must be a
- non-negative integer. See
- <xref linkend="administration-ranking"/> .
- The <literal>type</literal> attribute specifies how the record
- is to be treated. The following values may be given for
- <literal>type</literal>:
- <variablelist>
- <varlistentry>
- <term><literal>insert</literal></term>
- <listitem>
- <para>
- The record is inserted. If the record already exists, it is
- skipped (i.e. not replaced).
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><literal>replace</literal></term>
- <listitem>
- <para>
- The record is replaced. If the record does not already exist,
- it is skipped (i.e. not inserted).
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><literal>delete</literal></term>
- <listitem>
- <para>
- The record is deleted. If the record does not already exist,
- a warning issued and rest of records are skipped in
- from the input stream.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><literal>update</literal></term>
- <listitem>
- <para>
- The record is inserted or replaced depending on whether the
- record exists or not. This is the default behavior but may
- be effectively changed by "outside" the scope of the DOM
- filter by zebraidx commands or extended services updates.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term><literal>adelete</literal></term>
- <listitem>
- <para>
- The record is deleted. If the record does not already exist,
- it is skipped (i.e. nothing is deleted).
- </para>
- <note>
- <para>
- Requires version 2.0.54 or later.
- </para>
- </note>
- </listitem>
- </varlistentry>
- </variablelist>
- Note that the value of <literal>type</literal> is only used to
- determine the action if and only if the Zebra indexer is running
- in "update" mode (i.e zebraidx update) or if the specialUpdate
- action of the
- <link linkend="administration-extended-services-z3950">Extended
+ <para>
+ The unique <literal>record</literal> instruction
+ may have additional attributes <literal>id</literal>,
+ <literal>rank</literal> and <literal>type</literal>.
+ Attribute <literal>id</literal> is the value of the opaque ID
+ and may be any string not containing the whitespace character
+ <literal>' '</literal>.
+ The <literal>rank</literal> attribute value must be a
+ non-negative integer. See
+ <xref linkend="administration-ranking"/> .
+ The <literal>type</literal> attribute specifies how the record
+ is to be treated. The following values may be given for
+ <literal>type</literal>:
+ <variablelist>
+ <varlistentry>
+ <term><literal>insert</literal></term>
+ <listitem>
+ <para>
+ The record is inserted. If the record already exists, it is
+ skipped (i.e. not replaced).
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><literal>replace</literal></term>
+ <listitem>
+ <para>
+ The record is replaced. If the record does not already exist,
+ it is skipped (i.e. not inserted).
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><literal>delete</literal></term>
+ <listitem>
+ <para>
+ The record is deleted. If the record does not already exist,
+ a warning issued and rest of records are skipped in
+ from the input stream.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><literal>update</literal></term>
+ <listitem>
+ <para>
+ The record is inserted or replaced depending on whether the
+ record exists or not. This is the default behavior but may
+ be effectively changed by "outside" the scope of the DOM
+ filter by zebraidx commands or extended services updates.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><literal>adelete</literal></term>
+ <listitem>
+ <para>
+ The record is deleted. If the record does not already exist,
+ it is skipped (i.e. nothing is deleted).
+ </para>
+ <note>
+ <para>
+ Requires version 2.0.54 or later.
+ </para>
+ </note>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ Note that the value of <literal>type</literal> is only used to
+ determine the action if and only if the Zebra indexer is running
+ in "update" mode (i.e zebraidx update) or if the specialUpdate
+ action of the
+ <link linkend="administration-extended-services-z3950">Extended
Service Update</link> is used.
- For this reason a specialUpdate may end up deleting records!
- </para>
+ For this reason a specialUpdate may end up deleting records!
+ </para>
</listitem>
<listitem>
- <para> Multiple and possible nested <literal>index</literal>
- instructions must contain at least one
+ <para> Multiple and possible nested <literal>index</literal>
+ instructions must contain at least one
<literal>indexname:indextype</literal>
- pair, and may contain multiple such pairs separated by the
+ pair, and may contain multiple such pairs separated by the
whitespace character <literal>' '</literal>. In each index
- pair, the name and the type of the index is separated by a
+ pair, the name and the type of the index is separated by a
colon character <literal>':'</literal>.
- </para>
+ </para>
</listitem>
<listitem>
- <para>
+ <para>
Any index name consisting of ASCII letters, and following the
- standard &zebra; rules will do, see
+ standard &zebra; rules will do, see
<xref linkend="querymodel-pqf-apt-mapping-accesspoint"/>.
- </para>
+ </para>
</listitem>
<listitem>
- <para>
+ <para>
Index types are restricted to the values defined in
the standard configuration
file <filename>default.idx</filename>, see
- <xref linkend="querymodel-bib1"/> and
+ <xref linkend="querymodel-bib1"/> and
<xref linkend="fields-and-charsets"/> for details.
- </para>
+ </para>
</listitem>
<listitem>
- <para>
+ <para>
&acro.dom; input documents which are not resulting in both one
- unique valid
- <literal>record</literal> instruction and one or more valid
+ unique valid
+ <literal>record</literal> instruction and one or more valid
<literal>index</literal> instructions can not be searched and
found. Therefore,
invalid document processing is aborted, and any content of
- the <literal><extract></literal> and
+ the <literal><extract></literal> and
<literal><store></literal> pipelines is discarded.
- A warning is issued in the logs.
- </para>
+ A warning is issued in the logs.
+ </para>
</listitem>
</itemizedlist>
- </para>
-
- <para>The examples work as follows:
- From the original &acro.xml; file
- <literal>marc-one.xml</literal> (or from the &acro.xml; record &acro.dom; of the
- same form coming from an <literal><input></literal>
- pipeline),
- the indexing
- pipeline <literal><extract></literal>
- produces an indexing &acro.xml; record, which is defined by
- the <literal>record</literal> instruction
- &zebra; uses the content of
- <literal>z:id="11224466"</literal>
- or
- <literal>id=11224466</literal>
- as internal
- record ID, and - in case static ranking is set - the content of
- <literal>rank=42</literal>
- or
- <literal>z:rank="42"</literal>
- as static rank.
- </para>
-
+ </para>
- <para>In these examples, the following literal indexes are constructed:
- <screen>
+ <para>The examples work as follows:
+ From the original &acro.xml; file
+ <literal>marc-one.xml</literal> (or from the &acro.xml; record &acro.dom; of the
+ same form coming from an <literal><input></literal>
+ pipeline),
+ the indexing
+ pipeline <literal><extract></literal>
+ produces an indexing &acro.xml; record, which is defined by
+ the <literal>record</literal> instruction
+ &zebra; uses the content of
+ <literal>z:id="11224466"</literal>
+ or
+ <literal>id=11224466</literal>
+ as internal
+ record ID, and - in case static ranking is set - the content of
+ <literal>rank=42</literal>
+ or
+ <literal>z:rank="42"</literal>
+ as static rank.
+ </para>
+
+
+ <para>In these examples, the following literal indexes are constructed:
+ <screen>
any:w
control:0
title:w
title:p
title:s
- </screen>
- where the indexing type is defined after the
- literal <literal>':'</literal> character.
- Any value from the standard configuration
- file <filename>default.idx</filename> will do.
- Finally, any
- <literal>text()</literal> node content recursively contained
- inside the <literal><z:index></literal> element, or any
- element following a <literal>index</literal> processing instruction,
- will be filtered through the
- appropriate char map for character normalization, and will be
- inserted in the named indexes.
- </para>
- <para>
- Finally, this example configuration can be queried using &acro.pqf;
- queries, either transported by &acro.z3950;, (here using a yaz-client)
- <screen>
- <![CDATA[
- Z> open localhost:9999
- Z> elem dc
- Z> form xml
- Z>
- Z> find @attr 1=control @attr 4=3 11224466
- Z> scan @attr 1=control @attr 4=3 ""
- Z>
- Z> find @attr 1=title program
- Z> scan @attr 1=title ""
- Z>
- Z> find @attr 1=title @attr 4=2 "How to program a computer"
- Z> scan @attr 1=title @attr 4=2 ""
- ]]>
- </screen>
- or the proprietary
- extensions <literal>x-pquery</literal> and
- <literal>x-pScanClause</literal> to
- &acro.sru;, and &acro.srw;
- <screen>
- <![CDATA[
- http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@attr 1=title program
- http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr 1=title ""
- ]]>
- </screen>
- See <xref linkend="zebrasrv-sru"/> for more information on &acro.sru;/&acro.srw;
- configuration, and <xref linkend="gfs-config"/> or the &yaz;
- <ulink url="&url.yaz.cql;">&acro.cql; section</ulink>
- for the details or the &yaz; frontend server.
- </para>
- <para>
- Notice that there are no <filename>*.abs</filename>,
- <filename>*.est</filename>, <filename>*.map</filename>, or other &acro.grs1;
- filter configuration files involves in this process, and that the
- literal index names are used during search and retrieval.
- </para>
- <para>
- In case that we want to support the usual
- <literal>bib-1</literal> &acro.z3950; numeric access points, it is a
- good idea to choose string index names defined in the default
- configuration file <filename>tab/bib1.att</filename>, see
- <xref linkend="attset-files"/>
- </para>
-
- </section>
+ </screen>
+ where the indexing type is defined after the
+ literal <literal>':'</literal> character.
+ Any value from the standard configuration
+ file <filename>default.idx</filename> will do.
+ Finally, any
+ <literal>text()</literal> node content recursively contained
+ inside the <literal><z:index></literal> element, or any
+ element following a <literal>index</literal> processing instruction,
+ will be filtered through the
+ appropriate char map for character normalization, and will be
+ inserted in the named indexes.
+ </para>
+ <para>
+ Finally, this example configuration can be queried using &acro.pqf;
+ queries, either transported by &acro.z3950;, (here using a yaz-client)
+ <screen>
+ <![CDATA[
+ Z> open localhost:9999
+ Z> elem dc
+ Z> form xml
+ Z>
+ Z> find @attr 1=control @attr 4=3 11224466
+ Z> scan @attr 1=control @attr 4=3 ""
+ Z>
+ Z> find @attr 1=title program
+ Z> scan @attr 1=title ""
+ Z>
+ Z> find @attr 1=title @attr 4=2 "How to program a computer"
+ Z> scan @attr 1=title @attr 4=2 ""
+ ]]>
+ </screen>
+ or the proprietary
+ extensions <literal>x-pquery</literal> and
+ <literal>x-pScanClause</literal> to
+ &acro.sru;, and &acro.srw;
+ <screen>
+ <![CDATA[
+ http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@attr 1=title program
+ http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr 1=title ""
+ ]]>
+ </screen>
+ See <xref linkend="zebrasrv-sru"/> for more information on &acro.sru;/&acro.srw;
+ configuration, and <xref linkend="gfs-config"/> or the &yaz;
+ <ulink url="&url.yaz.cql;">&acro.cql; section</ulink>
+ for the details or the &yaz; frontend server.
+ </para>
+ <para>
+ Notice that there are no <filename>*.abs</filename>,
+ <filename>*.est</filename>, <filename>*.map</filename>, or other &acro.grs1;
+ filter configuration files involves in this process, and that the
+ literal index names are used during search and retrieval.
+ </para>
+ <para>
+ In case that we want to support the usual
+ <literal>bib-1</literal> &acro.z3950; numeric access points, it is a
+ good idea to choose string index names defined in the default
+ configuration file <filename>tab/bib1.att</filename>, see
+ <xref linkend="attset-files"/>
+ </para>
+
+ </section>
</section>
</section>
<title>&acro.dom; Record Model Configuration</title>
- <section id="record-model-domxml-index">
- <title>&acro.dom; Indexing Configuration</title>
+ <section id="record-model-domxml-index">
+ <title>&acro.dom; Indexing Configuration</title>
<para>
As mentioned above, there can be only one indexing pipeline,
and configuration of the indexing process is a synonym
of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the
- magic processing instructions or elements discussed in
- <xref linkend="record-model-domxml-canonical-index"/>.
+ magic processing instructions or elements discussed in
+ <xref linkend="record-model-domxml-canonical-index"/>.
Obviously, there are million of different ways to accomplish this
task, and some comments and code snippets are in order to
enlighten the wary.
means that the output &acro.xml; structure is taken as starting point of
the internal structure of the &acro.xslt; stylesheet, and portions of
the input &acro.xml; are <emphasis>pulled</emphasis> out and inserted
- into the right spots of the output &acro.xml; structure.
+ into the right spots of the output &acro.xml; structure.
On the other
side, <emphasis>push</emphasis> &acro.xslt; stylesheets are recursively
calling their template definitions, a process which is commanded
- by the input &acro.xml; structure, and is triggered to produce
+ by the input &acro.xml; structure, and is triggered to produce
some output &acro.xml;
whenever some special conditions in the input stylesheets are
met. The <emphasis>pull</emphasis> type is well-suited for input
<emphasis>push</emphasis> type might be the only possible way to
sort out deeply recursive input &acro.xml; formats.
</para>
- <para>
+ <para>
A <emphasis>pull</emphasis> stylesheet example used to index
&acro.oai; harvested records could use some of the following template
definitions:
<screen>
<![CDATA[
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
- xmlns:z="http://indexdata.com/zebra-2.0"
- xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/"
- xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/"
- xmlns:dc="http://purl.org/dc/elements/1.1/"
- version="1.0">
-
- <!-- Example pull and magic element style Zebra indexing -->
- <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
-
- <!-- disable all default text node output -->
- <xsl:template match="text()"/>
-
- <!-- disable all default recursive element node transversal -->
- <xsl:template match="node()"/>
-
- <!-- match only on oai xml record root -->
- <xsl:template match="/">
- <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}">
- <!-- you may use z:rank="{some XSLT; function here}" -->
-
- <!-- explicetly calling defined templates -->
- <xsl:apply-templates/>
- </z:record>
- </xsl:template>
-
- <!-- OAI indexing templates -->
- <xsl:template match="oai:record/oai:header/oai:identifier">
- <z:index name="oai_identifier:0">
- <xsl:value-of select="."/>
- </z:index>
- </xsl:template>
-
- <!-- etc, etc -->
-
- <!-- DC specific indexing templates -->
- <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
- <z:index name="dc_any:w dc_title:w dc_title:p dc_title:s ">
- <xsl:value-of select="."/>
- </z:index>
- </xsl:template>
-
- <!-- etc, etc -->
-
- </xsl:stylesheet>
+ xmlns:z="http://indexdata.com/zebra-2.0"
+ xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/"
+ xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/"
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ version="1.0">
+
+ <!-- Example pull and magic element style Zebra indexing -->
+ <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+ <!-- disable all default text node output -->
+ <xsl:template match="text()"/>
+
+ <!-- disable all default recursive element node transversal -->
+ <xsl:template match="node()"/>
+
+ <!-- match only on oai xml record root -->
+ <xsl:template match="/">
+ <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}">
+ <!-- you may use z:rank="{some XSLT; function here}" -->
+
+ <!-- explicetly calling defined templates -->
+ <xsl:apply-templates/>
+ </z:record>
+ </xsl:template>
+
+ <!-- OAI indexing templates -->
+ <xsl:template match="oai:record/oai:header/oai:identifier">
+ <z:index name="oai_identifier:0">
+ <xsl:value-of select="."/>
+ </z:index>
+ </xsl:template>
+
+ <!-- etc, etc -->
+
+ <!-- DC specific indexing templates -->
+ <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
+ <z:index name="dc_any:w dc_title:w dc_title:p dc_title:s ">
+ <xsl:value-of select="."/>
+ </z:index>
+ </xsl:template>
+
+ <!-- etc, etc -->
+
+ </xsl:stylesheet>
]]>
</screen>
</para>
- </section>
+ </section>
- <section id="record-model-domxml-index-marc">
- <title>&acro.dom; Indexing &acro.marcxml;</title>
+ <section id="record-model-domxml-index-marc">
+ <title>&acro.dom; Indexing &acro.marcxml;</title>
<para>
- The &acro.dom; filter allows indexing of both binary &acro.marc; records
- and &acro.marcxml; records, depending on its configuration.
- A typical &acro.marcxml; record might look like this:
- <screen>
+ The &acro.dom; filter allows indexing of both binary &acro.marc; records
+ and &acro.marcxml; records, depending on its configuration.
+ A typical &acro.marcxml; record might look like this:
+ <screen>
<![CDATA[
<record xmlns="http://www.loc.gov/MARC21/slim">
- <rank>42</rank>
- <leader>00366nam 22001698a 4500</leader>
- <controlfield tag="001"> 11224466 </controlfield>
- <controlfield tag="003">DLC </controlfield>
- <controlfield tag="005">00000000000000.0 </controlfield>
- <controlfield tag="008">910710c19910701nju 00010 eng </controlfield>
- <datafield tag="010" ind1=" " ind2=" ">
- <subfield code="a"> 11224466 </subfield>
- </datafield>
- <datafield tag="040" ind1=" " ind2=" ">
- <subfield code="a">DLC</subfield>
- <subfield code="c">DLC</subfield>
- </datafield>
- <datafield tag="050" ind1="0" ind2="0">
- <subfield code="a">123-xyz</subfield>
- </datafield>
- <datafield tag="100" ind1="1" ind2="0">
- <subfield code="a">Jack Collins</subfield>
- </datafield>
- <datafield tag="245" ind1="1" ind2="0">
- <subfield code="a">How to program a computer</subfield>
- </datafield>
- <datafield tag="260" ind1="1" ind2=" ">
- <subfield code="a">Penguin</subfield>
- </datafield>
- <datafield tag="263" ind1=" " ind2=" ">
- <subfield code="a">8710</subfield>
- </datafield>
- <datafield tag="300" ind1=" " ind2=" ">
- <subfield code="a">p. cm.</subfield>
- </datafield>
- </record>
+ <rank>42</rank>
+ <leader>00366nam 22001698a 4500</leader>
+ <controlfield tag="001"> 11224466 </controlfield>
+ <controlfield tag="003">DLC </controlfield>
+ <controlfield tag="005">00000000000000.0 </controlfield>
+ <controlfield tag="008">910710c19910701nju 00010 eng </controlfield>
+ <datafield tag="010" ind1=" " ind2=" ">
+ <subfield code="a"> 11224466 </subfield>
+ </datafield>
+ <datafield tag="040" ind1=" " ind2=" ">
+ <subfield code="a">DLC</subfield>
+ <subfield code="c">DLC</subfield>
+ </datafield>
+ <datafield tag="050" ind1="0" ind2="0">
+ <subfield code="a">123-xyz</subfield>
+ </datafield>
+ <datafield tag="100" ind1="1" ind2="0">
+ <subfield code="a">Jack Collins</subfield>
+ </datafield>
+ <datafield tag="245" ind1="1" ind2="0">
+ <subfield code="a">How to program a computer</subfield>
+ </datafield>
+ <datafield tag="260" ind1="1" ind2=" ">
+ <subfield code="a">Penguin</subfield>
+ </datafield>
+ <datafield tag="263" ind1=" " ind2=" ">
+ <subfield code="a">8710</subfield>
+ </datafield>
+ <datafield tag="300" ind1=" " ind2=" ">
+ <subfield code="a">p. cm.</subfield>
+ </datafield>
+ </record>
]]>
- </screen>
+ </screen>
</para>
<para>
- It is easily possible to make string manipulation in the &acro.dom;
- filter. For example, if you want to drop some leading articles
- in the indexing of sort fields, you might want to pick out the
- &acro.marcxml; indicator attributes to chop of leading substrings. If
- the above &acro.xml; example would have an indicator
- <literal>ind2="8"</literal> in the title field
- <literal>245</literal>, i.e.
- <screen>
+ It is easily possible to make string manipulation in the &acro.dom;
+ filter. For example, if you want to drop some leading articles
+ in the indexing of sort fields, you might want to pick out the
+ &acro.marcxml; indicator attributes to chop of leading substrings. If
+ the above &acro.xml; example would have an indicator
+ <literal>ind2="8"</literal> in the title field
+ <literal>245</literal>, i.e.
+ <screen>
<![CDATA[
- <datafield tag="245" ind1="1" ind2="8">
- <subfield code="a">How to program a computer</subfield>
- </datafield>
+ <datafield tag="245" ind1="1" ind2="8">
+ <subfield code="a">How to program a computer</subfield>
+ </datafield>
]]>
- </screen>
- one could write a template taking into account this information
- to chop the first <literal>8</literal> characters from the
- sorting index <literal>title:s</literal> like this:
- <screen>
+ </screen>
+ one could write a template taking into account this information
+ to chop the first <literal>8</literal> characters from the
+ sorting index <literal>title:s</literal> like this:
+ <screen>
<![CDATA[
<xsl:template match="m:datafield[@tag='245']">
- <xsl:variable name="chop">
- <xsl:choose>
- <xsl:when test="not(number(@ind2))">0</xsl:when>
- <xsl:otherwise><xsl:value-of select="number(@ind2)"/></xsl:otherwise>
- </xsl:choose>
- </xsl:variable>
-
- <z:index name="title:w title:p any:w">
- <xsl:value-of select="m:subfield[@code='a']"/>
- </z:index>
-
- <z:index name="title:s">
- <xsl:value-of select="substring(m:subfield[@code='a'], $chop)"/>
- </z:index>
-
- </xsl:template>
+ <xsl:variable name="chop">
+ <xsl:choose>
+ <xsl:when test="not(number(@ind2))">0</xsl:when>
+ <xsl:otherwise><xsl:value-of select="number(@ind2)"/></xsl:otherwise>
+ </xsl:choose>
+ </xsl:variable>
+
+ <z:index name="title:w title:p any:w">
+ <xsl:value-of select="m:subfield[@code='a']"/>
+ </z:index>
+
+ <z:index name="title:s">
+ <xsl:value-of select="substring(m:subfield[@code='a'], $chop)"/>
+ </z:index>
+
+ </xsl:template>
]]>
- </screen>
- The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be:
- <screen>
+ </screen>
+ The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be:
+ <screen>
<![CDATA[
- <z:index name="title:w title:p any:w">How to program a computer</z:index>
- <z:index name="title:s">program a computer</z:index>
+ <z:index name="title:w title:p any:w">How to program a computer</z:index>
+ <z:index name="title:s">program a computer</z:index>
]]>
- </screen>
- and the record would be sorted in the title index under 'P', not 'H'.
+ </screen>
+ and the record would be sorted in the title index under 'P', not 'H'.
</para>
- </section>
+ </section>
- <section id="record-model-domxml-index-wizzard">
- <title>&acro.dom; Indexing Wizardry</title>
+ <section id="record-model-domxml-index-wizzard">
+ <title>&acro.dom; Indexing Wizardry</title>
<para>
The names and types of the indexes can be defined in the
indexing &acro.xslt; stylesheet <emphasis>dynamically according to
- content in the original &acro.xml; records</emphasis>, which has
+ content in the original &acro.xml; records</emphasis>, which has
opportunities for great power and wizardry as well as grande
- disaster.
+ disaster.
</para>
<para>
The following excerpt of a <emphasis>push</emphasis> stylesheet
- <emphasis>might</emphasis>
+ <emphasis>might</emphasis>
be a good idea according to your strict control of the &acro.xml;
input format (due to rigorous checking against well-defined and
tight RelaxNG or &acro.xml; Schema's, for example):
<screen>
<![CDATA[
- <xsl:template name="element-name-indexes">
- <z:index name="{name()}:w">
- <xsl:value-of select="'1'"/>
- </z:index>
- </xsl:template>
+ <xsl:template name="element-name-indexes">
+ <z:index name="{name()}:w">
+ <xsl:value-of select="'1'"/>
+ </z:index>
+ </xsl:template>
]]>
</screen>
- This template creates indexes which have the name of the working
+ This template creates indexes which have the name of the working
node of any input &acro.xml; file, and assigns a '1' to the index.
- The example query
- <literal>find @attr 1=xyz 1</literal>
+ The example query
+ <literal>find @attr 1=xyz 1</literal>
finds all files which contain at least one
<literal>xyz</literal> &acro.xml; element. In case you can not control
which element names the input files contain, you might ask for
</para>
<para>
One variation over the theme <emphasis>dynamically created
- indexes</emphasis> will definitely be unwise:
+ indexes</emphasis> will definitely be unwise:
<screen>
- <![CDATA[
+ <![CDATA[
<!-- match on oai xml record root -->
- <xsl:template match="/">
- <z:record>
-
- <!-- create dynamic index name from input content -->
- <xsl:variable name="dynamic_content">
- <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
- </xsl:variable>
-
- <!-- create zillions of indexes with unknown names -->
- <z:index name="{$dynamic_content}:w">
- <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
- </z:index>
- </z:record>
-
- </xsl:template>
+ <xsl:template match="/">
+ <z:record>
+
+ <!-- create dynamic index name from input content -->
+ <xsl:variable name="dynamic_content">
+ <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
+ </xsl:variable>
+
+ <!-- create zillions of indexes with unknown names -->
+ <z:index name="{$dynamic_content}:w">
+ <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
+ </z:index>
+ </z:record>
+
+ </xsl:template>
]]>
</screen>
Don't be tempted to play too smart tricks with the power of
indexes with unpredictable names, resulting in severe &zebra;
index pollution..
</para>
- </section>
-
- <section id="record-model-domxml-debug">
- <title>Debuggig &acro.dom; Filter Configurations</title>
- <para>
- It can be very hard to debug a &acro.dom; filter setup due to the many
- successive &acro.marc; syntax translations, &acro.xml; stream splitting and
- &acro.xslt; transformations involved. As an aid, you have always the
- power of the <literal>-s</literal> command line switch to the
- <literal>zebraidz</literal> indexing command at your hand:
- <screen>
- zebraidx -s -c zebra.cfg update some_record_stream.xml
- </screen>
- This command line simulates indexing and dumps a lot of debug
- information in the logs, telling exactly which transformations
- have been applied, how the documents look like after each
- transformation, and which record ids and terms are send to the indexer.
- </para>
- </section>
+ </section>
- <!--
- <section id="record-model-domxml-elementset">
- <title>&acro.dom; Exchange Formats</title>
- <para>
- An exchange format can be anything which can be the outcome of an
- &acro.xslt; transformation, as far as the stylesheet is registered in
- the main &acro.dom; &acro.xslt; filter configuration file, see
- <xref linkend="record-model-domxml-filter"/>.
- In principle anything that can be expressed in &acro.xml;, HTML, and
- TEXT can be the output of a <literal>schema</literal> or
- <literal>element set</literal> directive during search, as long as
- the information comes from the
- <emphasis>original input record &acro.xml; &acro.dom; tree</emphasis>
- (and not the transformed and <emphasis>indexed</emphasis> &acro.xml;!!).
- </para>
+ <section id="record-model-domxml-debug">
+ <title>Debuggig &acro.dom; Filter Configurations</title>
<para>
- In addition, internal administrative information from the &zebra;
- indexer can be accessed during record retrieval. The following
- example is a summary of the possibilities:
+ It can be very hard to debug a &acro.dom; filter setup due to the many
+ successive &acro.marc; syntax translations, &acro.xml; stream splitting and
+ &acro.xslt; transformations involved. As an aid, you have always the
+ power of the <literal>-s</literal> command line switch to the
+ <literal>zebraidz</literal> indexing command at your hand:
<screen>
- <![CDATA[
- <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
- xmlns:z="http://indexdata.com/zebra-2.0"
- version="1.0">
-
- <!- - register internal zebra parameters - ->
- <xsl:param name="id" select="''"/>
- <xsl:param name="filename" select="''"/>
- <xsl:param name="score" select="''"/>
- <xsl:param name="schema" select="''"/>
-
- <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
-
- <!- - use then for display of internal information - ->
- <xsl:template match="/">
- <z:zebra>
- <id><xsl:value-of select="$id"/></id>
- <filename><xsl:value-of select="$filename"/></filename>
- <score><xsl:value-of select="$score"/></score>
- <schema><xsl:value-of select="$schema"/></schema>
- </z:zebra>
- </xsl:template>
-
- </xsl:stylesheet>
- ]]>
+ zebraidx -s -c zebra.cfg update some_record_stream.xml
</screen>
+ This command line simulates indexing and dumps a lot of debug
+ information in the logs, telling exactly which transformations
+ have been applied, how the documents look like after each
+ transformation, and which record ids and terms are send to the indexer.
</para>
+ </section>
+
+ <!--
+ <section id="record-model-domxml-elementset">
+ <title>&acro.dom; Exchange Formats</title>
+ <para>
+ An exchange format can be anything which can be the outcome of an
+ &acro.xslt; transformation, as far as the stylesheet is registered in
+ the main &acro.dom; &acro.xslt; filter configuration file, see
+ <xref linkend="record-model-domxml-filter"/>.
+ In principle anything that can be expressed in &acro.xml;, HTML, and
+ TEXT can be the output of a <literal>schema</literal> or
+ <literal>element set</literal> directive during search, as long as
+ the information comes from the
+ <emphasis>original input record &acro.xml; &acro.dom; tree</emphasis>
+ (and not the transformed and <emphasis>indexed</emphasis> &acro.xml;!!).
+ </para>
+ <para>
+ In addition, internal administrative information from the &zebra;
+ indexer can be accessed during record retrieval. The following
+ example is a summary of the possibilities:
+ <screen>
+ <![CDATA[
+ <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+ xmlns:z="http://indexdata.com/zebra-2.0"
+ version="1.0">
+
+ <!- - register internal zebra parameters - ->
+ <xsl:param name="id" select="''"/>
+ <xsl:param name="filename" select="''"/>
+ <xsl:param name="score" select="''"/>
+ <xsl:param name="schema" select="''"/>
+
+ <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+ <!- - use then for display of internal information - ->
+ <xsl:template match="/">
+ <z:zebra>
+ <id><xsl:value-of select="$id"/></id>
+ <filename><xsl:value-of select="$filename"/></filename>
+ <score><xsl:value-of select="$score"/></score>
+ <schema><xsl:value-of select="$schema"/></schema>
+ </z:zebra>
+ </xsl:template>
+
+ </xsl:stylesheet>
+ ]]>
+ </screen>
+ </para>
</section>
- -->
+ -->
- <!--
- <section id="record-model-domxml-example">
+ <!--
+ <section id="record-model-domxml-example">
<title>&acro.dom; Filter &acro.oai; Indexing Example</title>
<para>
- The source code tarball contains a working &acro.dom; filter example in
- the directory <filename>examples/dom-oai/</filename>, which
- should get you started.
- </para>
- <para>
- More example data can be harvested from any &acro.oai; compliant server,
- see details at the &acro.oai;
- <ulink url="http://www.openarchives.org/">
- http://www.openarchives.org/</ulink> web site, and the community
- links at
- <ulink url="http://www.openarchives.org/community/index.html">
- http://www.openarchives.org/community/index.html</ulink>.
- There is a tutorial
- found at
- <ulink url="http://www.oaforum.org/tutorial/">
- http://www.oaforum.org/tutorial/</ulink>.
- </para>
- </section>
+ The source code tarball contains a working &acro.dom; filter example in
+ the directory <filename>examples/dom-oai/</filename>, which
+ should get you started.
+ </para>
+ <para>
+ More example data can be harvested from any &acro.oai; compliant server,
+ see details at the &acro.oai;
+ <ulink url="http://www.openarchives.org/">
+ http://www.openarchives.org/</ulink> web site, and the community
+ links at
+ <ulink url="http://www.openarchives.org/community/index.html">
+ http://www.openarchives.org/community/index.html</ulink>.
+ There is a tutorial
+ found at
+ <ulink url="http://www.oaforum.org/tutorial/">
+ http://www.oaforum.org/tutorial/</ulink>.
+ </para>
+ </section>
-->
</section>
-
+
</chapter>
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
<chapter id="grs">
<title>&acro.grs1; Record Model and Filter Modules</title>
- <note>
- <para>
- The functionality of this record model has been improved and
- replaced by the DOM &acro.xml; record model. See
- <xref linkend="record-model-domxml"/>.
- </para>
- </note>
+ <note>
+ <para>
+ The functionality of this record model has been improved and
+ replaced by the DOM &acro.xml; record model. See
+ <xref linkend="record-model-domxml"/>.
+ </para>
+ </note>
<para>
The record model described in this chapter applies to the fundamental,
<para>
This is the canonical input format
described <xref linkend="grs-canonical-format"/>. It is using
- simple &acro.sgml;-like syntax.
+ simple &acro.sgml;-like syntax.
</para>
</listitem>
</varlistentry>
<listitem>
<para>
This allows &zebra; to read
- records in the ISO2709 (&acro.marc;) encoding standard.
+ records in the ISO2709 (&acro.marc;) encoding standard.
Last parameter <replaceable>type</replaceable> names the
<literal>.abs</literal> file (see below)
which describes the specific &acro.marc; structure of the input record as
use <literal>grs.marcxml</literal> filter instead (see below).
</para>
<para>
- The loadable <literal>grs.marc</literal> filter module
- is packaged in the GNU/Debian package
+ The loadable <literal>grs.marc</literal> filter module
+ is packaged in the GNU/Debian package
<literal>libidzebra2.0-mod-grs-marc</literal>
</para>
</listitem>
<para>
The internal representation for <literal>grs.marcxml</literal>
is the same as for <ulink url="&url.marcxml;">&acro.marcxml;</ulink>.
- It slightly more complicated to work with than
+ It slightly more complicated to work with than
<literal>grs.marc</literal> but &acro.xml; conformant.
</para>
<para>
<para>
This filter reads &acro.xml; records and uses
<ulink url="http://expat.sourceforge.net/">Expat</ulink> to
- parse them and convert them into ID&zebra;'s internal
+ parse them and convert them into ID&zebra;'s internal
<literal>grs</literal> record model.
Only one record per file is supported, due to the fact &acro.xml; does
not allow two documents to "follow" each other (there is no way
The loadable <literal>grs.xml</literal> filter module
is packaged in the GNU/Debian package
<literal>libidzebra2.0-mod-grs-xml</literal>
- </para>
+ </para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>grs.tcl.</literal><replaceable>filter</replaceable></term>
<listitem>
<para>
- Similar to grs.regx but using Tcl for rules, described in
+ Similar to grs.regx but using Tcl for rules, described in
<xref linkend="grs-regx-tcl"/>.
</para>
<para>
<screen>
<Distributor>
- <Name> USGS/WRD </Name>
- <Organization> USGS/WRD </Organization>
- <Street-Address>
- U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
- </Street-Address>
- <City> ALBUQUERQUE </City>
- <State> NM </State>
- <Zip-Code> 87102 </Zip-Code>
- <Country> USA </Country>
- <Telephone> (505) 766-5560 </Telephone>
+ <Name> USGS/WRD </Name>
+ <Organization> USGS/WRD </Organization>
+ <Street-Address>
+ U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
+ </Street-Address>
+ <City> ALBUQUERQUE </City>
+ <State> NM </State>
+ <Zip-Code> 87102 </Zip-Code>
+ <Country> USA </Country>
+ <Telephone> (505) 766-5560 </Telephone>
</Distributor>
</screen>
<!-- There is no indentation in the example above! -H
-note-
- -para-
- The indentation used above is used to illustrate how &zebra;
- interprets the mark-up. The indentation, in itself, has no
- significance to the parser for the canonical input format, which
- discards superfluous whitespace.
- -/para-
+ -para-
+ The indentation used above is used to illustrate how &zebra;
+ interprets the mark-up. The indentation, in itself, has no
+ significance to the parser for the canonical input format, which
+ discards superfluous whitespace.
+ -/para-
-/note-
-->
<screen>
<gils>
- <title>Zen and the Art of Motorcycle Maintenance</title>
+ <title>Zen and the Art of Motorcycle Maintenance</title>
</gils>
</screen>
type <literal>regx</literal>, argument
<emphasis>filter-filename</emphasis>).
</para>
-
+
<para>
Generally, an input filter consists of a sequence of rules, where each
rule consists of a sequence of expressions, followed by an action. The
and the actions normally contribute to the generation of an internal
representation of the record.
</para>
-
+
<para>
An expression can be either of the following:
</para>
<para>
Matches regular expression pattern <replaceable>reg</replaceable>
from the input record. The operators supported are the same
- as for regular expression queries. Refer to
+ as for regular expression queries. Refer to
<xref linkend="querymodel-regular"/>.
</para>
</listitem>
data element. The <replaceable>type</replaceable> is one of
the following:
<variablelist>
-
+
<varlistentry>
<term>record</term>
<listitem>
/^Subject:/ BODY /$/ { data -element title $1 }
/^Date:/ BODY /$/ { data -element lastModified $1 }
/\n\n/ BODY END {
- begin element bodyOfDisplay
- begin variant body iana "text/plain"
- data -text $1
- end record
+ begin element bodyOfDisplay
+ begin variant body iana "text/plain"
+ data -text $1
+ end record
}
</screen>
<para>
<screen>
- ROOT
- TITLE "Zen and the Art of Motorcycle Maintenance"
- AUTHOR "Robert Pirsig"
+ ROOT
+ TITLE "Zen and the Art of Motorcycle Maintenance"
+ AUTHOR "Robert Pirsig"
</screen>
</para>
<para>
<screen>
- ROOT
- TITLE "Zen and the Art of Motorcycle Maintenance"
- AUTHOR
- FIRST-NAME "Robert"
- SURNAME "Pirsig"
+ ROOT
+ TITLE "Zen and the Art of Motorcycle Maintenance"
+ AUTHOR
+ FIRST-NAME "Robert"
+ SURNAME "Pirsig"
</screen>
</para>
Which of the two elements are transmitted to the client by the server
depends on the specifications provided by the client, if any.
</para>
-
+
<para>
In practice, each variant node is associated with a triple of class,
type, value, corresponding to the variant mechanism of &acro.z3950;.
</para>
-
+
</section>
-
+
<section id="grs-data-elements">
<title>Data Elements</title>
-
+
<para>
Data nodes have no children (they are always leaf nodes in the record
tree).
</para>
-
+
<!--
FIXME! Documentation needs extension here about types of nodes - numerical,
textual, etc., plus the various types of inclusion notes.
</para>
-->
-
+
</section>
-
+
</section>
-
+
<section id="grs-conf">
<title>&acro.grs1; Record Model Configuration</title>
-
+
<para>
The following sections describe the configuration files that govern
- the internal management of <literal>grs</literal> records.
+ the internal management of <literal>grs</literal> records.
The system searches for the files
in the directories specified by the <emphasis>profilePath</emphasis>
setting in the <literal>zebra.cfg</literal> file.
</para>
<!--
- FIXME - Need a diagram here, or a simple explanation how it all hangs together -H
+ FIXME - Need a diagram here, or a simple explanation how it all hangs together -H
-->
<para>
known.
</para>
</listitem>
-
+
<listitem>
<para>
The variant set which is used in the profile. This provides a
</para>
</listitem>
- <listitem>
+ <listitem>
<para>
A list of element descriptions (this is the actual ARS of the
schema, in &acro.z3950; terms), which lists the ways in which the various
file. Some settings are optional (o), while others again are
mandatory (m).
</para>
-
+
</section>
-
+
<section id="abs-file">
<title>The Abstract Syntax (.abs) Files</title>
-
+
<para>
The name of this file type is slightly misleading in &acro.z3950; terms,
since, apart from the actual abstract syntax of the profile, it also
includes most of the other definitions that go into a database
profile.
</para>
-
+
<para>
When a record in the canonical, &acro.sgml;-like format is read from a file
or from the database, the first tag of the file should reference the
record is, say, <literal><gils></literal>, the system will look
for the profile definition in the file <literal>gils.abs</literal>.
Profile definitions are cached, so they only have to be read once
- during the lifespan of the current process.
+ during the lifespan of the current process.
</para>
<para>
introduces the profile, and should always be called first thing when
introducing a new record.
</para>
-
+
<para>
The file may contain the following directives:
</para>
-
+
<para>
<variablelist>
-
+
<varlistentry>
<term>name <replaceable>symbolic-name</replaceable></term>
<listitem>
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term>xelm <replaceable>xpath attributes</replaceable></term>
<listitem>
file via a header this directive is ignored.
If neither this directive is given, nor an encoding is set
within external records, ISO-8859-1 encoding is assumed.
- </para>
+ </para>
</listitem>
</varlistentry>
<varlistentry>
<para>
If this directive is followed by <literal>enable</literal>,
then extra indexing is performed to allow for XPath-like queries.
- If this directive is not specified - equivalent to
+ If this directive is not specified - equivalent to
<literal>disable</literal> - no extra XPath-indexing is performed.
</para>
</listitem>
</varlistentry>
- <!-- Adam's version
+ <!-- Adam's version
<varlistentry>
- <term>systag <replaceable>systemtag</replaceable> <replaceable>element</replaceable></term>
- <listitem>
- <para>
- This directive maps system information to an element during
- retrieval. This information is dynamically created. The
- following system tags are defined
- <variablelist>
- <varlistentry>
- <term>size</term>
- <listitem>
- <para>
- Size of record in bytes. By default this
- is mapped to element <literal>size</literal>.
- </para>
- </listitem>
- </varlistentry>
+ <term>systag <replaceable>systemtag</replaceable> <replaceable>element</replaceable></term>
+ <listitem>
+ <para>
+ This directive maps system information to an element during
+ retrieval. This information is dynamically created. The
+ following system tags are defined
+ <variablelist>
+ <varlistentry>
+ <term>size</term>
+ <listitem>
+ <para>
+ Size of record in bytes. By default this
+ is mapped to element <literal>size</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
- <varlistentry>
- <term>rank</term>
- <listitem>
- <para>
- Score/rank of record. By default this
- is mapped to element <literal>rank</literal>.
- If no score was calculated for the record (non-ranked
- searched) search this directive is ignored.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>sysno</term>
- <listitem>
- <para>
- &zebra;'s system number (record ID) for the
- record. By default this is mapped to element
- <literal>localControlNumber</literal>.
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
- If you do not want a particular system tag to be applied,
- then set the resulting element to something undefined in the
- abs file (such as <literal>none</literal>).
- </para>
- </listitem>
- </varlistentry>
+ <varlistentry>
+ <term>rank</term>
+ <listitem>
+ <para>
+ Score/rank of record. By default this
+ is mapped to element <literal>rank</literal>.
+ If no score was calculated for the record (non-ranked
+ searched) search this directive is ignored.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>sysno</term>
+ <listitem>
+ <para>
+ &zebra;'s system number (record ID) for the
+ record. By default this is mapped to element
+ <literal>localControlNumber</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ If you do not want a particular system tag to be applied,
+ then set the resulting element to something undefined in the
+ abs file (such as <literal>none</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
-->
<!-- Mike's version -->
<listitem>
<para>
Specifies what information, if any, &zebra; should
- automatically include in retrieval records for the
+ automatically include in retrieval records for the
``system fields'' that it supports.
<replaceable>systemTag</replaceable> may
be any of the following:
<varlistentry>
<term><literal>rank</literal></term>
<listitem><para>
- An integer indicating the relevance-ranking score
- assigned to the record.
- </para></listitem>
+ An integer indicating the relevance-ranking score
+ assigned to the record.
+ </para></listitem>
</varlistentry>
<varlistentry>
<term><literal>sysno</literal></term>
<listitem><para>
- An automatically generated identifier for the record,
- unique within this database. It is represented by the
- <literal><localControlNumber></literal> element in
- &acro.xml; and the <literal>(1,14)</literal> tag in &acro.grs1;.
- </para></listitem>
+ An automatically generated identifier for the record,
+ unique within this database. It is represented by the
+ <literal><localControlNumber></literal> element in
+ &acro.xml; and the <literal>(1,14)</literal> tag in &acro.grs1;.
+ </para></listitem>
</varlistentry>
<varlistentry>
<term><literal>size</literal></term>
<listitem><para>
- The size, in bytes, of the retrieved record.
- </para></listitem>
+ The size, in bytes, of the retrieved record.
+ </para></listitem>
</varlistentry>
</variablelist>
</para>
</varlistentry>
</variablelist>
</para>
-
+
<note>
<para>
The mechanism for controlling indexing is not adequate for
configuration table eventually.
</para>
</note>
-
+
<para>
The following is an excerpt from the abstract syntax file for the GILS
profile.
elm (4,1) controlIdentifier Identifier-standard
elm (2,6) abstract Abstract
elm (4,51) purpose !
- elm (4,52) originator -
+ elm (4,52) originator -
elm (4,53) accessConstraints !
elm (4,54) useConstraints !
elm (4,70) availability -
<para>
This file type describes the <replaceable>Use</replaceable> elements of
- an attribute set.
- It contains the following directives.
+ an attribute set.
+ It contains the following directives.
</para>
-
+
<para>
<variablelist>
<varlistentry>
attribute value is stored in the index (unless a
<replaceable>local-value</replaceable> is
given, in which case this is stored). The name is used to refer to the
- attribute from the <replaceable>abstract syntax</replaceable>.
+ attribute from the <replaceable>abstract syntax</replaceable>.
</para>
</listitem></varlistentry>
</variablelist>
otherwise is noted.
</para>
</note>
-
+
<para>
The directives available in the element set file are as follows:
</para>
</para>
<!--
- <emphasis>NOTE: FIXME! The schema-mapping functions are so far limited to a
- straightforward mapping of elements. This should be extended with
- mechanisms for conversions of the element contents, and conditional
- mappings of elements based on the record contents.</emphasis>
+ <emphasis>NOTE: FIXME! The schema-mapping functions are so far limited to a
+ straightforward mapping of elements. This should be extended with
+ mechanisms for conversions of the element contents, and conditional
+ mappings of elements based on the record contents.</emphasis>
-->
<para>
</para>
<!--
- NOTE: FIXME! This will be described better. We're in the process of
- re-evaluating and most likely changing the way that &acro.marc; records are
- handled by the system.</emphasis>
+ NOTE: FIXME! This will be described better. We're in the process of
+ re-evaluating and most likely changing the way that &acro.marc; records are
+ handled by the system.</emphasis>
-->
</section>
</para>
</listitem>
- <!-- FIXME - Is this used anywhere ? -H -->
+ <!-- FIXME - Is this used anywhere ? -H -->
<listitem>
<para>
SOIF. Support for this syntax is experimental, and is currently
level.
</para>
</listitem>
-
+
</itemizedlist>
</para>
</section>
-
+
<section id="grs-extended-marc-indexing">
<title>Extended indexing of &acro.marc; records</title>
-
+
<para>Extended indexing of &acro.marc; records will help you if you need index a
combination of subfields, or index only a part of the whole field,
or use during indexing process embedded fields of &acro.marc; record.
</para>
-
+
<para>Extended indexing of &acro.marc; records additionally allows:
<itemizedlist>
-
+
<listitem>
<para>to index data in LEADER of &acro.marc; record</para>
</listitem>
-
+
<listitem>
<para>to index data in control fields (with fixed length)</para>
</listitem>
-
+
<listitem>
<para>to use during indexing the values of indicators</para>
</listitem>
-
+
<listitem>
<para>to index linked fields for UNI&acro.marc; based formats</para>
</listitem>
-
+
</itemizedlist>
</para>
-
+
<note><para>In compare with simple indexing process the extended indexing
may increase (about 2-3 times) the time of indexing process for &acro.marc;
records.</para></note>
-
+
<section id="formula">
<title>The index-formula</title>
-
+
<para>At the beginning, we have to define the term
<emphasis>index-formula</emphasis> for &acro.marc; records. This term helps
to understand the notation of extended indexing of &acro.marc; records by &zebra;.
<ulink url="http://www.rba.ru/rusmarc/soft/Z39-50.htm">"The table
of conformity for &acro.z3950; use attributes and R&acro.usmarc; fields"</ulink>.
The document is available only in Russian language.</para>
-
+
<para>
The <emphasis>index-formula</emphasis> is the combination of
subfields presented in such way:
</para>
-
+
<screen>
71-00$a, $g, $h ($c){.$b ($c)} , (1)
</screen>
-
+
<para>
We know that &zebra; supports a &acro.bib1; attribute - right truncation.
- In this case, the <emphasis>index-formula</emphasis> (1) consists from
+ In this case, the <emphasis>index-formula</emphasis> (1) consists from
forms, defined in the same way as (1)</para>
-
+
<screen>
71-00$a, $g, $h
71-00$a, $g
71-00$a
</screen>
-
+
<note>
<para>The original &acro.marc; record may be without some elements, which included in <emphasis>index-formula</emphasis>.
</para>
</note>
-
+
<para>This notation includes such operands as:
<variablelist>
-
+
<varlistentry>
<term>#</term>
<listitem><para>It means whitespace character.</para></listitem>
</varlistentry>
-
+
<varlistentry>
<term>-</term>
<listitem><para>The position may contain any value, defined by
&acro.marc; format.
For example, <emphasis>index-formula</emphasis></para>
-
+
<screen>
70-#1$a, $g , (2)
</screen>
-
- <para>includes</para>
-
+
+ <para>includes</para>
+
<screen>
700#1$a, $g
701#1$a, $g
702#1$a, $g
</screen>
-
+
</listitem>
</varlistentry>
-
+
<varlistentry>
<term>{...}</term>
<listitem>
<para>The repeatable elements are defined in figure-brackets {}.
For example,
<emphasis>index-formula</emphasis></para>
-
+
<screen>
71-00$a, $g, $h ($c){.$b ($c)} , (3)
</screen>
-
+
<para>includes</para>
-
+
<screen>
71-00$a, $g, $h ($c). $b ($c)
71-00$a, $g, $h ($c). $b ($c). $b ($c)
71-00$a, $g, $h ($c). $b ($c). $b ($c). $b ($c)
</screen>
-
+
</listitem>
</varlistentry>
</variablelist>
-
+
<note>
<para>
All another operands are the same as accepted in &acro.marc; world.
</note>
</para>
</section>
-
+
<section id="notation">
<title>Notation of <emphasis>index-formula</emphasis> for &zebra;</title>
-
-
+
+
<para>Extended indexing overloads <literal>path</literal> of
<literal>elm</literal> definition in abstract syntax file of &zebra;
(<literal>.abs</literal> file). It means that names beginning with
<emphasis>index-formula</emphasis>. The database index is created and
linked with <emphasis>access point</emphasis> (&acro.bib1; use attribute)
according to this formula.</para>
-
+
<para>For example, <emphasis>index-formula</emphasis></para>
-
+
<screen>
71-00$a, $g, $h ($c){.$b ($c)} , (4)
</screen>
-
+
<para>in <literal>.abs</literal> file looks like:</para>
-
+
<screen>
mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)}
</screen>
-
-
+
+
<para>The notation of <emphasis>index-formula</emphasis> uses the operands:
<variablelist>
-
+
<varlistentry>
<term>_</term>
<listitem><para>It means whitespace character.</para></listitem>
</varlistentry>
-
+
<varlistentry>
<term>.</term>
<listitem><para>The position may contain any value, defined by
&acro.marc; format. For example,
<emphasis>index-formula</emphasis></para>
-
+
<screen>
70-#1$a, $g , (5)
</screen>
-
+
<para>matches <literal>mc-70._1_$a,_$g_</literal> and includes</para>
-
+
<screen>
700_1_$a,_$g_
701_1_$a,_$g_
</screen>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term>{...}</term>
<listitem><para>The repeatable elements are defined in
figure-brackets {}. For example,
<emphasis>index-formula</emphasis></para>
-
+
<screen>
71#00$a, $g, $h ($c) {.$b ($c)} , (6)
</screen>
-
- <para>matches
+
+ <para>matches
<literal>mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)}</literal> and
includes</para>
-
+
<screen>
71.00_$a,_$g,_$h_(_$c_).$b_(_$c_)
71.00_$a,_$g,_$h_(_$c_).$b_(_$c_).$b_(_$c_)
</screen>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term><...></term>
<listitem><para>Embedded <emphasis>index-formula</emphasis> (for
linked fields) is between <>. For example,
<emphasis>index-formula</emphasis>
</para>
-
+
<screen>
4--#-$170-#1$a, $g ($c) , (7)
</screen>
-
+
<para>matches
<literal>mc-4.._._$1<70._1_$a,_$g_(_$c_)>_</literal> and
includes</para>
-
+
<screen>
463_._$1<70._1_$a,_$g_(_$c_)>_
</screen>
-
+
</listitem>
</varlistentry>
</variablelist>
</para>
-
+
<note>
<para>All another operands are the same as accepted in &acro.marc; world.</para>
</note>
-
+
<section id="grs-examples">
<title>Examples</title>
-
+
<para>
<orderedlist>
-
+
<listitem>
-
+
<para>indexing LEADER</para>
-
+
<para>You need to use keyword "ldr" to index leader. For example,
indexing data from 6th and 7th position of LEADER</para>
-
+
<screen>
elm mc-ldr[6] Record-type !
elm mc-ldr[7] Bib-level !
</screen>
-
+
</listitem>
-
+
<listitem>
-
+
<para>indexing data from control fields</para>
-
+
<para>indexing date (the time added to database)</para>
-
+
<screen>
- elm mc-008[0-5] Date/time-added-to-db !
+ elm mc-008[0-5] Date/time-added-to-db !
</screen>
-
+
<para>or for R&acro.usmarc; (this data included in 100th field)</para>
-
+
<screen>
elm mc-100___$a[0-7]_ Date/time-added-to-db !
</screen>
-
+
</listitem>
-
+
<listitem>
-
+
<para>using indicators while indexing</para>
<para>For R&acro.usmarc; <emphasis>index-formula</emphasis>
<literal>70-#1$a, $g</literal> matches</para>
-
+
<screen>
elm 70._1_$a,_$g_ Author !:w,!:p
</screen>
-
- <para>When &zebra; finds a field according to
+
+ <para>When &zebra; finds a field according to
<literal>"70."</literal> pattern it checks the indicators. In this
case the value of first indicator doesn't mater, but the value of
- second one must be whitespace, in another case a field is not
+ second one must be whitespace, in another case a field is not
indexed.</para>
</listitem>
-
+
<listitem>
-
+
<para>indexing embedded (linked) fields for UNI&acro.marc; based
formats</para>
-
- <para>For R&acro.usmarc; <emphasis>index-formula</emphasis>
+
+ <para>For R&acro.usmarc; <emphasis>index-formula</emphasis>
<literal>4--#-$170-#1$a, $g ($c)</literal> matches</para>
-
+
<screen><![CDATA[
elm mc-4.._._$1<70._1_$a,_$g_(_$c_)>_ Author !:w,!:p
]]></screen>
-
+
<para>Data are extracted from record if the field matches to
<literal>"4.._."</literal> pattern and data in linked field
match to embedded
<emphasis>index-formula</emphasis>
<literal>70._1_$a,_$g_(_$c_)</literal>.</para>
-
+
</listitem>
-
+
</orderedlist>
</para>
-
-
+
+
</section>
</section>
</section>
-
+
</chapter>
<!-- Keep this comment at the end of the file
Local variables:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
<chapter id="tutorial">
<title>Tutorial</title>
-
+
<sect1 id="tutorial-oai">
<title>A first &acro.oai; indexing example</title>
Go to the <literal>examples/oai-pmh</literal> subdirectory of the
distribution archive, or make a deep copy of the Debian installation
directory
- <literal>/usr/share/idzebra-2.0.-examples/oai-pmh</literal>.
+ <literal>/usr/share/idzebra-2.0.-examples/oai-pmh</literal>.
An XML file containing multiple &acro.oai;
records is located in the sub
- directory <literal>examples/oai-pmh/data</literal>.
+ directory <literal>examples/oai-pmh/data</literal>.
</para>
- <para>
+ <para>
Additional OAI test records can be downloaded by running a shell
script (you may want to abort the script when you have waited
longer than your coffee brews ..).
cd ../
</screen>
</para>
- <para>
+ <para>
To index these &acro.oai; records, type:
<screen>
zebraidx-2.0 -c conf/zebra.cfg init
In case you have not installed zebra yet but have compiled the
binaries from this tarball, use the following command form:
<screen>
- ../../index/zebraidx -c conf/zebra.cfg this and that
+ ../../index/zebraidx -c conf/zebra.cfg this and that
</screen>
On some systems the &zebra; binaries are installed under the
generic names, you need to use the following command form:
<screen>
- zebraidx -c conf/zebra.cfg this and that
+ zebraidx -c conf/zebra.cfg this and that
</screen>
</para>
-
+
<para>
In this command, the word <literal>update</literal> is followed
by the name of a directory: <literal>zebraidx</literal> updates all
- files in the hierarchy rooted at <literal>data</literal>.
- The command option
+ files in the hierarchy rooted at <literal>data</literal>.
+ The command option
<literal>-c conf/zebra.cfg</literal> points to the proper
configuration file.
</para>
-
+
<para>
You might ask yourself how &acro.xml; content is indexed using &acro.xslt;
stylesheets: to satisfy your curiosity, you might want to run the
return records in the &acro.xml; format only. The indexing machine
did the splitting into individual records just behind the scenes.
</para>
-
+
</sect1>
<sect1 id="tutorial-oai-sru-pqf">
<title>Searching the &acro.oai; database by web service</title>
-
+
<para>
&zebra; has a build-in web service, which is close to the
&acro.sru; standard web service. We use it to access our new
- database using any &acro.xml; enabled web browser.
+ database using any &acro.xml; enabled web browser.
This service is using the &acro.pqf; query language.
In a later
section we show how to run a fully compliant &acro.sru; server,
<!--
relation tests:
-
+
<ulink url="">
http://localhost:9999/?version=1.1&operation=searchRetrieve
&zebra; uses &acro.xslt; stylesheets for both &acro.xml;record
indexing and
display retrieval. In this example installation, they are two
- retrieval schema's defined in
- <literal>conf/dom-conf.xml</literal>:
- the <literal>dc</literal> schema implemented in
+ retrieval schema's defined in
+ <literal>conf/dom-conf.xml</literal>:
+ the <literal>dc</literal> schema implemented in
<literal>conf/oai2dc.xsl</literal>, and
- the <literal>zebra</literal> schema implemented in
- <literal>conf/oai2zebra.xsl</literal>.
+ the <literal>zebra</literal> schema implemented in
+ <literal>conf/oai2zebra.xsl</literal>.
The URLs for accessing both are the same, except for the different
value of the <literal>recordSchema</literal> parameter:
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc
- </ulink>
+ </ulink>
and
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra
- </ulink>
+ </ulink>
For the curious, one can see that the &acro.xslt; transformations
- really do the magic.
+ really do the magic.
<screen>
xsltproc conf/oai2dc.xsl data/debug-record.xml
xsltproc conf/oai2zebra.xsl data/debug-record.xml
original stored &acro.oai; &acro.xml; record.
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::data">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::data
- </ulink>
+ </ulink>
</para>
</sect1>
stylesheet reveals the following word type indexes (i.e. those
with suffix <literal>:w</literal>):
<screen>
- any:w
+ any:w
title:w
author:w
subject:w
Similar we can direct searches to the other indexes defined. Or we
can create boolean combinations of searches on different
indexes. In this case we search for <literal>the</literal> in
- <literal>title</literal> and for <literal>fish</literal> in
- <literal>description</literal> using the query
+ <literal>title</literal> and for <literal>fish</literal> in
+ <literal>description</literal> using the query
<literal>@and @attr 1=title the @attr 1=description fish</literal>.
<ulink
url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@and
<literal>zebra::index::fieldname</literal>. In this example you
can see that the <literal>title</literal> index has both word
(type <literal>:w</literal>) and phrase (type
- <literal>:p</literal>)
- indexed fields,
+ <literal>:p</literal>)
+ indexed fields,
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::index::title">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::index::title
- </ulink>
+ </ulink>
</para>
<para>
But where in the indexes did the term match for the query occur?
Easily answered with the special &zebra; schema
<literal>zebra::snippet</literal>. The matching terms are
- encapsulated by <literal><s></literal> tags.
+ encapsulated by <literal><s></literal> tags.
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::snippet">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::snippet
- </ulink>
+ </ulink>
</para>
<para>
<literal>title:w</literal> index.
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::title:w">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::title:w
- </ulink>
+ </ulink>
</para>
<para>
<literal>:p</literal>.
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::publisher:p,title:p">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::publisher:p,title:p
- </ulink>
+ </ulink>
</para>
</sect1>
</para>
<para>
- We are all set to start the &acro.sru;/acro.z3950; server including
+ We are all set to start the &acro.sru;/acro.z3950; server including
&acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend
server configuration - just type
<screen>
<para>
First, we'd like to be sure that we can see the &acro.explain;
&acro.xml; response correctly. You might use either of these equivalent
- requests:
+ requests:
<ulink
url="http://localhost:9999">http://localhost:9999
</ulink>
</para>
<para>
- Now we can issue true &acro.sru; requests. For example,
+ Now we can issue true &acro.sru; requests. For example,
<literal>dc.title=the
and dc.description=fish</literal> results in the following page
<ulink
</para>
<para>
- Scan of indexes is a part of the &acro.sru; server business. For example,
+ Scan of indexes is a part of the &acro.sru; server business. For example,
scanning the <literal>dc.title</literal> index gives us an idea
what search terms are found there
<ulink
url="http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.title=fish">
http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.title=fish
</ulink>,
- whereas
+ whereas
<ulink
url="http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish">
- http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish
+ http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish
</ulink>
accesses the indexed identifiers.
</para>
<para>
In addition, all &zebra; internal special element sets or record
- schema's of the form
+ schema's of the form
<literal>zebra::</literal> just work right out of the box
<ulink
url="http://localhost:9999/?version=1.1&operation=searchRetrieve&query=dc.title=the
<sect1 id="tutorial-oai-z3950">
<title>Searching the &acro.oai; database by &acro.z3950; protocol</title>
-
+
<para>
In this section we repeat the searches and presents we have done so
far using the binary &acro.z3950; protocol, you can use any
- &acro.z3950; client.
+ &acro.z3950; client.
For instance, you can use the demo command-line client that comes
- with &yaz;.
+ with &yaz;.
</para>
<para>
- Connecting to the server is done by the command
+ Connecting to the server is done by the command
<screen>
yaz-client localhost:9999
</screen>
</para>
-
+
<para>
When the client has connected, you can type:
<screen>
Z> show 1+1
</screen>
</para>
-
+
<para>
&acro.z3950; presents using presentation stylesheets:
<screen>
Z> elements dc
Z> show 2+1
-
+
Z> elements zebra
Z> show 3+1
</screen>
</para>
<para>
- &acro.z3950; buildin Zebra presents (in this configuration only if
+ &acro.z3950; buildin Zebra presents (in this configuration only if
started without yaz-frontendserver):
-
+
<screen>
Z> elements zebra::meta
Z> show 4+1
-
+
Z> elements zebra::meta::sysno
Z> show 5+1
-
+
Z> format sutrs
Z> show 5+1
Z> format xml
-
+
Z> elements zebra::index
Z> show 6+1
-
+
Z> elements zebra::snippet
Z> show 7+1
-
+
Z> elements zebra::facet::any:w
Z> show 1+1
-
+
Z> elements zebra::facet::publisher:p,title:p
Z> show 1+1
</screen>
Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562
Z> show 1+1
-
+
Z> find @attr 1=title communication
Z> show 1+1
-
- Z> find @attr 1=identifier @attr 4=3
+
+ Z> find @attr 1=identifier @attr 4=3
http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86
Z> show 1+1
</screen>
- etc, etc.
+ etc, etc.
</para>
<para>
Z> querytype cql
Z> elements dc
Z>
- Z> find harry
+ Z> find harry
Z>
Z> find dc.creator = the
Z> find dc.creator = the
Z> find dc.title > some
Z>
Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78"
- Z> find dc.relation = something
+ Z> find dc.relation = something
</screen>
</para>
<!--
- etc, etc. Notice that all indexes defined by 'type="0"' in the
- indexing style sheet must be searched using the 'eq'
- relation.
-
+ etc, etc. Notice that all indexes defined by 'type="0"' in the
+ indexing style sheet must be searched using the 'eq'
+ relation.
+
Z> find title <> and
fails as well. ???
<tip>
<para>
- &acro.z3950; scan using server side CQL conversion -
- unfortunately, this will _never_ work as it is not supported by the
+ &acro.z3950; scan using server side CQL conversion -
+ unfortunately, this will _never_ work as it is not supported by the
&acro.z3950; standard.
- If you want to use scan using server side CQL conversion, you need to
+ If you want to use scan using server side CQL conversion, you need to
make an SRW connection using yaz-client, or a
SRU connection using REST Web Services - any browser will do.
</para>
</tip>
<tip>
- <para>
- All indexes defined by 'type="0"' in the
- indexing style sheet must be searched using the '@attr 4=3'
- structure attribute instruction.
+ <para>
+ All indexes defined by 'type="0"' in the
+ indexing style sheet must be searched using the '@attr 4=3'
+ structure attribute instruction.
</para>
</tip>
<para>
Notice that searching and scan on indexes
- <literal>contributor</literal>, <literal>language</literal>,
- <literal>rights</literal>, and <literal>source</literal>
- might fail, simply because none of the records in the small example set
+ <literal>contributor</literal>, <literal>language</literal>,
+ <literal>rights</literal>, and <literal>source</literal>
+ might fail, simply because none of the records in the small example set
have these fields set, and consequently, these indexes might not
- been created.
+ been created.
</para>
-
+
</sect1>
-
+
</chapter>
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook V4.4//EN"
+<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"
[
<!ENTITY % local SYSTEM "local.ent">
<manvolnum>1</manvolnum>
<refmiscinfo class="manual">Commands</refmiscinfo>
</refmeta>
-
+
<refnamediv>
<refname>zebraidx</refname>
<refpurpose>&zebra; Administrative Tool</refpurpose>
</refnamediv>
-
+
<refsynopsisdiv>
<cmdsynopsis>
<command>zebraidx</command>
<arg choice="opt" rep="repeat"><replaceable>file</replaceable></arg>
</cmdsynopsis>
</refsynopsisdiv>
-
+
<refsect1><title>DESCRIPTION</title>
<para>
<command>zebraidx</command> allows you to insert, delete or updates
</refsect1>
<refsect1>
<title>COMMANDS</title>
-
+
<variablelist>
<varlistentry>
<term>update <replaceable>directory</replaceable></term>
<term>clean</term>
<listitem><para>
Clean shadow files and "forget" changes.
- </para></listitem>
+ </para></listitem>
</varlistentry>
<varlistentry>
<term>create <replaceable>database</replaceable></term>
<refsect1>
<title>OPTIONS</title>
<variablelist>
-
+
<varlistentry>
<term>-t <replaceable>type</replaceable></term>
<listitem>
settings for <replaceable>group</replaceable>
(see <link linkend="zebra-cfg">&zebra; Configuration File</link> in
the &zebra; manual).
- </para>
+ </para>
</listitem>
</varlistentry>
<varlistentry>
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term>-l <replaceable>file</replaceable></term>
<listitem>
- <para>
+ <para>
Write log messages to <replaceable>file</replaceable> instead
of <literal>stderr</literal>.
</para>
</listitem>
</varlistentry>
-
+
<varlistentry>
<term>-m <replaceable>mbytes</replaceable></term>
<listitem>
</para>
</listitem>
</varlistentry>
- <varlistentry>
+ <varlistentry>
<term>-v <replaceable>level</replaceable></term>
<listitem>
<para>
<manvolnum>8</manvolnum>
</citerefentry>
</para>
-</refsect1>
+ </refsect1>
</refentry>
<!-- Keep this comment at the end of the file
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
-sgml-parent-document:nil
+sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook V4.4//EN"
+<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"
[
<!ENTITY % local SYSTEM "local.ent">
<manvolnum>8</manvolnum>
<refmiscinfo class="manual">Commands</refmiscinfo>
</refmeta>
-
+
<refnamediv>
<refname>zebrasrv</refname>
<refpurpose>Zebra Server</refpurpose>
<refsynopsisdiv>
&zebrasrv-synopsis;
</refsynopsisdiv>
- <refsect1><title>DESCRIPTION</title>
+ <refsect1><title>DESCRIPTION</title>
<para>Zebra is a high-performance, general-purpose structured text indexing
and retrieval engine. It reads structured records in a variety of input
formats (e.g. email, &acro.xml;, &acro.marc;) and allows access to them through exact
- boolean search expressions and relevance-ranked free-text queries.
+ boolean search expressions and relevance-ranked free-text queries.
</para>
<para>
<command>zebrasrv</command> is the &acro.z3950; and &acro.sru; frontend
server for the <command>Zebra</command> search engine and indexer.
- </para>
- <para>
+ </para>
+ <para>
On Unix you can run the <command>zebrasrv</command>
server from the command line - and put it
in the background. It may also operate under the inet daemon.
</refsect1>
<refsect1>
<title>OPTIONS</title>
-
- <para>
+
+ <para>
The options for <command>zebrasrv</command> are the same
as those for &yaz;' <command>yaz-ztest</command>.
Option <literal>-c</literal> specifies a Zebra configuration
file - if omitted <filename>zebra.cfg</filename> is read.
</para>
-
+
&zebrasrv-options;
</refsect1>
-
+
<refsect1 id="protocol-support">
<title>&acro.z3950; Protocol Support and Behavior</title>
<refsect2 id="zebrasrv-search">
<title>&acro.z3950; Search</title>
-
+
<para>
The supported query type are 1 and 101. All operators are currently
supported with the restriction that only proximity units of type "word"
without limitations.
Searches may span multiple databases.
</para>
-
+
<para>
The server has full support for piggy-backed retrieval (see
also the following section).
</para>
- </refsect2>
-
+ </refsect2>
+
<refsect2 id="zebrasrv-present">
<title>&acro.z3950; Present</title>
<para>
Of these Zebra supports the attribute specification type in which
case the use attribute specifies the "Sort register".
Sort registers are created for those fields that are of type "sort" in
- the default.idx file.
+ the default.idx file.
The corresponding character mapping file in default.idx specifies the
ordinal of each character used in the actual sort.
</para>
timeout.
</para>
</refsect2>
-
- <refsect2 id="zebrasrv-explain">
- <title>&acro.z3950; Explain</title>
- <para>
- Zebra maintains a "classic"
+
+ <refsect2 id="zebrasrv-explain">
+ <title>&acro.z3950; Explain</title>
+ <para>
+ Zebra maintains a "classic"
<ulink url="&url.z39.50.explain;">&acro.z3950; Explain</ulink> database
- on the side.
+ on the side.
This database is called <literal>IR-Explain-1</literal> and can be
searched using the attribute set <literal>exp-1</literal>.
</para>
<para>
- The records in the explain database are of type
+ The records in the explain database are of type
<literal>grs.sgml</literal>.
- The root element for the Explain grs.sgml records is
- <literal>explain</literal>, thus
+ The root element for the Explain grs.sgml records is
+ <literal>explain</literal>, thus
<filename>explain.abs</filename> is used for indexing.
</para>
<note>
- <para>
+ <para>
Zebra <emphasis>must</emphasis> be able to locate
<filename>explain.abs</filename> in order to index the Explain
records properly. Zebra will work without it but the information
<para>
In addition to &acro.z3950;, Zebra supports the more recent and
web-friendly IR protocol <ulink url="&url.sru;">&acro.sru;</ulink>.
- &acro.sru; can be carried over &acro.soap; or a &acro.rest;-like protocol
- that uses HTTP &acro.get; or &acro.post; to request search responses. The request
- itself is made of parameters such as
- <literal>query</literal>,
- <literal>startRecord</literal>,
- <literal>maximumRecords</literal>
- and
- <literal>recordSchema</literal>;
- the response is an &acro.xml; document containing hit-count, result-set
- records, diagnostics, etc. &acro.sru; can be thought of as a re-casting
- of &acro.z3950; semantics in web-friendly terms; or as a standardisation
- of the ad-hoc query parameters used by search engines such as Google
- and AltaVista; or as a superset of A9's OpenSearch (which it
- predates).
+ &acro.sru; can be carried over &acro.soap; or a &acro.rest;-like protocol
+ that uses HTTP &acro.get; or &acro.post; to request search responses. The request
+ itself is made of parameters such as
+ <literal>query</literal>,
+ <literal>startRecord</literal>,
+ <literal>maximumRecords</literal>
+ and
+ <literal>recordSchema</literal>;
+ the response is an &acro.xml; document containing hit-count, result-set
+ records, diagnostics, etc. &acro.sru; can be thought of as a re-casting
+ of &acro.z3950; semantics in web-friendly terms; or as a standardisation
+ of the ad-hoc query parameters used by search engines such as Google
+ and AltaVista; or as a superset of A9's OpenSearch (which it
+ predates).
</para>
<para>
Zebra supports &acro.z3950;, &acro.sru; &acro.get;, SRU &acro.post;, SRU &acro.soap; (&acro.srw;)
requests and handling them accordingly. This is a achieved through
the use of Deep Magic; civilians are warned not to stand too close.
</para>
- <refsect2 id="zebrasrv-sru-run">
- <title>Running zebrasrv as an &acro.sru; Server</title>
- <para>
- Because Zebra supports all protocols on one port, it would
- seem to follow that the &acro.sru; server is run in the same way as
- the &acro.z3950; server, as described above. This is true, but only in
- an uninterestingly vacuous way: a Zebra server run in this manner
- will indeed recognise and accept &acro.sru; requests; but since it
- doesn't know how to handle the &acro.cql; queries that these protocols
- use, all it can do is send failure responses.
- </para>
- <note>
+ <refsect2 id="zebrasrv-sru-run">
+ <title>Running zebrasrv as an &acro.sru; Server</title>
<para>
- It is possible to cheat, by having &acro.sru; search Zebra with
- a &acro.pqf; query instead of &acro.cql;, using the
- <literal>x-pquery</literal>
- parameter instead of
- <literal>query</literal>.
- This is a
- <emphasis role="strong">non-standard extension</emphasis>
- of &acro.cql;, and a
- <emphasis role="strong">very naughty</emphasis>
- thing to do, but it does give you a way to see Zebra serving &acro.sru;
- ``right out of the box''. If you start your favourite Zebra
- server in the usual way, on port 9999, then you can send your web
- browser to:
- </para>
- <screen>
- http://localhost:9999/Default?version=1.1
+ Because Zebra supports all protocols on one port, it would
+ seem to follow that the &acro.sru; server is run in the same way as
+ the &acro.z3950; server, as described above. This is true, but only in
+ an uninterestingly vacuous way: a Zebra server run in this manner
+ will indeed recognise and accept &acro.sru; requests; but since it
+ doesn't know how to handle the &acro.cql; queries that these protocols
+ use, all it can do is send failure responses.
+ </para>
+ <note>
+ <para>
+ It is possible to cheat, by having &acro.sru; search Zebra with
+ a &acro.pqf; query instead of &acro.cql;, using the
+ <literal>x-pquery</literal>
+ parameter instead of
+ <literal>query</literal>.
+ This is a
+ <emphasis role="strong">non-standard extension</emphasis>
+ of &acro.cql;, and a
+ <emphasis role="strong">very naughty</emphasis>
+ thing to do, but it does give you a way to see Zebra serving &acro.sru;
+ ``right out of the box''. If you start your favourite Zebra
+ server in the usual way, on port 9999, then you can send your web
+ browser to:
+ </para>
+ <screen>
+ http://localhost:9999/Default?version=1.1
&operation=searchRetrieve
&x-pquery=mineral
&startRecord=1
&maximumRecords=1
- </screen>
+ </screen>
+ <para>
+ This will display the &acro.xml;-formatted &acro.sru; response that includes the
+ first record in the result-set found by the query
+ <literal>mineral</literal>. (For clarity, the &acro.sru; URL is shown
+ here broken across lines, but the lines should be joined together
+ to make single-line URL for the browser to submit.)
+ </para>
+ </note>
<para>
- This will display the &acro.xml;-formatted &acro.sru; response that includes the
- first record in the result-set found by the query
- <literal>mineral</literal>. (For clarity, the &acro.sru; URL is shown
- here broken across lines, but the lines should be joined together
- to make single-line URL for the browser to submit.)
+ In order to turn on Zebra's support for &acro.cql; queries, it's necessary
+ to have the &yaz; generic front-end (which Zebra uses) translate them
+ into the &acro.z3950; Type-1 query format that is used internally. And
+ to do this, the generic front-end's own configuration file must be
+ used. See <xref linkend="gfs-config"/>;
+ the salient point for &acro.sru; support is that
+ <command>zebrasrv</command>
+ must be started with the
+ <literal>-f frontendConfigFile</literal>
+ option rather than the
+ <literal>-c zebraConfigFile</literal>
+ option,
+ and that the front-end configuration file must include both a
+ reference to the Zebra configuration file and the &acro.cql;-to-&acro.pqf;
+ translator configuration file.
</para>
- </note>
- <para>
- In order to turn on Zebra's support for &acro.cql; queries, it's necessary
- to have the &yaz; generic front-end (which Zebra uses) translate them
- into the &acro.z3950; Type-1 query format that is used internally. And
- to do this, the generic front-end's own configuration file must be
- used. See <xref linkend="gfs-config"/>;
- the salient point for &acro.sru; support is that
- <command>zebrasrv</command>
- must be started with the
- <literal>-f frontendConfigFile</literal>
- option rather than the
- <literal>-c zebraConfigFile</literal>
- option,
- and that the front-end configuration file must include both a
- reference to the Zebra configuration file and the &acro.cql;-to-&acro.pqf;
- translator configuration file.
- </para>
- <para>
- A minimal front-end configuration file that does this would read as
- follows:
- </para>
- <screen>
-<![CDATA[
- <yazgfs>
- <server>
- <config>zebra.cfg</config>
- <cql2rpn>../../tab/pqf.properties</cql2rpn>
- </server>
- </yazgfs>
-]]></screen>
- <para>
- The
- <literal><config></literal>
- element contains the name of the Zebra configuration file that was
- previously specified by the
- <literal>-c</literal>
- command-line argument, and the
- <literal><cql2rpn></literal>
- element contains the name of the &acro.cql; properties file specifying how
- various &acro.cql; indexes, relations, etc. are translated into Type-1
- queries.
- </para>
- <para>
- A zebra server running with such a configuration can then be
- queried using proper, conformant &acro.sru; URLs with &acro.cql; queries:
- </para>
- <screen>
- http://localhost:9999/Default?version=1.1
+ <para>
+ A minimal front-end configuration file that does this would read as
+ follows:
+ </para>
+ <screen>
+ <![CDATA[
+ <yazgfs>
+ <server>
+ <config>zebra.cfg</config>
+ <cql2rpn>../../tab/pqf.properties</cql2rpn>
+ </server>
+ </yazgfs>
+ ]]></screen>
+ <para>
+ The
+ <literal><config></literal>
+ element contains the name of the Zebra configuration file that was
+ previously specified by the
+ <literal>-c</literal>
+ command-line argument, and the
+ <literal><cql2rpn></literal>
+ element contains the name of the &acro.cql; properties file specifying how
+ various &acro.cql; indexes, relations, etc. are translated into Type-1
+ queries.
+ </para>
+ <para>
+ A zebra server running with such a configuration can then be
+ queried using proper, conformant &acro.sru; URLs with &acro.cql; queries:
+ </para>
+ <screen>
+ http://localhost:9999/Default?version=1.1
&operation=searchRetrieve
&query=title=utah and description=epicent*
&startRecord=1
&acro.cql; version 1.1. In particular, it provides support for the
following elements of the protocol.
</para>
-
+
<refsect2 id="zebrasrvr-search-and-retrieval">
<title>&acro.sru; Search and Retrieval</title>
<para>
- Zebra supports the
+ Zebra supports the
<ulink url="&url.sru.searchretrieve;">&acro.sru; searchRetrieve</ulink>
operation.
</para>
query language, &acro.cql;, and that all conforming implementations can
therefore be trusted to correctly interpret the same queries. It
is with some shame, then, that we admit that Zebra also supports
- an additional query language, our own Prefix Query Format
+ an additional query language, our own Prefix Query Format
(<ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>).
A &acro.pqf; query is submitted by using the extension parameter
<literal>x-pquery</literal>,
<literal>query</literal> parameter.
</para>
</refsect2>
-
+
<refsect2 id="zebrasrv-sru-scan">
<title>&acro.sru; Scan</title>
<para>
in the &yaz; Frontend Server configuration file that is described in the
<xref linkend="gfs-config"/>.
</para>
- <para>
- Unfortunately, the data found in the
+ <para>
+ Unfortunately, the data found in the
&acro.cql;-to-&acro.pqf; text file must be added by hand-craft into the explain
section of the &yaz; Frontend Server configuration file to be able
- to provide a suitable explain record.
+ to provide a suitable explain record.
Too bad, but this is all extreme
new alpha stuff, and a lot of work has yet to be done ..
</para>
Present operation which requests records from an established
result set. In &acro.sru;, this is achieved by sending a subsequent
<literal>searchRetrieve</literal> request with the query
- <literal>cql.resultSetId=</literal><emphasis>id</emphasis> where
+ <literal>cql.resultSetId=</literal><emphasis>id</emphasis> where
<emphasis>id</emphasis> is the identifier of the previously
generated result-set.
</para>
</para>
</refsect2>
</refsect1>
-
+
<refsect1 id="zebrasrv-sru-examples">
- <title>&acro.sru; Examples</title>
- <para>
- Surf into <literal>http://localhost:9999</literal>
- to get an explain response, or use
- <screen><![CDATA[
- http://localhost:9999/?version=1.1&operation=explain
- ]]></screen>
- </para>
- <para>
- See number of hits for a query
- <screen><![CDATA[
- http://localhost:9999/?version=1.1&operation=searchRetrieve
- &query=text=(plant%20and%20soil)
- ]]></screen>
- </para>
- <para>
- Fetch record 5-7 in Dublin Core format
- <screen><![CDATA[
- http://localhost:9999/?version=1.1&operation=searchRetrieve
- &query=text=(plant%20and%20soil)
- &startRecord=5&maximumRecords=2&recordSchema=dc
- ]]></screen>
- </para>
- <para>
- Even search using &acro.pqf; queries using the <emphasis>extended naughty
- parameter</emphasis> <literal>x-pquery</literal>
- <screen><![CDATA[
- http://localhost:9999/?version=1.1&operation=searchRetrieve
- &x-pquery=@attr%201=text%20@and%20plant%20soil
- ]]></screen>
- </para>
- <para>
- Or scan indexes using the <emphasis>extended extremely naughty
- parameter</emphasis> <literal>x-pScanClause</literal>
- <screen><![CDATA[
- http://localhost:9999/?version=1.1&operation=scan
- &x-pScanClause=@attr%201=text%20something
- ]]></screen>
- <emphasis>Don't do this in production code!</emphasis>
- But it's a great fast debugging aid.
- </para>
+ <title>&acro.sru; Examples</title>
+ <para>
+ Surf into <literal>http://localhost:9999</literal>
+ to get an explain response, or use
+ <screen><![CDATA[
+ http://localhost:9999/?version=1.1&operation=explain
+ ]]></screen>
+ </para>
+ <para>
+ See number of hits for a query
+ <screen><![CDATA[
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &query=text=(plant%20and%20soil)
+ ]]></screen>
+ </para>
+ <para>
+ Fetch record 5-7 in Dublin Core format
+ <screen><![CDATA[
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &query=text=(plant%20and%20soil)
+ &startRecord=5&maximumRecords=2&recordSchema=dc
+ ]]></screen>
+ </para>
+ <para>
+ Even search using &acro.pqf; queries using the <emphasis>extended naughty
+ parameter</emphasis> <literal>x-pquery</literal>
+ <screen><![CDATA[
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &x-pquery=@attr%201=text%20@and%20plant%20soil
+ ]]></screen>
+ </para>
+ <para>
+ Or scan indexes using the <emphasis>extended extremely naughty
+ parameter</emphasis> <literal>x-pScanClause</literal>
+ <screen><![CDATA[
+ http://localhost:9999/?version=1.1&operation=scan
+ &x-pScanClause=@attr%201=text%20something
+ ]]></screen>
+ <emphasis>Don't do this in production code!</emphasis>
+ But it's a great fast debugging aid.
+ </para>
</refsect1>
<manvolnum>1</manvolnum>
</citerefentry>
</para>
-</refsect1>
+ </refsect1>
</refentry>
<!-- Keep this comment at the end of the file
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End: