1 <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook V4.1//EN"
2 "http://www.oasis-open.org/docbook/xml/4.1/docbookx.dtd"
4 <!ENTITY % local SYSTEM "local.ent">
6 <!ENTITY % entities SYSTEM "entities.ent">
8 <!ENTITY % common SYSTEM "common/common.ent">
11 <!-- $Id: pazpar2_conf.xml,v 1.24 2007-05-25 12:30:27 marc Exp $ -->
12 <refentry id="pazpar2_conf">
14 <productname>Pazpar2</productname>
15 <productnumber>&version;</productnumber>
18 <refentrytitle>Pazpar2 conf</refentrytitle>
19 <manvolnum>5</manvolnum>
23 <refname>pazpar2_conf</refname>
24 <refpurpose>Pazpar2 Configuration</refpurpose>
29 <command>pazpar2.conf</command>
33 <refsect1><title>DESCRIPTION</title>
35 The pazpar2 configuration file, together with any referenced XSLT files,
36 govern pazpar2's behavior as a client, and control the normalization and
37 extraction of data elements from incoming result records, for the
38 purposes of merging, sorting, facet analysis, and display.
42 The file is specified using the option -f on the pazpar2 command line.
43 There is not presently a way to reload the configuration file without
44 restarting pazpar2, although this will most likely be added some time
49 <refsect1><title>FORMAT</title>
51 The configuration file is XML-structured. It must be valid XML. All
52 elements specific to pazpar2 should belong to the namespace
53 "http://www.indexdata.com/pazpar2/1.0" (this is assumed in the
54 following examples). The root element is named 'pazpar2'. Under the
55 root element are a number of elements which group categories of
56 information. The categories are described below.
59 <refsect2 id="config-server"><title>server</title>
61 This section governs overall behavior of the client. The data
62 elements are described below.
64 <variablelist> <!-- level 1 -->
69 Configures the webservice -- this controls how you can connect
70 to pazpar2 from your browser or server-side code. The
71 attributes 'host' and 'port' control the binding of the
72 server. The 'host' attribute can be used to bind the server to
73 a secondary IP address of your system, enabling you to run
74 pazpar2 on port 80 alongside a conventional web server. You
75 can override this setting on the command lineusing the option -h.
84 If this item is given, pazpar2 will forward all incoming HTTP
85 requests that do not contain the filename 'search.pz2' to the
86 host and port specified using the 'host' and 'port'
87 attributes. The 'myurl' attribute is required, and should provide
88 the base URL of the server. Generally, the HTTP URL for the host
89 specified in the 'listen' parameter. This functionality is
90 crucial if you wish to use
91 pazpar2 in conjunction with browser-based code (JS, Flash,
92 applets, etc.) which operates in a security sandbox. Such code
93 can only connect to the same server from which the enclosing
94 HTML page originated. Pazpar2s proxy functionality enables you
95 to host all of the main pages (plus images, CSS, etc) of your
96 application on a conventional webserver, while efficiently
97 processing webservice requests for metasearch status, results,
107 If this item is given, pazpar2 will send all Z39.50
108 packages through this Z39.50 proxy server.
109 At least one of the 'host' and 'post' attributes is required.
110 The 'host' attribute may contain both host name and port
111 number, seperated by a colon ':', or only the host name.
112 An empty 'host' attribute sets the Z39.50 host address
119 <term>icu_chain</term>
122 Definition of ICU tokenization and normalization rules
123 are used if ICU support is compiled in. The 'id'
124 attribute is currently not used, and the 'locale'
125 attribute must be set to one of the locale strings
126 defined in ICU. The child elements listed below can be
127 in any order, except the 'index' element which logically
128 belongs to the end of the list. The stated tokenization,
129 normalization and charmapping instructions are performed
130 in order from top to bottom.
132 <variablelist> <!-- Level 2 -->
133 <varlistentry><term>casemap</term>
136 The attribure 'rule' defines the direction of the
137 per-character casemapping, allowed values are "l"
138 (lower), "u" (upper), "t" (title).
142 <varlistentry><term>normalize</term>
145 Normalization and transformation of tokens follows
146 the rules defined in the 'rule' attribute. For
147 possible values we refer to the extensive ICU
148 documentation found at the
149 <ulink url="&url.icu.transform;">ICU
150 transformation</ulink> home page. Set filtering
151 principles are explained at the
152 <ulink url="&url.icu.unicode.set;">ICU set and
153 filtering</ulink> page.
157 <varlistentry><term>tokenize</term>
160 Tokenization is the only rule in the ICU chain
161 which splits one token into multiple tokens. The
162 'rule' attribute may have the following values:
163 "s" (sentence), "l" (line-break), "w" (word), and
164 "c" (character), the later probably not beeing
165 very useful in a runing pazpar2 installation.
169 <varlistentry><term>index</term>
172 Finally the 'index' element instruction - without
173 any 'rule' attribute - is used to store the tokens
174 after chain processing in the relevance ranking
175 unit of Pazpar2. It will always be the last
176 instruction in the chain.
188 This nested element controls the behavior of pazpar2 with
189 respect to your data model. In pazpar2, incoming records are
190 normalized, using XSLT, into an internal representation.
191 The 'service' section controls the further processing and
192 extraction of data from the internal representation, primarily
193 through the 'metdata' sub-element.
196 <variablelist> <!-- Level 2 -->
197 <varlistentry><term>metadata</term>
200 One of these elements is required for every data element in
201 the internal representation of the record (see
202 <xref linkend="data_model"/>. It governs
203 subsequent processing as pertains to sorting, relevance
204 ranking, merging, and display of data elements. It supports
205 the following attributes:
208 <variablelist> <!-- level 3 -->
209 <varlistentry><term>name</term>
212 This is the name of the data element. It is matched
213 against the 'type' attribute of the
215 in the normalized record. A warning is produced if
216 metdata elements with an unknown name are
218 normalized record. This name is also used to
220 data elements in the records returned by the
221 webservice API, and to name sort lists and browse
227 <varlistentry><term>type</term>
230 The type of data element. This value governs any
231 normalization or special processing that might take
232 place on an element. Possible values are 'generic'
233 (basic string), 'year' (a range is computed if
234 multiple years are found in the record). Note: This
235 list is likely to increase in the future.
240 <varlistentry><term>brief</term>
243 If this is set to 'yes', then the data element is
244 includes in brief records in the webservice API. Note
245 that this only makes sense for metadata elements that
246 are merged (see below). The default value is 'no'.
251 <varlistentry><term>sortkey</term>
254 Specifies that this data element is to be used for
255 sorting. The possible values are 'numeric' (numeric
256 value), 'skiparticle' (string; skip common, leading
257 articles), and 'no' (no sorting). The default value is
263 <varlistentry><term>rank</term>
266 Specifies that this element is to be used to
268 records against the user's query (when ranking is
269 requested). The value is an integer, used as a
270 multiplier against the basic TF*IDF score. A value of
271 1 is the base, higher values give additional
273 elements of this type. The default is '0', which
274 excludes this element from the rank calculation.
279 <varlistentry><term>termlist</term>
282 Specifies that this element is to be used as a
283 termlist, or browse facet. Values are tabulated from
284 incoming records, and a highscore of values (with
285 their associated frequency) is made available to the
286 client through the webservice API.
288 are 'yes' and 'no' (default).
293 <varlistentry><term>merge</term>
296 This governs whether, and how elements are extracted
297 from individual records and merged into cluster
298 records. The possible values are: 'unique' (include
299 all unique elements), 'longest' (include only the
300 longest element (strlen), 'range' (calculate a range
301 of values across al matching records), 'all' (include
302 all elements), or 'no' (don't merge; this is the
307 </variablelist> <!-- attributes to metadata -->
311 </variablelist> <!-- Data elements in service directive -->
314 </variablelist> <!-- Data elements in server directive -->
319 <refsect1><title>EXAMPLE</title>
320 <para>Below is a working example configuration:
322 <?xml version="1.0" encoding="UTF-8"?>
323 <pazpar2 xmlns="http://www.indexdata.com/pazpar2/1.0">
326 <listen port="9004"/>
327 <proxy host="us1.indexdata.com" myurl="us1.indexdata.com"/>
329 <!-- <zproxy host="localhost" port="9000"/> -->
330 <!-- <zproxy host="localhost:9000"/> -->
331 <!-- <zproxy port="9000"/> -->
334 <!-- optional ICU ranking configuration example -->
336 <icu_chain id="el:word" locale="el">
337 <normalize rule="[:Control:] Any-Remove"/>
339 <normalize rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
346 <metadata name="title" brief="yes" sortkey="skiparticle" merge="longest" rank="6"/>
347 <metadata name="isbn" merge="unique"/>
348 <metadata name="date" brief="yes" sortkey="numeric" type="year" merge="range"
350 <metadata name="author" brief="yes" termlist="yes" merge="longest" rank="2"/>
351 <metadata name="subject" merge="unique" termlist="yes" rank="3"/>
352 <metadata name="url" merge="unique"/>
361 <refsect1 id="target_settings"><title>TARGET SETTINGS</title>
363 Pazpar2 features a cunning scheme by which you can associate various
364 kinds of attributes, or settings with search targets. This can be done
365 through XML files which are read at startup; each file can associate
366 one or more settings with one or more targets. The file format is generic
367 in nature, designed to support a wide range of application requirements. The
368 settings can be purely technical things, like, how to perform a title
369 search against a given target, or it can associate arbitrary name=value
370 pairs with groups of targets -- for instance, if you would like to
371 place all commercial full-text bases in one group for selection
372 purposes, or you would like to control what targets are accessible
377 During startup, pazpar2 will recursively read a specified directory
378 (can be identified in the pazpar2.cfg file or on the command line), and
379 process any settings files found therein.
383 Clients of the pazpar2 webservice interface can selectively override
384 settings for individual targets within the scope of one session. This
385 can be used in conjunction with an external authentication system to
386 determine which resources are to be accessible to which users. Pazpar2
387 itself has no notion of end-users, and so can be used in conjunction
388 with any type of authentication system. Similarly, the authentication
389 tokens submitted to access-controlled search targets can similarly be
390 overriden, to allow use of pazpar2 in a consortial or multi-library
391 environment, where different end-users may need to be represented to
392 some search targets in different ways. This, again, can be managed
393 using an external database or other lookup mechanism. Setting overrides
394 can be performed either using the 'init' or the 'settings' webservice
395 command (see XXX ref to pazpar2 protocol).
399 In fact, every setting that applies to a database (except pz:id, which
400 can only be used for filtering targets to use for a search) can be overriden
401 on a per-session basis. This allows the client to override specific CCL fields
402 for searching, etc., to meet the needs of a session or user.
406 Finally, as an extreme case of this, the webservice client can
407 introduce entirely new targets, on the fly, as part of the init or
408 settings command. This is useful if you desire to manage information
409 about your search targets in a separate application such as a database.
410 You do not need any static settings file whatsoever to run pazpar2 -- as
411 long as the webservice client is prepared to supply the necessary
412 information at the beginning of every session.
416 NOTE: The following discussion of practical issues related to session and settings
417 management are cast in terms of a user interface based on Ajax/Javascript
418 technology. It would apply equally well to many other kinds of browser-based logic.
422 Typically, a Javascript client is not allowed to directly alter the parameters
423 of a session. There are two reasons for this. One has to do with access
424 to information; typically, information about a user will be stored in a
425 system on the server side, or it will be accessible in some way from the server.
426 However, since the Javascript client cannot be entirely trusted (some hostile
427 agent might in fact 'pretend' to be a regular ws client), it is more robust
428 to control session sesttings from scripting that you run as part of your
429 webserver. Typically, this can be handled during the session initialization,
434 Step 1: The Javascript client loads, and asks the webserver for a new pazpar2
435 session ID. This can be done using a Javascript call, for instance. Note that
436 it is possible to submit Ajax HTTPXmlRequest calls either to pazpar2 or to the
437 webserver that pazpar2 is proxying for. See (XXX Insert link to pazpar2 protocol).
441 Step 2: Code on the webserver authenticates the user, by database lookup,
442 LDAP access, NCIP, etc. Determines which resources the user has access to,
443 and any user-specific parameters that are to be applied during this session.
447 Step 3: The webserver initializes a new pazpar2 settings, and sets user-specific
448 parameters as necessary, using the init webservice command. A new session ID is
453 Step 4: The webserver returns this session ID to the Javascript client, which then
454 uses the session ID to submit searches, show results, etc.
458 Step 5: When the Javascript client ceases to use the session, pazpar2 destroys
459 any session-specific information.
462 <refsect2><title>SETTINGS FILE FORMAT</title>
464 Each file contains a root element named <settings>. It may
465 contain one or more <set> elements. The settings and set
466 elements may contain the following attributes. Attributes in the set node
467 overrides those in the setting root element. Each set node must
468 specify (directly, or inherited from the parent node) at least a
469 target, name, and value.
477 This specifies the search target to which this setting should be
478 applied. Targets are identified by their Z39.50 URL, generally
479 including the host, port, and database name, (e.g.
480 bagel.indexdata.com:210/marc). Two wildcard forms are accepted:
481 * (asterisk) matches all known targets;
482 bagel.indexdata.com:210/* matches all known databases on the given
486 A precedence system determines what happens if there are
487 overlapping values for the same setting name for the same
488 target. A setting for a specific target name overrides a
489 setting whch specifies target using a wildcard. This makes it
490 easy to set defaults for all targets, and then override them
491 for specific targets or hosts. If there are
492 multiple overlapping settings with the same name and target
493 value, the 'precedence' attribute determines what happens.
501 The name of the setting. This can be anything you like.
502 However, pazpar2 reserves a number of setting names for
503 specific purposes, all starting with 'pz:', and it is a good
504 idea to avoid that prefix if you make up your own setting
505 names. See below for a list of reserved variables.
513 The value of the setting. Generally, this can be anything you
514 want -- however, some of the reserved settings may expect
515 specific kinds of values.
520 <term>precedence</term>
523 This should be an integer. If not provided, the default value
524 is 0. If two (or more) settings have the same content for
525 target and name, the precedence value determines the outcome.
526 If both settings have the same precedence value, they are both
527 applied to the target(s). If one has a higher value, then the
528 value of that setting is applied, and the other one is ignored.
535 By setting defaults for target, name, or value in the root
536 settings node, you can use the settings files in many different
537 ways. For instance, you can use a single file to set defaults for
538 many different settings, like search fields, retrieval syntaxes,
539 etc. You can have one file per server, which groups settings for
540 that server or target. You could also have one file which associates
541 a number of targets with a given setting, for instance, to associate
542 many databases with a given category or class that makes sense
543 within your application.
547 The following examples illustrate uses of the settings system to
548 associate settings with targets to meet different requirements.
552 The example below associates a set of default values that can be
553 used across many targets. Note the wildcard for targets.
554 This associates the given settings with all targets for which no
555 other information is provided.
557 <settings target="*">
559 <!-- This file introduces default settings for pazpar2 -->
560 <!-- $Id: pazpar2_conf.xml,v 1.24 2007-05-25 12:30:27 marc Exp $ -->
562 <!-- mapping for unqualified search -->
563 <set name="pz:cclmap:term" value="u=1016 t=l,r s=al"/>
565 <!-- field-specific mappings -->
566 <set name="pz:cclmap:ti" value="u=4 s=al"/>
567 <set name="pz:cclmap:su" value="u=21 s=al"/>
568 <set name="pz:cclmap:isbn" value="u=7"/>
569 <set name="pz:cclmap:issn" value="u=8"/>
570 <set name="pz:cclmap:date" value="u=30 r=r"/>
572 <!-- Retrieval settings -->
574 <set name="pz:requestsyntax" value="marc21"/>
575 <!-- <set name="pz:elements" value="F"/> NOT YET IMPLEMENTED -->
577 <!-- Result normalization settings -->
579 <set name="pz:nativesyntax" value="iso2709"/>
580 <set name="pz:xslt" value="../etc/marc21.xsl"/>
588 The next example shows certain settings overriden for one target,
589 one which returns XML records containing DublinCore elements, and
590 which furthermore requires a username/password.
592 <settings target="funkytarget.com:210/db1">
593 <set name="pz:requestsyntax" value="xml"/>
594 <set name="pz:nativesyntax" value="xml"/>
595 <set name="pz:xslt" value="../etc/dublincore.xsl"/>
597 <set name="pz:authentication" value="myuser/password"/>
603 The following example associates a specific name/value combination
604 with a number of targets. The targets below are access-restricted,
605 and can only be used by users with special credentials.
607 <settings name="pz:allow" value="0">
608 <set target="funkytarget.com:210/*"/>
609 <set target="commercial.com:2100/expensiveDb"/>
616 <refsect2><title>RESERVED SETTING NAMES</title>
618 The following setting names are reserved by pazpar2 to control the
619 behavior of the client function.
624 <term>pz:cclmap:xxx</term>
627 This establishes a CCL field definition or other setting, for
628 the purpose of mapping end-user queries. XXX is the field or
629 setting name, and the value of the setting provides parameters
630 (e.g. parameters to send to the server, etc.). Please consult
631 the YAZ manual for a full overview of the many capabilities of
632 the powerful and flexible CCL parser.
635 Note that it is easy to etablish a set of default parameters,
636 and then override them individually for a given target.
641 <term>pz:requestsyntax</term>
644 This specifies the record syntax to use when requesting
645 records from a given server. The value can be a symbolic name like
646 marc21 or xml, or it can be a Z39.50-style dot-separated OID.
651 <term>pz:elements</term>
654 The element set name to be used when retrieving records from a
655 server (not yet implemented).
660 <term>pz:piggyback</term>
663 Piggybacking enables the server to retrieve records from the
664 server as part of the search response in Z39.50. Almost all
665 servers support this (or fail it gracefully), but a few
666 servers will produce undesirable results.
667 Set to '1' to enable piggybacking, '0' to disable it. Default
668 is 1 (piggybacking enabled).
673 <term>pz:nativesyntax</term>
676 The representation (syntax) of the retrieval records. Currently
677 recognized values are iso2709 and xml.
680 For iso2709, can also specify a native character set, e.g. "iso2709;latin-1".
681 If no character set is provided, MARC-8 is assumed.
689 Provides the path of an XSLT stylesheet which will be used to
690 map incoming records to the internal representation.
695 <term>pz:authentication</term>
698 Sets an authentication string for a given server. See the section on
699 authorization and authentication for discussion.
704 <term>pz:allow</term>
707 Allows or denies access to the resources it is applied to. Possible
708 values are '0' and '1'. The default is '1' (allow access to this resource).
709 See the manual section on authorization and authentication for discussion
710 about how to use this setting.
715 <term>pz:maxrecs</term>
718 Controls the maximum number of records to be retrieved from a
719 server. The default is 100 (not yet implemented).
727 This setting can't be 'set' -- it contains the ID (normally
728 ZURL) for a given target, and is useful for filtering --
729 specifically when you want to select one or more specific
730 targets in the search command.
739 <!-- Keep this comment at the end of the file
744 sgml-minimize-attributes:nil
745 sgml-always-quote-attributes:t
748 sgml-parent-document:nil
749 sgml-local-catalogs: nil
750 sgml-namecase-general:t