X-Git-Url: http://jsfdemo.indexdata.com/?a=blobdiff_plain;ds=sidebyside;f=doc%2Frecordmodel-alvisxslt.xml;h=f9ad32025c8cbae48df8e8fc46fe22e4adf66eb6;hb=3f3bb5184cb06221655717e9cfeb646dc699ead5;hp=764190da266a33d214b29a7414a86ba896fd767a;hpb=14a2dbce03d7802ab5b1e57b09d915339bb5fc54;p=idzebra-moved-to-github.git
diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml
index 764190d..f9ad320 100644
--- a/doc/recordmodel-alvisxslt.xml
+++ b/doc/recordmodel-alvisxslt.xml
@@ -1,5 +1,5 @@
-
+
ALVIS XML Record Model and Filter Module
@@ -12,44 +12,437 @@
releases of the Zebra Information Server.
-
-
+ This filter has been developed under the
+ ALVIS project funded by
+ the European Community under the "Information Society Technologies"
+ Program (2002-2006).
+
-
- ALLVIS Record Filter
+
+ ALVIS Record Filter
- The experimental, loadable Alvis XM/XSLT filter module
+ The experimental, loadable Alvis XML/XSLT filter module
mod-alvis.so is packaged in the GNU/Debian package
libidzebra1.4-mod-alvis.
+ It is invoked by the zebra.cfg configuration statement
+
+ recordtype.xml: alvis.db/filter_alvis_conf.xml
+
+ In this example on all data files with suffix
+ *.xml, where the
+ Alvis XSLT filter configuration file is found in the
+ path db/filter_alvis_conf.xml.
+
+ The Alvis XSLT filter configuration file must be
+ valid XML. It might look like this (This example is
+ used for indexing and display of OAI harvested records):
+
+ <?xml version="1.0" encoding="UTF-8"?>
+ <schemaInfo>
+ <schema name="identity" stylesheet="xsl/identity.xsl" />
+ <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+ stylesheet="xsl/oai2index.xsl" />
+ <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
+ <!-- use split level 2 when indexing whole OAI Record lists -->
+ <split level="2"/>
+ </schemaInfo>
+
+
+
+ All named stylesheets defined inside
+ schema element tags
+ are for presentation after search, including
+ the indexing stylesheet (which is a great debugging help). The
+ names defined in the name attributes must be
+ unique, these are the literal schema or
+ element set names used in
+ SRW,
+ SRU and
+ Z39.50 protocol queries.
+ The paths in the stylesheet attributes
+ are relative to zebras working directory, or absolute to file
+ system root.
+
+
+ The <split level="2"/> decides where the
+ XML Reader shall split the
+ collections of records into individual records, which then are
+ loaded into DOM, and have the indexing XSLT stylesheet applied.
+
+
+ There must be exactly one indexing XSLT stylesheet, which is
+ defined by the magic attribute
+ identifier="http://indexdata.dk/zebra/xslt/1".
-
- ALLVIS Internal Record Representation
- FIXME
-
-
-
- ALLVIS Canonical Format
- FIXME
-
-
-
-
-
-
-
- ALLVIS Record Model Configuration
- FIXME
-
-
-
-
+
+ ALVIS Internal Record Representation
+ When indexing, an XML Reader is invoked to split the input
+ files into suitable record XML pieces. Each record piece is then
+ transformed to an XML DOM structure, which is essentially the
+ record model. Only XSLT transformations can be applied during
+ index, search and retrieval. Consequently, output formats are
+ restricted to whatever XSLT can deliver from the record XML
+ structure, be it other XML formats, HTML, or plain text. In case
+ you have libxslt1 running with EXSLT support,
+ you can use this functionality inside the Alvis
+ filter configuration XSLT stylesheets.
+
+
+
+
+ ALVIS Canonical Indexing Format
+ The output of the indexing XSLT stylesheets must contain
+ certain elements in the magic
+ xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ namespace. The output of the XSLT indexing transformation is then
+ parsed using DOM methods, and the contained instructions are
+ performed on the magic elements and their
+ subtrees.
+
+
+ For example, the output of the command
+
+ xsltproc xsl/oai2index.xsl one-record.xml
+
+ might look like this:
+
+ <?xml version="1.0" encoding="UTF-8"?>
+ <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ z:id="oai:JTRS:CP-3290---Volume-I"
+ z:rank="47896"
+ z:type="update">
+ <z:index name="oai_identifier" type="0">
+ oai:JTRS:CP-3290---Volume-I</z:index>
+ <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
+ <z:index name="oai_setspec" type="0">jtrs</z:index>
+ <z:index name="dc_all" type="w">
+ <z:index name="dc_title" type="w">Proceedings of the 4th
+ International Conference and Exhibition:
+ World Congress on Superconductivity - Volume I</z:index>
+ <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
+ Burnham, Editors</z:index>
+ </z:index>
+ </z:record>
+
+
+ This means the following: From the original XML file
+ one-record.xml (or from the XML record DOM of the
+ same form coming from a splitted input file), the indexing
+ stylesheet produces an indexing XML record, which is defined by
+ the record element in the magic namespace
+ xmlns:z="http://indexdata.dk/zebra/xslt/1".
+ Zebra uses the content of
+ z:id="oai:JTRS:CP-3290---Volume-I" as internal
+ record ID, and - in case static ranking is set - the content of
+ z:rank="47896" as static rank. Following the
+ discussion in
+ we see that this records is internally ordered
+ lexicographically according to the value of the string
+ oai:JTRS:CP-3290---Volume-I47896.
+ The type of action performed during indexing is defined by
+ z:type="update">, with recognized values
+ insert, update, and
+ delete.
+
+ In this example, the following literal indexes are constructed:
+
+ oai:identifier
+ oai:datestamp
+ oai:setspec
+ dc:all
+ dc:title
+ dc:creator
+
+ where the indexing type is defined in the
+ type attribute
+ (any value from the standard configuration
+ file default.idx will do). Finally, any
+ text() node content recursively contained
+ inside the index will be filtered through the
+ appropriate charmap for character normalization, and will be
+ inserted in the index.
+
+
+ Specific to this example, we see that the single word
+ oai:JTRS:CP-3290---Volume-I will be literal,
+ byte for byte without any form of character normalization,
+ inserted into the index named oai:identifier,
+ the text
+ Kumar Krishen and *Calvin Burnham, Editors
+ will be inserted using the w character
+ normalization defined in default.idx into
+ the index dc:creator (that is, after character
+ normalization the index will keep the inidividual words
+ kumar, krishen,
+ and, calvin,
+ burnham, and editors), and
+ finally both the texts
+ Proceedings of the 4th International Conference and Exhibition:
+ World Congress on Superconductivity - Volume I
+ and
+ Kumar Krishen and *Calvin Burnham, Editors
+ will be inserted into the index dc:all using
+ the same character normalization map w.
+
+
+ Finally, this example configuration can be queried using PQF
+ queries, either transported by Z39.50, (here using a yaz-client)
+
+ open localhost:9999
+ Z> elem dc
+ Z> form xml
+ Z>
+ Z> f @attr 1=dc:creator Kumar
+ Z> scan @attr 1=dc:creator adam
+ Z>
+ Z> f @attr 1=dc:title @attr 4=2 "proceeding congress superconductivity"
+ Z> scan @attr 1=dc:title abc
+ ]]>
+
+ or the proprietary
+ extentions x-pquery and
+ x-pScanClause to
+ SRU, and SRW
+
+
+
+ See for more information on SRU/SRW
+ configuration, and or the YAZ
+ CQL section
+ for the details or the YAZ frontend server.
+
+
+ Notice that there are no *.abs,
+ *.est, *.map, or other GRS-1
+ filter configuration files involves in this process, and that the
+ literal index names are used during search and retrieval.
+
+
+
+
+
+
+ ALVIS Record Model Configuration
+
+
+
+ ALVIS Indexing Configuration
+
+ As mentioned above, there can be only one indexing
+ stylesheet, and configuration of the indexing process is a synonym
+ of writing an XSLT stylesheet which produces XML output containing the
+ magic elements discussed in
+ .
+ Obviously, there are million of different ways to accomplish this
+ task, and some comments and code snippets are in order to lead
+ our paduans on the right track to the good side of the force.
+
+
+ Stylesheets can be written in the pull or
+ the push style: pull
+ means that the output XML structure is taken as starting point of
+ the internal structure of the XSLT stylesheet, and portions of
+ the input XML are pulled out and inserted
+ into the right spots of the output XML structure. On the other
+ side, push XSLT stylesheets are recursavly
+ calling their template definitions, a process which is commanded
+ by the input XML structure, and avake to produce some output XML
+ whenever some special conditions in the input styelsheets are
+ met. The pull type is well-suited for input
+ XML with strong and well-defined structure and semantcs, like the
+ following OAI indexing example, whereas the
+ push type might be the only possible way to
+ sort out deeply recursive input XML formats.
+
+
+ A pull stylesheet example used to index
+ OAI harvested records could use some of the following template
+ definitions:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+ Notice also,
+ that the names and types of the indexes can be defined in the
+ indexing XSLT stylesheet dynamically according to
+ content in the original XML records, which has
+ opportunities for great power and wizardery as well as grande
+ disaster.
+
+
+ The following excerpt of a push stylesheet
+ might
+ be a good idea according to your strict control of the XML
+ input format (due to rigerours checking against well-defined and
+ tight RelaxNG or XML Schema's, for example):
+
+
+
+
+
+
+ ]]>
+
+ This template creates indexes which have the name of the working
+ node of any input XML file, and assigns a '1' to the index.
+ The example query
+ find @attr 1=xyz 1
+ finds all files which contain at least one
+ xyz XML element. In case you can not control
+ which element names the input files contain, you might ask for
+ disaster and bad karma using this technique.
+
+
+ One variation over the theme dynamically created
+ indexes will definitely be unwise:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+ Don't be tempted to cross
+ the line to the dark side of the force, paduan; this leads
+ to suffering and pain, and universal
+ disentigration of your project schedule.
+
+
+
+
ALVIS Exchange Formats
- FIXME
-
-
-
+
+ An exchange format can be anything which can be the outcome of an
+ XSLT transformation, as far as the stylesheet is registered in
+ the main Alvis XSLT filter configuration file, see
+ .
+ In principle anything that can be expressed in XML, HTML, and
+ TEXT can be the output of a schema or
+ element set directive during search, as long as
+ the information comes from the
+ original input record XML DOM tree
+ (and not the transformed and indexed XML!!).
+
+
+ In addition, internal administrative information from the Zebra
+ indexer can be accessed during record retrieval. The following
+ example is a summary of the possibilities:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+
+
+
+ ALVIS Filter OAI Indexing Example
+
+ The sourcecode tarball contains a working Alvis filter example in
+ the directory examples/alvis-oai/, which
+ should get you started.
+
+
+ More example data can be harvested from any OAI complient server,
+ see details at the OAI
+
+ http://www.openarchives.org/ web site, and the community
+ links at
+
+ http://www.openarchives.org/community/index.html.
+ There is a tutorial
+ found at
+
+ http://www.oaforum.org/tutorial/.
+
+
+
+
@@ -72,7 +465,7 @@ c) Main "alvis" XSLT filter config file:
- the pathes are relative to the directory where zebra.init is placed
+ the paths are relative to the directory where zebra.init is placed
and is started up.
The split level decides where the SAX parser shall split the
@@ -94,7 +487,7 @@ c) Main "alvis" XSLT filter config file:
and so on.
- in db/ a cql2pqf.txt yaz-client config file
- which is also used in the yaz-server CQL-to-PQF process
+ which is also used in the yaz-server CQL-to-PQF process
see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map