From 23ebb5f3708511db457cebc72bf11e6d9db23b9c Mon Sep 17 00:00:00 2001 From: Marc Cromme Date: Fri, 2 Feb 2007 14:34:20 +0000 Subject: [PATCH] more feature info. tables still look like a grande disaster, but the content is there - more or less. needs pretty formating and tweaking --- doc/introduction.xml | 487 +++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 398 insertions(+), 89 deletions(-) diff --git a/doc/introduction.xml b/doc/introduction.xml index 47096a3..7a95367 100644 --- a/doc/introduction.xml +++ b/doc/introduction.xml @@ -1,5 +1,5 @@ - + Introduction
@@ -73,8 +73,30 @@ &zebra; Features Overview - - &zebra; Features Overview + + + +
+ &zebra; networked protocols @@ -86,7 +108,57 @@ - Boolean query language + Operation types + &z3950;/&sru; explain, search, and scan + + + + + Remote update + &z3950; extended services + + + + + &z3950; + &z3950; protocol support + Protocol facilities: Init, Search, Present (retrieval), + Segmentation (support for very large records), Delete, Scan + (index browsing), Sort, Close and support for the ``update'' + Extended Service to add or replace an existing &xml; + record. Piggy-backed presents are honored in the search + request. Named result sets are supported. + + + + Web Service support + &sru_gps; + The protocol operations explain, + searchRetrieve and scan + are supported. &cql; to internal + query model &rpn; conversion is supported. Extended RPN queries + for search/retrieve and scan are supported. + + + + +
+ + + + &zebra; search functionality + + + + Feature + Availability + Notes + Reference + + + + + Query languages &cql; and &rpn;/&pqf; The type-1 Reverse Polish Notation (&rpn;) and it's textual representation Prefix Query Format (&pqf;) are @@ -96,28 +168,221 @@ - Operation types - &z3950;/&sru; explain, search, and scan - - - - - Recursive boolean query tree + Complex boolean query tree &cql; and &rpn;/&pqf; Both &cql; and &rpn;/&pqf; allow atomic query parts (&apt;) to be combined into complex boolean query trees - Large databases - 64 file pointers assure that register files can extend - the 2 GB limit. Logical files can be - automatically partitioned over multiple disks, thus allowing for - large databases. + Field search + user defined + Atomic query parts (&apt;) are either general, or + directed at user-specified document fields + + + + + Data normalization + + Data normalization, text tokenization and character mappings can be + applied during indexing and searching + + + + Predefined field types + + Data fields can be indexed as phrase, as into word tokenized text, + as numeric values, url's, dates, and raw binary data. + + + + Regular expression matching + Regexp + Full regular expression matching and "approximate + matching" (eg. spelling mistake corrections) are handled. + + + + Search truncation + + Fuzzy searches + + In addition, fuzzy searches are implemented, where one + spelling mistake in search terms is matched + + + + + + + + + &zebra; index scanning + + + + Feature + Availability + Notes + Reference + + + + + Scan + yes + Scan on a given named index returns all the + indexed terms in lexicographical order near the given start term. + + + + Facetted browsing + partial + &zebra; supports scan inside a hit + set from a previous search, thus reducing the listed + terms to the + subset of terms found in the documents/records of the hit set. + + + + Drill-down or refine-search + partially + scanning in result sets can be used to implement + drill-down in search clients + + + + +
+ + + + &zebra; document presentation + + + + Feature + Availability + Notes + Reference + + + + + Hit count + yes + Search results include at any time the total hit count of a given + query, either exact computed, or approximative, in case that the + hit count exceeds a possible pre-defined hit set truncation + level. + + + + + Paged result sets + yes + Paging of search requests and present/display request can return any + successive number of records from any start position in the hit set, + i.e. it is trivial to provide search results in successive pages of + any size. + + + + &xml;ocument transformations + &xslt; based + Record presentation can be performed in many pre-defined &xml; data + formats, where the original &xml; records are on-the-fly transformed + through any preconfigured &xslt; transformation. It is therefore + trivial to present records in short/full &xml; views, transforming to + RSS, Dublin Core, or other &xml; based data formats, or transform + records to XHTML snippets ready for inserting in XHTML pages. + + + + Binary record transformations + &marc;, &usmarc;, &marc21; and &marcxml; + + + + + Record Syntaxes + + Multiple record syntaxes + for data retrieval: &grs1;, &sutrs;, + &xml;, ISO2709 (&marc;), etc. Records can be mapped between record syntaxes + and schemas on the fly. + + + + +
+ + + + &zebra; sorting and ranking + + + + Feature + Availability + Notes + Reference + + + + + Sort + numeric, lexicographic + Sorting on the basis of alpha-numeric and numeric data + is supported. Alphanumeric sorts can be configured for different data encodings + and locales for European languages. + + + + Combined sorting + yes + Sorting on the basis of combined sorts ­ e.g. combinations of + ascending/descending sorts of lexicographical/numeric/date field data + is supported + + + + Relevance ranking + TF-IDF like + Relevance-ranking of free-text queries is supported + using a TF-IDF like algorithm. + + + + Relevence ranking + TDF-IDF like + + + + + +
+ + + + + &zebra; document model + + + + Feature + Availability + Notes + Reference + + + + Complex semi-structured Documents &xml; and &grs1; Documents Both &xml; and &grs1; documents exhibit a &dom; like internal @@ -125,18 +390,6 @@ - Database updates - live, incremental updates - Robust updating - records can be added and deleted ``on the fly'' - without rebuilding the index from scratch. - Records can be safely updated even while users are accessing - the server. - The update procedure is tolerant to crashes or hard interrupts - during database updating - data can be reconstructed following - a crash. - - - Input document formats &xml;, &sgml;, Text, ISO2709 (&marc;) @@ -148,13 +401,6 @@ - Relevance ranking - TF-IDF like - Relevance-ranking of free-text queries is supported - using a TF-IDF like algorithm. - - - Document storage Index-only, Key storage, Document storage Data can be, and usually is, imported @@ -163,116 +409,179 @@ collections. + + + +
+ + + + + &zebra; data size and scalability + + - Regular expression matching - Regexp - Full regular expression matching and "approximate - matching" (eg. spelling mistake corrections) are handled. - + Feature + Availability + Notes + Reference + + - Search truncation + No of records + 40-60 million + + + + Data size + 100 GB of record data - Remote update - &z3950; extended services + File pointers + 64 bit - Supported Platforms - UNIX, Linux, Windows (NT/2000/2003/XP) - &zebra; is written in portable C, so it runs on most - Unix-like systems as well as Windows (NT/2000/2003/XP). Binary - distributions are - available for GNU/Debian Linux and Windows + Scale out + multiple discs + - &z3950; - &z3950; protocol support - Protocol facilities: Init, Search, Present (retrieval), - Segmentation (support for very large records), Delete, Scan - (index browsing), Sort, Close and support for the ``update'' - Extended Service to add or replace an existing &xml; - record. Piggy-backed presents are honored in the search - request. Named result sets are supported. + Performance + O(n * log N) + &zebra; query speed and performance is affected roughly by + O(log N), + where N is the total database size, and by + O(n), where n is the + specific query hit set size. - Record Syntaxes + Average search times - Multiple record syntaxes - for data retrieval: &grs1;, &sutrs;, - &xml;, ISO2709 (&marc;), etc. Records can be mapped between record syntaxes - and schemas on the fly. + Even on very large size databases hit rates of 20 queries per + seconds with average query answering time of 1 second are possible, + provided that the boolean queries are constructed sufficiently + precise to result in hit sets of the order of 1000 to 5.000 + documents. - Web Service support - &sru_gps; - The protocol operations explain, - searchRetrieve and scan - are supported. &cql; to internal - query model &rpn; conversion is supported. Extended RPN queries - for search/retrieve and scan are supported. + Large databases + 64 file pointers assure that register files can extend + the 2 GB limit. Logical files can be + automatically partitioned over multiple disks, thus allowing for + large databases. + + + +
+ + + + &zebra; live updates + + + Feature + Availability + Notes + Reference + + + + + Batch updates - - + It is possible to schedule record inserts/updates/deletes in any + quantity, from single individual handled records to batch updates + in strikes of any size, as well as total re-indexing of all records + from file system. - + Incremental updates - - + Remote updates + &z3950; extended services + Live updates - - + Data updates are transaction based and can be performed on running + &zebra; systems. Full searchability is preserved during life data update due to use + of shadow disk areas for update operations. Multiple update transactions at the same time are lined up, to be + performed one after each other. Data integrity is preserved. - - - + Database updates + live, incremental updates + Robust updating - records can be added and deleted ``on the fly'' + without rebuilding the index from scratch. + Records can be safely updated even while users are accessing + the server. + The update procedure is tolerant to crashes or hard interrupts + during database updating - data can be reconstructed following + a crash. + + +
+ + + &zebra; supported platforms + + + Feature + Availability + Notes + Reference + + + + + Linux - - + GNU Linux (32 and 64bit), journaling Reiser or (better) JFS filesystem + on disks. GNU/Debian Linux packages are available - - - + Unix + tarball + Usual tarball install possible on many major Unix systems + Windows - - + Windows installer packages available - - - + Supported Platforms + UNIX, Linux, Windows (NT/2000/2003/XP) + &zebra; is written in portable C, so it runs on most + Unix-like systems as well as Windows (NT/2000/2003/XP). Binary + distributions are + available for GNU/Debian Linux and Windows -- 1.7.10.4