heikki/README-HEIKKI

   1 Heikki's experiments with ranking
   2
   3 Personal notes, likely to be out of date.
   4
   5 Things to experiment with, and find out, and mess about
   6
   7 Goals:
   8  - Understand the ranking
   9  - make a better ranking merging algorithm
  10
  11
  12
  13
  14 Tue 19-Nov-2013 Started this branch
  15
  16
  17 Wed 20-Nov-2013 Make a script that tests ranking against yaz-zserver
  18 (as that is the default config). Mostly to have a script to build on later.
  19
  20 Thu 21-Nov-2013. Start my own complete config
  21
  22 Fri 22-Nov-2013. Adam defined a new sort type, relevance_h, and put it place
  23 in the code. Now I have a place to implement my stuff. Relevant places:
  24  pazpar2_config.c:1020  - minor
  25  session.c:1318 - call relevance_prepare_read also for my type
  26  reclists.c:104 - parse params
  27  reclists.c:166 - compare function (for quicksort)
  28  relevance.c:417 - calculate score
  29          (same function as for relevance, but with extra arg for type)
  30
  31 The compare function compares positions, when sorting by Metadata_sortkey_position
  32 This loops through the records (in the cluster) and finds the smallest rec->pos
  33 and then compares those.
  34
  35 Next: See if I can implement a round robin.
  36  - clients.h declares int clients_count(void)
  37  - rec->client is a pointer to the client, but we don't have an ordinal from that
  38    - keep an array of structs with the pointer, and locate the client number that way
  39  - robin-score = pos * n_clients + client_num
  40
  41 relevance_new_rec is called every time a new record pops up. One or more to count_word,
  42 exactly one to done_rec. That's where I can compare to the ranking of the previous
  43 record. struct_relevance is one structure I have for myself, global (for the user
  44 session), so I can keep my stuff in there, possibly an array of things for each target.
  45
  46 I should also add stuff directly to the client, and to the record, as I need.
  47
  48 Next: Plot the tf/idf scores against round-robin sorted order. Will be messy,
  49 but later when we get a target that returns sorted records, it will make sense.
  50
  51
  52 Wed 27-Nov
  53 Setting up multiple SOLR targets in the same pazpar2
  54  - Add #999 to the z-urls, so pazpar2 won't merge them. Different number for each
  55
  56 This URL shows the databases, with their numbers
  57 http://lui.indexdata.com/solr/select?q=database:*&facet=true&facet.method=fc&facet.field=author_exact&facet.field=subject_exact&facet.field=date&facet.field=medium_exact&facet.field=database&rows=0&facet.mincount=1
  58
  59 Add this to the target defs
  60 <set name="pz:extra_args" value="fq=database:4902">
  61
  62 After this, it should be possible to get records from different databases, some
  63 with many records, some with a few. This is a good testing ground for merging
  64 rankings! Test first with a round-robin, and plot the scores.
  65
  66 Thu 28-Nov
  67 Ok, I can now merge a number of SOLR databases (harvest jobs), and plot their rankings
  68 as solr gives them, in the order of different merge strategies
  69 Next: Add the normalizing merge strategy. Then plot different strategies against different queries
  70 Write a conclusion, and consider this plotting job done
  71
  72
  73 Fri 13-Dec-2013
  74 Adam is adding a float type to pazpar2. I have made a prrof of concept of the normalizing
  75 by curve fitting. I think it is time to close this branch, and start (re)implementing
  76 things in the main branch. Keep the old branch around for reference!
  77
  78 Need new config options:
  79  - sort: native, native + position
  80  - or per target: native score / fake score from position / use tf/idf
  81  - per target: weight for combining rankings (cluster merge), so we can trust one
  82    target more than others
  83  - per target: boost rankings
  84
  85 Start coding:
  86  - in relevance-prepare-read, go through records, collect scores in arrays (per target),
  87  - fit the curve, normalize the scores.
  88  - cluster scoring
  89