IR Systems/Architecture & Large-scale IR
Mark Sanderson
IR systems: take documents (do some processing), create an index, search it and update when collection changes.
Some Problems:
Various document formats (and they change)
Character encodings (ASCII, Unicode, UTF-8 etc.)
Tokenization (In English words are separated by spaces; not in all languages.)
How to do searching (fast)
Maintain index (e.g. Inverted file list)
For web collection IFL becomes very long
Ranking documents
Rank documents by number of query words in the document (Quorum search).
Sophisticated techniques: BM25
Academic Experiments
Lemur (http://www.lemurproject.org/)
Written in C++
Powerful query language
Support for distributed search
Terrier (http://ir.dcs.gla.ac.uk/terrier/)
Desktop application
DFR ranking scheme
Zettair (http://www.seg.rmit.edu.au/zettair/)
Written in C
Very fast indexing



.gif)
0 comments:
Post a Comment