Tuesday, September 4, 2007

IR Systems/Architecture & Large-scale IR-Mark Sanderson

IR Systems/Architecture & Large-scale IR

Mark Sanderson

University of Sheffield

IR systems: take documents (do some processing), create an index, search it and update when collection changes.

Some Problems:

Various document formats (and they change)

Character encodings (ASCII, Unicode, UTF-8 etc.)

Tokenization (In English words are separated by spaces; not in all languages.)

How to do searching (fast)

Maintain index (e.g. Inverted file list)

For web collection IFL becomes very long

Ranking documents

Rank documents by number of query words in the document (Quorum search).

Sophisticated techniques: BM25

Academic Experiments

Lemur (http://www.lemurproject.org/)

Written in C++

Powerful query language

Support for distributed search

Terrier (http://ir.dcs.gla.ac.uk/terrier/)

Desktop application

DFR ranking scheme

Zettair (http://www.seg.rmit.edu.au/zettair/)

Written in C

Very fast indexing

Categories:

0 comments: