Tuesday, September 4, 2007

Evaluation in IR-Stephen Robertson

Evaluation in Information Retrieval

by

Stephen Robertson

Microsoft Research Ltd., Cambridge, U.K.

and City University, London, U.K.

Why Evaluate:

  • To challenge ideas about what makes for good search
  • To prove that your ideas are better than someone else’s
  • To decide between alternative methods
  • To tune/train/optimize a system
  • To discover points of failure

What is a good (relevant) document:

  • Traditionally, one judged (by an expert) to be on the topic
  • More properly, one judged by the user to be helpful in resolving her/his problem

Assuming binary relevance and an input-output system, the function of the system is:

1. To retrieve relevant documents

Measurement of performance for this is:

Recall=No. of relevant docs retrieved/total relevant in the collection

2. Not to retrieve non-relevant documents

Measure of performance for this is:

Precision=No. of relevant docs retrieved/Total retrieved

Various problems (interpolation/extrapolation; averaging over requests)

  • Dealing with incomplete judgements
  • Choosing which documents to judge

trec_eval: program by Chris Buckley used for TREC

Note: Measures like recall and precision are somewhat crude as diagnostic tools

Some other performance measure:

  • Mean Average Precision
  • Mean Reciprocal rank
  • Success @ n
  • Cumulative gain
  • Normalized Discounted Cumulative Gain (NDCG)

Note: Some are more user oriented than others e.g. precision@5

IR Experiments:

  • Traditionally run different systems on same set of document and requests

Good for comparison of mechanisms (Not so good for many user experiments)

The Text Retrieval Conference (TREC):

  • Competition/collaboration between IR research groups worldwide
  • Run by NIST, just outside Washington DC
  • Common tasks, common test materials, common measures, common evaluation procedures

Some User Issues:

  • Interaction
  • Relevance

Relevance should be judged in relation to needs not requests

The cognitive view

An information need arises from an anomalous state of knowledge (ASK);

The process of resolving an ASK is a cognitive process on the part of the user;

Information seeking is part of that process;

Users’ models of information seeking are strongly influenced by systems.

Some Conflicts:

In a lab test, we try to control variables (i.e. separate the different factors)

But in interactive searching, the user has access to a range of interactive mechanisms.

In a lab test, we try to keep user outside the system.

But in interactive searching, the user/searcher is inside (part of ) the system

In a lab test, we can repeat an experiment, with variations, any number of times

But in interactive searching, repetition is difficult and expensive and unlikely to produce identical results.

Routing/filtering experiments at TREC:

The task

Incoming stream of documents

Persistent user profile

Task: send appropriate incoming documents to the user

Learn from user relevance feedback

Batch routing:

Take a fixed time point, with a ‘history’ and a ‘future’

Optimise query in relation to history

Evaluate against future (in particular, evaluate by ranking the test set)

Results: excellent performance, but some danger of overfitting

Adaptive filtering:

Start from scratch

text query

possibly one or two examples of relevant documents

Binary decision by system

Feedback only on those items ‘sent’ to the user

For scoring systems, thresholding is critical

Note: Evaluation measures are more difficult

Categories:

0 comments: