Evaluation in IR-Stephen Robertson ~ Blog on IR, NLP and Related Areas<br>by<br>Shailesh Pandey

Evaluation in Information Retrieval

Stephen Robertson

Microsoft Research Ltd., Cambridge, U.K.

and City University, London, U.K.

Why Evaluate:

To challenge ideas about what makes for good search
To prove that your ideas are better than someone else’s
To decide between alternative methods
To tune/train/optimize a system
To discover points of failure

What is a good (relevant) document:

Traditionally, one judged (by an expert) to be on the topic
More properly, one judged by the user to be helpful in resolving her/his problem

Assuming binary relevance and an input-output system, the function of the system is:

1. To retrieve relevant documents

Measurement of performance for this is:

Recall=No. of relevant docs retrieved/total relevant in the collection

2. Not to retrieve non-relevant documents

Measure of performance for this is:

Precision=No. of relevant docs retrieved/Total retrieved

Various problems (interpolation/extrapolation; averaging over requests)

Dealing with incomplete judgements
Choosing which documents to judge

trec_eval: program by Chris Buckley used for TREC

Note: Measures like recall and precision are somewhat crude as diagnostic tools

Some other performance measure:

Mean Average Precision
Mean Reciprocal rank
Success @ n
Cumulative gain
Normalized Discounted Cumulative Gain (NDCG)

Note: Some are more user oriented than others e.g. precision@5

IR Experiments:

Traditionally run different systems on same set of document and requests

Good for comparison of mechanisms (Not so good for many user experiments)

The Text Retrieval Conference (TREC):

Competition/collaboration between IR research groups worldwide
Run by NIST, just outside Washington DC
Common tasks, common test materials, common measures, common evaluation procedures

Some User Issues:

Interaction
Relevance

Relevance should be judged in relation to needs not requests

• The cognitive view

– An information need arises from an anomalous state of knowledge (ASK);

– The process of resolving an ASK is a cognitive process on the part of the user;

– Information seeking is part of that process;

– Users’ models of information seeking are strongly influenced by systems.

Some Conflicts:

• In a lab test, we try to control variables (i.e. separate the different factors)

– But in interactive searching, the user has access to a range of interactive mechanisms.

• In a lab test, we try to keep user outside the system.

– But in interactive searching, the user/searcher is inside (part of ) the system

• In a lab test, we can repeat an experiment, with variations, any number of times

– But in interactive searching, repetition is difficult and expensive and unlikely to produce identical results.

Routing/filtering experiments at TREC:

The task

– Incoming stream of documents

– Persistent user profile

– Task: send appropriate incoming documents to the user

– Learn from user relevance feedback

Batch routing:

– Take a fixed time point, with a ‘history’ and a ‘future’

– Optimise query in relation to history

– Evaluate against future (in particular, evaluate by ranking the test set)

– Results: excellent performance, but some danger of overfitting

Adaptive filtering:

– Start from scratch

• text query

• possibly one or two examples of relevant documents

– Binary decision by system

– Feedback only on those items ‘sent’ to the user

– For scoring systems, thresholding is critical

Note: Evaluation measures are more difficult

Categories: ESSIR-2007

Blog on IR, NLP and Related Areas
by
Shailesh Pandey

Blog primarily created to jot down things I have observed, discussed or learned that are relevant to my research. Comments are welcomed.

Tuesday, September 4, 2007