Performance Measures ~ Blog on IR, NLP and Related Areas<br>by<br>Shailesh Pandey

Measures of Performance

Precision & Recall

Precision={{relevant docs}∩{retrieved docs}}/{retrieved docs}

Recall={{relevant docs}∩{retrieved docs}}/{relevant docs in collection}

Confusion Matrix

True +ve (TP): Number of relevant docs that system retrieved

False +ve (FP): Number of non-relevant docs that system retrieved

True -ve (TN): Number of non-relevant docs that the system did not retrieve

False -ve (FN): Number of relevant docs that the system did not retrieve

	Relevant	Non-Relevant
Retrieved	TP	FP
Not-Retrieved	FN	TN

Precision=TP/(TP+FP)

Recall=TP/(TP+FN)

F-Score

F_α=1/{(α*1/P)+(1-α)*1/R}

Where,

α-->Weighing Factor

Higher α means that recall is more important and lower means that precision is more important.

User Oriented Measures

Coverage=Relevant docs retrieved and known to user/Relevant docs known to user

Novelty=Relevant docs unknown to user/relevant docs retrieved

Problems with precision & recall

Users generally do not really care about recall.
P & R doesn't take into account the order of answers.
Doesn't take into account the degree of relevance.
What to do if the denominators were zero?
Measuring Recall involves identifying relevant docs in the corpus which is difficult.

Average Precision

AP=(K=1 to |S|)Summation {rel(S_k)*P@K/R}

P@K=(i=1 to K) rel(S_i)/K

R=(d_i Є D) summation {rel(d)}

Where,

D=Corpus od documents

S=returned list of documents (s1,s2,...,sn)

P@K=precision of s1,s2,...,sk

rel(d)=”true” relevance

0: non-relevant

1: Relevant

R=Relevant docs in corpus

Example:

	1	2	3	4	5	6	7	8	9	10
Recall	0.2	0.2	0.4	0.4	0.4	0.6	0.6	0.6	0.8	1.0
Precision	1.0	0.5	0.67	0.5	0.4	0.5	0.43	0.38	0.44	0.5

Suppose 1, 3, 6, 9 and 10 are relevant documents.

AP=(1.0+0.67+0.5+0.44+0.5)/5=0.62

(Calculate by averaging precision when recall increases i.e. when relevant docs arrives)

Mean Average Precision (MAP)

-->Take the AP value per query

Calculate the mean score (Average over query)

Mean Reciprocal Rank (MRR)

Suppose we have 3 sample queries for a system that tries to translate English word to their plurals. In each case. The system makes three guesses, with the first one being the one it thinks is most likely correct.

Query	Results	Correct	Rank	Reciprocal Rank
cat	Catten, cati, cats	cats	3	1/3
torus	Torii, tori, toruses	tori	2	1/2
virus	Viruses, virii, viri	viruses	1	1

MRR=(1/3+1/2+1)/3=0.61

Precision@n

-->Proportion of top n documents that are satisfactory

P@10=0.4 means that 4 of the top 10 documents were satisfactory.

A run of 50 queries would be measured according to mean P@n.

Success@n

-->The proportion of queries for which a correct answer was within the top n.

s@1=0.5 means that for half of the queries, the correct answer was at rank 1.

Relevance Measurement

Term Frequency (tf)

-->tf in a document is simply the number of times a given term appears in that document.

-->This count is usually normalised to prevent a bias towards longer documents which may have a higher tf regardless of the actual importance of that term in the document.

tf_i=n_i/∑_k n_k

Where,

ni=number of occurrences of the considered term

Denominator is the number of occurrences of all the terms

Inverse Document Frequency (IDF)

idf_i=log₂ D/{d: t_iЄd}

Where,

D=total number of documents in the corpus

{d: t_iЄd}=number of documents where the term t_i appears

Then, tf-idf=tf . Idf

BM25

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.

score(D,Q)=Summation(i=1 to n) IDF(q_i) * {(f(q_i,D)*(k₁+1))/(f(q_i,D)+k₁*(1-b+b*|D|/avgdl))}

where f(q_i,D) is q_i's term frequency in the document D, | D | is the length of the document D, and avgdl is the average document length in the text collection from which documents are drawn. k₁ and b are free parameters.

The variable k1 is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency. b is another tuning parameter (0 ≤ b ≤ 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.

IDF(q_i) is the IDF (inverse document frequency) weight of the query term q_i.

It is usually computed as:
IDF(q_i)=log {(N-n(q_i)+0.5)/(n(q_i)+0.5)}

where N is the total number of documents in the collection, and n(q_i) is the number of documents containing q_i.

If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries.

score(D,Q)=Summation(i=1 to n) IDF(q_i) * {(f(q_i,D)*(k₁+1))/(f(q_i,D)+k₁*(1-b+b*|D|/avgdl))} * {(k₃+1)f(qi,Q)/(k₃+f(qi,Q))}

k3 being another positive tuning parameter that this time calibrates term frequency scaling of the query.

Normalized Discounted Cumulative Gain (NDCG)

-->Devised specifically for web search evaluation

-->rewards relevant documents in the top ranked results more heavily than those ranked lower

For a query q

Relevance level of documents ranked j wrt q br rq(j)

Response list in inspected upto rank L

NDCGq = Zq*Summation(j=1 to L) (2^rq⁽^j⁾ − 1)/log(1+j)

Zq is a normalization factor that ensures a perfect ordering of the results for the query, will receive the score of one.

Suppose we try the same query on two different search engines and we get the results with rank as given below. We can see that both have the same number of similar ranked documents. But in case of Result1, the higher ranked document appears at a lower position (Counting first document position as 1) than in Result2. So it rewards Result1 and receives a greater score. This is where the discount factor (the 1/log(1+j)) plays the role.

Result1=3 1 1
Gain1=7 1 1
Discount1=1 0.63 0.5
CDG1=7 0.63 0.5

Result2=1 3 1
Gain2=1 7 1
Discount2=1 0.63 0.5
CDG2=1 4.41 0.5

Blog on IR, NLP and Related Areas
by
Shailesh Pandey

Blog primarily created to jot down things I have observed, discussed or learned that are relevant to my research. Comments are welcomed.

Wednesday, September 12, 2007