Measures of Performance
Precision & Recall
Precision={{relevant docs}∩{retrieved docs}}/{retrieved docs}
Recall={{relevant docs}∩{retrieved docs}}/{relevant docs in collection}
Confusion Matrix
True +ve (TP): Number of relevant docs that system retrieved
False +ve (FP): Number of non-relevant docs that system retrieved
True -ve (TN): Number of non-relevant docs that the system did not retrieve
False -ve (FN): Number of relevant docs that the system did not retrieve
| | Relevant | Non-Relevant |
| Retrieved | TP | FP |
| Not-Retrieved | FN | TN |
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
F-Score
Fα=1/{(α*1/P)+(1-α)*1/R}
Where,
α-->Weighing Factor
Higher α means that recall is more important and lower means that precision is more important.
User Oriented Measures
Coverage=Relevant docs retrieved and known to user/Relevant docs known to user
Novelty=Relevant docs unknown to user/relevant docs retrieved
Problems with precision & recall
Users generally do not really care about recall.
P & R doesn't take into account the order of answers.
Doesn't take into account the degree of relevance.
What to do if the denominators were zero?
Measuring Recall involves identifying relevant docs in the corpus which is difficult.
Average Precision
AP=(K=1 to |S|)Summation {rel(Sk)*P@K/R}
P@K=(i=1 to K) rel(Si)/K
R=(di Є D) summation {rel(d)}
Where,
D=Corpus od documents
S=returned list of documents (s1,s2,...,sn)
P@K=precision of s1,s2,...,sk
rel(d)=”true” relevance
0: non-relevant
1: Relevant
R=Relevant docs in corpus
Example:
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| Recall | 0.2 | 0.2 | 0.4 | 0.4 | 0.4 | 0.6 | 0.6 | 0.6 | 0.8 | 1.0 |
| Precision | 1.0 | 0.5 | 0.67 | 0.5 | 0.4 | 0.5 | 0.43 | 0.38 | 0.44 | 0.5 |
Suppose 1, 3, 6, 9 and 10 are relevant documents.
AP=(1.0+0.67+0.5+0.44+0.5)/5=0.62
(Calculate by averaging precision when recall increases i.e. when relevant docs arrives)
Mean Average Precision (MAP)
-->Take the AP value per query
Calculate the mean score (Average over query)
Mean Reciprocal Rank (MRR)
Suppose we have 3 sample queries for a system that tries to translate English word to their plurals. In each case. The system makes three guesses, with the first one being the one it thinks is most likely correct.
| Query | Results | Correct | Rank | Reciprocal Rank |
| cat | Catten, cati, cats | cats | 3 | 1/3 |
| torus | Torii, tori, toruses | tori | 2 | 1/2 |
| virus | Viruses, virii, viri | viruses | 1 | 1 |
MRR=(1/3+1/2+1)/3=0.61
Precision@n
-->Proportion of top n documents that are satisfactory
P@10=0.4 means that 4 of the top 10 documents were satisfactory.
A run of 50 queries would be measured according to mean P@n.
Success@n
-->The proportion of queries for which a correct answer was within the top n.
s@1=0.5 means that for half of the queries, the correct answer was at rank 1.
Relevance Measurement
Term Frequency (tf)
-->tf in a document is simply the number of times a given term appears in that document.
-->This count is usually normalised to prevent a bias towards longer documents which may have a higher tf regardless of the actual importance of that term in the document.
tfi=ni/∑k nk
Where,
ni=number of occurrences of the considered term
Denominator is the number of occurrences of all the terms
Inverse Document Frequency (IDF)
idfi=log2 D/{d: tiЄd}
Where,
D=total number of documents in the corpus
{d: tiЄd}=number of documents where the term ti appears
Then, tf-idf=tf . Idf
BM25
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.
score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))}
where f(qi,D) is qi's term frequency in the document D, | D | is the length of the document D, and avgdl is the average document length in the text collection from which documents are drawn. k1 and b are free parameters.
The variable k1 is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency. b is another tuning parameter (0 ≤ b ≤ 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.
IDF(qi) is the IDF (inverse document frequency) weight of the query term qi.
It is usually computed as:
IDF(qi)=log {(N-n(qi)+0.5)/(n(qi)+0.5)}
where N is the total number of documents in the collection, and n(qi) is the number of documents containing qi.
If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries.
score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))} * {(k3+1)f(qi,Q)/(k3+f(qi,Q))}
k3 being another positive tuning parameter that this time calibrates term frequency scaling of the query.
Normalized Discounted Cumulative Gain (NDCG)
-->Devised specifically for web search evaluation
-->rewards relevant documents in the top ranked results more heavily than those ranked lower
For a query q
Relevance level of documents ranked j wrt q br rq(j)
Response list in inspected upto rank L
NDCGq = Zq*Summation(j=1 to L) (2rq(j) − 1)/log(1+j)
Zq is a normalization factor that ensures a perfect ordering of the results for the query, will receive the score of one.
Suppose we try the same query on two different search engines and we get the results with rank as given below. We can see that both have the same number of similar ranked documents. But in case of Result1, the higher ranked document appears at a lower position (Counting first document position as 1) than in Result2. So it rewards Result1 and receives a greater score. This is where the discount factor (the 1/log(1+j)) plays the role.
Result1=3 1 1
Gain1=7 1 1
Discount1=1 0.63 0.5
CDG1=7 0.63 0.5
Result2=1 3 1
Gain2=1 7 1
Discount2=1 0.63 0.5
CDG2=1 4.41 0.5



.gif)
0 comments:
Post a Comment