Probabilistic Models
Goal: To estimate the probability that a document is relevant to user queries.
Probabilistic Ranking Principle (PRP)
The order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need i.e. P(R=1|,d,q).
In the simple case of PRP we assume there is no retrieval cost associated that would differentially weigh action or errors.
If we just want to return a set of retrieval results, the Bayes Optimal Decision Rule is to use:
d is relevant iff P(R=1|,d,q)>P(R=0|,d,q)
PRP with retrieval cost
Let C1 be the cost of retrieval of a relevant document and C0 be the cost of retrieval of a non-relevant document. Then PRP says that if for a specific document d and for all documents d' not yet retrieved
C1*P(R=1|d)+C0*P(R=0|d)<=C1*P(R=1|d')+C0*P(R=0|d')
then d is the next document to be retrieved.
Binary Independence Retrieval Model
Documents and queries are both represented as binary term vector i.e. a document is represented by ~x=(x1,...,xn) where xt=1 if term t is present in the document d and xt=0 otherwise. Similarly for q we represent it by ~q.
Independence means that there is no association between terms.
Assume that the relevance of each document is independent of the relevance of other documents.
P(R=1|~x,~q)={P(~x|R=1,~q) * P(R=1|~q)}/P(~x,~q)
P(R=0|~x,~q)={P(~x|R=0,~q) * P(R=0|~q)}/P(~x,~q)
Ranking functions for query terms
As we are interested only in the ranking of documents, we use some other quantities that are easier to compute like odds.
O(R|~x,~q)=P(R=1|~x,~q)/P(R=0|~x,~q)
={ P(R=1|~q)/P(R=0|~q) } * { P(~x|R=1,~q)/P(~x|R=0,~q) }
The first component is a constant for a given query so there is no need to estimate it.
To estimate the second part of the equation, we make the Naive Bayes Conditional Independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word.
P(~x|R=1,~q)/P(~x|R=0,~q)=∏(t=1 to M) { P(xt|R=1,~q)/P(xt|R=0,~q) }
So,
O(R|~x,~q)=O(R|~q) *∏ (t=1 to M) P(xt|R=1,~q)/P(xt|R=0,~q)
Since xt=0 or 1, we can separate the two cases:
O(R|~x,~q)=O(R|~q) ∏(t:xt=1) { P(xt=1|R=1,~q)/P(xt=1|R=0,~q) } * ∏(t:xt=0) { P(xt=0|R=1,~q)/P(xt=0|R=0,~q) }
Let,
pt=P(xt=1|R=1,~q) be the probability of a term occurring in a document relevant to a query.
ut=P(xt=1|R=0,~q) be the probability of a term occurring in a document not-relevant to a query.
We make another assumption:
Terms not occurring in the qury are equally likely to occur in relevant and non-relevant document I.e. If qt=0 then pt=ut
So,
O(R|~q,~x)=O(R,~q) * ∏(t:xt=qt=1) pt/ut * ∏(t:xt=0,qt=1) *(1-pt)/(1-ut)
=O(R,~q) * ∏(t:xt=qt=1) { pt * (1-ut) }/{ ut * (1-pt) } * ∏(t:qt=1) (1-pt)/(1-qt)
The second part in the above equation is a constant for a query as it computes only for query terms.
Retrieval Status Value (RSV) in this model:
RSVd=log ∏(t:xt=qt=1) { pt*(1-ut) }/{ ut * (1-pt) }
=∑(t:xt=qt=1) log {(pt*(1-ut))/(ut*(1-pt))}
Let Ct=log {(pt*(1-ut))/(ut*(1-pt))}
=log pt/(1-pt) + log ut/(1-ut)
The Ct quantities function as term weights in this model and the document score for a query is
RSVd=∑(t:xt=qt=1) Ct
Let,
s=Number of relevant documents containing xt
S=total number of relevant documents
nt=Number of document containing term t
N=Number of documents presented to the user
pt=s/S
ut=(nt-s)/(N-S)
In practice ut=nt/N
Assuming that relevant documents are a very small percentage of the collection.
Binary Independence Indexing Model (BII)
BIR regards a single query with respect to a number of documents, the BII model observes one document in relation to a number of queries submitted.
BII model yields the same ranking for two different queries formulated with same set of terms.
Let
~Zk=(zk1,......,zkn)
where Zki=1 if ti є q kt and Zki=0 otherwise
So,
P(R|qk,dm)=P(R|~Zk,dm)
Using Bayes' theorem,
P(R|~Zk,dm)=P(R|dm) * { P(~Zk|R,dm)/P(~Zk|dm) }
Assuming that the distribution of terms in all queries to which a document with representation dm is relevant is independent
P(~Zk|R,dm)=∏(i=1 to n) P(zki|R,dm)
Assuming that the relevance of a document dm w.r.t. A query qk depends only on the terms from q kt
and not on other terms, we get:
P(R,~Zk,dm)=∏(i=1 to n) P(zki)/P(Zk) * P(R|dm) * ∏(zki=1) { P(R|zki=1,dm)/P(R|dm) } * ∏(zki=0) { P(R|zki=0,dm)/P(R|dm) }
The first term is constant for a given query.
Assuming P(R|ti,dm)=P(R|dm) for all ti not in dmT
P(R|qk,dm)=ck * P(R|dm) * ∏(ti єqkT ∩ dmT) P(R|ti,dm)/P(R|dm)



.gif)
0 comments:
Post a Comment