Probabilistic IR-BIR and BII ~ Blog on IR, NLP and Related Areas<br>by<br>Shailesh Pandey

Probabilistic Models

Goal: To estimate the probability that a document is relevant to user queries.

Probabilistic Ranking Principle (PRP)

The order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need i.e. P(R=1|,d,q).

In the simple case of PRP we assume there is no retrieval cost associated that would differentially weigh action or errors.

If we just want to return a set of retrieval results, the Bayes Optimal Decision Rule is to use:

d is relevant iff P(R=1|,d,q)>P(R=0|,d,q)

PRP with retrieval cost

Let C1 be the cost of retrieval of a relevant document and C0 be the cost of retrieval of a non-relevant document. Then PRP says that if for a specific document d and for all documents d' not yet retrieved

C1*P(R=1|d)+C0*P(R=0|d)<=C1*P(R=1|d')+C0*P(R=0|d')

then d is the next document to be retrieved.

Binary Independence Retrieval Model

Documents and queries are both represented as binary term vector i.e. a document is represented by ~x=(x1,...,xn) where xt=1 if term t is present in the document d and xt=0 otherwise. Similarly for q we represent it by ~q.

Independence means that there is no association between terms.

Assume that the relevance of each document is independent of the relevance of other documents.

P(R=1|~x,~q)={P(~x|R=1,~q) * P(R=1|~q)}/P(~x,~q)

P(R=0|~x,~q)={P(~x|R=0,~q) * P(R=0|~q)}/P(~x,~q)

Ranking functions for query terms

As we are interested only in the ranking of documents, we use some other quantities that are easier to compute like odds.

O(R|~x,~q)=P(R=1|~x,~q)/P(R=0|~x,~q)

={ P(R=1|~q)/P(R=0|~q) } * { P(~x|R=1,~q)/P(~x|R=0,~q) }

The first component is a constant for a given query so there is no need to estimate it.

To estimate the second part of the equation, we make the Naive Bayes Conditional Independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word.

P(~x|R=1,~q)/P(~x|R=0,~q)=∏(t=1 to M) { P(xt|R=1,~q)/P(xt|R=0,~q) }

So,

O(R|~x,~q)=O(R|~q) *∏ (t=1 to M) P(xt|R=1,~q)/P(xt|R=0,~q)

Since xt=0 or 1, we can separate the two cases:

O(R|~x,~q)=O(R|~q) ∏(t:xt=1) { P(xt=1|R=1,~q)/P(xt=1|R=0,~q) } * ∏(t:xt=0) { P(xt=0|R=1,~q)/P(xt=0|R=0,~q) }

Let,

pt=P(xt=1|R=1,~q) be the probability of a term occurring in a document relevant to a query.

ut=P(xt=1|R=0,~q) be the probability of a term occurring in a document not-relevant to a query.

We make another assumption:

Terms not occurring in the qury are equally likely to occur in relevant and non-relevant document I.e. If qt=0 then pt=ut

So,

O(R|~q,~x)=O(R,~q) * ∏(t:xt=qt=1) pt/ut * ∏(t:xt=0,qt=1) *(1-pt)/(1-ut)

=O(R,~q) * ∏(t:xt=qt=1) { pt * (1-ut) }/{ ut * (1-pt) } * ∏(t:qt=1) (1-pt)/(1-qt)

The second part in the above equation is a constant for a query as it computes only for query terms.

Retrieval Status Value (RSV) in this model:

RSVd=log ∏(t:xt=qt=1) { pt*(1-ut) }/{ ut * (1-pt) }

=∑(t:xt=qt=1) log {(pt*(1-ut))/(ut*(1-pt))}

Let Ct=log {(pt*(1-ut))/(ut*(1-pt))}

=log pt/(1-pt) + log ut/(1-ut)

The Ct quantities function as term weights in this model and the document score for a query is

RSVd=∑(t:xt=qt=1) Ct

Let,

s=Number of relevant documents containing xt

S=total number of relevant documents

nt=Number of document containing term t

N=Number of documents presented to the user

pt=s/S

ut=(nt-s)/(N-S)

In practice ut=nt/N

Assuming that relevant documents are a very small percentage of the collection.

Binary Independence Indexing Model (BII)

BIR regards a single query with respect to a number of documents, the BII model observes one document in relation to a number of queries submitted.

BII model yields the same ranking for two different queries formulated with same set of terms.

Let

~Zk=(zk1,......,zkn)

where Zki=1 if ti є q _k^t and Zki=0 otherwise

So,

P(R|qk,dm)=P(R|~Zk,dm)

Using Bayes' theorem,

P(R|~Zk,dm)=P(R|dm) * { P(~Zk|R,dm)/P(~Zk|dm) }

Assuming that the distribution of terms in all queries to which a document with representation dm is relevant is independent

P(~Zk|R,dm)=∏(i=1 to n) P(zki|R,dm)

Assuming that the relevance of a document dm w.r.t. A query qk depends only on the terms from q _k^t

and not on other terms, we get:

P(R,~Zk,dm)=∏(i=1 to n) P(zki)/P(Zk) * P(R|dm) * ∏(zki=1) { P(R|zki=1,dm)/P(R|dm) } * ∏(zki=0) { P(R|zki=0,dm)/P(R|dm) }

The first term is constant for a given query.

Assuming P(R|ti,dm)=P(R|dm) for all ti not in dm^T

P(R|qk,dm)=ck * P(R|dm) * ∏(ti єqk^T ∩ dm^T) P(R|ti,dm)/P(R|dm)

Blog on IR, NLP and Related Areas
by
Shailesh Pandey

Blog primarily created to jot down things I have observed, discussed or learned that are relevant to my research. Comments are welcomed.

Sunday, September 23, 2007