Probabilistic IR-DFR ~ Blog on IR, NLP and Related Areas<br>by<br>Shailesh Pandey

Divergence From Randomness Model

The model is based on the hypothesis that the level of treatment of the INFORMATIVE WORDS is witnessed by an ELITE SET of documents, in which these words occur to a relatively greater extent than in the rest of the documents.

On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution.

There are three main important components of a DFR model:

1)Informative Content

This is inverse measure of the probability having “by chance” a certain number of occurrences of a term within a document according to the model of randomness being used.

2)Gain

Measures the probability of coming across another occurrence of the term given that 'tf' occurrences have already been seen. This probability tends to 1 (i.e. becomes a certainty) as 'tf' increases. This mimics the property of informative word distribution.

3)Term Frequency normalisation

Takes into account the relationship between document length and term frequency within a document so that smaller documents are not penalised against longer documents and are ranked higher.

The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d". In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:

weight(t|d)∝-log Prob_M(t∊d|Collection)

where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. There are many ways to choose M, each of these provides a basic DFR model.

If the model M is the binomial distribution, then the basic model is P and computes the value

-log Prob_P(t∊d|Collection)=-log (TF tf) p^tf q^TF-tf

where:

TF is the term-frequency of the term t in the Collection
tf is the term-frequency of the term t in the document d
N is the number of documents in the Collection
p is 1/N and q=1-p

Gain

When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case we get a high weight, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight, we tune or smooth the weight by considering only the portion of it which is the amount of information gained with the term:

gain(t|d)=P_risk * (-log Prob_M(t∊d|Collection))

The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus smaller the probability P_risk is, that is:

P_risk=1-Prob(t∊d|d∊Elite set)

Models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B:

P_risk=1/(tf+1) (Laplace model)

P_risk=TF/{ df*(tf+1) } (Ratio B of two Binomial distribution)

where df is the number of documents containing the term.

Blog on IR, NLP and Related Areas
by
Shailesh Pandey

Blog primarily created to jot down things I have observed, discussed or learned that are relevant to my research. Comments are welcomed.

Sunday, September 23, 2007