Divergence From Randomness Model
The model is based on the hypothesis that the level of treatment of the INFORMATIVE WORDS is witnessed by an ELITE SET of documents, in which these words occur to a relatively greater extent than in the rest of the documents.
On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution.
There are three main important components of a DFR model:
1)Informative Content
This is inverse measure of the probability having “by chance” a certain number of occurrences of a term within a document according to the model of randomness being used.
2)Gain
Measures the probability of coming across another occurrence of the term given that 'tf' occurrences have already been seen. This probability tends to 1 (i.e. becomes a certainty) as 'tf' increases. This mimics the property of informative word distribution.
3)Term Frequency normalisation
Takes into account the relationship between document length and term frequency within a document so that smaller documents are not penalised against longer documents and are ranked higher.
The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d". In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:
| weight(t|d)∝-log ProbM(t∊d|Collection) |
|
where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. There are many ways to choose M, each of these provides a basic DFR model.
If the model M is the binomial distribution, then the basic model is P and computes the value
| -log ProbP(t∊d|Collection)=-log (TF tf) ptf qTF-tf |
|
where:
TF is the term-frequency of the term t in the Collection
tf is the term-frequency of the term t in the document d
N is the number of documents in the Collection
p is 1/N and q=1-p
Gain
When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case we get a high weight, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight, we tune or smooth the weight by considering only the portion of it which is the amount of information gained with the term:
gain(t|d)=Prisk * (-log ProbM(t∊d|Collection))
The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus smaller the probability Prisk is, that is:
| Prisk=1-Prob(t∊d|d∊Elite set) |
|
Models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B:
| Prisk=1/(tf+1) (Laplace model) Prisk=TF/{ df*(tf+1) } (Ratio B of two Binomial distribution) |
|
where df is the number of documents containing the term.
|
|



.gif)
0 comments:
Post a Comment