Sunday, September 23, 2007

Probabilistic IR-DFR

Divergence From Randomness Model

The model is based on the hypothesis that the level of treatment of the INFORMATIVE WORDS is witnessed by an ELITE SET of documents, in which these words occur to a relatively greater extent than in the rest of the documents.

On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution.


There are three main important components of a DFR model:

1)Informative Content

This is inverse measure of the probability having “by chance” a certain number of occurrences of a term within a document according to the model of randomness being used.


2)Gain

Measures the probability of coming across another occurrence of the term given that 'tf' occurrences have already been seen. This probability tends to 1 (i.e. becomes a certainty) as 'tf' increases. This mimics the property of informative word distribution.


3)Term Frequency normalisation

Takes into account the relationship between document length and term frequency within a document so that smaller documents are not penalised against longer documents and are ranked higher.


The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d". In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:


weight(t|d)-log ProbM(td|Collection)


where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. There are many ways to choose M, each of these provides a basic DFR model.

If the model M is the binomial distribution, then the basic model is P and computes the value


-log ProbP(td|Collection)=-log (TF tf) ptf qTF-tf


where:

  • TF is the term-frequency of the term t in the Collection

  • tf is the term-frequency of the term t in the document d

  • N is the number of documents in the Collection

  • p is 1/N and q=1-p

Gain

When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case we get a high weight, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight, we tune or smooth the weight by considering only the portion of it which is the amount of information gained with the term:

gain(t|d)=Prisk * (-log ProbM(td|Collection))

The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus smaller the probability Prisk is, that is:


Prisk=1-Prob(td|dElite set)




Models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B:


Prisk=1/(tf+1) (Laplace model)

Prisk=TF/{ df*(tf+1) } (Ratio B of two Binomial distribution)


where df is the number of documents containing the term.






Probabilistic IR-BIR and BII

Probabilistic Models

Goal: To estimate the probability that a document is relevant to user queries.


Probabilistic Ranking Principle (PRP)

The order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need i.e. P(R=1|,d,q).


In the simple case of PRP we assume there is no retrieval cost associated that would differentially weigh action or errors.

If we just want to return a set of retrieval results, the Bayes Optimal Decision Rule is to use:

d is relevant iff P(R=1|,d,q)>P(R=0|,d,q)


PRP with retrieval cost

Let C1 be the cost of retrieval of a relevant document and C0 be the cost of retrieval of a non-relevant document. Then PRP says that if for a specific document d and for all documents d' not yet retrieved

C1*P(R=1|d)+C0*P(R=0|d)<=C1*P(R=1|d')+C0*P(R=0|d')

then d is the next document to be retrieved.


Binary Independence Retrieval Model

Documents and queries are both represented as binary term vector i.e. a document is represented by ~x=(x1,...,xn) where xt=1 if term t is present in the document d and xt=0 otherwise. Similarly for q we represent it by ~q.


Independence means that there is no association between terms.

Assume that the relevance of each document is independent of the relevance of other documents.


P(R=1|~x,~q)={P(~x|R=1,~q) * P(R=1|~q)}/P(~x,~q)

P(R=0|~x,~q)={P(~x|R=0,~q) * P(R=0|~q)}/P(~x,~q)


Ranking functions for query terms

As we are interested only in the ranking of documents, we use some other quantities that are easier to compute like odds.

O(R|~x,~q)=P(R=1|~x,~q)/P(R=0|~x,~q)

={ P(R=1|~q)/P(R=0|~q) } * { P(~x|R=1,~q)/P(~x|R=0,~q) }


The first component is a constant for a given query so there is no need to estimate it.


To estimate the second part of the equation, we make the Naive Bayes Conditional Independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word.


P(~x|R=1,~q)/P(~x|R=0,~q)=∏(t=1 to M) { P(xt|R=1,~q)/P(xt|R=0,~q) }


So,

O(R|~x,~q)=O(R|~q) * (t=1 to M) P(xt|R=1,~q)/P(xt|R=0,~q)


Since xt=0 or 1, we can separate the two cases:

O(R|~x,~q)=O(R|~q) (t:xt=1) { P(xt=1|R=1,~q)/P(xt=1|R=0,~q) } * (t:xt=0) { P(xt=0|R=1,~q)/P(xt=0|R=0,~q) }


Let,

pt=P(xt=1|R=1,~q) be the probability of a term occurring in a document relevant to a query.

ut=P(xt=1|R=0,~q) be the probability of a term occurring in a document not-relevant to a query.


We make another assumption:

Terms not occurring in the qury are equally likely to occur in relevant and non-relevant document I.e. If qt=0 then pt=ut


So,

O(R|~q,~x)=O(R,~q) * (t:xt=qt=1) pt/ut * (t:xt=0,qt=1) *(1-pt)/(1-ut)

=O(R,~q) * (t:xt=qt=1) { pt * (1-ut) }/{ ut * (1-pt) } * (t:qt=1) (1-pt)/(1-qt)

The second part in the above equation is a constant for a query as it computes only for query terms.


Retrieval Status Value (RSV) in this model:

RSVd=log (t:xt=qt=1) { pt*(1-ut) }/{ ut * (1-pt) }

=(t:xt=qt=1) log {(pt*(1-ut))/(ut*(1-pt))}


Let Ct=log {(pt*(1-ut))/(ut*(1-pt))}

=log pt/(1-pt) + log ut/(1-ut)


The Ct quantities function as term weights in this model and the document score for a query is

RSVd=(t:xt=qt=1) Ct


Let,

s=Number of relevant documents containing xt

S=total number of relevant documents

nt=Number of document containing term t

N=Number of documents presented to the user

pt=s/S

ut=(nt-s)/(N-S)


In practice ut=nt/N

Assuming that relevant documents are a very small percentage of the collection.


Binary Independence Indexing Model (BII)

BIR regards a single query with respect to a number of documents, the BII model observes one document in relation to a number of queries submitted.


BII model yields the same ranking for two different queries formulated with same set of terms.


Let

~Zk=(zk1,......,zkn)

where Zki=1 if ti Ñ” q kt and Zki=0 otherwise

So,

P(R|qk,dm)=P(R|~Zk,dm)


Using Bayes' theorem,

P(R|~Zk,dm)=P(R|dm) * { P(~Zk|R,dm)/P(~Zk|dm) }


Assuming that the distribution of terms in all queries to which a document with representation dm is relevant is independent

P(~Zk|R,dm)=∏(i=1 to n) P(zki|R,dm)


Assuming that the relevance of a document dm w.r.t. A query qk depends only on the terms from q kt

and not on other terms, we get:


P(R,~Zk,dm)=∏(i=1 to n) P(zki)/P(Zk) * P(R|dm) * ∏(zki=1) { P(R|zki=1,dm)/P(R|dm) } * ∏(zki=0) { P(R|zki=0,dm)/P(R|dm) }


The first term is constant for a given query.


Assuming P(R|ti,dm)=P(R|dm) for all ti not in dmT

P(R|qk,dm)=ck * P(R|dm) * ∏(ti Ñ”qkT ∩ dmT) P(R|ti,dm)/P(R|dm)

Wednesday, September 12, 2007

Performance Measures

Measures of Performance


Precision & Recall

Precision={{relevant docs}∩{retrieved docs}}/{retrieved docs}

Recall={{relevant docs}∩{retrieved docs}}/{relevant docs in collection}


Confusion Matrix

True +ve (TP): Number of relevant docs that system retrieved

False +ve (FP): Number of non-relevant docs that system retrieved

True -ve (TN): Number of non-relevant docs that the system did not retrieve

False -ve (FN): Number of relevant docs that the system did not retrieve



Relevant

Non-Relevant

Retrieved

TP

FP

Not-Retrieved

FN

TN


Precision=TP/(TP+FP)

Recall=TP/(TP+FN)


F-Score

Fα=1/{(α*1/P)+(1-α)*1/R}


Where,

α-->Weighing Factor

Higher α means that recall is more important and lower means that precision is more important.


User Oriented Measures

Coverage=Relevant docs retrieved and known to user/Relevant docs known to user


Novelty=Relevant docs unknown to user/relevant docs retrieved


Problems with precision & recall

  1. Users generally do not really care about recall.

  2. P & R doesn't take into account the order of answers.

  3. Doesn't take into account the degree of relevance.

  4. What to do if the denominators were zero?

  5. Measuring Recall involves identifying relevant docs in the corpus which is difficult.


Average Precision

AP=(K=1 to |S|)Summation {rel(Sk)*P@K/R}

P@K=(i=1 to K) rel(Si)/K

R=(di Є D) summation {rel(d)}


Where,

D=Corpus od documents

S=returned list of documents (s1,s2,...,sn)

P@K=precision of s1,s2,...,sk

rel(d)=”true” relevance

0: non-relevant

1: Relevant

R=Relevant docs in corpus


Example:


1

2

3

4

5

6

7

8

9

10

Recall

0.2

0.2

0.4

0.4

0.4

0.6

0.6

0.6

0.8

1.0

Precision

1.0

0.5

0.67

0.5

0.4

0.5

0.43

0.38

0.44

0.5


Suppose 1, 3, 6, 9 and 10 are relevant documents.


AP=(1.0+0.67+0.5+0.44+0.5)/5=0.62

(Calculate by averaging precision when recall increases i.e. when relevant docs arrives)


Mean Average Precision (MAP)

-->Take the AP value per query

Calculate the mean score (Average over query)


Mean Reciprocal Rank (MRR)

Suppose we have 3 sample queries for a system that tries to translate English word to their plurals. In each case. The system makes three guesses, with the first one being the one it thinks is most likely correct.


Query

Results

Correct

Rank

Reciprocal Rank

cat

Catten, cati, cats

cats

3

1/3

torus

Torii, tori, toruses

tori

2

1/2

virus

Viruses, virii, viri

viruses

1

1


MRR=(1/3+1/2+1)/3=0.61


Precision@n

-->Proportion of top n documents that are satisfactory

P@10=0.4 means that 4 of the top 10 documents were satisfactory.

A run of 50 queries would be measured according to mean P@n.


Success@n

-->The proportion of queries for which a correct answer was within the top n.

s@1=0.5 means that for half of the queries, the correct answer was at rank 1.


Relevance Measurement

Term Frequency (tf)

-->tf in a document is simply the number of times a given term appears in that document.

-->This count is usually normalised to prevent a bias towards longer documents which may have a higher tf regardless of the actual importance of that term in the document.


tfi=ni/∑k nk


Where,

ni=number of occurrences of the considered term

Denominator is the number of occurrences of all the terms


Inverse Document Frequency (IDF)

idfi=log2 D/{d: tiЄd}

Where,

D=total number of documents in the corpus

{d: tiЄd}=number of documents where the term ti appears


Then, tf-idf=tf . Idf


BM25

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.

score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))}


where f(qi,D) is qi's term frequency in the document D, | D | is the length of the document D, and avgdl is the average document length in the text collection from which documents are drawn. k1 and b are free parameters.

The variable k1 is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency. b is another tuning parameter (0 ≤ b ≤ 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.

IDF(qi) is the IDF (inverse document frequency) weight of the query term qi.

It is usually computed as:
IDF(q
i)=log {(N-n(qi)+0.5)/(n(qi)+0.5)}

where N is the total number of documents in the collection, and n(qi) is the number of documents containing qi.

If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries.

score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))} * {(k3+1)f(qi,Q)/(k3+f(qi,Q))}

k3 being another positive tuning parameter that this time calibrates term frequency scaling of the query.

Normalized Discounted Cumulative Gain (NDCG)

-->Devised specifically for web search evaluation

-->rewards relevant documents in the top ranked results more heavily than those ranked lower


For a query q

Relevance level of documents ranked j wrt q br rq(j)

Response list in inspected upto rank L

NDCGq = Zq*Summation(j=1 to L) (2rq(j) 1)/log(1+j)

Zq is a normalization factor that ensures a perfect ordering of the results for the query, will receive the score of one.

Suppose we try the same query on two different search engines and we get the results with rank as given below. We can see that both have the same number of similar ranked documents. But in case of Result1, the higher ranked document appears at a lower position (Counting first document position as 1) than in Result2. So it rewards Result1 and receives a greater score. This is where the discount factor (the 1/log(1+j)) plays the role.

Result1=3 1 1
Gain1=7 1 1
Discount1=1 0.63 0.5
CDG1=7 0.63 0.5


Result2=1 3 1
Gain2=1 7 1
Discount2=1 0.63 0.5
CDG2=1 4.41 0.5

Friday, September 7, 2007

Machine Learning from Text Classification Applications-David D. Lewis

Machine Learning from Text Classification Applications

Dave Lewis

David D. Lewis Consulting, LLC

www.DavidDLewis.com


Text Classification

Assigning text to one of several predefined groups


Use

As a component in NLP systems

Improving information access


How to improve Information Access

Relevance Feedback (Interactive Search)

Organize search results into classes

Awareness control (Filtering porn, highlighting relevant text etc.)


Categories of Classifications

Binary (either this or that)

Multiclass (belongs to one of the classes)

Multilabel (may belong to more than one class)

Ordinal (Multiclass w/ ordered classes)

Hierarchical (class organized as trees)

Fuzzy (ordinal with numeric membership)


Types of Classifiers

Rule-Based

Numeric (weighted combination of attribute values)

Instance Based (Direct comparison of text with known labeled examples)

Automata/Pattern Matching (Based on pattern matching)

Hybrid (Weighted/voted, numeric w/ attributes defined by patterns etc.)


Why good classifiers go bad

Input data changes (change in format, meaning of word changes etc.)

Classes added, deleted, redefined

Organizational policies changed

Cost of errors change


How to deal with this

Make robust classifier

Monitor classifier effectiveness

Periodically retrain

Monitor for novel inputs

from Winning TREC to Beating Google-David Hawking

from Winning TREC to Beating Google

by

David Hawking

CSIRO & Funnelback Pty. Ltd.


This talks presented an overview of issues related to web search and a live demonstration of the search engine (specifically indexing and searching) he and his colleagues have built.


They have used UK2006 collection which has:

  • 80 million pages-0.4% of web

  • crawled by Universit degli Studi di Milano

  • distributed by Yahoo! research, Barcelona

  • intended for SPAM detection/rejection experiment

  • Available at: www.yr-bcn.es/webspam/datasets/


What the difference between TREC adhoc and Searching the web?

  • In TREC adhoc the task is to find all documents relevant to a topic. But while searching on the web people are just looking for a couple of results.

  • The searches in TREC adhoc are Informational. Web searching however can be for Informational, Transactional or Navigational needs.

  • Usually people in web search hardly go beyond the first page. So it may be fair to say that measures should take into account the first page of results.

  • TREC adhoc uses Mean Average Precision (MAP). Web search typically is measured by Normalised Discounted Cumulative Gain (NDCG).


Crawler tasks:

Maintain Seedlist of URLs

Look at Robots.txt for permissions to crawl


Problems

What to do if servers are down?

Webpage generally have lots of error

Resolving URL

How to handle redirects?

How to crawl URL that is dynamically generated by a script?


Indexing

Requires maintaining enormous vocabulary, inverted files.

But how do we do Instant update, deletion on Index files that are usually compressed and huge.


Ranking

Static Scores

  • Page Rank - (may have some biases for eg. Bias toward fortune 500 vs rest of the companies)

  • URL form

  • fRank

  • Spam Score

  • Adult Content Score

Document at a Time Ranking

Assign document numbers in order of descending static score.

Should index be distributed over clusters?


Caching

Cache web pages for common queries.

Caching also needed for generating summaries describing the result.


Query Processing- the Need for Speed

Rank the document

For each result, obtain: Title, URLs, Query biased summary etc.

Tuesday, September 4, 2007

Introduction to IR-Keith van Rijsbergen

Introduction to Information Retrieval

Keith van Rijsbergen

Computing Science

Glasgow University

IR: Retrieval of unstructured data (text documents, images, videos etc.)

Some general terms used in IR:

Term Frequency

Frequency of word occurrence in a document is a useful measure for word significance.

Inverse document frequency

The value of a keyword varies inversely with the log of the number of documents in which it occurs.

IR Model

Explains the structure and processes of IR systems

Clarifies the general characteristics of IR systems

There exist various models

Boolean, Vector space, Probabilistic, Language models, Cognitive etc.

Cranfield Paradigm

• Document collection

• Relevance judgements in advance

• Run strategy A and B

• Evaluate A and B in terms of Precision & Recall

• Compare A with B statistically

• State whether A is comparable to B, A is better than B, B is worse than A

IR Systems/Architecture & Large-scale IR-Mark Sanderson

IR Systems/Architecture & Large-scale IR

Mark Sanderson

University of Sheffield

IR systems: take documents (do some processing), create an index, search it and update when collection changes.

Some Problems:

Various document formats (and they change)

Character encodings (ASCII, Unicode, UTF-8 etc.)

Tokenization (In English words are separated by spaces; not in all languages.)

How to do searching (fast)

Maintain index (e.g. Inverted file list)

For web collection IFL becomes very long

Ranking documents

Rank documents by number of query words in the document (Quorum search).

Sophisticated techniques: BM25

Academic Experiments

Lemur (http://www.lemurproject.org/)

Written in C++

Powerful query language

Support for distributed search

Terrier (http://ir.dcs.gla.ac.uk/terrier/)

Desktop application

DFR ranking scheme

Zettair (http://www.seg.rmit.edu.au/zettair/)

Written in C

Very fast indexing

Open-Source, Distributed and P2P IR-Wray Buntine

Open Source, Distributed and Peer-to-Peer IR

Wray Buntine

National ICT Australia

Helsinki Institute for Information Technology

Some Open Source systems

Technorati: http://technorati.com/

Built on Lucene backend for search

Facility to add Tags and authority

Wikia Search

“Open Search” based on distributed, social and semantic concepts.

Acquired grub (http://www.grub.org/), a P2P crawler from looksmart. It currently uses Lucene.

Other systems mentioned in the talk:

Getting higher ratings in search

Some of the methods used to get clients website higher in the ratings are link farms, fake website and flogs.

Some Academic Systems available on different license types

· Indri: U.Mass+CMU C++ system with language models on BSD license, part of Lemur.

· Lucene: Java-based industrial system sometimes used in academia due to its popularity.

· MG4J: “Managing Gigabytes for Java” system from U.Milano under LGPL.

· Terrier: feature-laden Java-based system from U.Glasgow on MPL.

· Wumpus: scalable desktop-oriented system from U.Waterloo on GPL.

· Zettair: simple, fast C-based system from RMIT University on BSD style license.

Distributed and peer-to-peer architecture

Factors in distributed IR

· Modes: cluster based, P2P etc.

· Co-operating vs Competing nodes

· Homogeneous vs Heterogeneous node

· Centralized vs Decentralized vs No control

· Nature of content

Stages in a Search Process

  • Crawling: by URL or domain
  • Pre-processing: by document
  • Indexing: a very large collection of (term,doc) elements
  • Querying: by term or document or other combinations
  • Result serving: by document

Note: The hard task is to distribute Indexing and Querying (the core of search)

Distributed Query Score computation

For multi term query, problem occurs if terms are on different nodes. So we need to distribute document-term entries.

Federated Querying

Tasks

  • Resource Selection: Selecting the nodes to distribute the queries to
  • Result merging: Assembling results from different result sets
  • Discovery: Sampling to estimate collection statistics for nodes

Distributed querying can employ either of these schemes

Document partitioning (by topic, language, genre etc.):

  • Flooding of queries
  • Selective routing

Tasks: partitioning, query routing and obtaining results

Employ hierarchical assembling of results

Term partitioning (for multi-term queries)

P2P IR

In a P2P network there is no centralized control but there can be a super-peer i.e. some nodes are more equal than others. P2P are self-organizing, failure resilient systems that encourage resource sharing. P2P systems organize peers into a network overlay. There is use of distributed hash tables to route queries.