Blog on IR, NLP and Related AreasbyShailesh Pandey

Tuesday, October 30, 2007

Reading List

Posted on 8:42 AM by shailesh | No comments

S.No.	Title	Source
1	Indexing Low Frequency Information for Question Answering	http://riao.free.fr/papers/100.pdf
2	Splitting Complex Temporal Questions for Question Answering systems	http://acl.ldc.upenn.edu/P/P04/P04-1072.pdf
3	Different Structures for Evaluating Answers to Complex Questions: PyramidsWon’t Topple, and Neither Will Human Assessors	http://www.umiacs.umd.edu/~jimmylin/publications/Dang_Lin_ACL2007.pdf
4	Answering Clinical Questions with Knowledge-Based and Statistical Techniques	http://www.umiacs.umd.edu/~jimmylin/publications/Demner-Fushman_Lin_CL2007.pdf
5	Answering Complex, list and context questions with LCC's Question-Answering Server	http://trec.nist.gov/pubs/trec10/papers/lcc-trec10.pdf
6	TREC 2006 QA Track Papers	http://trec.nist.gov/pubs/trec15/t15_proceedings.tracks.html#qa
	MIT QA
	UMass at TREC ciQA 2006
	TREC-2006 at Maryland: Blog, Enterprise, Legal and QA Tracks
	Contextual information and assessor characteristics in complex question answering
	Identifying Relationships Between Entities in Text for Complex Interactive Question Answering Task
7	Answering complex questions with random walk models	http://portal.acm.org/citation.cfm?id=1148170.1148211
8	The role of information retrieval in answering complex questions	http://portal.acm.org/citation.cfm?id=1273141&coll=
9	Answering English questions by computer: a survey	http://portal.acm.org/citation.cfm?id=363732&coll=G
10	Complex Answers: A Case Study Using a WWW Question Answering System	http://citeseer.ist.psu.edu/560017.html
11	New Directions in Question Answering	http://portal.acm.org/citation.cfm?id=1108994.1109001
12	USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING	http://www.cs.colorado.edu/~martin/Papers/cuaq-icon-2002.pdf
13	Question Answering Based on Semantic Structures	http://www.icsi.berkeley.edu/~snarayan/837.pdf
14	Finding Answers to Complex Questions	http://web.syr.edu/~diekemar/ruimte/Papers/Chapter11-Diekema.pdf
15	Automatic Question Answering: Beyond the Factoid	http://www.aclweb.org/anthology-new/N/N04/N04-1008.pdf
16	Answering complex questions and performing deep reasoning in advance question answering systems	http://www.public.asu.edu/~cbaral/cse571-s05/classnotes/Baral-kick-off-may04-v2.ppt
17	What do You Mean? Finding Answers to Complex Questions	http://web.syr.edu/~diekemar/ruimte/Papers/SS803ADiekema.pdf
18	Long-Answer Question Answering and Rhetorical-Semantic Relations	http://www1.cs.columbia.edu/nlp/theses/sasha_blair_goldensohn.pdf
19	Question Answering using Integrated Information Retrieval and Information Extraction	http://www.aclweb.org/anthology-new/N/N07/N07-1067.pdf
20	Using Semantic Roles to Improve Question Answering	http://www.aclweb.org/anthology-new/D/D07/D07-1002.pdf
21	Answering Questions Using Advanced Semantics and Probabilistic Inference	http://www.cs.pitt.edu/~mrotaru/comp/nlp/BBN/Narayanan%20QA%20NAACL%20Work%202004.pdf
22	Structured Retrieval for Question Answering	http://durazno.lti.cs.cmu.edu/wiki/moin.cgi/Javelin_P
23	Answering complex questions and performing deep reasoning in advance QA systems: ARDA AQUAINT Program Phase 2	http://www.public.asu.edu/~cbaral/aquaint04-06/site-
24	Automatic Detection of Causal Relations for Question Answering	http://portal.acm.org/citation.cfm?id=1119322
25	Learning to Find Answers to Questions on the Web	http://portal.acm.org/citation.cfm?id=990303&coll=G
26	Bridging the lexical chasm: statistical approaches to answer-finding	http://portal.acm.org/citation.cfm?id=345576&coll=G
27	Syntactic and Semantic Decomposition Strategies for Question Answering from Multiple Resources	http://groups.csail.mit.edu/infolab/publications/Katz-etal-AAAI05W5.pdf
28	ANALYSIS OF SEMANTIC CLASSES: TOWARD NON-FACTOID QA	http://ftp.cs.toronto.edu/pub/gh/Niu-thesis.pdf
29	The role of information retrieval in answering complex questions	http://portal.acm.org/citation.cfm?id=1273073.1273141
30	A Statistical Classification Approach to Question Answering using Web Data	http://www.ewdw.com/PUBLICATIONS/cw2005.pdf
31	Retrieving answers from frequently asked questions pages on the web	http://portal.acm.org/citation.cfm?id=1099571
32	Probabilistic model for definitional question answering	http://portal.acm.org/citation.cfm?id=1148210
33	Strategies for Advanced Question Answering	http://acl.ldc.upenn.edu/hlt-naacl2004/qa/pdf/harabagiu-strategies.pdf
34	Towards Answering Opinion Questions:	http://acl.ldc.upenn.edu/acl2003/emnlp/pdfs/Yu.pdf
35	Answering Definition Questions Using Multiple Knowledge Sources	http://acl.ldc.upenn.edu/N/N04/N04-1007.pdf
36	Knowledge-Based Question Answering	http://research.microsoft.com/~cyl/download/papers/sci2002.pdf
37	Using Scenario Knowledge in Automatic Question Answering	http://acl.ldc.upenn.edu/W/W06/W06-0705.pdf

Probabilistic IR-DFR

Posted on 8:44 AM by shailesh | No comments

Divergence From Randomness Model

The model is based on the hypothesis that the level of treatment of the INFORMATIVE WORDS is witnessed by an ELITE SET of documents, in which these words occur to a relatively greater extent than in the rest of the documents.

On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution.

There are three main important components of a DFR model:

1)Informative Content

This is inverse measure of the probability having “by chance” a certain number of occurrences of a term within a document according to the model of randomness being used.

2)Gain

Measures the probability of coming across another occurrence of the term given that 'tf' occurrences have already been seen. This probability tends to 1 (i.e. becomes a certainty) as 'tf' increases. This mimics the property of informative word distribution.

3)Term Frequency normalisation

Takes into account the relationship between document length and term frequency within a document so that smaller documents are not penalised against longer documents and are ranked higher.

The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d". In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:

weight(t|d)∝-log Prob_M(t∊d|Collection)

where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. There are many ways to choose M, each of these provides a basic DFR model.

If the model M is the binomial distribution, then the basic model is P and computes the value

-log Prob_P(t∊d|Collection)=-log (TF tf) p^tf q^TF-tf

where:

TF is the term-frequency of the term t in the Collection
tf is the term-frequency of the term t in the document d
N is the number of documents in the Collection
p is 1/N and q=1-p

Gain

When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case we get a high weight, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight, we tune or smooth the weight by considering only the portion of it which is the amount of information gained with the term:

gain(t|d)=P_risk * (-log Prob_M(t∊d|Collection))

The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus smaller the probability P_risk is, that is:

P_risk=1-Prob(t∊d|d∊Elite set)

Models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B:

P_risk=1/(tf+1) (Laplace model)

P_risk=TF/{ df*(tf+1) } (Ratio B of two Binomial distribution)

where df is the number of documents containing the term.

Probabilistic IR-BIR and BII

Posted on 8:38 AM by shailesh | No comments

Probabilistic Models

Goal: To estimate the probability that a document is relevant to user queries.

Probabilistic Ranking Principle (PRP)

The order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need i.e. P(R=1|,d,q).

In the simple case of PRP we assume there is no retrieval cost associated that would differentially weigh action or errors.

If we just want to return a set of retrieval results, the Bayes Optimal Decision Rule is to use:

d is relevant iff P(R=1|,d,q)>P(R=0|,d,q)

PRP with retrieval cost

Let C1 be the cost of retrieval of a relevant document and C0 be the cost of retrieval of a non-relevant document. Then PRP says that if for a specific document d and for all documents d' not yet retrieved

C1*P(R=1|d)+C0*P(R=0|d)<=C1*P(R=1|d')+C0*P(R=0|d')

then d is the next document to be retrieved.

Binary Independence Retrieval Model

Documents and queries are both represented as binary term vector i.e. a document is represented by ~x=(x1,...,xn) where xt=1 if term t is present in the document d and xt=0 otherwise. Similarly for q we represent it by ~q.

Independence means that there is no association between terms.

Assume that the relevance of each document is independent of the relevance of other documents.

P(R=1|~x,~q)={P(~x|R=1,~q) * P(R=1|~q)}/P(~x,~q)

P(R=0|~x,~q)={P(~x|R=0,~q) * P(R=0|~q)}/P(~x,~q)

Ranking functions for query terms

As we are interested only in the ranking of documents, we use some other quantities that are easier to compute like odds.

O(R|~x,~q)=P(R=1|~x,~q)/P(R=0|~x,~q)

={ P(R=1|~q)/P(R=0|~q) } * { P(~x|R=1,~q)/P(~x|R=0,~q) }

The first component is a constant for a given query so there is no need to estimate it.

To estimate the second part of the equation, we make the Naive Bayes Conditional Independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word.

P(~x|R=1,~q)/P(~x|R=0,~q)=∏(t=1 to M) { P(xt|R=1,~q)/P(xt|R=0,~q) }

So,

O(R|~x,~q)=O(R|~q) *∏ (t=1 to M) P(xt|R=1,~q)/P(xt|R=0,~q)

Since xt=0 or 1, we can separate the two cases:

O(R|~x,~q)=O(R|~q) ∏(t:xt=1) { P(xt=1|R=1,~q)/P(xt=1|R=0,~q) } * ∏(t:xt=0) { P(xt=0|R=1,~q)/P(xt=0|R=0,~q) }

Let,

pt=P(xt=1|R=1,~q) be the probability of a term occurring in a document relevant to a query.

ut=P(xt=1|R=0,~q) be the probability of a term occurring in a document not-relevant to a query.

We make another assumption:

Terms not occurring in the qury are equally likely to occur in relevant and non-relevant document I.e. If qt=0 then pt=ut

So,

O(R|~q,~x)=O(R,~q) * ∏(t:xt=qt=1) pt/ut * ∏(t:xt=0,qt=1) *(1-pt)/(1-ut)

=O(R,~q) * ∏(t:xt=qt=1) { pt * (1-ut) }/{ ut * (1-pt) } * ∏(t:qt=1) (1-pt)/(1-qt)

The second part in the above equation is a constant for a query as it computes only for query terms.

Retrieval Status Value (RSV) in this model:

RSVd=log ∏(t:xt=qt=1) { pt*(1-ut) }/{ ut * (1-pt) }

=∑(t:xt=qt=1) log {(pt*(1-ut))/(ut*(1-pt))}

Let Ct=log {(pt*(1-ut))/(ut*(1-pt))}

=log pt/(1-pt) + log ut/(1-ut)

The Ct quantities function as term weights in this model and the document score for a query is

RSVd=∑(t:xt=qt=1) Ct

Let,

s=Number of relevant documents containing xt

S=total number of relevant documents

nt=Number of document containing term t

N=Number of documents presented to the user

pt=s/S

ut=(nt-s)/(N-S)

In practice ut=nt/N

Assuming that relevant documents are a very small percentage of the collection.

Binary Independence Indexing Model (BII)

BIR regards a single query with respect to a number of documents, the BII model observes one document in relation to a number of queries submitted.

BII model yields the same ranking for two different queries formulated with same set of terms.

Let

~Zk=(zk1,......,zkn)

where Zki=1 if ti є q _k^t and Zki=0 otherwise

So,

P(R|qk,dm)=P(R|~Zk,dm)

Using Bayes' theorem,

P(R|~Zk,dm)=P(R|dm) * { P(~Zk|R,dm)/P(~Zk|dm) }

Assuming that the distribution of terms in all queries to which a document with representation dm is relevant is independent

P(~Zk|R,dm)=∏(i=1 to n) P(zki|R,dm)

Assuming that the relevance of a document dm w.r.t. A query qk depends only on the terms from q _k^t

and not on other terms, we get:

P(R,~Zk,dm)=∏(i=1 to n) P(zki)/P(Zk) * P(R|dm) * ∏(zki=1) { P(R|zki=1,dm)/P(R|dm) } * ∏(zki=0) { P(R|zki=0,dm)/P(R|dm) }

The first term is constant for a given query.

Assuming P(R|ti,dm)=P(R|dm) for all ti not in dm^T

P(R|qk,dm)=ck * P(R|dm) * ∏(ti єqk^T ∩ dm^T) P(R|ti,dm)/P(R|dm)

Performance Measures

Posted on 7:21 AM by shailesh | No comments

Measures of Performance

Precision & Recall

Precision={{relevant docs}∩{retrieved docs}}/{retrieved docs}

Recall={{relevant docs}∩{retrieved docs}}/{relevant docs in collection}

Confusion Matrix

True +ve (TP): Number of relevant docs that system retrieved

False +ve (FP): Number of non-relevant docs that system retrieved

True -ve (TN): Number of non-relevant docs that the system did not retrieve

False -ve (FN): Number of relevant docs that the system did not retrieve

	Relevant	Non-Relevant
Retrieved	TP	FP
Not-Retrieved	FN	TN

Precision=TP/(TP+FP)

Recall=TP/(TP+FN)

F-Score

F_α=1/{(α*1/P)+(1-α)*1/R}

Where,

α-->Weighing Factor

Higher α means that recall is more important and lower means that precision is more important.

User Oriented Measures

Coverage=Relevant docs retrieved and known to user/Relevant docs known to user

Novelty=Relevant docs unknown to user/relevant docs retrieved

Problems with precision & recall

Users generally do not really care about recall.
P & R doesn't take into account the order of answers.
Doesn't take into account the degree of relevance.
What to do if the denominators were zero?
Measuring Recall involves identifying relevant docs in the corpus which is difficult.

Average Precision

AP=(K=1 to |S|)Summation {rel(S_k)*P@K/R}

P@K=(i=1 to K) rel(S_i)/K

R=(d_i Є D) summation {rel(d)}

Where,

D=Corpus od documents

S=returned list of documents (s1,s2,...,sn)

P@K=precision of s1,s2,...,sk

rel(d)=”true” relevance

0: non-relevant

1: Relevant

R=Relevant docs in corpus

Example:

	1	2	3	4	5	6	7	8	9	10
Recall	0.2	0.2	0.4	0.4	0.4	0.6	0.6	0.6	0.8	1.0
Precision	1.0	0.5	0.67	0.5	0.4	0.5	0.43	0.38	0.44	0.5

Suppose 1, 3, 6, 9 and 10 are relevant documents.

AP=(1.0+0.67+0.5+0.44+0.5)/5=0.62

(Calculate by averaging precision when recall increases i.e. when relevant docs arrives)

Mean Average Precision (MAP)

-->Take the AP value per query

Calculate the mean score (Average over query)

Mean Reciprocal Rank (MRR)

Suppose we have 3 sample queries for a system that tries to translate English word to their plurals. In each case. The system makes three guesses, with the first one being the one it thinks is most likely correct.

Query	Results	Correct	Rank	Reciprocal Rank
cat	Catten, cati, cats	cats	3	1/3
torus	Torii, tori, toruses	tori	2	1/2
virus	Viruses, virii, viri	viruses	1	1

MRR=(1/3+1/2+1)/3=0.61

Precision@n

-->Proportion of top n documents that are satisfactory

P@10=0.4 means that 4 of the top 10 documents were satisfactory.

A run of 50 queries would be measured according to mean P@n.

Success@n

-->The proportion of queries for which a correct answer was within the top n.

s@1=0.5 means that for half of the queries, the correct answer was at rank 1.

Relevance Measurement

Term Frequency (tf)

-->tf in a document is simply the number of times a given term appears in that document.

-->This count is usually normalised to prevent a bias towards longer documents which may have a higher tf regardless of the actual importance of that term in the document.

tf_i=n_i/∑_k n_k

Where,

ni=number of occurrences of the considered term

Denominator is the number of occurrences of all the terms

Inverse Document Frequency (IDF)

idf_i=log₂ D/{d: t_iЄd}

Where,

D=total number of documents in the corpus

{d: t_iЄd}=number of documents where the term t_i appears

Then, tf-idf=tf . Idf

BM25

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.

score(D,Q)=Summation(i=1 to n) IDF(q_i) * {(f(q_i,D)*(k₁+1))/(f(q_i,D)+k₁*(1-b+b*|D|/avgdl))}

where f(q_i,D) is q_i's term frequency in the document D, | D | is the length of the document D, and avgdl is the average document length in the text collection from which documents are drawn. k₁ and b are free parameters.

The variable k1 is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency. b is another tuning parameter (0 ≤ b ≤ 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.

IDF(q_i) is the IDF (inverse document frequency) weight of the query term q_i.

It is usually computed as:
IDF(q_i)=log {(N-n(q_i)+0.5)/(n(q_i)+0.5)}

where N is the total number of documents in the collection, and n(q_i) is the number of documents containing q_i.

If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries.

score(D,Q)=Summation(i=1 to n) IDF(q_i) * {(f(q_i,D)*(k₁+1))/(f(q_i,D)+k₁*(1-b+b*|D|/avgdl))} * {(k₃+1)f(qi,Q)/(k₃+f(qi,Q))}

k3 being another positive tuning parameter that this time calibrates term frequency scaling of the query.

Normalized Discounted Cumulative Gain (NDCG)

-->Devised specifically for web search evaluation

-->rewards relevant documents in the top ranked results more heavily than those ranked lower

For a query q

Relevance level of documents ranked j wrt q br rq(j)

Response list in inspected upto rank L

NDCGq = Zq*Summation(j=1 to L) (2^rq⁽^j⁾ − 1)/log(1+j)

Zq is a normalization factor that ensures a perfect ordering of the results for the query, will receive the score of one.

Suppose we try the same query on two different search engines and we get the results with rank as given below. We can see that both have the same number of similar ranked documents. But in case of Result1, the higher ranked document appears at a lower position (Counting first document position as 1) than in Result2. So it rewards Result1 and receives a greater score. This is where the discount factor (the 1/log(1+j)) plays the role.

Result1=3 1 1
Gain1=7 1 1
Discount1=1 0.63 0.5
CDG1=7 0.63 0.5

Result2=1 3 1
Gain2=1 7 1
Discount2=1 0.63 0.5
CDG2=1 4.41 0.5

Machine Learning from Text Classification Applications-David D. Lewis

Posted on 3:44 AM by shailesh | No comments

Machine Learning from Text Classification Applications

Dave Lewis

David D. Lewis Consulting, LLC

www.DavidDLewis.com

Text Classification

Assigning text to one of several predefined groups

Use

As a component in NLP systems

Improving information access

How to improve Information Access

Relevance Feedback (Interactive Search)

Organize search results into classes

Awareness control (Filtering porn, highlighting relevant text etc.)

Categories of Classifications

Binary (either this or that)

Multiclass (belongs to one of the classes)

Multilabel (may belong to more than one class)

Ordinal (Multiclass w/ ordered classes)

Hierarchical (class organized as trees)

Fuzzy (ordinal with numeric membership)

Types of Classifiers

Rule-Based

Numeric (weighted combination of attribute values)

Instance Based (Direct comparison of text with known labeled examples)

Automata/Pattern Matching (Based on pattern matching)

Hybrid (Weighted/voted, numeric w/ attributes defined by patterns etc.)

Why good classifiers go bad

Input data changes (change in format, meaning of word changes etc.)

Classes added, deleted, redefined

Organizational policies changed

Cost of errors change

How to deal with this

Make robust classifier

Monitor classifier effectiveness

Periodically retrain

Monitor for novel inputs

from Winning TREC to Beating Google-David Hawking

Posted on 2:43 AM by shailesh | No comments

from Winning TREC to Beating Google

David Hawking

CSIRO & Funnelback Pty. Ltd.

This talks presented an overview of issues related to web search and a live demonstration of the search engine (specifically indexing and searching) he and his colleagues have built.

They have used UK2006 collection which has:

80 million pages-0.4% of web
crawled by Universit degli Studi di Milano
distributed by Yahoo! research, Barcelona
intended for SPAM detection/rejection experiment
Available at: www.yr-bcn.es/webspam/datasets/

What the difference between TREC adhoc and Searching the web?

In TREC adhoc the task is to find all documents relevant to a topic. But while searching on the web people are just looking for a couple of results.
The searches in TREC adhoc are Informational. Web searching however can be for Informational, Transactional or Navigational needs.
Usually people in web search hardly go beyond the first page. So it may be fair to say that measures should take into account the first page of results.
TREC adhoc uses Mean Average Precision (MAP). Web search typically is measured by Normalised Discounted Cumulative Gain (NDCG).

Crawler tasks:

Maintain Seedlist of URLs

Look at Robots.txt for permissions to crawl

Problems

What to do if servers are down?

Webpage generally have lots of error

Resolving URL

How to handle redirects?

How to crawl URL that is dynamically generated by a script?

Indexing

Requires maintaining enormous vocabulary, inverted files.

But how do we do Instant update, deletion on Index files that are usually compressed and huge.

Ranking

Static Scores

Page Rank - (may have some biases for eg. Bias toward fortune 500 vs rest of the companies)
URL form
fRank
Spam Score
Adult Content Score

Document at a Time Ranking

Assign document numbers in order of descending static score.

Should index be distributed over clusters?

Caching

Cache web pages for common queries.

Caching also needed for generating summaries describing the result.

Query Processing- the Need for Speed

Rank the document

For each result, obtain: Title, URLs, Query biased summary etc.

Blog on IR, NLP and Related Areas
by
Shailesh Pandey

Blog primarily created to jot down things I have observed, discussed or learned that are relevant to my research. Comments are welcomed.

Tuesday, October 30, 2007

Reading List

Sunday, September 23, 2007

Probabilistic IR-DFR

Gain

Probabilistic IR-BIR and BII

Wednesday, September 12, 2007

Performance Measures

Friday, September 7, 2007

Machine Learning from Text Classification Applications-David D. Lewis

from Winning TREC to Beating Google-David Hawking

Popular Posts

Categories

Blog Archive

About Me

Categories

Blog Archive