Tuesday, October 30, 2007

Reading List

S.No. Title Source
1 Indexing Low Frequency Information for Question Answering http://riao.free.fr/papers/100.pdf
2 Splitting Complex Temporal Questions for Question Answering systems http://acl.ldc.upenn.edu/P/P04/P04-1072.pdf
3 Different Structures for Evaluating Answers to Complex Questions:
PyramidsWon’t Topple, and Neither Will Human Assessors
http://www.umiacs.umd.edu/~jimmylin/publications/Dang_Lin_ACL2007.pdf
4 Answering Clinical Questions with Knowledge-Based and
Statistical Techniques
http://www.umiacs.umd.edu/~jimmylin/publications/Demner-Fushman_Lin_CL2007.pdf
5 Answering Complex, list and context questions with LCC's
Question-Answering Server
http://trec.nist.gov/pubs/trec10/papers/lcc-trec10.pdf
6 TREC 2006 QA Track Papers http://trec.nist.gov/pubs/trec15/t15_proceedings.tracks.html#qa

MIT QA

UMass at TREC ciQA 2006

TREC-2006 at Maryland: Blog, Enterprise, Legal and QA Tracks

Contextual information and assessor characteristics in
complex question answering


Identifying Relationships Between Entities in Text for
Complex Interactive Question Answering Task

7 Answering complex questions with random walk models http://portal.acm.org/citation.cfm?id=1148170.1148211
8 The role of information retrieval in answering complex questions http://portal.acm.org/citation.cfm?id=1273141&coll=GUIDE&dl=GUIDE&CFID=37657637&CFTOKEN=93958137
9 Answering English questions by computer: a survey http://portal.acm.org/citation.cfm?id=363732&coll=GUIDE&dl=GUIDE&CFID=37657637&CFTOKEN=93958137
10 Complex Answers: A Case Study Using a WWW Question Answering System http://citeseer.ist.psu.edu/560017.html
11 New Directions in Question Answering http://portal.acm.org/citation.cfm?id=1108994.1109001
12 USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING http://www.cs.colorado.edu/~martin/Papers/cuaq-icon-2002.pdf
13 Question Answering Based on Semantic Structures http://www.icsi.berkeley.edu/~snarayan/837.pdf
14 Finding Answers to Complex Questions http://web.syr.edu/~diekemar/ruimte/Papers/Chapter11-Diekema.pdf
15 Automatic Question Answering: Beyond the Factoid http://www.aclweb.org/anthology-new/N/N04/N04-1008.pdf
16 Answering complex questions and performing deep reasoning in
advance question answering systems
http://www.public.asu.edu/~cbaral/cse571-s05/classnotes/Baral-kick-off-may04-v2.ppt
17 What do You Mean? Finding Answers to Complex Questions http://web.syr.edu/~diekemar/ruimte/Papers/SS803ADiekema.pdf
18 Long-Answer Question Answering and Rhetorical-Semantic Relations http://www1.cs.columbia.edu/nlp/theses/sasha_blair_goldensohn.pdf
19 Question Answering using Integrated Information Retrieval and
Information Extraction
http://www.aclweb.org/anthology-new/N/N07/N07-1067.pdf
20 Using Semantic Roles to Improve Question Answering http://www.aclweb.org/anthology-new/D/D07/D07-1002.pdf
21 Answering Questions Using Advanced Semantics and Probabilistic Inference http://www.cs.pitt.edu/~mrotaru/comp/nlp/BBN/Narayanan%20QA%20NAACL%20Work%202004.pdf
22 Structured Retrieval for Question Answering http://durazno.lti.cs.cmu.edu/wiki/moin.cgi/Javelin_Project/Publications?action=AttachFile&do=get&target=fp376-bilotti.pdf
23 Answering complex questions and performing deep reasoning in
advance QA systems: ARDA AQUAINT Program Phase 2
http://www.public.asu.edu/~cbaral/aquaint04-06/site-visit/sept-04/Baral-site-visit-sept13-04-v1.ppt
24 Automatic Detection of Causal Relations for Question Answering http://portal.acm.org/citation.cfm?id=1119322
25 Learning to Find Answers to Questions on the Web http://portal.acm.org/citation.cfm?id=990303&coll=GUIDE&dl=ACM&CFID=37968417&CFTOKEN=68839066
26 Bridging the lexical chasm: statistical approaches to answer-finding http://portal.acm.org/citation.cfm?id=345576&coll=GUIDE&dl=ACM&CFID=37968417&CFTOKEN=68839066
27 Syntactic and Semantic Decomposition Strategies for Question Answering
from Multiple Resources
http://groups.csail.mit.edu/infolab/publications/Katz-etal-AAAI05W5.pdf
28 ANALYSIS OF SEMANTIC CLASSES: TOWARD NON-FACTOID QA http://ftp.cs.toronto.edu/pub/gh/Niu-thesis.pdf
29 The role of information retrieval in answering complex questions http://portal.acm.org/citation.cfm?id=1273073.1273141
30 A Statistical Classification Approach to Question Answering using Web Data http://www.ewdw.com/PUBLICATIONS/cw2005.pdf
31 Retrieving answers from frequently asked questions pages on the web http://portal.acm.org/citation.cfm?id=1099571
32 Probabilistic model for definitional question answering http://portal.acm.org/citation.cfm?id=1148210
33 Strategies for Advanced Question Answering http://acl.ldc.upenn.edu/hlt-naacl2004/qa/pdf/harabagiu-strategies.pdf
34 Towards Answering Opinion Questions: http://acl.ldc.upenn.edu/acl2003/emnlp/pdfs/Yu.pdf
35 Answering Definition Questions Using Multiple Knowledge Sources http://acl.ldc.upenn.edu/N/N04/N04-1007.pdf
36 Knowledge-Based Question Answering http://research.microsoft.com/~cyl/download/papers/sci2002.pdf
37 Using Scenario Knowledge in Automatic Question Answering http://acl.ldc.upenn.edu/W/W06/W06-0705.pdf

Sunday, September 23, 2007

Probabilistic IR-DFR

Divergence From Randomness Model

The model is based on the hypothesis that the level of treatment of the INFORMATIVE WORDS is witnessed by an ELITE SET of documents, in which these words occur to a relatively greater extent than in the rest of the documents.

On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution.


There are three main important components of a DFR model:

1)Informative Content

This is inverse measure of the probability having “by chance” a certain number of occurrences of a term within a document according to the model of randomness being used.


2)Gain

Measures the probability of coming across another occurrence of the term given that 'tf' occurrences have already been seen. This probability tends to 1 (i.e. becomes a certainty) as 'tf' increases. This mimics the property of informative word distribution.


3)Term Frequency normalisation

Takes into account the relationship between document length and term frequency within a document so that smaller documents are not penalised against longer documents and are ranked higher.


The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d". In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:


weight(t|d)-log ProbM(td|Collection)


where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. There are many ways to choose M, each of these provides a basic DFR model.

If the model M is the binomial distribution, then the basic model is P and computes the value


-log ProbP(td|Collection)=-log (TF tf) ptf qTF-tf


where:

  • TF is the term-frequency of the term t in the Collection

  • tf is the term-frequency of the term t in the document d

  • N is the number of documents in the Collection

  • p is 1/N and q=1-p

Gain

When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case we get a high weight, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight, we tune or smooth the weight by considering only the portion of it which is the amount of information gained with the term:

gain(t|d)=Prisk * (-log ProbM(td|Collection))

The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus smaller the probability Prisk is, that is:


Prisk=1-Prob(td|dElite set)




Models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B:


Prisk=1/(tf+1) (Laplace model)

Prisk=TF/{ df*(tf+1) } (Ratio B of two Binomial distribution)


where df is the number of documents containing the term.






Probabilistic IR-BIR and BII

Probabilistic Models

Goal: To estimate the probability that a document is relevant to user queries.


Probabilistic Ranking Principle (PRP)

The order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need i.e. P(R=1|,d,q).


In the simple case of PRP we assume there is no retrieval cost associated that would differentially weigh action or errors.

If we just want to return a set of retrieval results, the Bayes Optimal Decision Rule is to use:

d is relevant iff P(R=1|,d,q)>P(R=0|,d,q)


PRP with retrieval cost

Let C1 be the cost of retrieval of a relevant document and C0 be the cost of retrieval of a non-relevant document. Then PRP says that if for a specific document d and for all documents d' not yet retrieved

C1*P(R=1|d)+C0*P(R=0|d)<=C1*P(R=1|d')+C0*P(R=0|d')

then d is the next document to be retrieved.


Binary Independence Retrieval Model

Documents and queries are both represented as binary term vector i.e. a document is represented by ~x=(x1,...,xn) where xt=1 if term t is present in the document d and xt=0 otherwise. Similarly for q we represent it by ~q.


Independence means that there is no association between terms.

Assume that the relevance of each document is independent of the relevance of other documents.


P(R=1|~x,~q)={P(~x|R=1,~q) * P(R=1|~q)}/P(~x,~q)

P(R=0|~x,~q)={P(~x|R=0,~q) * P(R=0|~q)}/P(~x,~q)


Ranking functions for query terms

As we are interested only in the ranking of documents, we use some other quantities that are easier to compute like odds.

O(R|~x,~q)=P(R=1|~x,~q)/P(R=0|~x,~q)

={ P(R=1|~q)/P(R=0|~q) } * { P(~x|R=1,~q)/P(~x|R=0,~q) }


The first component is a constant for a given query so there is no need to estimate it.


To estimate the second part of the equation, we make the Naive Bayes Conditional Independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word.


P(~x|R=1,~q)/P(~x|R=0,~q)=∏(t=1 to M) { P(xt|R=1,~q)/P(xt|R=0,~q) }


So,

O(R|~x,~q)=O(R|~q) * (t=1 to M) P(xt|R=1,~q)/P(xt|R=0,~q)


Since xt=0 or 1, we can separate the two cases:

O(R|~x,~q)=O(R|~q) (t:xt=1) { P(xt=1|R=1,~q)/P(xt=1|R=0,~q) } * (t:xt=0) { P(xt=0|R=1,~q)/P(xt=0|R=0,~q) }


Let,

pt=P(xt=1|R=1,~q) be the probability of a term occurring in a document relevant to a query.

ut=P(xt=1|R=0,~q) be the probability of a term occurring in a document not-relevant to a query.


We make another assumption:

Terms not occurring in the qury are equally likely to occur in relevant and non-relevant document I.e. If qt=0 then pt=ut


So,

O(R|~q,~x)=O(R,~q) * (t:xt=qt=1) pt/ut * (t:xt=0,qt=1) *(1-pt)/(1-ut)

=O(R,~q) * (t:xt=qt=1) { pt * (1-ut) }/{ ut * (1-pt) } * (t:qt=1) (1-pt)/(1-qt)

The second part in the above equation is a constant for a query as it computes only for query terms.


Retrieval Status Value (RSV) in this model:

RSVd=log (t:xt=qt=1) { pt*(1-ut) }/{ ut * (1-pt) }

=(t:xt=qt=1) log {(pt*(1-ut))/(ut*(1-pt))}


Let Ct=log {(pt*(1-ut))/(ut*(1-pt))}

=log pt/(1-pt) + log ut/(1-ut)


The Ct quantities function as term weights in this model and the document score for a query is

RSVd=(t:xt=qt=1) Ct


Let,

s=Number of relevant documents containing xt

S=total number of relevant documents

nt=Number of document containing term t

N=Number of documents presented to the user

pt=s/S

ut=(nt-s)/(N-S)


In practice ut=nt/N

Assuming that relevant documents are a very small percentage of the collection.


Binary Independence Indexing Model (BII)

BIR regards a single query with respect to a number of documents, the BII model observes one document in relation to a number of queries submitted.


BII model yields the same ranking for two different queries formulated with same set of terms.


Let

~Zk=(zk1,......,zkn)

where Zki=1 if ti є q kt and Zki=0 otherwise

So,

P(R|qk,dm)=P(R|~Zk,dm)


Using Bayes' theorem,

P(R|~Zk,dm)=P(R|dm) * { P(~Zk|R,dm)/P(~Zk|dm) }


Assuming that the distribution of terms in all queries to which a document with representation dm is relevant is independent

P(~Zk|R,dm)=∏(i=1 to n) P(zki|R,dm)


Assuming that the relevance of a document dm w.r.t. A query qk depends only on the terms from q kt

and not on other terms, we get:


P(R,~Zk,dm)=∏(i=1 to n) P(zki)/P(Zk) * P(R|dm) * ∏(zki=1) { P(R|zki=1,dm)/P(R|dm) } * ∏(zki=0) { P(R|zki=0,dm)/P(R|dm) }


The first term is constant for a given query.


Assuming P(R|ti,dm)=P(R|dm) for all ti not in dmT

P(R|qk,dm)=ck * P(R|dm) * ∏(ti єqkT ∩ dmT) P(R|ti,dm)/P(R|dm)

Wednesday, September 12, 2007

Performance Measures

Measures of Performance


Precision & Recall

Precision={{relevant docs}∩{retrieved docs}}/{retrieved docs}

Recall={{relevant docs}∩{retrieved docs}}/{relevant docs in collection}


Confusion Matrix

True +ve (TP): Number of relevant docs that system retrieved

False +ve (FP): Number of non-relevant docs that system retrieved

True -ve (TN): Number of non-relevant docs that the system did not retrieve

False -ve (FN): Number of relevant docs that the system did not retrieve



Relevant

Non-Relevant

Retrieved

TP

FP

Not-Retrieved

FN

TN


Precision=TP/(TP+FP)

Recall=TP/(TP+FN)


F-Score

Fα=1/{(α*1/P)+(1-α)*1/R}


Where,

α-->Weighing Factor

Higher α means that recall is more important and lower means that precision is more important.


User Oriented Measures

Coverage=Relevant docs retrieved and known to user/Relevant docs known to user


Novelty=Relevant docs unknown to user/relevant docs retrieved


Problems with precision & recall

  1. Users generally do not really care about recall.

  2. P & R doesn't take into account the order of answers.

  3. Doesn't take into account the degree of relevance.

  4. What to do if the denominators were zero?

  5. Measuring Recall involves identifying relevant docs in the corpus which is difficult.


Average Precision

AP=(K=1 to |S|)Summation {rel(Sk)*P@K/R}

P@K=(i=1 to K) rel(Si)/K

R=(di Є D) summation {rel(d)}


Where,

D=Corpus od documents

S=returned list of documents (s1,s2,...,sn)

P@K=precision of s1,s2,...,sk

rel(d)=”true” relevance

0: non-relevant

1: Relevant

R=Relevant docs in corpus


Example:


1

2

3

4

5

6

7

8

9

10

Recall

0.2

0.2

0.4

0.4

0.4

0.6

0.6

0.6

0.8

1.0

Precision

1.0

0.5

0.67

0.5

0.4

0.5

0.43

0.38

0.44

0.5


Suppose 1, 3, 6, 9 and 10 are relevant documents.


AP=(1.0+0.67+0.5+0.44+0.5)/5=0.62

(Calculate by averaging precision when recall increases i.e. when relevant docs arrives)


Mean Average Precision (MAP)

-->Take the AP value per query

Calculate the mean score (Average over query)


Mean Reciprocal Rank (MRR)

Suppose we have 3 sample queries for a system that tries to translate English word to their plurals. In each case. The system makes three guesses, with the first one being the one it thinks is most likely correct.


Query

Results

Correct

Rank

Reciprocal Rank

cat

Catten, cati, cats

cats

3

1/3

torus

Torii, tori, toruses

tori

2

1/2

virus

Viruses, virii, viri

viruses

1

1


MRR=(1/3+1/2+1)/3=0.61


Precision@n

-->Proportion of top n documents that are satisfactory

P@10=0.4 means that 4 of the top 10 documents were satisfactory.

A run of 50 queries would be measured according to mean P@n.


Success@n

-->The proportion of queries for which a correct answer was within the top n.

s@1=0.5 means that for half of the queries, the correct answer was at rank 1.


Relevance Measurement

Term Frequency (tf)

-->tf in a document is simply the number of times a given term appears in that document.

-->This count is usually normalised to prevent a bias towards longer documents which may have a higher tf regardless of the actual importance of that term in the document.


tfi=ni/∑k nk


Where,

ni=number of occurrences of the considered term

Denominator is the number of occurrences of all the terms


Inverse Document Frequency (IDF)

idfi=log2 D/{d: tiЄd}

Where,

D=total number of documents in the corpus

{d: tiЄd}=number of documents where the term ti appears


Then, tf-idf=tf . Idf


BM25

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.

score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))}


where f(qi,D) is qi's term frequency in the document D, | D | is the length of the document D, and avgdl is the average document length in the text collection from which documents are drawn. k1 and b are free parameters.

The variable k1 is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency. b is another tuning parameter (0 ≤ b ≤ 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.

IDF(qi) is the IDF (inverse document frequency) weight of the query term qi.

It is usually computed as:
IDF(q
i)=log {(N-n(qi)+0.5)/(n(qi)+0.5)}

where N is the total number of documents in the collection, and n(qi) is the number of documents containing qi.

If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries.

score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))} * {(k3+1)f(qi,Q)/(k3+f(qi,Q))}

k3 being another positive tuning parameter that this time calibrates term frequency scaling of the query.

Normalized Discounted Cumulative Gain (NDCG)

-->Devised specifically for web search evaluation

-->rewards relevant documents in the top ranked results more heavily than those ranked lower


For a query q

Relevance level of documents ranked j wrt q br rq(j)

Response list in inspected upto rank L

NDCGq = Zq*Summation(j=1 to L) (2rq(j) 1)/log(1+j)

Zq is a normalization factor that ensures a perfect ordering of the results for the query, will receive the score of one.

Suppose we try the same query on two different search engines and we get the results with rank as given below. We can see that both have the same number of similar ranked documents. But in case of Result1, the higher ranked document appears at a lower position (Counting first document position as 1) than in Result2. So it rewards Result1 and receives a greater score. This is where the discount factor (the 1/log(1+j)) plays the role.

Result1=3 1 1
Gain1=7 1 1
Discount1=1 0.63 0.5
CDG1=7 0.63 0.5


Result2=1 3 1
Gain2=1 7 1
Discount2=1 0.63 0.5
CDG2=1 4.41 0.5

Friday, September 7, 2007

Machine Learning from Text Classification Applications-David D. Lewis

Machine Learning from Text Classification Applications

Dave Lewis

David D. Lewis Consulting, LLC

www.DavidDLewis.com


Text Classification

Assigning text to one of several predefined groups


Use

As a component in NLP systems

Improving information access


How to improve Information Access

Relevance Feedback (Interactive Search)

Organize search results into classes

Awareness control (Filtering porn, highlighting relevant text etc.)


Categories of Classifications

Binary (either this or that)

Multiclass (belongs to one of the classes)

Multilabel (may belong to more than one class)

Ordinal (Multiclass w/ ordered classes)

Hierarchical (class organized as trees)

Fuzzy (ordinal with numeric membership)


Types of Classifiers

Rule-Based

Numeric (weighted combination of attribute values)

Instance Based (Direct comparison of text with known labeled examples)

Automata/Pattern Matching (Based on pattern matching)

Hybrid (Weighted/voted, numeric w/ attributes defined by patterns etc.)


Why good classifiers go bad

Input data changes (change in format, meaning of word changes etc.)

Classes added, deleted, redefined

Organizational policies changed

Cost of errors change


How to deal with this

Make robust classifier

Monitor classifier effectiveness

Periodically retrain

Monitor for novel inputs

from Winning TREC to Beating Google-David Hawking

from Winning TREC to Beating Google

by

David Hawking

CSIRO & Funnelback Pty. Ltd.


This talks presented an overview of issues related to web search and a live demonstration of the search engine (specifically indexing and searching) he and his colleagues have built.


They have used UK2006 collection which has:

  • 80 million pages-0.4% of web

  • crawled by Universit degli Studi di Milano

  • distributed by Yahoo! research, Barcelona

  • intended for SPAM detection/rejection experiment

  • Available at: www.yr-bcn.es/webspam/datasets/


What the difference between TREC adhoc and Searching the web?

  • In TREC adhoc the task is to find all documents relevant to a topic. But while searching on the web people are just looking for a couple of results.

  • The searches in TREC adhoc are Informational. Web searching however can be for Informational, Transactional or Navigational needs.

  • Usually people in web search hardly go beyond the first page. So it may be fair to say that measures should take into account the first page of results.

  • TREC adhoc uses Mean Average Precision (MAP). Web search typically is measured by Normalised Discounted Cumulative Gain (NDCG).


Crawler tasks:

Maintain Seedlist of URLs

Look at Robots.txt for permissions to crawl


Problems

What to do if servers are down?

Webpage generally have lots of error

Resolving URL

How to handle redirects?

How to crawl URL that is dynamically generated by a script?


Indexing

Requires maintaining enormous vocabulary, inverted files.

But how do we do Instant update, deletion on Index files that are usually compressed and huge.


Ranking

Static Scores

  • Page Rank - (may have some biases for eg. Bias toward fortune 500 vs rest of the companies)

  • URL form

  • fRank

  • Spam Score

  • Adult Content Score

Document at a Time Ranking

Assign document numbers in order of descending static score.

Should index be distributed over clusters?


Caching

Cache web pages for common queries.

Caching also needed for generating summaries describing the result.


Query Processing- the Need for Speed

Rank the document

For each result, obtain: Title, URLs, Query biased summary etc.