| S.No. | Title | Source |
| 1 | Indexing Low Frequency Information for Question Answering | http://riao.free.fr/papers/100.pdf |
| 2 | Splitting Complex Temporal Questions for Question Answering systems | http://acl.ldc.upenn.edu/P/P04/P04-1072.pdf |
| 3 | Different Structures for Evaluating Answers to Complex Questions: PyramidsWon’t Topple, and Neither Will Human Assessors | http://www.umiacs.umd.edu/~jimmylin/publications/ |
| 4 | Answering Clinical Questions with Knowledge-Based and Statistical Techniques | http://www.umiacs.umd.edu/~jimmylin/publications/ |
| 5 | Answering Complex, list and context questions with LCC's Question-Answering Server | http://trec.nist.gov/pubs/trec10/papers/lcc-trec10.pdf |
| 6 | TREC 2006 QA Track Papers | http://trec.nist.gov/pubs/trec15/t15_proceedings.trac |
| MIT QA | ||
| UMass at TREC ciQA 2006 | ||
| TREC-2006 at Maryland: Blog, Enterprise, Legal and QA Tracks | ||
| Contextual information and assessor characteristics in complex question answering | ||
| Identifying Relationships Between Entities in Text for Complex Interactive Question Answering Task | ||
| 7 | Answering complex questions with random walk models | http://portal.acm.org/citation.cfm?id=1148170.11482 |
| 8 | The role of information retrieval in answering complex questions | http://portal.acm.org/citation.cfm?id=1273141&coll= |
| 9 | Answering English questions by computer: a survey | http://portal.acm.org/citation.cfm?id=363732&coll=G |
| 10 | Complex Answers: A Case Study Using a WWW Question Answering Syst | http://citeseer.ist.psu.edu/560017.html |
| 11 | New Directions in Question Answering | http://portal.acm.org/citation.cfm?id=1108994.11090 |
| 12 | USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING | http://www.cs.colorado.edu/~martin/Papers/cuaq-ico |
| 13 | Question Answering Based on Semantic Structures | http://www.icsi.berkeley.edu/~snarayan/837.pdf |
| 14 | Finding Answers to Complex Questions | http://web.syr.edu/~diekemar/ruimte/Papers/Chapter |
| 15 | Automatic Question Answering: Beyond the Factoid | http://www.aclweb.org/anthology-new/N/N04/N04-100 |
| 16 | Answering complex questions and performing deep reasoning in advance question answering systems | http://www.public.asu.edu/~cbaral/cse571-s05/class |
| 17 | What do You Mean? Finding Answers to Complex Questions | http://web.syr.edu/~diekemar/ruimte/Papers/SS803A |
| 18 | Long-Answer Question Answering and Rhetorical-Semantic Relations | http://www1.cs.columbia.edu/nlp/theses/sasha_blair |
| 19 | Question Answering using Integrated Information Retrieval and Information Extraction | http://www.aclweb.org/anthology-new/N/N07/N07-106 |
| 20 | Using Semantic Roles to Improve Question Answering | http://www.aclweb.org/anthology-new/D/D07/D07-100 |
| 21 | Answering Questions Using Advanced Semantics and Probabilistic Inferenc | http://www.cs.pitt.edu/~mrotaru/comp/nlp/BBN/Nara |
| 22 | Structured Retrieval for Question Answering | http://durazno.lti.cs.cmu.edu/wiki/moin.cgi/Javelin_P |
| 23 | Answering complex questions and performing deep reasoning in advance QA systems: ARDA AQUAINT Program Phase 2 | http://www.public.asu.edu/~cbaral/aquaint04-06/site- |
| 24 | Automatic Detection of Causal Relations for Question Answering | http://portal.acm.org/citation.cfm?id=1119322 |
| 25 | Learning to Find Answers to Questions on the Web | http://portal.acm.org/citation.cfm?id=990303&coll=G |
| 26 | Bridging the lexical chasm: statistical approaches to answer-finding | http://portal.acm.org/citation.cfm?id=345576&coll=G |
| 27 | Syntactic and Semantic Decomposition Strategies for Question Answering from Multiple Resources | http://groups.csail.mit.edu/infolab/publications/Katz- |
| 28 | ANALYSIS OF SEMANTIC CLASSES: TOWARD NON-FACTOID QA | http://ftp.cs.toronto.edu/pub/gh/Niu-thesis.pdf |
| 29 | The role of information retrieval in answering complex questions | http://portal.acm.org/citation.cfm?id=1273073.12731 |
| 30 | A Statistical Classification Approach to Question Answering using Web Dat | http://www.ewdw.com/PUBLICATIONS/cw2005.pdf |
| 31 | Retrieving answers from frequently asked questions pages on the web | http://portal.acm.org/citation.cfm?id=1099571 |
| 32 | Probabilistic model for definitional question answering | http://portal.acm.org/citation.cfm?id=1148210 |
| 33 | Strategies for Advanced Question Answering | http://acl.ldc.upenn.edu/hlt-naacl2004/qa/pdf/harabag |
| 34 | Towards Answering Opinion Questions: | http://acl.ldc.upenn.edu/acl2003/emnlp/pdfs/Yu.pdf |
| 35 | Answering Definition Questions Using Multiple Knowledge Sources | http://acl.ldc.upenn.edu/N/N04/N04-1007.pdf |
| 36 | Knowledge-Based Question Answering | http://research.microsoft.com/~cyl/download/papers/ |
| 37 | Using Scenario Knowledge in Automatic Question Answering | http://acl.ldc.upenn.edu/W/W06/W06-0705.pdf |
Tuesday, October 30, 2007
Reading List
Sunday, September 23, 2007
Probabilistic IR-DFR
Divergence From Randomness Model
The model is based on the hypothesis that the level of treatment of the INFORMATIVE WORDS is witnessed by an ELITE SET of documents, in which these words occur to a relatively greater extent than in the rest of the documents.
On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution.
There are three main important components of a DFR model:
1)Informative Content
This is inverse measure of the probability having “by chance” a certain number of occurrences of a term within a document according to the model of randomness being used.
2)Gain
Measures the probability of coming across another occurrence of the term given that 'tf' occurrences have already been seen. This probability tends to 1 (i.e. becomes a certainty) as 'tf' increases. This mimics the property of informative word distribution.
3)Term Frequency normalisation
Takes into account the relationship between document length and term frequency within a document so that smaller documents are not penalised against longer documents and are ranked higher.
The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d". In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:
| weight(t|d)∝-log ProbM(t∊d|Collection) |
|
where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. There are many ways to choose M, each of these provides a basic DFR model.
If the model M is the binomial distribution, then the basic model is P and computes the value
| -log ProbP(t∊d|Collection)=-log (TF tf) ptf qTF-tf |
|
where:
TF is the term-frequency of the term t in the Collection
tf is the term-frequency of the term t in the document d
N is the number of documents in the Collection
p is 1/N and q=1-p
Gain
When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case we get a high weight, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight, we tune or smooth the weight by considering only the portion of it which is the amount of information gained with the term:
gain(t|d)=Prisk * (-log ProbM(t∊d|Collection))
The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus smaller the probability Prisk is, that is:
| Prisk=1-Prob(t∊d|d∊Elite set) |
|
Models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B:
| Prisk=1/(tf+1) (Laplace model) Prisk=TF/{ df*(tf+1) } (Ratio B of two Binomial distribution) |
|
where df is the number of documents containing the term.
|
|
Probabilistic IR-BIR and BII
Probabilistic Models
Goal: To estimate the probability that a document is relevant to user queries.
Probabilistic Ranking Principle (PRP)
The order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need i.e. P(R=1|,d,q).
In the simple case of PRP we assume there is no retrieval cost associated that would differentially weigh action or errors.
If we just want to return a set of retrieval results, the Bayes Optimal Decision Rule is to use:
d is relevant iff P(R=1|,d,q)>P(R=0|,d,q)
PRP with retrieval cost
Let C1 be the cost of retrieval of a relevant document and C0 be the cost of retrieval of a non-relevant document. Then PRP says that if for a specific document d and for all documents d' not yet retrieved
C1*P(R=1|d)+C0*P(R=0|d)<=C1*P(R=1|d')+C0*P(R=0|d')
then d is the next document to be retrieved.
Binary Independence Retrieval Model
Documents and queries are both represented as binary term vector i.e. a document is represented by ~x=(x1,...,xn) where xt=1 if term t is present in the document d and xt=0 otherwise. Similarly for q we represent it by ~q.
Independence means that there is no association between terms.
Assume that the relevance of each document is independent of the relevance of other documents.
P(R=1|~x,~q)={P(~x|R=1,~q) * P(R=1|~q)}/P(~x,~q)
P(R=0|~x,~q)={P(~x|R=0,~q) * P(R=0|~q)}/P(~x,~q)
Ranking functions for query terms
As we are interested only in the ranking of documents, we use some other quantities that are easier to compute like odds.
O(R|~x,~q)=P(R=1|~x,~q)/P(R=0|~x,~q)
={ P(R=1|~q)/P(R=0|~q) } * { P(~x|R=1,~q)/P(~x|R=0,~q) }
The first component is a constant for a given query so there is no need to estimate it.
To estimate the second part of the equation, we make the Naive Bayes Conditional Independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word.
P(~x|R=1,~q)/P(~x|R=0,~q)=∏(t=1 to M) { P(xt|R=1,~q)/P(xt|R=0,~q) }
So,
O(R|~x,~q)=O(R|~q) *∏ (t=1 to M) P(xt|R=1,~q)/P(xt|R=0,~q)
Since xt=0 or 1, we can separate the two cases:
O(R|~x,~q)=O(R|~q) ∏(t:xt=1) { P(xt=1|R=1,~q)/P(xt=1|R=0,~q) } * ∏(t:xt=0) { P(xt=0|R=1,~q)/P(xt=0|R=0,~q) }
Let,
pt=P(xt=1|R=1,~q) be the probability of a term occurring in a document relevant to a query.
ut=P(xt=1|R=0,~q) be the probability of a term occurring in a document not-relevant to a query.
We make another assumption:
Terms not occurring in the qury are equally likely to occur in relevant and non-relevant document I.e. If qt=0 then pt=ut
So,
O(R|~q,~x)=O(R,~q) * ∏(t:xt=qt=1) pt/ut * ∏(t:xt=0,qt=1) *(1-pt)/(1-ut)
=O(R,~q) * ∏(t:xt=qt=1) { pt * (1-ut) }/{ ut * (1-pt) } * ∏(t:qt=1) (1-pt)/(1-qt)
The second part in the above equation is a constant for a query as it computes only for query terms.
Retrieval Status Value (RSV) in this model:
RSVd=log ∏(t:xt=qt=1) { pt*(1-ut) }/{ ut * (1-pt) }
=∑(t:xt=qt=1) log {(pt*(1-ut))/(ut*(1-pt))}
Let Ct=log {(pt*(1-ut))/(ut*(1-pt))}
=log pt/(1-pt) + log ut/(1-ut)
The Ct quantities function as term weights in this model and the document score for a query is
RSVd=∑(t:xt=qt=1) Ct
Let,
s=Number of relevant documents containing xt
S=total number of relevant documents
nt=Number of document containing term t
N=Number of documents presented to the user
pt=s/S
ut=(nt-s)/(N-S)
In practice ut=nt/N
Assuming that relevant documents are a very small percentage of the collection.
Binary Independence Indexing Model (BII)
BIR regards a single query with respect to a number of documents, the BII model observes one document in relation to a number of queries submitted.
BII model yields the same ranking for two different queries formulated with same set of terms.
Let
~Zk=(zk1,......,zkn)
where Zki=1 if ti є q kt and Zki=0 otherwise
So,
P(R|qk,dm)=P(R|~Zk,dm)
Using Bayes' theorem,
P(R|~Zk,dm)=P(R|dm) * { P(~Zk|R,dm)/P(~Zk|dm) }
Assuming that the distribution of terms in all queries to which a document with representation dm is relevant is independent
P(~Zk|R,dm)=∏(i=1 to n) P(zki|R,dm)
Assuming that the relevance of a document dm w.r.t. A query qk depends only on the terms from q kt
and not on other terms, we get:
P(R,~Zk,dm)=∏(i=1 to n) P(zki)/P(Zk) * P(R|dm) * ∏(zki=1) { P(R|zki=1,dm)/P(R|dm) } * ∏(zki=0) { P(R|zki=0,dm)/P(R|dm) }
The first term is constant for a given query.
Assuming P(R|ti,dm)=P(R|dm) for all ti not in dmT
P(R|qk,dm)=ck * P(R|dm) * ∏(ti єqkT ∩ dmT) P(R|ti,dm)/P(R|dm)
Wednesday, September 12, 2007
Performance Measures
Measures of Performance
Precision & Recall
Precision={{relevant docs}∩{retrieved docs}}/{retrieved docs}
Recall={{relevant docs}∩{retrieved docs}}/{relevant docs in collection}
Confusion Matrix
True +ve (TP): Number of relevant docs that system retrieved
False +ve (FP): Number of non-relevant docs that system retrieved
True -ve (TN): Number of non-relevant docs that the system did not retrieve
False -ve (FN): Number of relevant docs that the system did not retrieve
| | Relevant | Non-Relevant |
| Retrieved | TP | FP |
| Not-Retrieved | FN | TN |
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
F-Score
Fα=1/{(α*1/P)+(1-α)*1/R}
Where,
α-->Weighing Factor
Higher α means that recall is more important and lower means that precision is more important.
User Oriented Measures
Coverage=Relevant docs retrieved and known to user/Relevant docs known to user
Novelty=Relevant docs unknown to user/relevant docs retrieved
Problems with precision & recall
Users generally do not really care about recall.
P & R doesn't take into account the order of answers.
Doesn't take into account the degree of relevance.
What to do if the denominators were zero?
Measuring Recall involves identifying relevant docs in the corpus which is difficult.
Average Precision
AP=(K=1 to |S|)Summation {rel(Sk)*P@K/R}
P@K=(i=1 to K) rel(Si)/K
R=(di Є D) summation {rel(d)}
Where,
D=Corpus od documents
S=returned list of documents (s1,s2,...,sn)
P@K=precision of s1,s2,...,sk
rel(d)=”true” relevance
0: non-relevant
1: Relevant
R=Relevant docs in corpus
Example:
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| Recall | 0.2 | 0.2 | 0.4 | 0.4 | 0.4 | 0.6 | 0.6 | 0.6 | 0.8 | 1.0 |
| Precision | 1.0 | 0.5 | 0.67 | 0.5 | 0.4 | 0.5 | 0.43 | 0.38 | 0.44 | 0.5 |
Suppose 1, 3, 6, 9 and 10 are relevant documents.
AP=(1.0+0.67+0.5+0.44+0.5)/5=0.62
(Calculate by averaging precision when recall increases i.e. when relevant docs arrives)
Mean Average Precision (MAP)
-->Take the AP value per query
Calculate the mean score (Average over query)
Mean Reciprocal Rank (MRR)
Suppose we have 3 sample queries for a system that tries to translate English word to their plurals. In each case. The system makes three guesses, with the first one being the one it thinks is most likely correct.
| Query | Results | Correct | Rank | Reciprocal Rank |
| cat | Catten, cati, cats | cats | 3 | 1/3 |
| torus | Torii, tori, toruses | tori | 2 | 1/2 |
| virus | Viruses, virii, viri | viruses | 1 | 1 |
MRR=(1/3+1/2+1)/3=0.61
Precision@n
-->Proportion of top n documents that are satisfactory
P@10=0.4 means that 4 of the top 10 documents were satisfactory.
A run of 50 queries would be measured according to mean P@n.
Success@n
-->The proportion of queries for which a correct answer was within the top n.
s@1=0.5 means that for half of the queries, the correct answer was at rank 1.
Relevance Measurement
Term Frequency (tf)
-->tf in a document is simply the number of times a given term appears in that document.
-->This count is usually normalised to prevent a bias towards longer documents which may have a higher tf regardless of the actual importance of that term in the document.
tfi=ni/∑k nk
Where,
ni=number of occurrences of the considered term
Denominator is the number of occurrences of all the terms
Inverse Document Frequency (IDF)
idfi=log2 D/{d: tiЄd}
Where,
D=total number of documents in the corpus
{d: tiЄd}=number of documents where the term ti appears
Then, tf-idf=tf . Idf
BM25
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.
score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))}
where f(qi,D) is qi's term frequency in the document D, | D | is the length of the document D, and avgdl is the average document length in the text collection from which documents are drawn. k1 and b are free parameters.
The variable k1 is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency. b is another tuning parameter (0 ≤ b ≤ 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.
IDF(qi) is the IDF (inverse document frequency) weight of the query term qi.
It is usually computed as:
IDF(qi)=log {(N-n(qi)+0.5)/(n(qi)+0.5)}
where N is the total number of documents in the collection, and n(qi) is the number of documents containing qi.
If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries.
score(D,Q)=Summation(i=1 to n) IDF(qi) * {(f(qi,D)*(k1+1))/(f(qi,D)+k1*(1-b+b*|D|/avgdl))} * {(k3+1)f(qi,Q)/(k3+f(qi,Q))}
k3 being another positive tuning parameter that this time calibrates term frequency scaling of the query.
Normalized Discounted Cumulative Gain (NDCG)
-->Devised specifically for web search evaluation
-->rewards relevant documents in the top ranked results more heavily than those ranked lower
For a query q
Relevance level of documents ranked j wrt q br rq(j)
Response list in inspected upto rank L
NDCGq = Zq*Summation(j=1 to L) (2rq(j) − 1)/log(1+j)
Zq is a normalization factor that ensures a perfect ordering of the results for the query, will receive the score of one.
Suppose we try the same query on two different search engines and we get the results with rank as given below. We can see that both have the same number of similar ranked documents. But in case of Result1, the higher ranked document appears at a lower position (Counting first document position as 1) than in Result2. So it rewards Result1 and receives a greater score. This is where the discount factor (the 1/log(1+j)) plays the role.
Result1=3 1 1
Gain1=7 1 1
Discount1=1 0.63 0.5
CDG1=7 0.63 0.5
Result2=1 3 1
Gain2=1 7 1
Discount2=1 0.63 0.5
CDG2=1 4.41 0.5
Friday, September 7, 2007
Machine Learning from Text Classification Applications-David D. Lewis
Machine Learning from Text Classification Applications
Dave Lewis
David D. Lewis Consulting, LLC
Text Classification
Assigning text to one of several predefined groups
Use
As a component in NLP systems
Improving information access
How to improve Information Access
Relevance Feedback (Interactive Search)
Organize search results into classes
Awareness control (Filtering porn, highlighting relevant text etc.)
Categories of Classifications
Binary (either this or that)
Multiclass (belongs to one of the classes)
Multilabel (may belong to more than one class)
Ordinal (Multiclass w/ ordered classes)
Hierarchical (class organized as trees)
Fuzzy (ordinal with numeric membership)
Types of Classifiers
Rule-Based
Numeric (weighted combination of attribute values)
Instance Based (Direct comparison of text with known labeled examples)
Automata/Pattern Matching (Based on pattern matching)
Hybrid (Weighted/voted, numeric w/ attributes defined by patterns etc.)
Why good classifiers go bad
Input data changes (change in format, meaning of word changes etc.)
Classes added, deleted, redefined
Organizational policies changed
Cost of errors change
How to deal with this
Make robust classifier
Monitor classifier effectiveness
Periodically retrain
Monitor for novel inputs
from Winning TREC to Beating Google-David Hawking
from Winning TREC to Beating Google
by
David Hawking
CSIRO & Funnelback Pty. Ltd.
This talks presented an overview of issues related to web search and a live demonstration of the search engine (specifically indexing and searching) he and his colleagues have built.
They have used UK2006 collection which has:
80 million pages-0.4% of web
crawled by Universit degli Studi di Milano
distributed by Yahoo! research, Barcelona
intended for SPAM detection/rejection experiment
Available at: www.yr-bcn.es/webspam/datasets/
What the difference between TREC adhoc and Searching the web?
In TREC adhoc the task is to find all documents relevant to a topic. But while searching on the web people are just looking for a couple of results.
The searches in TREC adhoc are Informational. Web searching however can be for Informational, Transactional or Navigational needs.
Usually people in web search hardly go beyond the first page. So it may be fair to say that measures should take into account the first page of results.
TREC adhoc uses Mean Average Precision (MAP). Web search typically is measured by Normalised Discounted Cumulative Gain (NDCG).
Crawler tasks:
Maintain Seedlist of URLs
Look at Robots.txt for permissions to crawl
Problems
What to do if servers are down?
Webpage generally have lots of error
Resolving URL
How to handle redirects?
How to crawl URL that is dynamically generated by a script?
Indexing
Requires maintaining enormous vocabulary, inverted files.
But how do we do Instant update, deletion on Index files that are usually compressed and huge.
Ranking
Static Scores
Page Rank - (may have some biases for eg. Bias toward fortune 500 vs rest of the companies)
URL form
fRank
Spam Score
Adult Content Score
Document at a Time Ranking
Assign document numbers in order of descending static score.
Should index be distributed over clusters?
Caching
Cache web pages for common queries.
Caching also needed for generating summaries describing the result.
Query Processing- the Need for Speed
Rank the document
For each result, obtain: Title, URLs, Query biased summary etc.







.gif)