Natural Language Processing for Information Retrieval
Maarten de Rijke
ISLA,
Basic NLP is commonly used in IR e.g. tokenizing, stopping, stemming. More advanced technique is also common e.g. phrase identification, named entity extraction. But advanced NLP can be problematic for IR because IR is not about semantics or syntactic structure; it is about statistical properties of text. IR problems can be reworked so that NLP is potentially useful e.g. question answering, sentimental analysis, biomedical text analysis etc.
Question Answering
People do ask questions that may belong to these categories:
- Factoids e.g. Where does moss grow?
- Definition(oid)s e.g. What is a rational number?
- Procedures e.g. How to speed up XP?
There are some which are difficult to categorize e.g. How to understand woman?
Anatomy of a question
Question type
Idiomatic categorization of questions:
TREC 2003: factoid, list, definition
Answer type
The class of object sought by the question:
Person (Who..?), Place (Where..?), Date (When..?), Number (How Many..?)
Question focus
The property or entity that is being sought by the question:
In what state is the
What is the population of
Question topic
The object or event that the question is about:
What is the height of
Here, height is the focus and
Historically QA approaches have seen the movement of corpus from database (1970s), encyclopedias (1990s) to the web (2000-).
Evaluation Measures
MRR of top N candidates given by the system, N=5,3,1
Precision@1
Confidence Weighted Score (CWS)= 1/N (Summationi=1N(#correct up to rank i/i))
QA at TREC and CLEF
TREC-8
Answers can be 50 or 250 bytes long
Systems return up to 5 answers
Answers had to be justified (supply supporting docs
Scored by MRR
TREC-10
No answer questions introduced (NIL questions)
TREC-11
Only 1 answer
Exact answer only
Scored by CWS
TREC-12
Definition and list questions
TREC-13
Scenario based
TREC-15
ciQA: complex, interactive QA
TREC-16
ciQA
Doc collection consist of newspaper and blogs
CLEF 2006
Answer validation exercise, real-time exercise
WiQA (QA using Wikipedia)
CLEF 2007
Mixed corpus (Newspaper and Wikipedia)
Notes:
QA could benefit from structured retrieval
Some interesting issues: Mapping question to (set of) queries


.gif)
0 comments:
Post a Comment