Tuesday, September 4, 2007

NLP for IR-Maarten de Rijke

Natural Language Processing for Information Retrieval

Maarten de Rijke

ISLA, University of Amsterdam

Basic NLP is commonly used in IR e.g. tokenizing, stopping, stemming. More advanced technique is also common e.g. phrase identification, named entity extraction. But advanced NLP can be problematic for IR because IR is not about semantics or syntactic structure; it is about statistical properties of text. IR problems can be reworked so that NLP is potentially useful e.g. question answering, sentimental analysis, biomedical text analysis etc.

Question Answering

People do ask questions that may belong to these categories:

  • Factoids e.g. Where does moss grow?
  • Definition(oid)s e.g. What is a rational number?
  • Procedures e.g. How to speed up XP?

There are some which are difficult to categorize e.g. How to understand woman?

Anatomy of a question

Question type

Idiomatic categorization of questions:

TREC 2003: factoid, list, definition

Answer type

The class of object sought by the question:

Person (Who..?), Place (Where..?), Date (When..?), Number (How Many..?)

Question focus

The property or entity that is being sought by the question:

In what state is the Grand Canyon?

What is the population of Bulgaria?

Question topic

The object or event that the question is about:

What is the height of Mt. Everest?

Here, height is the focus and Mt. Everest is the topic.

Historically QA approaches have seen the movement of corpus from database (1970s), encyclopedias (1990s) to the web (2000-).

Evaluation Measures

MRR of top N candidates given by the system, N=5,3,1

Precision@1

Confidence Weighted Score (CWS)= 1/N (Summationi=1N(#correct up to rank i/i))

QA at TREC and CLEF


TREC-8

Answers can be 50 or 250 bytes long

Systems return up to 5 answers

Answers had to be justified (supply supporting docs

Scored by MRR

TREC-10

No answer questions introduced (NIL questions)

TREC-11

Only 1 answer

Exact answer only

Scored by CWS

TREC-12

Definition and list questions

TREC-13

Scenario based

TREC-15

ciQA: complex, interactive QA

TREC-16

ciQA

Doc collection consist of newspaper and blogs

CLEF 2006

Answer validation exercise, real-time exercise

WiQA (QA using Wikipedia)

CLEF 2007

Mixed corpus (Newspaper and Wikipedia)


Notes:

QA could benefit from structured retrieval

Some interesting issues: Mapping question to (set of) queries
Categories:

0 comments: