Tuesday, September 4, 2007

Structure/XML Retrieval-Mounia Lalmas

Structure/XML Retrieval

By

Mounia Lalmas

Queen Mary, University of London

Documents can be considered structured according to (among others):

Linear order of words (e.g. sentence, paragraphs etc.)

Hierarchical (e.g. book’s chapter, sections etc.)

Links, cross references

Temporal and spatial relationships in multimedia documents

XML: eXtensible Mark-up Language

Meta-language-adopted as document format language by W3C

Used to describe content and structure but not layout

XML documents are trees

Data-centric view:

– XML as exchange format for structured data

– Used for messaging between enterprise applications

– Mainly a recasting of relational data

Document-centric view:

– XML as format for representing the logical structure of documents

– Rich in text

XML allows the users to retrieve document parts relevant to information need and not the entire document.

Queries

Content-only (CO) queries

• Standard IR queries, but here we are retrieving document components

– “Zidane headbutting Materazzi”

Content-and-structure (CAS) queries

• Put constraints on which types of components are to be retrieved

– “Sections of an article in the Times about Zidane headbutting Materazzi”

XML query languages:

• Four “levels” of expressiveness

– Keyword search (CO Queries): e.g. “xml”

– Tag + Keyword search: e.g. book: xml

– Path Expression + Keyword search (CAS Queries): e.g. /book[./title about “xml db”]

– XQuery + Complex full-text search

for $b in /book

let score $s := $b ftcontains “xml” && “db”

distance 5

XML Retrieval Challenges:

  1. Term Statistics

How to calculate tf and idf for XML

  1. Relationship Statistics

Which sub-elements best contribute to the content of its parent element and vice-versa?

How to estimate relationship statistics (e.g. size, no. of children, depth, etc.)?

  1. Structure Statistics

Which element is a good retrieval unit?

How to estimate structure statistics (frequency, user studies, size, depth)?

  1. Overlapping Elements

Which one to return in case of more than one relevant elements (whether to return parent or the child)?

  1. Interpreting Structural Constraints

Problem of many DTs, DTDs/schema not known in advance etc.

Need to identify similar tags/elements

Evaluation of XML retrieval: INEX

Two types of topics:

• Content-only (CO) topics

ignore document structure (simulates users, who do not have any knowledge of

the document structure or who choose not to use such knowledge)

• Content-and-structure (CAS) topics

contain conditions referring both to content and structure of the sought elements (simulate users who do have some knowledge of the structure of the searched collection)

Relevance Assessment:

• find smallest component (àspecificity) that is highly relevant (àexhaustivity)

specificity: extent to which a document component is focused on the information need, while being an informative unit.

exhaustivity: extent to which the information contained in a document component satisfies the information need.

Categories:

0 comments: