Structure/XML Retrieval
By
Mounia Lalmas
Queen Mary,
Documents can be considered structured according to (among others):
• Linear order of words (e.g. sentence, paragraphs etc.)
• Hierarchical (e.g. book’s chapter, sections etc.)
• Links, cross references
• Temporal and spatial relationships in multimedia documents
XML: eXtensible Mark-up Language
• Meta-language-adopted as document format language by W3C
• Used to describe content and structure but not layout
• XML documents are trees
Data-centric view:
– XML as exchange format for structured data
– Used for messaging between enterprise applications
– Mainly a recasting of relational data
Document-centric view:
– XML as format for representing the logical structure of documents
– Rich in text
XML allows the users to retrieve document parts relevant to information need and not the entire document.
Queries
Content-only (CO) queries
• Standard IR queries, but here we are retrieving document components
– “Zidane headbutting Materazzi”
Content-and-structure (CAS) queries
• Put constraints on which types of components are to be retrieved
– “Sections of an article in the Times about Zidane headbutting Materazzi”
XML query languages:
• Four “levels” of expressiveness
– Keyword search (CO Queries): e.g. “xml”
– Tag + Keyword search: e.g. book: xml
– Path Expression + Keyword search (CAS Queries): e.g. /book[./title about “xml db”]
– XQuery + Complex full-text search
for $b in /book
let score $s := $b ftcontains “xml” && “db”
distance 5
XML Retrieval Challenges:
- Term Statistics
How to calculate tf and idf for XML
- Relationship Statistics
Which sub-elements best contribute to the content of its parent element and vice-versa?
How to estimate relationship statistics (e.g. size, no. of children, depth, etc.)?
- Structure Statistics
Which element is a good retrieval unit?
How to estimate structure statistics (frequency, user studies, size, depth)?
- Overlapping Elements
Which one to return in case of more than one relevant elements (whether to return parent or the child)?
- Interpreting Structural Constraints
Problem of many DTs, DTDs/schema not known in advance etc.
Need to identify similar tags/elements
Evaluation of XML retrieval: INEX
Two types of topics:
• Content-only (CO) topics
ignore document structure (simulates users, who do not have any knowledge of
the document structure or who choose not to use such knowledge)
• Content-and-structure (CAS) topics
contain conditions referring both to content and structure of the sought elements (simulate users who do have some knowledge of the structure of the searched collection)
Relevance Assessment:
• find smallest component (àspecificity) that is highly relevant (àexhaustivity)
• specificity: extent to which a document component is focused on the information need, while being an informative unit.
• exhaustivity: extent to which the information contained in a document component satisfies the information need.



.gif)
0 comments:
Post a Comment