Friday, September 7, 2007

from Winning TREC to Beating Google-David Hawking

from Winning TREC to Beating Google

by

David Hawking

CSIRO & Funnelback Pty. Ltd.


This talks presented an overview of issues related to web search and a live demonstration of the search engine (specifically indexing and searching) he and his colleagues have built.


They have used UK2006 collection which has:

  • 80 million pages-0.4% of web

  • crawled by Universit degli Studi di Milano

  • distributed by Yahoo! research, Barcelona

  • intended for SPAM detection/rejection experiment

  • Available at: www.yr-bcn.es/webspam/datasets/


What the difference between TREC adhoc and Searching the web?

  • In TREC adhoc the task is to find all documents relevant to a topic. But while searching on the web people are just looking for a couple of results.

  • The searches in TREC adhoc are Informational. Web searching however can be for Informational, Transactional or Navigational needs.

  • Usually people in web search hardly go beyond the first page. So it may be fair to say that measures should take into account the first page of results.

  • TREC adhoc uses Mean Average Precision (MAP). Web search typically is measured by Normalised Discounted Cumulative Gain (NDCG).


Crawler tasks:

Maintain Seedlist of URLs

Look at Robots.txt for permissions to crawl


Problems

What to do if servers are down?

Webpage generally have lots of error

Resolving URL

How to handle redirects?

How to crawl URL that is dynamically generated by a script?


Indexing

Requires maintaining enormous vocabulary, inverted files.

But how do we do Instant update, deletion on Index files that are usually compressed and huge.


Ranking

Static Scores

  • Page Rank - (may have some biases for eg. Bias toward fortune 500 vs rest of the companies)

  • URL form

  • fRank

  • Spam Score

  • Adult Content Score

Document at a Time Ranking

Assign document numbers in order of descending static score.

Should index be distributed over clusters?


Caching

Cache web pages for common queries.

Caching also needed for generating summaries describing the result.


Query Processing- the Need for Speed

Rank the document

For each result, obtain: Title, URLs, Query biased summary etc.

Categories:

0 comments: