from Winning TREC to Beating Google
by
David Hawking
CSIRO & Funnelback Pty. Ltd.
This talks presented an overview of issues related to web search and a live demonstration of the search engine (specifically indexing and searching) he and his colleagues have built.
They have used UK2006 collection which has:
80 million pages-0.4% of web
crawled by Universit degli Studi di Milano
distributed by Yahoo! research, Barcelona
intended for SPAM detection/rejection experiment
Available at: www.yr-bcn.es/webspam/datasets/
What the difference between TREC adhoc and Searching the web?
In TREC adhoc the task is to find all documents relevant to a topic. But while searching on the web people are just looking for a couple of results.
The searches in TREC adhoc are Informational. Web searching however can be for Informational, Transactional or Navigational needs.
Usually people in web search hardly go beyond the first page. So it may be fair to say that measures should take into account the first page of results.
TREC adhoc uses Mean Average Precision (MAP). Web search typically is measured by Normalised Discounted Cumulative Gain (NDCG).
Crawler tasks:
Maintain Seedlist of URLs
Look at Robots.txt for permissions to crawl
Problems
What to do if servers are down?
Webpage generally have lots of error
Resolving URL
How to handle redirects?
How to crawl URL that is dynamically generated by a script?
Indexing
Requires maintaining enormous vocabulary, inverted files.
But how do we do Instant update, deletion on Index files that are usually compressed and huge.
Ranking
Static Scores
Page Rank - (may have some biases for eg. Bias toward fortune 500 vs rest of the companies)
URL form
fRank
Spam Score
Adult Content Score
Document at a Time Ranking
Assign document numbers in order of descending static score.
Should index be distributed over clusters?
Caching
Cache web pages for common queries.
Caching also needed for generating summaries describing the result.
Query Processing- the Need for Speed
Rank the document
For each result, obtain: Title, URLs, Query biased summary etc.



.gif)
0 comments:
Post a Comment