Open Source, Distributed and Peer-to-Peer IR
Wray Buntine
National ICT
Helsinki Institute for Information Technology
Some Open Source systems
Technorati: http://technorati.com/
Built on Lucene backend for search
Facility to add Tags and authority
Wikia Search
“Open Search” based on distributed, social and semantic concepts.
Acquired grub (http://www.grub.org/), a P2P crawler from looksmart. It currently uses Lucene.
Other systems mentioned in the talk:
- Creative Commons Search (http://search.creativecommons.org/)
- Social Bookmarks (http://del.icio.us/)
- Internet Archive
- WorldCat (http://www.worldcat.org/): Large collection of library content and services
Getting higher ratings in search
Some of the methods used to get clients website higher in the ratings are link farms, fake website and flogs.
Some Academic Systems available on different license types
· Indri: U.Mass+CMU C++ system with language models on BSD license, part of Lemur.
· Lucene: Java-based industrial system sometimes used in academia due to its popularity.
· MG4J: “Managing Gigabytes for Java” system from U.Milano under LGPL.
· Terrier: feature-laden Java-based system from U.Glasgow on MPL.
· Wumpus: scalable desktop-oriented system from U.Waterloo on GPL.
· Zettair: simple, fast C-based system from
Distributed and peer-to-peer architecture
Factors in distributed IR
· Modes: cluster based, P2P etc.
· Co-operating vs Competing nodes
· Homogeneous vs Heterogeneous node
· Centralized vs Decentralized vs No control
· Nature of content
Stages in a Search Process
- Crawling: by URL or domain
- Pre-processing: by document
- Indexing: a very large collection of (term,doc) elements
- Querying: by term or document or other combinations
- Result serving: by document
Note: The hard task is to distribute Indexing and Querying (the core of search)
Distributed Query Score computation
For multi term query, problem occurs if terms are on different nodes. So we need to distribute document-term entries.
Federated Querying
Tasks
- Resource Selection: Selecting the nodes to distribute the queries to
- Result merging: Assembling results from different result sets
- Discovery: Sampling to estimate collection statistics for nodes
Distributed querying can employ either of these schemes
Document partitioning (by topic, language, genre etc.):
- Flooding of queries
- Selective routing
Tasks: partitioning, query routing and obtaining results
Employ hierarchical assembling of results
Term partitioning (for multi-term queries)
P2P IR
In a P2P network there is no centralized control but there can be a super-peer i.e. some nodes are more equal than others. P2P are self-organizing, failure resilient systems that encourage resource sharing. P2P systems organize peers into a network overlay. There is use of distributed hash tables to route queries.



.gif)
0 comments:
Post a Comment