Open-Source, Distributed and P2P IR-Wray Buntine ~ Blog on IR, NLP and Related Areas<br>by<br>Shailesh Pandey

Tuesday, September 4, 2007

Open-Source, Distributed and P2P IR-Wray Buntine

Posted on 9:15 AM by shailesh | No comments

Open Source, Distributed and Peer-to-Peer IR

Wray Buntine

National ICT Australia

Helsinki Institute for Information Technology

Some Open Source systems

Technorati: http://technorati.com/

Built on Lucene backend for search

Facility to add Tags and authority

Wikia Search

“Open Search” based on distributed, social and semantic concepts.

Acquired grub (http://www.grub.org/), a P2P crawler from looksmart. It currently uses Lucene.

Other systems mentioned in the talk:

Creative Commons Search (http://search.creativecommons.org/)
Social Bookmarks (http://del.icio.us/)
Internet Archive
WorldCat (http://www.worldcat.org/): Large collection of library content and services

Getting higher ratings in search

Some of the methods used to get clients website higher in the ratings are link farms, fake website and flogs.

Some Academic Systems available on different license types

· Indri: U.Mass+CMU C++ system with language models on BSD license, part of Lemur.

· Lucene: Java-based industrial system sometimes used in academia due to its popularity.

· MG4J: “Managing Gigabytes for Java” system from U.Milano under LGPL.

· Terrier: feature-laden Java-based system from U.Glasgow on MPL.

· Wumpus: scalable desktop-oriented system from U.Waterloo on GPL.

· Zettair: simple, fast C-based system from RMIT University on BSD style license.

Distributed and peer-to-peer architecture

Factors in distributed IR

· Modes: cluster based, P2P etc.

· Co-operating vs Competing nodes

· Homogeneous vs Heterogeneous node

· Centralized vs Decentralized vs No control

· Nature of content

Stages in a Search Process

Crawling: by URL or domain
Pre-processing: by document
Indexing: a very large collection of (term,doc) elements
Querying: by term or document or other combinations
Result serving: by document

Note: The hard task is to distribute Indexing and Querying (the core of search)

Distributed Query Score computation

For multi term query, problem occurs if terms are on different nodes. So we need to distribute document-term entries.

Federated Querying

Tasks

Resource Selection: Selecting the nodes to distribute the queries to
Result merging: Assembling results from different result sets
Discovery: Sampling to estimate collection statistics for nodes

Distributed querying can employ either of these schemes

Document partitioning (by topic, language, genre etc.):

Flooding of queries
Selective routing

Tasks: partitioning, query routing and obtaining results

Employ hierarchical assembling of results

Term partitioning (for multi-term queries)

P2P IR

In a P2P network there is no centralized control but there can be a super-peer i.e. some nodes are more equal than others. P2P are self-organizing, failure resilient systems that encourage resource sharing. P2P systems organize peers into a network overlay. There is use of distributed hash tables to route queries.

Categories: ESSIR-2007

Blog on IR, NLP and Related Areas
by
Shailesh Pandey

Blog primarily created to jot down things I have observed, discussed or learned that are relevant to my research. Comments are welcomed.

Tuesday, September 4, 2007