Tuesday, September 4, 2007

Open-Source, Distributed and P2P IR-Wray Buntine

Open Source, Distributed and Peer-to-Peer IR

Wray Buntine

National ICT Australia

Helsinki Institute for Information Technology

Some Open Source systems

Technorati: http://technorati.com/

Built on Lucene backend for search

Facility to add Tags and authority

Wikia Search

“Open Search” based on distributed, social and semantic concepts.

Acquired grub (http://www.grub.org/), a P2P crawler from looksmart. It currently uses Lucene.

Other systems mentioned in the talk:

Getting higher ratings in search

Some of the methods used to get clients website higher in the ratings are link farms, fake website and flogs.

Some Academic Systems available on different license types

· Indri: U.Mass+CMU C++ system with language models on BSD license, part of Lemur.

· Lucene: Java-based industrial system sometimes used in academia due to its popularity.

· MG4J: “Managing Gigabytes for Java” system from U.Milano under LGPL.

· Terrier: feature-laden Java-based system from U.Glasgow on MPL.

· Wumpus: scalable desktop-oriented system from U.Waterloo on GPL.

· Zettair: simple, fast C-based system from RMIT University on BSD style license.

Distributed and peer-to-peer architecture

Factors in distributed IR

· Modes: cluster based, P2P etc.

· Co-operating vs Competing nodes

· Homogeneous vs Heterogeneous node

· Centralized vs Decentralized vs No control

· Nature of content

Stages in a Search Process

  • Crawling: by URL or domain
  • Pre-processing: by document
  • Indexing: a very large collection of (term,doc) elements
  • Querying: by term or document or other combinations
  • Result serving: by document

Note: The hard task is to distribute Indexing and Querying (the core of search)

Distributed Query Score computation

For multi term query, problem occurs if terms are on different nodes. So we need to distribute document-term entries.

Federated Querying

Tasks

  • Resource Selection: Selecting the nodes to distribute the queries to
  • Result merging: Assembling results from different result sets
  • Discovery: Sampling to estimate collection statistics for nodes

Distributed querying can employ either of these schemes

Document partitioning (by topic, language, genre etc.):

  • Flooding of queries
  • Selective routing

Tasks: partitioning, query routing and obtaining results

Employ hierarchical assembling of results

Term partitioning (for multi-term queries)

P2P IR

In a P2P network there is no centralized control but there can be a super-peer i.e. some nodes are more equal than others. P2P are self-organizing, failure resilient systems that encourage resource sharing. P2P systems organize peers into a network overlay. There is use of distributed hash tables to route queries.

Categories:

0 comments: