Detalhes do Documento

Geographical partition for distributed web crawling

Autor(es): Exposto, José cv logo 1 ; Macedo, Joaquim cv logo 2 ; Pina, António Manuel Silva cv logo 3 ; Alves, Albano Agostinho Gomes cv logo 4 ; Amaro, José Carlos Rufino cv logo 5

Data: 2005

Identificador Persistente: http://hdl.handle.net/1822/6321

Origem: RepositóriUM - Universidade do Minho

Assunto(s): Web Mining; Parallel Crawling; Web Partitioning


Descrição
This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.
Tipo de Documento Documento de conferência
Idioma Inglês
delicious logo  facebook logo  linkedin logo  twitter logo 
degois logo
mendeley logo

Documentos Relacionados



    Financiadores do RCAAP

Fundação para a Ciência e a Tecnologia Universidade do Minho   Governo Português Ministério da Educação e Ciência Programa Operacional da Sociedade do Conhecimento União Europeia