Detalhes do Documento

Efficient partitioning strategies for distributed Web crawling

Autor(es): Exposto, José cv logo 1 ; Macedo, Joaquim cv logo 2 ; Pina, António Manuel Silva cv logo 3 ; Alves, Albano Agostinho Gomes cv logo 4 ; Amaro, José Carlos Rufino cv logo 5

Data: 2007

Identificador Persistente: http://hdl.handle.net/1822/6634

Origem: RepositóriUM - Universidade do Minho

Assunto(s): Databases; Computer communications and networks


Descrição
This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The in- vestigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infras- tructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers’ pages. The values of the graph edges represent, respectively, computed RTTs and pages links between nodes. The two graphs are further combined, using a multi-ob jective partitio- ning algorithm, to support Web space partitioning and load allocation for an adaptable number of geographical distributed crawlers. Partitioning strategies were evaluated by varying the number of partiti- ons (crawlers) to obtain merit figures for: i) download time, ii) exchange time and iii) relocation time. Evaluation has showed that our partitio- ning schemes outperform traditional hostname hash based counterparts in all evaluated metric, achieving on average 18% reduction for download time, 78% reduction for exchange time and 46% reduction for relocation time.
Tipo de Documento Documento de conferência
Idioma Inglês
delicious logo  facebook logo  linkedin logo  twitter logo 
degois logo
mendeley logo

Documentos Relacionados



    Financiadores do RCAAP

Fundação para a Ciência e a Tecnologia Universidade do Minho   Governo Português Ministério da Educação e Ciência Programa Operacional da Sociedade do Conhecimento União Europeia