Spider與crawler不一樣點

  寫這篇的動力源於上一篇中反反復復顯露出來的robots,它讓我想起了spider(蛛蛛)與crawler(爬行動物)。此二者同樣?不同?

  曾經就看過一篇文章,說此二者不同,或是嚴明說不同。剛剛又在網上搜了搜,大多意見說此二者同樣。這個大多的意見,我就不在此熬述了,網上找吧,一大堆呢。我就這篇說說此二者不同。對或錯誤,全當個參照,萬家爭鳴、百花盛開。

  在 WebmasterWorld,曾有過個帖子,談的就是spider與crawler。帖子著手就有一段敘述:

  Search engines consist of five discrete software components:

  Spider : a robotic browser like program that downloads webpages.

  Crawler : a wandering spider that automatically follows links found on pages.

  Indexer : a blender like program that dissects webpages that are downloaded by spiders.

  The Database : a warehouse of the pages downloaded and processed.

  Search Engine Results Engine : digs search results out of the database.

  一句話總結概括一下子它的意思,就是:spider與crawler不同。

  帖子裡還有個觀點,就是說robots有5種,其名字、效用順次是:spider,下載網頁;crawler,順著內鏈,過訪該鏈接的另一 端;indexer,收錄下載了的網頁;datebase,下載了的、處置了的網頁的庫房;result engine, 從數值庫中找出搜索最後結果。5種?這個觀點,我不曉得是否准確,然而至少對我來說,夠新而別致的。

  還有人發言道:

  Let’s talk about how robots interpret your page for a bit. If I follow Brett’s historical topic, you have three different types of robots, a spider, crawler and indexer.

  First the Spider comes around and requests the URI. It reads server header information and other on page information. Then the Crawler follows all the links within that domain (those that are found and allowed). Then the Indexer reads the html while making heads and tails of it.

  其發言者覺得robots有3種:spider、crawler、indexer。一著手是spider依據URI,過訪進來,繼續,讀取服務器的header和網頁的head標簽。而後,crawler順著spider發覺的網seo頁的內鏈,去過訪該內鏈的另一端。最終,indexer來讀取HTML代碼。

  大家是怎麼對待這個問題呢?期望我這篇能起到拋磚引玉的效用。