What is the difference between the various crawling modes?

11/1/2012

Kunskapsdatabasen, Indexering

When you manually start a crawl and indexing in SiteSeeker Admin you can choose between three different crawling modes, full, minimal or no crawl mode. Here follows a description in detail of the differences between the various modes so that you can pick the best mode for your environment.

Full crawl

In full crawl mode, SiteSeeker always examines the URLs of all starting points and all URLs on the website which are linked to.

SiteSeeker makes a request for the page or document in question from the web server for each URL.

SiteSeeker caches the contents of all crawled pages and documents, so if a page has been previously crawled, the web server may direct SiteSeeker to use the cached copy. This way, subsequent crawls will be quicker, provided the web server supports the HTTP feature if modified since. Only in full crawl mode can you be assured that SiteSeeker retrieves all newly added and altered documents and that erased documents are no longer indexed.

Full crawl mode is useful in the following cases:

  • When you have changed a template which is included by many pages, for example navigation
  • When you have added documents that you want to be searchable and which are not linked to from any of the starting URLs.
  • When you have made changes in a document that is not a starting URL and you want SiteSeeker to index it immediately.

Minimal indexing

In the minimal crawl mode, SiteSeeker will crawl as few pages as possible. However, the URLs of starting points (link depth 0) are normally always examined as well as links in them (link depth 1), and even URLs that have not been successfully crawled before (e.g. dead links).

Minimal crawl mode is useful in the following cases:

  • When you have updated a page that is a starting URL.
  • When you have repaired a dead link
  • When you have changed crawl settings, for example link depth and you want to index more or fewer documents.
  • If you have changed the settings for allowed/excluded URLs.

You cannot assume that SiteSeeker will crawl all newly added and updated documents in minimal crawl mode.

In conclusion, SiteSeeker will typically make much fewer HTTP requests to the web server(s) than in full crawl mode. In some cases, the number of requests may still be high, for instance if there are many dead links or the crawl settings have been changed.

Suggestion: If new or important pages are being linked from specific URLs, you can set these URLs as starting points, thereby making sure that the new and important pages are crawled in this mode.

No crawl mode

SiteSeeker does not examine any URLs on the web server in the no crawl mode.

No crawl mode is useful in the following cases:

  • If you have changed the crawl settings to crawl fewer pages, e.g. introduced a limit for link depth.
  • If you have added excluded paths
  • If you have changed a setting for the interface where it is specified that a change requires indexing.

SiteSeeker however still makes a HTTP request for the file robots.txt In no crawl mode.

Automatic indexing

SiteSeeker automatically indexes at least once every 24 hour period, unless an indexing is already running. The crawl mode is generally full during this indexing. It is possible to perform indexing more than once every 24 hours. Please contact Euroling Support if you need help to change the indexing interval.
Finally, note that the crawl modes that you can set for manual indexing does not apply to automatic indexing.

Do you have a need for quickly making newly added pages searchable? In that case, read more about notified indexing and how you can get SiteSeeker to update the index instantly whenever the content on the website is altered.