Excluding information from crawling and indexing

8/30/2012

Indexering

SiteSeeker normally crawls every web page and document found on a website, starting with the home page of the website.

If there are links to objects on the site that should not be crawled, they can be excluded – we show you how.

In this document, we will describe in detail how you can exclude pages which should not be crawled and indexed. We will also show how you can exclude pages from indexing but where links need to be active.

Use a robots.txt file and robots meta tags when:

  • Configuration should be general and valid for both SiteSeeker and global search engines, e.g. Google.

In order to exclude web pages and documents from crawling, you can use a robots.txt file placed in the root folder of the website. Additionally, you can use robots meta tags to limit indexing and in some cases crawling.

Directives in a robots.txt file can apply to all or to specific search engines.

A robots.txt file is recommended when many pages should be excluded; crawling may be very time-consuming and the restrictions in the robots.txt file typically shorten crawling time. When using robots meta tags, SiteSeeker must first crawl the page in order to read the tag.

How to avoid the search page from being indexed using robots.txt

Example: The website contains three search pages that should not be indexed, neither by SiteSeeker nor by global search engines in order to avoid corruption of statistics. See Statistical support for more information about how page lists and search engines can influence search statistics.
Solution: Create, or alternatively update, the robots.txt-file in the root folder of the website and exempt the search pages from being crawled by all user-agents. Note that the URL is case sensitive.

Code example:

User-agent: *
Disallow: /en/Search/
Disallow: /sv/Sok/
Disallow: /en/Menu/Publications/List/

Robots meta tags

In each individual HTML-document, you can control whether or not the document should be indexed, and if links in the document should be crawled. This is done by adding any of the following tags in the document's <HEAD>-section:

<META name="ROBOTS" content="NOINDEX"> do not index this page, but crawl links

<META name="ROBOTS" content="NOFOLLOW">Index this page, do not crawl links

<META name="ROBOTS" content="NOINDEX,NOFOLLOW"> do not index this page, do not crawl links

The meta tag <META name="ROBOTS" content="NOINDEX">is well suited for documents containing only links and no real content, i.e. navigation pages or link pages.<META name="ROBOTS" content="NOINDEX,NOFOLLOW"> works well for search pages, add robots.txt in order to avoid inaccurate search statistics and unnecessary load on the server.

All search engines, including SiteSeeker takes into account the robots-meta tags, read more about the robots-meta tag.