Excluding information from crawling and indexing

8/30/2012

Indexering

SiteSeeker normally crawls every web page and document found on a website, starting with the home page of the website.

If there are links to objects on the site that should not be crawled, they can be excluded – we show you how.

In this document, we will describe in detail how you can exclude pages which should not be crawled and indexed. We will also show how you can exclude pages from indexing but where links need to be active.

The settings for limiting crawl and indexing in SiteSeeker Admin are used when:

  • It is not possible to specify robots meta tags or make changes in robots.txt in a simple manner
  • You have several SiteSeeker clients that are indexing different sections of the same website and you do not want to use robots meta tags or robots.txt since that would affect the other clients.
  • The expressivity in robots.txt does not support the limitation you want to achieve
  • You want to influence how SiteSeeker indexes the site, but not how global search engines indexes

Server load will typically be reduced when using these settings in SiteSeeker Admin compared to robots meta tags, and comparable to that of using a robots.txt file.

Note that these settings do not affect global search engines and should therefore be complemented with a robots.txt-file in order to exempt the search page from being indexed.

The settings for limiting crawl and indexing are found in server settings >"Indexing restrictions" in SiteSeeker Admin.

Exclude paths

You can exclude paths and parts of URLs to prevent crawling or indexing. If for example, you exclude the path images/*, no objects in the folder images will be crawled and if you are excluding images/b* no objects in the folder images that starts with the letter b will be crawled.

The specified paths are relative to the root folder of the website. If you want to exclude all pages in the folder named old, regardless of its position in the folder structure, you can set */old/* as excluded path.

Example: You can often find print friendly versions of web pages on websites. Normally, you would not want these print friendly versions of the web pages to be searchable.

Solution: The URL:s to the print friendly versions of the pages might look something like the examples below:

http://www.acme.com/dept/print.asp?id=10088
http://www.acme.com/dept/index.asp?id=10088&printMode=yes.

By adding the expression *printMode=* in the field Exclude paths you ensure that SiteSeeker exclude these pages.

In order to ensure that the global search engines do not index the print friendly versions, you need to use the robots.txt-file or robots meta tags.

Exclude paths but follow links

Pages which should not be searchable, but where SiteSeeker needs to follow links, are specified in this section. A typical situation is the home page of a website. This setting corresponds to the meta tag <META name="ROBOTS" content="NOINDEX">, with the difference that global search engines can still index the page. This is desirable since the home page should be indexed by the global search engines.

Example: Pages containing lists, e.g. A to Z directories, FAQ lists or news listings, should be crawled but the list pages themselves should not be displayed in the results list. Instead, you want to display only target pages that the user is linked to from the listing page.

A-Z page can be exempted from being indexed

Solution: Set the search path to your listing page in the field Exclude paths, but follow links.

Settings in SiteSeeker Admin to exempt search paths but follow links

In order to ensure that the global search engines do not index the listing page, you need to use the robots.txt-file or robots-meta tags.

Exclude certain document types from index

The settings for document types can be found under server settings > Document types in SiteSeeker Admin.

SiteSeeker indexes web pages, text documents, PDF- and RTF-documents as well as Microsoft Office documents in the standard configuration. Images are not indexed and links are only followed from web pages. These settings can be altered in SiteSeeker Admin.

Set meta attributes required for indexing

The settings for meta elements can be found under server settings > Meta information in SiteSeeker Admin.

You can add a requirement for a certain meta attribute required in order for SiteSeeker to index a web page. These attributes can be used if you only want to index certain pages on a website. Note however that non-HTML-documents and images will be indexed even if they are missing this meta attribute.

Using this method, SiteSeeker will still crawl all pages, even though only pages with the specified attribute will be indexed. If there are many pages to be excluded, we recommend using the other methods described in this article as a complement, to make crawling more efficient.

Index only certain pages or folders

You can easily set up SiteSeeker so that only certain parts of the website is indexed, using a combination of settings in SiteSeeker Admin.

Example: On the website example.siteseeker.se, only pages found in the foldernew/ should be indexed. The home page for these pages are new/index.html.

Solution 1: Only index the pages found in new/which are being linked to from other pages innew/:

Specify * under Indexing restrictions » Exclude paths
Specify new/* under Indexing restrictions > Allow paths
Specify new/index.html under Starting URLs
Do no not check the box Crawl website root under Starting points

Solution 2: Only index pages found innew/, including pages that are linked to from other pages on the rest of the website:

Specify * under Indexing restrictions > Exclude paths, but follow links
Specify new/*under Indexing restrictions > Allow paths
Specify new/index.htmlunder Starting URLs
Check the box Crawl website root under Starting URLs

Solution 1 is more efficient since SiteSeeker only crawls the pages that should be indexed. With solution 2, you can ensure that SiteSeeker finds all pages contained innew, even those which are only being linked to from pages not contained innew. If the website is large andnewonly contains a small part, the result is still that many web pages which are not being indexed will still be crawled and examined by SiteSeeker, which will increase bandwidth consumption.

If the website does not have a folder structure

If the website does not have a folder structure you can set the desired starting URLs and then limit the link depth under Indexing.

If you can identify substrings of URL:s that you want to be indexed as well as not be indexed, you can add these, surrounded by asterisks (*), to the respective fields under Indexing restrictions. If a URL matches an expression both under excluded paths and under allowed paths, SiteSeeker will rule the field with the longest matching expression as the "winner".