How does SiteSeeker use sitemaps?

5/7/2014

Kunskapsdatabasen, Indexering

SiteSeeker supports sitemaps in familiar formats and can be located using "robots.txt", starting point configuration in SiteSeeker Admin or filename.

What is a sitemap?

A sitemap is a file that is used to locate information on a web site. Sitemaps lists the document addresses (URLs) for a site using a familiar protocol.

Supported protocols

SiteSeeker follows the sitemap protocol decribed on sitemaps.org but also supports HTML-"sitemaps".

How are sitemaps recognized?

Valid from version 6.10.286 - All start points are treated as sitemaps or sitemap indexes. In order to determine which start point are actual sitemaps, they are parsed and validated using the sitemaps protocol. If the start point passes validation all relevant information is extracted.

Pages that are not start points but have "sitemap" + ".xml"/".xml.gz"/".ashx" in their names, such as: "Sitemap.XML", "my_sitemap.xml.gz"...

How are the sitemaps used?

If a sitemap is a sitemap index, its sitemap URLs are extracted and fed back into the crawler as new start points.

Sitemap URLs are fed to the crawler as page links and are treated like normal HTML links.

Sitemaps compressed with gzip are automatically decompressed.

Valid from version 6.10.286 - Start point sitemaps are discarded after extracting.

Limitations

Siteseeker does not enforce the limitations suggested by the sitemap protocol. A sitemap can contain any number of URLs, absolute or relative, and is only limited by the configured maximum page size.

Currently, only the sitemap URLs are used and all other information is ignored.