Troubleshooting - why are some pages or documents not being indexed?

11/1/2012

Indexering, Installation, Sökning

If some pages or documents have not been indexed, use this troubleshooting guide to identify the problem.

How do you know if the page is indexed?

You can easily check if a certain document has been indexed or not by searching with the operator url:. With the search queryurl:www.abc.com/info/index.html (orurl:info/index.html) you will get a result for that particular document if it is indexed. If you do not get a hit, the document is not indexed.

You can also search for all documents that have a URL which begins with a certain character sequence. For example, the search query url:info/* will return all pages or documents that have a URL (after the domain name) that starts with info/.

With the operator link: you are able to search for all indexed documents with links to a specific document or documents. Certain settings in your search interface may limit your ability to search for the URL, if that is the case, you may use the template search page or the URL inspector in SiteSeeker Admin.

If the page or document has been indexed, but you cannot find it with a normal query, learn more about this problem here: Why do I NOT get a search result on documents/webpages which contain the word I was searching for?

Troubleshooting crawling and indexing

If a web page that you want to be indexed is not displayed when querying, please check the following:

  • The maximum number of indexed pages for the web server or maximum link depth has been exceeded
    Check if the limit for maximum number of indexed pages or if maximum link depth has been exceeded in the report Indexing overview and change if necessary in Server settings » Indexing » Page Indexing safety limit and Maximum link depth. Set the safety limit for indexing to at least three times as high as the expected number of pages/documents
  • The page has been added or altered since the last crawl and indexing
    In the report Indexing overview check the time for the latest page load. Try to perform a full crawl in SiteSeeker Start » Start indexing. You can always perform a new crawl when something has been changed on the website and you want it to be included in the index.
  • The page is located on a server which is not indexed by SiteSeeker
    Make sure the server where the page is located is listed in SiteSeeker Admin. Note that www.example.com and department.example.com are regarded as two separate servers, even if the websites physically exsists on the same server.
  • There are only JavaScript links to the page
    JavaScript links cannot normally be followed by SiteSeeker or by global search engines. Such links both impairs SEO and decreases the usability in addition to web pages not being indexed by SiteSeeker. A more detailed description of the problem with JavaScript links and how you can overcome it.
  • The page is only being linked to from pages containing the meta tag <META name="ROBOTS" content="NOFOLLOW">
    Such a tag in a page with links prohibits SiteSeeker to follow the links in the page, see explanation of robots-meta tags. If you want the page to be indexed you can either remove the meta tag in the linking page or create a link to the page in question from another page without the meta tag.
  • There is no link to the page in any of the indexed pages
    If the page should be indexed without being linked to, you can specify the page as a Starting point in Configure » Servers. It is however uncommon for pages to not have any links to them, since visitors normally find pages by following links on the website, which is exactly what SiteSeeker does.
  • It is forbidden to index the page according to the web server's robots.txt
    See which pages that cannot be crawled according to the file robots.txt under Indexing premisses in SiteSeeker Start » crawl and indexl. If you cannot make changes in robots.txt, you can move or copy the page to a folder or website branch that is not marked as forbidden. There is also the possibility to configure SiteSeeker to ignore the robots.txt file in the server settings in SiteSeeker Admin. This should only be done as a last resort.
  • The page contains the meta tag <META name="ROBOTS" content="NOINDEX">
    This tag instructs all search engines, including SiteSeeker, not to index the page, see the description of robots-meta tags. Check if the page is listed in the report Non indexed pages If the page should be indexed, you can remove the meta tag or make a copy of the page without the meta tag and create a link to the copy.
  • The page is not allowed to be indexed according to current settings for that web server in Admin
    See the report Indexing overview and change if necessary Excluded paths in Configure> Servers > Indexing restrictions. If the other pages in the excluded path are not being indexed, you can move the page to a non-excluded path and then run an index.
  • Links in web pages cannot be followed according to settings for document types for the web server in Admin
    In SiteSeeker Admin, open the server settings. In the section "Document types", make sure that the checkboxes for "Follow links" are checked for all relevant types (typically "Web pages").
  • The page is a duplicate page
    Check if the page has been reported as a duplicate. If the page is actually not a duplicate, make sure that the HTML code is valid. Invalid HTML code may affect duplicate detection. For more information, see the section The page has invalid HTML code below. If the page is classified as a duplicate even though the HTML code is valid, the page may have the same text content as another page, but with different images or JavaScript. To correct this, make a minor alteration in the text to make it unique and then run a new crawl.
  • The page is in a format that is not indexed
    SiteSeeker will index files in several different formats, e.g. Word, PDF, PowerPoint, Excel, etc. Make sure that the document has a file format that SiteSeeker can index. If it is, that particular format may be disabled in SiteSeeker. You can review and edit these formats for each server, see the server settings in SiteSeeker in the section "Document types".
  • Converting from a different format than HTML failed
    Check if the page is in the report Non indexed pages in Statistics/Status » Statistical indexing. Some PDF files are password protected and cannot be converted. If a document is created with an unknown application or with an early version of an application, try to re-save the document with an application that you know produces documents SiteSeeker can index..
  • The page contains invalid HTML code
    The content sometimes cannot be interpreted by SiteSeeker even if you can read it in your web browser (but maybe not in all web browsers), if the page contains invalid HTML code. Use W3C's HTML validator to check the page's HTML code. Common mistakes are the lack of</HEAD>or <BODY>tags or that a " is missing from an attribute. Another example of incorrect HTML code that can affect SiteSeeker are irregular strings in JavaScript, especially strings that contains </script>.
  • The pages that are linking to the page have invalid HTML code
    SiteSeeker cannot always find the links even if the links are active in your web browser, if the pages that are linking to the page have incorrect HTML code. Use W3C's HTML validator to check the linking page's HTML code.
  • The web server do not send the page with the correct content because it believes that SiteSeeker cannot read HTML
    No pages are usually indexed in this case. If it is impossible to correct this issue, try letting SiteSeeker present itself as a web browser when crawling the pages. You may change the user agent of SiteSeeker's crawler in the server settings, found in SiteSeeker Admin, under "Crawling". This is however not an optimal solution since it may make web server logs harder to use, and may impair web site statistics.
  • The web server redirects the home page to another web page.
    No pages are usually indexed in this case. Try to set the page which the web browser is redirected to when you visit the web server's home page as a Starting URL in the server settings, found in SiteSeeker Admin. For example: if you browse to www.example.com/ and the web browser automatically loads www.example.com/en/start.html, you can specify en/start.html as a Starting URL.

Links in reports does not work

In the SiteSeeker index reports there are often links to queries that illustrates different aspects of the indexed material and give an indication of possible problem areas. As a default, these links are connected to the template search page. If this search page does not work (if you for example use access control), you can set which URL to be used for the search links in the reports. This settings can be found in Search pages » All search pages: Search.