All Collections
Scanning
How Monsido Scans Links on your Website
How Monsido Scans Links on your Website

How links are counted and scanned in Monsido

Updated over a week ago

This article gives information about how the Monsido scan works to scan links that are on your website.

Article Navigation


Links in Monsido

The Monsido scanner follows the links on your website to find all of the content that is present on your domain.

A link is a unique URL that points to a HTML page, or an asset like an image, JavaScript file, CSS file, or documents. All links are unique – any difference in a URL (such as a single letter change) creates a distinct link.

Examples of unique links are:

  • https:/ /domain.tld/webaccessibility

  • https:/ /domain.tld/web-accessibility

Or

  • https:/ /domain.tld/pageId=23

  • https:/ /domain.tld/pageId=24

Or

  • https:/ /domain.tld/list?sort=price

  • https:/ /domain.tld/list?sort=color

In each pair of links above, the two links shown are counted as two separate, unique links.


How Monsido counts pages

Any unique link to any HTML page on the primary domain, subdomain (if added), or internal URL (link) is counted as one page.

Monsido only counts each unique link once, even if the link occurs multiple times and/or on multiple pages.

A flowchart of the order in which Monsido counts links on pages, as described below this image.

The structure in the image above is like this:

  • Primary Domain (page 1) contains three links:

    • Link to page 2

    • Link to page 3

    • Link to page 4

  • Page 2 contains one link:

    • Link to page 5

  • Page 3 contains one link:

    • Link to page 5

  • Page 4 contains one link:

    • Link to page 5

  • Page 5 contains one link:

    • Link to page 1

In this example, the Monsido scanner counts:

  • 5 unique pages (primary domain URL + 4 URL pages)

  • 5 unique links

  • 3 occurrences of link 5

External links: The Monsido scanner determines if an external link is broken, but it does not count external links as pages.


Exclusions and constraints

Monsido includes features that allow you to exclude certain links and URL paths from a scan. Use these to stop Monsido from following, scanning, and counting certain links and pages.

Link Excludes

These allow you to specify patterns to exclude specific links from the scan by giving instructions to the crawler to ignore all URLs that match the pattern. The link is still recorded as present on the page, but we will not check it or follow it.

For more information, see the User Guide article:

Path Constraints

These give you the option to control the pages that Monsido scans. With a regular expression, you can include or exclude content from the scan.

Example 1:

If you only want to scan the news section of your homepage, https:/ /domain.tld/news, add a constraint with ^/news to force the crawler to crawl content there. Important note – the start URL needs to be a part of the constraints. This can be done in multiple ways.

  • Change the start URL to https:/ /domain.tld/news

or

  • Add an extra constraint with ^/$ - that includes the frontpage and add it.

Example 2:

If you have a result page and you want to remove all the results from the scan. The results page could look something like:

https:/ /domain.tld/search/results?query=test 

It is possible to create a negative constraint to do this. It could look something like: !search/results?

For more information, see the User Guide article:


Canonical links

Canonical links are a way to indicate the preferred version of a webpage when there are duplicate or similar versions under different URLs. Canonical links are mostly used to help search engines understand which version to index and display in the search results, which improves SEO.

You can read more about canonical links here:

Example uses of canonical links:

Example 1: Print version of a page

For example this URL:

https:/ /domain.tld/page_id=32 

When a print version of this page is created, many CMS systems add a print parameter within the URL that looks something like:

https:/ /domain.tld/page_id=32?print=yes

In many real-world cases, the content of these two pages is effectively or exactly the same. In the example above, these URLs register as two separate pages for web crawlers or search engines.

A common way to address this is to add a canonical tag on the print page like this:

https:/ /domain.tld/page_id=32?print=yes 

That points to the primary page:

https:/ /domain.tld/page_id=32

To do this, insert a tag into the head section of the HTML, here is an example:

​<link rel="canonical" href="https://domain.tld/page_id=32"> 

This canonical tag tells web crawlers and search engines that:

i) these pages contain duplicate content and

ii) the URL without the print parameter is the main version of the page.

Example 2: Sortable lists

Another example is a page that displays a sortable list of items – like a news site with a list of articles or a store with a list of products.

Assume the https:/ /domain.tld/list contains a list where you can sort by color, price, or size. The content contained on the page remains the same, but each sorted version of the page has a unique URL such as:

https:/ /domain.tld/list?sort=colors
​https:/ /domain.tld/list?sort=price
​https:/ /domain.tld/list?sort=size

In this case, you could add a canonical link to the main list like this:

<link rel="canonical" href="https://domain.tld/list">.

This tag indicates that the default sort version of the page should be considered the primary version.

Add this canonical tag to alert search engines or web crawlers (like Monsido) that each of these URLs links to pages that have the same content.

You can read more on canonical links here:

Monsido can use canonical tags to exclude URLs that point to identical content, see the User Guide articles:


How the Monsido Scan Works

Monsido’s crawler uses a dynamic discovery process, which means it actively explores and discovers web pages on your website. It does this by following links from one page to another systematically to find all pages on your domain.

Monsido does the scan with a breadth-first approach, which means it starts with the initial webpage and then systematically explores all of the links on a page before it moves on to the next level/depth of pages. The crawler can simultaneously scan 10 different pages of the same domain, and in most cases respects the depth priority of the links, depending on the response and processing latency of each page. When a sitemap is found, all pages in the sitemap are considered to be at depth level 0 (top).

Note: The crawler is capped at a depth of 100 links from the start page.

Monsido’s crawler can also inspect robots.txt files. It can detect sitemaps that are declared on the robots.txt file and automatically does a scan of all links that are on the XML of the sitemap.


Sitemaps

As an alternative to a start page, you could add a sitemap. Sitemaps can be an essential tool to enhance the effectiveness of the Monsido domain scan. Their advantages come to light particularly when dealing with large or complex URLs, as well as those that contain a great deal of multimedia content.

A sitemap essentially serves as a roadmap for the Monsido crawler, which shows Monsido an organized structure of your website. This facilitates easier navigation and helps Monsido to discover URLs across your site. Making sure that every page is linked to at least one other page can be challenging for users with large websites. A sitemap can address this issue by guiding the Monsido crawler to new pages that might otherwise be overlooked.


Types of pages in a CMS

Many CMS systems categorize different types of pages into categories such as news pages, event pages, and other pages that are created by a CMS module.

Often, a page is a normal content page that the user can set up themselves, whereas a news page created by a CMS module is categorized differently. These pages can either be categorized as a collection of pages or as a specific content type inside the CMS system. The same happens with events and forms. All content has a unique URL and can be accessed by a user.


Troubleshooting

We have collected the answers to some common issues that you might encounter regarding broken links found by the Monsido scan.

For more information, see the User Guide article:


Additional Information

For more information on setting up Monsido scans, see the User Guide articles:

For definitions and explanations of acronyms and abbreviations used in the Monsido User Guide, see:

For further assistance, contact the Monsido support team at support@monsido.com or use the Monsido chat and help features inside the application.

Image of the toolbar with the Help Center buttons highlighted.

Did this answer your question?