Dynamic robots.txt files in Kentico
This quick guide will show you how to set up automatic publishing of appropriate robots.txt contents for your Kentico site.
The robots.txt file directs search engines (well, the ones who behave themselves anyway) on what they should and shouldn’t index within your site, as well as where to find your sitemap.
It’s extremely helpful for improving SEO on your live site, but also the opposite: stopping search engines entirely from listing your other environments, such as staging, test, and development servers (if they’re accessible to the public).
For this reason, it’s extremely important you don’t accidentally put the wrong robots.txt file in the wrong environment! Luckily Kentico gives us some very handy tools for not only managing the contents of your robots.txt file via the CMS, but also dynamically serving the correct file based on which environment you’re running.
This quick guide will show you how to set up automatic publishing of appropriate robots.txt contents for your Kentico site. I would recommend rolling something like this into your standard process for all Kentico sites.
Step 1: Create your dynamic robots.txt
- Create a new page, e.g. /System/Robots
- If you’re using a recent version of Kentico, there should already be a Robots.txt template set up with most of this stuff! If not, create a new template and continue...
- The only web part needed on this template is "Custom Response"
- Content type: text/plain
- Encoding: UTF-8
- Response code: 200
- Content: [put the contents of your robots.txt here]
Step 2: Get it working at /robots.txt
Note: for this step, you need to have extensionless (or any extension) URLs configured for your site. This involves a simple web.config change for IIS 7+, or configuring a custom 404 handler in IIS 6.
There are potentially a couple of ways to get /robots.txt serving up the page you want, such as setting a custom URL path or using an ASP.NET Route Pattern. However, in recent versions, Kentico provides a special setting just for this case. In Kentico 8, open the Settings application, and under URLs and SEO > Search engine optimization (SEO), you’ll find the "Robots.txt path" setting. Simply set this to the Alias Path of your new robots.txt page, and you’re done.
Kentico’s documentation details the manual process for these first two steps here: https://docs.kentico.com/display/K8/Managing+robots.txt
At this point, you have a page that will return a robots.txt. But let’s go one step further and make it dynamic, so it serves different contents based on whether your site is live or not.
Step 3: Make it more dynamic!
Now for the magic part - making your robots.txt file automatically block search engines when your site is not live, but allow them when it is.
Go to the design tab of your robots.txt page. Set the Visible property of your web part to a macro, by clicking the arrow symbol next to the checkbox to open the macro editor, then use the macro editor to enter the following macro:
Domain == CurrentSite.Domain
This means this response will only be used when you are looking at the site via its primary domain (configured in the Sites application). For this to work, you need the main domain for the site to be set to the live site domain, and any other environments (dev, staging etc) to be set as domain aliases.
Clone this web part (or copy/paste into the same zone in Kentico 8) so you have a second one, and name the two so you can differentiate clearly between them (as they have no visible content).
Now you can edit the Visible macro of the second web part like so:
Domain != CurrentSite.Domain
Notice the difference between the "==" and "!=" - this has the opposite effect. So this web part will show when the first one doesn’t, and hence represents the robots.txt that is used when you are NOT viewing the live site.
You can now edit the response for each robots.txt version separately.
The contents of your live robots.txt file can be determined by you. I’d recommend allowing most pages and resources through, to minimise the risk of a search engine deciding your website isn’t accessible or mobile friendly, for example. Search engines such as Google are pretty good at sorting through what they're crawling nowadays. You should definitely include a reference to your XML Sitemap, which Kentico can also generate automatically for you.
For your blocking (dev) robots.txt file, something simple like this should do the trick:
user-agent: * disallow: /
Step 4: Test it!
You can manually test it by simply tacking /robots.txt on the end of your website’s URLs for your different environments. I’d highly recommend using Google Webmaster Tools to do a better test, as it will also validate the contents of your robots.txt file and allow you to test and confirm that specific pages are being blocked or allowed.
Blocking search engines can be achieved in a number of ways. Another popular option is adding a meta tag to your site, such as this:
<meta name="robots" content="noindex,nofollow">
If this is added to your master page, this will instruct any (well-behaved) robots to not add any page it crawls to its index, and also to not follow any links it finds in that page. So it should stop a lot of crawler traffic. If you were to add it using an HTML Head web part and use a macro similar to the ones described above, it would have a similar effect.
This option is very powerful, and very dangerous! In the past, I have seen a website accidentally deploy this meta tag through to production and be almost immediately wiped from Google’s index!
HTTP headers can also be added at an IIS level to all sites on a development server, or firewall authentication could be used to block public access to an entire server. These options only apply to specific environments however, and will not carry through with a site as it moves to a new environment, such as a UAT or Staging environment, that is still not live.
There are many ways to control SEO-related settings for sites across environments, and many ways to make mistakes! Hopefully this provides some guidance on how one extremely powerful SEO tool, the robots.txt file, can be dynamically controlled to give you more reliability and confidence around your site’s SEO.
Want more? Here are some other blog posts you might be interested in.
A content calendar can be an extremely powerful tool – if well set-up and maintained. Content Strategist Tami Iseli outlines some of the factors that can reduce the chances of abandonment, along with a downloadable content calendar spreadsheet that automatically syncs your content ideas to a Google calendar.
Want to build a .NET Core MVC website using Umbraco Headless? Follow this tutorial from Luminary Technical Lead Emmanuel Tissera...