For now, I have enabled the Google Search Console on this blog, mainly because I do not have a search yet, but I like to reference my previous posts related to the topic at hand. And for now, I have been using Google search to quickly find where I was writing about this or that.
The problems started when I have found out that some posts, especially ones that I have put a lot of energy into, were not showing up. Well, why should I care? I do care about security and privacy, why should I worry sense of what what Google thinks? Google is not respecting privacy of its users so their opinion should not matter.
Google Search Console
Google Search Console is a tool that aids with multiple things, like importing a sitemaps, detecting problems like content being outside of the view on the mobile devices, removing pages from the search results, and of course finding out reasons why some page is not indexed. And sometimes it even is indexed but something else prevents it from displaying it in the search results. And there are tons of possible reasons apparently.
Before the Console shows anything useful however, I have to prove the ownership of the domain. It can be done by multiple ways like making a readable file accessible from the web or setting a domain TXT record and some of them are quite similar that the ways ACME checks the domain ownership before issuing a TLS certificate. I have written something about it already as well.
Another way to enable the Console functionality is to turn on Google Analytics, for instance by inserting a line to the HTML source. I wanted to avoid precisely this option to preserve the privacy of anyone stumbling to the site. In the end, I started writing for myself, without the goal of including advertising, so the analytics were not needed for targeting anyway.
TXT record enables Analytics too
I have chosen TXT record due to it's simplicity, it's non-intrusiveness from the perspective of code version control (no files changed whatsoever) and its instant coverage for the domain and subdomains as well.
Unbeknown to be, this option turns on the Analytics anyway, without the apparent option to turn this off. Please check if it is still turned on anytime via:
dig peterbabic.dev TXT | grep -i google
If the output shows the string
google-site-verification, the Analytics
are still turned on, so beware.
Indexed but unlisted pages
With the Console activated, I went probing for the pages missing in the
search results. Most of them had no actual problem with them, they were
crawled already (visited by the Google bot) and were included in the index.
The reason around 28 pages, which at the time of writing were about the
third of the all posts were not showing up was stated as
page with redirect. I knew what a redirect is, but I had no idea what it
meant in this context.
Like was I somehow setting the permanent 301 redirect or a temporary 302 HTTP status code redirect somewhere in my application? And if so, then where? It is in a Nginx configuration or rather is the Single Page App (SPA) router responsible? These were the questions I did not know I need answers for and would be still blindly ignorant against, had I not turned the Console on.
Nginx or SPA router to blame
I spent the better part of the evening probing the Nginx configuration as it was much shorter than the app's code, so the problem there could be ruled out much sooner. Unfortunately, this went nowhere. Or it rather led to the conclusion that the problem is definitely not in the configuration of Nginx. The problem had to lie in the page router.
The router in the SPA is mapping patterns in the URL to the different parts
of the application. This means that even though the app is only the single
page, router still manages to change the URL after clicking on the link.
This makes sure that when refreshing the page, users land where they were
before the refresh. Without router, the URL would not change and it would
always be just the domain name like
https://example.com, meaning every
time a user refreshes, they land on the root page.
This routing is essential also for Search Engine Optimization (SEO). For SEO to be able to crawl, index and list the page, it has to have it's own unique URL. The possibility of just clicking around and get to the desired content in the application is not sufficient for SEO, an unique URL is a necessary condition.
Trailing slash inconsistency with SPA
Since this routing is done in the application code, the redirects are buried there as well, and therein lies the problem. If the router is exhibiting an inconsistent behavior, it is not good for the SEO. The issue I have discovered I am dealing with is the inconsistency in the trailing slash.
The router was creating the URLs in a way that they lacked the trailing slash:
Also, all the links on the page were constructed in precisely the same way. But, the full URLs are still being 301 redirected, most probably by the router, to the version with the trailing slash at the end:
These two addresses look the same but are in fact different from the perspective of the search engine. In the past, the convention was, that an URL without the trailing slash represents a file, while the one with the trailing slash represents the folder. This was very similar to the file browsers at the time.
That some URL represents a file was also supported by the fact it was
showing an extension, for instance
.html, but displaying extensions in
the URL is quite gone these days and listing contents of the folder
directly is also not a part of the SEO strategy, because listing contents
of the folder only lists more folders and file names, which is not really
that much useful content to index.
So it made sense in the past for the two addresses to display different content (the content of the file, even without the extension in the first example, while the contents a folder in the second one). We now want the two addresses to represent the exact same content, and for SEO to be aware of this, there should be consistency in the redirecting and building links.
I have turned on the functionality the Google Search Console is offering by proving the ownership of the domain by including the TXT record among other DNS records. I have inadvertently turned on Google Analytics by this, and I feel like being cheated, because it looks like I have no option to turn the Analytics off, or at least not with the DNS option, while at the same time still using the other Console features, mainly checking if individual links are being indexed.
By checking the details of the links I have found I have inconsistency in the trailing slash redirects I did not know before and that most probably the router in the SPA is responsible. With no easy fix in sight, I am keeping this issue in my backlog.
The is a 59th post of #100daystooffload.