Imagine you had some scanner that could probe the whole Internet with regards to major content management systems (e.g., WordPress, Drupal, and Joomla) within minutes, grab the front page markup, store all received HTTP headers, take various measurements, and evaluate all of this on a per-site basis as well as across the content management systems—and it was just waiting for your input.
What would you make it do?
Large-scale Analysis of the Internet
When analyzing the Internet as a whole, or a specific portion of it, there are certainly lots of things one could do. The following sections include a few, potentially interesting, aspects of a single website or content management system worthy of being evaluated.
Please, jump in if you have some idea yourself! 🙂
One of the first things that come to mind regarding the analysis of a major CMS in the wild is market share. The percentage of, say, Drupal websites is surely a good indicator for the popularity and general usage of the system.
However, there is more to it than just asking, for example, “How many websites are powered by WordPress?”, for the whole Internet might not even be the most interesting target in all cases. Maybe you want to limit this to the Alexa top sites, quite possibly even for a specific country or category only…? Or how about analyzing websites of one or more top level domains only?
Another more than interesting aspect is the distribution of the individual CMS versions. What are the, say, five most used versions for Joomla? Are there versions to be found that shouldn’t even exist? Maybe fake versions? How many pre-release or even development versions are there in the wild? Alphas, betas, RCs, anything else?
What about versions that are not the latest released ones in their individual branch (e.g., running v1.2.3 even though v1.2.7 is the latest)? This is particularly interesting when considering automatic updates, which have been introduced in WordPress 3.7, for example. If a website is not running the latest non-major version, this could mean auto-updates are disabled, or somehow broken.
Also, how many of all CMS-powered websites do not give away their installed version?
More and more website owners (want to) offer their content in more than one language. Analyzing how many websites offer multiple language versions, and how, is certainly something really exciting—at least for me, as lead developer of MultilingualPress, an open source plugin for multilingual WordPress websites. 🙂
How can these other language versions be found? Are there multiple languages on every page? Hopefully wrapped in containers with the according
lang attributes? Or does the website contain links to content in other languages (i.e., the
<a> tags contain an according
<link rel="alternate" hreflang="*" href="*" /> tags to other language versions of the current content? Maybe (also)
Link HTTP headers, again, with appropriate
Also interesting is how and where the individual languages are managed. Completely separate domains? Subdomains? Subdirectories? Only separate URLs (in the same directory, if any)?
A CMS can be used to power virtually all sorts of websites. Wouldn’t it be cool to (try to) evaluate what different types of websites there are, and what their distribution looks like? Blogs, e-commerce sites, (product) landing pages, portals, forums, and whatnot.
A huge amount of websites already do—although every single one should—integrate some sort of analytics. It wouldn’t surprise me to find out Google Analytics is the most-used one. How many websites use it actually?
Search Engine Optimization
Everyone wants their website to rank, so they most likely apply search engine optimization techniques, both on-page and technical SEO. And some of these can be evaluated.
Does the website include Open Graph meta data? What about structured data markup?
Are static files such as images and videos served off a CDN?
Does the web server send caching (i.e.,
Expires) HTTP headers?
What is the most frequent document type declaration? HTML5? Or still an HTML 4.01 definition? Maybe even XHTML?
Are there any SVG images in use? As source in
<img /> or
<object> tags only? How many websites use
How many flexbox websites are there in the wild?
Are there any websites using CSS variables?
How many websites use jQuery? In what versions?
How many websites include (untranspiled!) ES2015 (or even ES2016) code?
Similar to the website’s markup, there are, unfortunately fairly limited, means to investigate a website’s server technology.
What web server software is running?
Does the website still use HTTP only, or is it HTTPS?
What PHP version is used?
Which database server is running, and in what versions?
Having a history of both the front page markup and the HTTP headers at hand, one could make lots of the analyses mentioned before, dating back to arbitrary or individually meaningful days.
Each individual content managment system’s version distribution can be analyzed—based on markup and HTTP headers only, of course. One could also check websites for how quickly they had been updated after major and minor/patch releases of the used CMS.
The website type can be analyzed as well—taking only the front page into account. And also analytics and SEO can be evaluated.
Yes, there are limitations. And depending on what you want to do and what the server gives you, there might even be quite a few.
For several HTTP headers, the server can send (or not send, that is) whatever data it likes. Also, lots of the meta data included in the website’s markup cannot be fully trusted.
However, one can certainly do their very best with the data available, and the necessary consideration that not everything is the absolute truth.
What is it that you always wanted to know about your favorite CMS, analysis-wise?
In your personal opinion, which are the three most interesting aspects when comparing all websites for each individual CMS, and what are your top 3 when comparing the content managements systems with each other?
Is there anything you learned somewhere about a CMS that you just cannot believe?
Shoot! It really would mean so much to me.
Some of Your Suggestions
To provide/keep a better overview about things suggested either in the comments below or via other channels, such as social media and IM, I included the following list. I will update it whenever there was suggested something relevant (and manageable 😉 ).
Switching from One to Another CMS
For all websites proven to run Drupal, Joomla or WordPress now, check what they were running one and two years ago.
How large or complex (in terms of hierarchy) is the website?
What is the content all about?
Scan a website for following/applying accessibility, for example, Accessible Rich Internet Applications (ARIA) accessibility attributes.
How often is a website updated, and when was the last time? This information could be gained from parsing RSS feeds, for example.