Imagine you had some scanner that could probe the whole Internet with regards to major content management systems (e.g., WordPress, Drupal, and Joomla) within minutes, grab the front page markup, store all received HTTP headers, take various measurements, and evaluate all of this on a per-site basis as well as across the content management systems—and it was just waiting for your input.
What would you make it do?
Large-scale Analysis of the Internet
When analyzing the Internet as a whole, or a specific portion of it, there are certainly lots of things one could do. The following sections include a few, potentially interesting, aspects of a single website or content management system worthy of being evaluated.
Please, jump in if you have some idea yourself! 🙂
Market Share
One of the first things that come to mind regarding the analysis of a major CMS in the wild is market share. The percentage of, say, Drupal websites is surely a good indicator for the popularity and general usage of the system.
However, there is more to it than just asking, for example, “How many websites are powered by WordPress?”, for the whole Internet might not even be the most interesting target in all cases. Maybe you want to limit this to the Alexa top sites, quite possibly even for a specific country or category only…? Or how about analyzing websites of one or more top level domains only?
Version Census
Another more than interesting aspect is the distribution of the individual CMS versions. What are the, say, five most used versions for Joomla? Are there versions to be found that shouldn’t even exist? Maybe fake versions? How many pre-release or even development versions are there in the wild? Alphas, betas, RCs, anything else?
What about versions that are not the latest released ones in their individual branch (e.g., running v1.2.3 even though v1.2.7 is the latest)? This is particularly interesting when considering automatic updates, which have been introduced in WordPress 3.7, for example. If a website is not running the latest non-major version, this could mean auto-updates are disabled, or somehow broken.
Also, how many of all CMS-powered websites do not give away their installed version?
Multilingualism
More and more website owners (want to) offer their content in more than one language. Analyzing how many websites offer multiple language versions, and how, is certainly something really exciting—at least for me, as lead developer of MultilingualPress, an open source plugin for multilingual WordPress websites. 🙂
How can these other language versions be found? Are there multiple languages on every page? Hopefully wrapped in containers with the according lang
attributes? Or does the website contain links to content in other languages (i.e., the <a>
tags contain an according lang
attribute)?
Are there <link rel="alternate" hreflang="*" href="*" />
tags to other language versions of the current content? Maybe (also) Link
HTTP headers, again, with appropriate hreflang
?
Also interesting is how and where the individual languages are managed. Completely separate domains? Subdomains? Subdirectories? Only separate URLs (in the same directory, if any)?
Website Type
A CMS can be used to power virtually all sorts of websites. Wouldn’t it be cool to (try to) evaluate what different types of websites there are, and what their distribution looks like? Blogs, e-commerce sites, (product) landing pages, portals, forums, and whatnot.
Analytics
A huge amount of websites already do—although every single one should—integrate some sort of analytics. It wouldn’t surprise me to find out Google Analytics is the most-used one. How many websites use it actually?
However, there are several other prominent analytics tools or services such as Clicky and Piwik. Evaluating the usage of a pre-defined set of analytics is therefore something worth doing.
Search Engine Optimization
Everyone wants their website to rank, so they most likely apply search engine optimization techniques, both on-page and technical SEO. And some of these can be evaluated.
Does the website include Open Graph meta data? What about structured data markup?
Are static files such as images and videos served off a CDN?
Does the web server send caching (i.e., Expires
) HTTP headers?
Website Technology
Analyzing both a website’s markup and referenced assets (e.g., CSS and JavaScript files) can provide a good insight into the used technology. The following sections include a few first criteria.
HTML
What is the most frequent document type declaration? HTML5? Or still an HTML 4.01 definition? Maybe even XHTML?
Are there any SVG images in use? As source in <img />
or <object>
tags only? How many websites use <svg>
elements?
CSS
How many flexbox websites are there in the wild?
Are there any websites using CSS variables?
JavaScript
How many websites use jQuery? In what versions?
How many websites include (untranspiled!) ES2015 (or even ES2016) code?
Are there any websites that do not include any JavaScript at all (on the front page)?
Server Technology
Similar to the website’s markup, there are, unfortunately fairly limited, means to investigate a website’s server technology.
What web server software is running?
Does the website still use HTTP only, or is it HTTPS?
What PHP version is used?
Which database server is running, and in what versions?
Looking Back
Having a history of both the front page markup and the HTTP headers at hand, one could make lots of the analyses mentioned before, dating back to arbitrary or individually meaningful days.
Each individual content managment system’s version distribution can be analyzed—based on markup and HTTP headers only, of course. One could also check websites for how quickly they had been updated after major and minor/patch releases of the used CMS.
The website type can be analyzed as well—taking only the front page into account. And also analytics and SEO can be evaluated.
Limitations
Yes, there are limitations. And depending on what you want to do and what the server gives you, there might even be quite a few.
For several HTTP headers, the server can send (or not send, that is) whatever data it likes. Also, lots of the meta data included in the website’s markup cannot be fully trusted.
However, one can certainly do their very best with the data available, and the necessary consideration that not everything is the absolute truth.
So?
What is it that you always wanted to know about your favorite CMS, analysis-wise?
In your personal opinion, which are the three most interesting aspects when comparing all websites for each individual CMS, and what are your top 3 when comparing the content managements systems with each other?
Is there anything you learned somewhere about a CMS that you just cannot believe?
Shoot! It really would mean so much to me.
Some of Your Suggestions
To provide/keep a better overview about things suggested either in the comments below or via other channels, such as social media and IM, I included the following list. I will update it whenever there was suggested something relevant (and manageable 😉 ).
Switching from One to Another CMS
For all websites proven to run Drupal, Joomla or WordPress now, check what they were running one and two years ago.
Content Analysis
How large or complex (in terms of hierarchy) is the website?
What is the content all about?
Accessibility
Scan a website for following/applying accessibility, for example, Accessible Rich Internet Applications (ARIA) accessibility attributes.
Activeness
How often is a website updated, and when was the last time? This information could be gained from parsing RSS feeds, for example.
Hi Thorsten,
I am not quite sure, do you also have/build up historical data? I guess so, because:
> One could also check websites for how quickly they had been updated after major and minor/patch releases of the used CMS.
In this case it would be interesting to see, if/how often/where to the CMS of a website changes? Did the site had WordPress and is now using Joomla. Did the site was custom made and switched to WordPress and stuff like this.
Hi David!
Not I personally, but yes, there are various scans available (e.g., the data on scans.io, which you can also query via Censys).
Yes, this is surely something worth investigating. Thanks. 🙂
Hi Thorsten,
So, you’ll build whatever scanner I can come up with, right? 😉
What I would find interesting is data that is more rarely found about the contextual use of the different CMS’s.
* How large/complex is the site? (Crawl analyzing the hierarchical structure and volume of unique content)
* How much (estimated) traffic does the site get? (Alexa, SemRush, …)
* What topics is the site about? (Facebook DeepText, Wit.AI, …)
I would love to have that type of data for discussions about market share of the different CMSs, as I suspect that each CMS target a subset of possible websites only.
Let me know when my data’s ready! 😉
Cheers,
Alain
Hey Alain!
Err, yeah, sure. This is what this all is about… (Not.) 😉
While all of this is, of course, more than just interesting, I fear this is not (easily) possible—for me.
The main problem is that crawling a complete site or even running text analyses on its contents is not something you want to do with, say, the Alexa top 10 million websites, or even the IPv4 namespace. 😀 However, one could do this for a more restricted data set (i.e., IP addresses or domains). I have absolutely no prior knowledge in any of this, though.
Thanks a lot for the suggestions. 🙂
How is what you are attempting different than what already exists at https://builtwith.com/ ?
Hi Ryan!
It is correct that there are several tools, services or databases that can help answering one or more of the questions or analysis criteria I mentioned in this post. BuiltWith is one of these, and it gives you, for example, detailed information about (some of) the technology of a specific website, as well as the market share of a tool (e.g., a specific content management system). There are also detailed reports and/or surveys about some of the things included in this post, for example, on W3Techs or Netcraft.
The idea here is not primarily to reinvent the wheel just because, but rather to do things myself. Things I can reproduce at any time. Things I have absolute control over. Things I can run multiple analyses on, which means I have the exact same websites for all kinds of inspections.
Also, I have planned to evaluate a couple of criteria that I didn’t find somewhere before. And last but not least, I wrote this post to ask you for things you are interested in. I hope for several ideas that have not yet been analyzed (or not for the suggested set websites, maybe), and so far I am pretty happy with the response (not only here, but also on other channels).
Does this make sense? 🙂
Cheers,
Thorsten
I would like to get any info on used extensions and their versions. Best would be, if there is any vulnerability in it. So a security scanner would be great.
Hi there.
First of all, I don’t know if you have a specific CMS in mind, but for WordPress there is WPScan as well as the WPScan Vulnerability Database.
Scanning (or probing) for extensions is nothing you can easily do, most of the times. If a server is misconfigured, for example, if directory listing is enabled/possible, you might have an angle. Or if you search for the existence of one or a few specific extensions (e.g., probing for
/wp-content/plugins/jetpack/css/jetpack.css
) you may get the information you are interested in. This does not mean you can enumerate all activated (or even installed) extensions, though.And analyzing a website’s front-end markup will only exhibit extensions that actually leave a trace in the HTML. Extensions that are only used in the back end, or only on very few front-end pages cannot be detected.
Also, like I mentioned to Alain (a few comments up), these sorts of expensive analyses can not (easily) be done for huge data sets, such as the Alexa top 10 million websites.
Thank you for your comment, though! 🙂
Cheers,
Thorsten
Hey Thorsten,
How about parsing the RSS feeds to make a guess at how often and how recently the site was updated, to add the dimension of “active/inactive” to your data?
Alain
Wow, that is a brilliant idea! Thanks a lot!
I use this approach on https://wproll.com to rank sites by (1) how often the blog is updated and (2) how many of the posts mention “WordPress”.