Welcome to ‘djnz’s hackblog’
Paul Mobbs’ blog on ‘techno-Luddism’, documenting ideas for low-tech engineering and technology hacking to help people regain control over the tools in our lives.
To keep up with new information you can follow me on social media and YouTube – and please subscribe if possible, as in today’s digital analytics popularity contest it's the only way to get a wider audience.
© January 2021 Paul Mobbs. This paper is released under the Creative Commons Attribution Non-Commercial Share-Alike 4.0 International license.
Click here for copyright information.
‘djnz’s hackblog’ no.2, 7th January 2021
Last updated 2021-01-21
♳ Web bloat was driven by analytics and syndication
♴ The ecological catastrophe that is ‘HTTPS’
♵ Designing-out bloat from the web
♶ Creating my page template
A decade ago I wrote a lengthy piece on ‘web bloat’ – the trend for on-line resources to demand more-and-more data-heavy transmissions to send a web page or other files. At the time the FRAW site rated really well. A decade later, as web standards and guiding business models having changed, it’s time to revisit this idea and see if it’s possible to improve upon what went before. That begins with a very simple problem: How do you design your default template.
A little eye-opener for those who were not aware: While looking at this page, press Ctrl+U on the keyboard and see what you get. Perhaps scroll down bit and see if you recognise anything from what you have read thus far.
Now go to another site, let’s say a story in The Independent news site (admittedly, one of the worst offenders), and do the same thing again.
Notice the difference? How much ‘web bloat’ a page contains is a design issue; and the control of bloat has a direct link to the impact of the ecological impact of a web resource.
Web bloat was driven by analytics and syndication
Behind every pretty web page is an orgy of formatting codes, but these days there’s much more than that. Most commercial pages are festooned with ’widgets’ that do everything from: Adding lists of syndicated content to the page; to tracking exactly which page you are looking at; and possibly even which bit of the page you are looking at and for how long, so powerful have these tracking widgets become.
Today most pages are generated dynamically by the web server from a database, formatted according to a template for that resource each time you ask to download it, and in the process automatically inserting whatever widgets they are paid to include at that time. This is what allows pages to track individuals, because, depending upon the tracking data received from your browser, they can composite a page specifically for you alone.
That also burns a lot of power in data centres to generate that content, even before it’s sent out over the ‘Net to your device – essentially just warehousing huge amounts of live data on everyone on the off change you‘ll browse a page at that time.
In contrast the pages on the FRAW site are static; they are files stores in a hard drive. At most server-side includes are used for the index pages to cut down on duplication and make page design simpler, but quite literally most pages just go straight from the hard drive straight onto the ‘Net – or the server cache, for the most used files, reducing the time taken because they don’t need to be reloaded.
Here’s a good example of why FRAW does not dynamically ‘embed’ content:
Example 1 is pretty much the simplest form of web page to display a link to a YouTube video (uncoincidentally, the video is all about widgets and analytics – it’s worth a watch!).
Example 1: Linked YouTube video page <!DOCTYPE html> <html lang="en"> <head> <title>Test Page</title> <meta charset="utf-8" /> <meta http-equiv="content-language" content="en-GB" /> <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> <meta http-equiv="content-style-type" content="text/css" /> </head> <body><div> <p><a href="https://www.youtube.com/watch?v=6EHSlhnE6Ck" target="_top" title="YouTube link">YouTube link</a></p> </div></body> </html>
Note that the HTML code boxes in the page have been designed to display all the ‘invisible’ content in dark red, and only the parts the user can see in dark blue.
Let'a save the ‘whole page’ to a folder to capture both the HTML page and all the other components it loaded. Now run the du command against that folder:
$ du -sb test_page 4503 test_page
In other words, that entire page only requires 4,503 bytes of data to display it – and it only takes one file sent from the server to the browser to display that.
Example 2: Embedded YouTube video page <!DOCTYPE html> <html lang="en"> <head> <title>Test Page</title> <meta charset="utf-8" /> <meta http-equiv="content-language" content="en-GB" /> <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> <meta http-equiv="content-style-type" content="text/css" /> </head> <body><div> <iframe width="560" height="315" src="https://www.youtube.com/embed/6EHSlhnE6Ck" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen> </iframe> </div></body> </html>
Now look at example 2.
Instead of the <a> of a web link, we have the <iframe> tag that tells the browser to embed data from another source – much like web servers do all the time when they syndicate data from other sites.
Save the ‘whole page’ to a different folder and then run the du command:
$ du -sb test_page-embed 2307765 test_page-embed
What’s more, displaying that required a further 8 files to be loaded, which then creates tracking data not only with youtube.com, but also doubleclick.net (Google’s commercial data analytics and marketing arm), google.com, and gstatic.com (Google’s servers which send static content out on-line).
A simple rule: “Don’t embed content”. Once you embed you not only introduce bloated content beyond your control, but alongside that you will introduce widgets and other mechanisms that generate surveillance data for analytics companies.
The ecological catastrophe that is ‘HTTPS’
The FRAW site deliberately does not use the HTTPS protocol due to its heavier ecological footprint. HTTPS gives the presumption of ‘security’, and yes, it is important for those operations which need to be truly secure such as financial transactions. But it is not used in a way that logically segregates what is ‘secure’ from what is ‘irrelevant’.
The fact is if you want to be truly ‘secure’ in your web browsing – especially as HTTPS downloads all those widgets that will frisk your browser for every shred of information about you – there are far more secure ways of doing than than HTTPS alone.
The ecological flaw in HTTPS, though, is what happens when you route everything through HTTPS by default – irrespective of whether it needs security or not. That creates some very impactful side-effects from how the protocol shifts data around the web. At the same time you can’t justify this, with the claim of increases privacy or security, when much of the content served over HTTPS embeds a myriad of surveillance widgets automatically.
When a page is requested over an unencrypted HTTP connection:
- The browser sends a request to the server for the specific page; and,
- The server replies with the page data, or an error code if the request is not valid.
If the downloaded page contains links to other content – like the embedded YouTube video above – then the browser makes separate requests for each of those extra files, which are also sent directly from whichever server they are located on.
HTTPS uses the TLS protocol to encrypt data in transit. What that means is:
- Your browser says “hello”;
- The server replies with its HTTPS certificate data and a public encryption key (if your browser gives you a warning about an ‘insecure site’, this bit doesn’t work);
- The browser replies with its encryption key;
- The server replies with another encryption key for the file transfer session being carried out;
- The browser sends a request for the file required encrypted with the server’s key;
- The server replies with the file data encrypted with the browser's key;
- The browser says, “thanks, bye”.
If the page contains more links for other resources sent via HTTPS, then it repeats the above for each one.
It’s secure(ish). Problem is each of those network connections involves transmission over the ‘Net – which burns energy and bandwidth.
Now think specifically about the nature of the file being transmitted. A request for a large file over HTTPS generates an encryption ‘overhead’ of a few hundred to a few thousand bytes; nothing compared to size of the files. What if, however, the file you request is itself only a few hundred bytes or kilobytes?
An awful lot of the components of today’s complex web files, with all those embedded widgets and other content, is essentially a stack of small files. Each small files has to be separately downloaded over HTTPS, adding a far greater ‘overhead’ to download pages full of widgets and embedded content compared to single large files.
To put a comparison on that, earlier I suggested downloading a page from The Independent. Download the top index page, save the ’whole page’, and for me it came to 93 files totalling over 13 megabytes. Most of those files it downloaded were just a few hundred to a few thousand bytes in length – mostly tracking widgets and code snippets. What’s more important is that those 93 files were distributed over 16 separate sites (technically it may be more as the ‘active’ scripting part of that is disabled in my browser to prevent it loading yet more advertising embeds and tracking widgets).
When the browser first contacts it site it gets the HTTPS certificate, but every file downloaded after that can reuse that information for each subsequent transfer from that site. When the browser downloads widgets from 16 separate sites over HTTPS it has to download 16 separate certificates and validate them before it even proceeds to download the data.
We need to accept that HTTPS is an ecologically imperfect solution to a problem largely created the web’s contemporary drive for marketing-based surveillance – which most people find annoying in any case. If we are ever going to begin to tackle the ecological footprint of the web then we have to dismantle this absurd system. This is the reason why FRAW cannot justify using HTTPS.
Designing-out bloat from the web
The point about multiple small files also has relevance for unencrypted HTTP connections; more small files, more impact (albeit not on a scale with HTTPS). What would be useful is finding methods to eliminate the need for small additional files to be downloaded at all.
Why have multiple files?
Modularity: It would be too complicated to maintain a monolithic file, so it is split into many small parts, each controlling or providing one small part of the overall content.
That’s a really strong design principle, with good arguments for its use. That doesn’t detract from the fact that multiple small files use up resources as they are merged to composite a page for transmission – or (more likely) loaded by the browser after transmission from multiple sites.
FRAW does not require adverts or widgets for monetisation. That takes away almost all of the need to adopt these practices. What is left is the need to create a simple design that is easy to maintain – which is what the critical issue is here.
Instead of web design let’s look at this from a different point of view: Entropy.
Life is a constant struggle against entropy, and in fact life is characterised as the simplest level as a force which can organise against the trend of entropy.
Web sites use large amounts of resources precisely because they need to create large amounts of ordered content tailored to innumerable permutations of data – consequently, that is why they use a lot of energy to do that, because that is creating ‘order’ against the trend of entropy.
Let’s turn that idea on its head:
Instead of format-less content that can be randomly reorganised to be something else, why not just accept, like the hard copy manuscripts humans have maintained for millennia, that once content is formatted that it will stay like that forever more? Provided the data is properly structured within the page document, there is no barrier to going back in later and retrieving and reformatting it if necessary.
Increasingly much of the formatting data which accompanies web content is there by virtue of the ’Net’s over-riding business model: Each of the syndicating sites contributes its own formatting for each small section of content it loads, multiplying the load. If we accept, though, that once created the document is a fixed point in time, then there is not reason that extraneous content could not be filtered before the page was composited.
The final point in reducing the entropy of a page design is the variety of elements within it. Reduce the formatting with more ‘thoughtful’ design, which sought to minimise the need for additional content, and it is possible to shrink the formatting content accordingly.
OK, that’s a wish list. I can hear the screams of corporate web designers that, “it would make my job impossible”. But that’s the whole point here: Adapting to the inevitable ecological catastrophe which flows form a failure to change is not and option; hence redesigning the web for less systemic complexity and greater content simplicity; and with a more enduring ‘static’ quality for data includes – is the only viable alternative.
Creating my page template
The whole purpose of this article has been to ‘write something’, thereby allowing me to create a simple template which supports that design. If the thing I write about just happens to document the design principles for the template, well, it has a nice iterative synchronicity that also saves time. (ooops!, I used the ‘D-word’ – I know how the concept of ‘documentation’ scares some people!)
So let’s run through the whole piece again:
Static design with no dynamically embedded content? Check.
No HTTPS is a given of hosting on the FRAW site. Check.
Minimising the upload of additional image and style information… Ah!!
This is where things get technical.
It is possible to embed images as base64 encoded text (see ‘Embedding Images’ box).
Converting binary images to text strings rather requires a little practise with image editing in order not simply to reduce the physical size of the image in pixels, but the number of colours, or the conversion of a high colour image to ‘halftone’ dots or shades that produce the same effect in fewer colours.
E.g. when reducing to an indexed image on GIMP, select ‘Positioned’ or ‘Floyd-Steinbeck’ options to reduce to halftones.
This allows the elimination of a number of images:
- The page icon image was reduced to 64 pixels square at 12 colours, then the 622 byte PNG converted to 833 bytes of text;
- The page background was reduced to a 20 pixel square 4-colour greyscale image, then the 242 byte PNG converted to 329 bytes of text;
- The Creative Commons logo was reduced to a 863 byte PNG, then into a 1,168 byte text string.
The challenge are the larger images used in the text – for example the diagram of the file components of a web page shown earlier.
Bigger image, longer text string: Due to the inherent problem of converting binary image data to base64 text, the increasing size of the text string off-sets the benefit of undertaking this process. That again is a good reason why never embedding an image of more than 30 to 60 kilobytes is a good idea – it’s pointless in terms of the equivalent impact.
So, extraneous images embedded as text, Check.
Scripting files – that’s easy as on the FRAW site it’s not used unless absolutely necessary. But even if a script were required in the page, it could easily be added as a <script> tag in the page header section rather than being loaded as a file. Check.
Content formatting… again, quite possibly the hardest part of the whole exercise.
The format of web pages is what gives them character. Problem is, the history of content formatting means there is a lot of old baggage kicking around in the system. The FRAW site began in 1995, and there are still a few files kicking around from that era with their very basic HTML version 2 formatting from the late 1990s.
That problem of formatting was compounded with the introduction of style sheets (or ‘CSS’) shortly after. It allows infinite variation, and with that endlessly chained style sheets can easily consume more and more resources; not helped by the fact that as the standard has advanced its requirements were never consistently implemented by all browsers – meaning some pages have multiple style sheets, uploading a different one depending upon the browser or the type of screen used.
Let’s take a strategic decision: Forget all of that which has gone before.
I have a text document. What is a text document? It’s just lots of paragraphs:
In which case, do we need headings? A heading has an important logical function to the reader, but to the machine it’s irrelevant. In which case why not just use a style ‘class’ selector to define a heading and apply it to a paragraph? – thereby eliminating the need to define styles for the traditional <Hx> heading tags.
Likewise, why have specific kinds of <p> to format elements? Why not define a class, and then that could be applied to a <div>, a <pre>, or a <p> tag equally. Likewise rather than having different combinations of <em> or <span> etc., a class called “quotation” could be equally applied to a <p>, <em> or <span> without duplication.
OK, if you’re not an HTML/CSS geek, what I’m saying sounds gobbledygook – whereas for many HTML/CSS geeks it’ll just be heresy. The ability to endlessly refine and extend is self-justifying; arguing that we should return to something akin to web design circa 1998 would negate that tendency.
It comes back to the point about entropy: The more differentiation of styles, the greater the ‘information space’ which must be consumed to represent it; so by simplifying both the visual style of the document, and the way that is represented through mark-up code, it reduces the needs to represent lots of different styles – and hence the ‘information space’ which must be encoded into data.
As I’ve been writing this in my HTML editor, I’ve been creating classes along these lines. Now I have reached the end, I seem to have created just the style template I need in order to represent the layout I want to present. Hence, style simplification and embedding, Check.
What I’m now left with is a single flat file that represents this entire document – both for images and a style information. This leads to one further implication: Compression.
If it is possible to reduce our complex patterns of web design to simple, single flat file formats with all style and graphical content embedded, that makes transmission and archiving so easy. Let’s see… the ~63,700 byte HTML file will reduce to ~29,350 bytes with Gzip and ~27,000 bytes with XZ – essentially halving the size of the entire document.
Coming back to the entropy issue, as text with simple encoding, it’s arguable that we could preserve the information they contain far more easily – certainly than the more complex forms of binary or proprietary file format which creates issues about compatibility and portability.
I have terabytes of data archived from the last 35 years of computer use, and have already experienced problems trying to resurrect old data formats even from just 20 years ago. Hence I find the idea of a universally simple way of encoding knowledge to give it the easiest portability and archive-ability rather, ‘elegant’.
Then it suddenly hits me. All I’ve done today, in trying to find a cutting-edge way of solving the ecological footprint issue of the contemporary web, is create a document which – style sheet interpretation excepted – would have been equally acceptable for my web site 25 years ago! Clearly, the idea that the ‘Net needs a speed limit, and that we need to focus on recapturing the simple, efficient means of encoding data from the early days of the web, must be correct.