+++*

Symbolic Forest

A homage to loading screens.

Blog : Posts tagged with ‘performance’

We can rebuild it! We have the technology! (part three)

Introducing Pug

If you want to start reading this series of articles from the start, the first part is here. In the previous part we discussed how I adapted Wintersmith to my purposes, adding extra page generators for different types of archive page, and refactoring them to make sure that I wasn’t repeating the same logic in multiple places, which is always a good process to follow on any sort of coding project. This post is about the templating language that Wintersmith uses, Pug. When I say “that Wintersmith uses”, incidentally, you should always add a “by default” rider, because as we saw previously adding support for something else generally wouldn’t be too hard to do.

In this case, though, I decided to stick with Pug because rather than being a general-purpose templating or macro language, it’s specifically tailored towards outputting HTML. If you’ve ever tried playing around with HTML itself, you’re probably aware that the structure of an HTML document (or the Document Object Model, as it’s known) has to form a tree of elements, but also that the developer is responsible for making sure that it actually is a valid tree by ending elements in the right order. In other words, when you write code that looks like this:

<section><article><h2>Post title</h2><p>Some <em>content</em> here.</p></article></section>

it’s the developer who is responsible for making sure that those </p>, </article> and </section> tags come in the right order; that the code ends its elements in reverse order to how they started. Because HTML doesn’t pay any attention to (most) white space, they have to be supplied. Pug, on the other hand, enforces correct indentation to represent the tree structure of the document, meaning that you can’t accidentally create a document that doesn’t have a valid structure. It might be the wrong structure, if you mess up your indentation, but that’s a separate issue. In Pug, the above line of HTML would look like this:

section
  article
    h2 Post title
    p Some
      em content
      | here.

You specify the content of an element by putting it on the same line or indenting the following line; elements are automatically closed when you reach a line with the same or less indentation. Note that Pug also assumes that the first word on each line will be an opening tag, and how we can suppress this assumption with the | symbol. You can supply attributes in brackets, so an <a href="target"> ... </a> element becomes a(href="target") ..., and Pug also has CSS-selector-style shortcuts for the class and id attributes, because they’re so commonly used. The HTML code

<section class="mainContent"><article id="post-94">...</article></section>

becomes this in Pug:

section.mainContent
  article#post-94 ...

So far so good; and I immediately cracked on with looking at the pages of the old Wordpress blog and converting the HTML of a typical page into Pug. Moreover, Pug also supports inheritance and mixins (a bit like functions), so I could repeat the exercise of refactoring common code into a single location. The vast majority of the template code for each type of page sits in a single layout.pug file, which is inherited by the templates for specific types of page. It defines a mixin called post() which takes the data structure of a single post as its argument and formats it. The template for single posts is reduced to just this:

extends layout
block append vars
  - subHeader = '';
block append title
  | #{ ' : ' + page.title }
block content
  +post(page)

The block keyword is used to either append to or overwrite specific regions of the primary layout.pug template. The content part of the home page template is just as straightforward:

extends layout
block content
  each article in articles
    +post(article)

I’ve omitted the biggest part of the home page template, which inserts the “Newer posts” and “Older posts” links at the bottom of the page; you can see though that for the content block, the only difference is that we iterate over a range of articles—chosen by the page generator function—and call the mixin for each one.

The great thing about Pug, though, is that it lets you drop out into JavaScript at any point to run bits of code, and when doing that, you don’t just have access to the data for the page it’s working on, you can see the full data model of the entire site. So this makes it easy to do things such as output the sidebar menus (I say sidebar; they’re at the bottom if you’re on mobile) with content that includes things like the number of posts in each month and each category. In the case of the tag cloud, it effectively has to put together a histogram of all of the tags on every post, which we can only do if we have sight of the entire model. It’s also really useful to be able to do little bits of data manipulation on the content before we output it, even if it’s effectively little cosmetic things. The mixin for each post contains the following Javascript, to process the post’s categories:

- if (!Array.isArray(thePost.metadata.categories)) thePost.metadata.categories = [ thePost.metadata.categories ]
- thePost.metadata.categories = Array.from(new Set(thePost.metadata.categories))

The - at the start of each line tells Pug that this is JavaScript code to be run, rather than template content; all this code does is tidy up the post’s category data a little, firstly by making sure the categories are an array, and secondly by removing any duplicates.

You can, however, get a bit carried away with the JavaScript you include in the template. My first complete design for the blog, it turned out, took something like 90 minutes to 2 hours to build the site on my puny laptop; not really helpful if you just want to knock off a quick blog post and upload it. That’s because all of the code I had written to generate the tag cloud, the monthly menus and the category menus, was in the template, so it was being re-computed over again for each page. If you assume that the time taken to generate all those menus is roughly proportional to the number of posts on the blog, O(n) in computer science terms (I haven’t really looked into it—it can’t be any better but it may indeed be worse) then the time taken to generate the whole blog becomes O(n2), which translates as “this doesn’t really scale very well”. The garden blog with its sixtyish posts so far was no problem; for this blog (over 750 posts and counting) it wasn’t really workable.

What’s the solution to this? Back to the Wintersmith code. All those menus are (at least with the present design) always going to contain the same data at any given time, so we only ever need to generate them once. So, I created another Wintersmith plugin, cacher.coffee. The JavaScript code I’d put into my layout templates was converted into CoffeeScript code, called from the plugin registration function. It doesn’t generate HTML itself; instead, it generates data structures containing all of the information in the menus. If you were to write it out as JSON it would look something like this:

"monthData": [
  { "url": "2020/10/", "name": "October 2020", "count": 4 },
  { "url": "2020/09/", "name": "September 2020", "count": 9 },
  ...
],
"categoryData": [
  { "name": "Artistic", "longer": "Posts categorised in Artistic", "count": 105 },
  ...
],
"tagData": [
  { "name": "archaeology", "count": 18, "fontSize": "0.83333" },
  { "name": "art", "count": 23, "fontSize": "0.97222" },
  ...
]

And so on; you get the idea. The template then just contains some very simple code that loops through these data structures and turns them into HTML in the appropriate way for each site. Doing this cut the build time down from up to two hours to around five minutes. It’s still not as quick to write a post here as it is with something like Wordpress, but five minutes is a liveable amount of overhead as far as I am concerned.

The Plain People Of The Internet: So, you’re saying you got it all wrong the first time? Wouldn’t it all have been fine from the start if you’d done it that way to begin with?

Well, yes and no. It would have been cleaner code from the start, that’s for certain; the faster code also has a much better logical structure, as it keeps the code that generates the semantic content at arm’s length from the code that handles the visual appearance, using the data structure above as a contract between the two. Loose coupling between components is, from an architectural point of view, nearly always preferable than tight coupling. On the other hand, one of the basic principles of agile development (in all its many and glorious forms) is: don’t write more code than you need. For a small side project like this blog, the best course of action is nearly always to write the simplest thing that will work, be aware than you’re now owing some technical debt, and come back to improve it later. The difficult thing in the commercial world is nearly always making sure that that last part actually happens, but for a site like this all you need is self-discipline and time.

That just about covers, I think, how I learned enough Pug to put together the templates for this site. We still haven’t covered, though, the layout itself, and all the important ancillary stuff you should never gloss over such as the build-deploy process and how it’s actually hosted. We’ll make a start on that in the next post in this series.

Performance

In which things turn to treacle

I’ve noticed, over the past few months or so, that sometimes this site seems to load rather slowly. The slow periods didn’t seem to match any spikes in my own traffic, though, so I didn’t see that there was necessarily much I could do about it; moreover, as it wasn’t this site’s traffic that seemed to be causing the problem, I wasn’t under any obligation to do anything about it.

As I’ve mentioned before, a few months back I switched to Google Analytics for my statistics-tracking. Which is all well and good; it has a lot more features than I had available previously. Its only limitation is: it uses cookies and Javascript to do its work. Because of that, it only logs visits by real people, using real browsers,* and not spiders, robots, RSS readers or nasty cracking attempts. Often, especially if you’re a marketing person, that’s exactly what you want. If you’re into the geekery, though, it can cover up what’s exactly going on, traffic-wise, at the server level.

Searching my logs, rather than looking at the Google statistics, showed that I was getting huge numbers of hits for very long URLs, consisting of valid paths joined together by lots of directories named ‘&’:

Logfile extract

That’s a screenshot of a single request in the logfile – the whole thing being about 850 characters long. ‘%26′ is an encoded ‘&’ character. Because of the way WordPress works, these things are valid URLs, and requests for them were coming in at a pretty fast rate. Before long, the request rate was faster than the page generation time – and that’s when the problem really starts to build up, because from there things snowball until nobody gets served.

All these requests were coming from a single IP address, an ordinary consumer type of address in Italy.** Moreover, the user-agent was being disguised. Each hit was coming in from the same IP address, but with a different-but-plausible-looking user-agent string, so the hits looked like a normal, ordinary browser with a real person behind it.

The problem was solved fairly easily, to be honest; and the site was soon behaving itself again. It should still be behaving itself now. But if you came here yesterday afternoon and thought the site didn’t seem to be working very well, that’s why it was. I’m going to have to keep an eye on things, to see if it starts happening again.

* and only if they have Javascript enabled, at that, although I know that covers 99% of the known world nowadays.

** which made me think to myself: “I know I’ve pissed people off … but none of them are Italian!