+++*

Symbolic Forest

A homage to loading screens.

Blog : Post Category : Technology : Page 1

No more cookies!

Or, rather, no more analytics

Regular readers—or, at least, people who have looked at this site before the last month or two—might remember that it used to have a discreet cookie consent banner at the top of the page, asking if you consented to me planting a tracking cookie that I promised not to send to anyone else. It would pop up again about once a year, just to make sure you hadn’t changed your mind. If you clicked yes, you appeared on my Google Analytics dashboard. If you clicked no, you didn’t.

What you probably haven’t noticed is that it isn’t there any more. A few weeks ago now, I quietly stripped it out. This site now puts no cookies of any sort on your machine, necessary or otherwise, so there’s no need for me to ask to do it.

When I first started this site’s predecessor, twenty-something years ago, I found it quite fascinating looking at the statistics, and in particular, looking at what search terms had brought people to the site. If you look back in the archives, it used to be a common topic for posts: “look what someone was searching for and it led them to me!” What to do when you find a dead bat was one common one; and the lyrics to the childrens’ hymn “Autumn Days When The Grass Is Jewelled”. It was, I thought—and I might not have been right about this—an interesting topic to read about, and it was certainly a useful piece of filler back in the days of 2005 when I was aiming to publish a post on this site every day, rather than every month. If you go back to the archives for 2005, there’s a lot of filler.

Now, though? Hopefully there’s not as much filler on the site as there was back then. But the logs have changed. Barely anything reaches this site through “organic search” any more—”organic search” is the industry term for “people entering a search phrase in their browser and hitting a link”. Whether this means Google has got better or worse at giving people search results I don’t know—personally, for the searches I make, Google has got a lot worse for the sort of searches where I didn’t know what site I wanted to go to beforehand, but for the sort of lazy searches where I already know where I want to go, it’s got better. I suspect the first sort were generally the sort that brought people here. Anyway, all the traffic to this site comes from people who follow me on social media so follow the link when I tell them there’s a new blog post up.

Given that the analytics aren’t very interesting, I hadn’t looked at them for months. And, frankly, do I write this site in order to generate traffic to it? No, I dont. I write this site to scratch an itch, to get things off my chest, because there’s something I want to say. I write this site in order to write this site, not to drive my income or to self-promote. I don’t really need a hit counter in order to do that. Morover, I realised that in all honesty I couldn’t justify the cutesy “I’m only setting a cookie to satisfy my own innate curiosity” message I’d put in the consent banner, because although I was just doing that, I had no idea what Google were doing with the information that you’d been here. The less information they can gather on us, the better. It’s an uphill struggle, but it’s a small piece in the jigsaw.

So, no more cookies, no more consent banner and no more analytics, until I come up with the itch to write my own on-prem cookie-based analytics engine that I can promise does just give me the sort of stats that satisfy my own nosiness—which I’m not likely to do, because I have more than enough things ongoing to last me a lifetime already. This site is that little bit more indie, that little bit more Indieweb, because I can promise I’m not doing anything at all to harvest your data and not sending any of it to any third parties. The next bit to protect you will be setting up an SSL certificate, which has been on the to-do list for some months now; for this site, given that you can’t send me any data, all SSL will really do is guarantee that I’m still me and haven’t been replaced, which isn’t likely to be anything you’re particularly worried about. It will come, though, probably more as a side-piece to some other aspect of improving the site’s infrastructure than anything else. This site is, always has been, proudly independent, and I hope it always will be.

Know your limits!

Or remember that computers are still not boxes of infinite resource, whatever you might think

Sometimes, given that I often work with people who are twenty years or so younger than me, I feel old. I mean, the archives of this blog go back over twenty years now: these are serious, intelligent colleagues, and when I started writing my first blog posts they were likely still toddlers.

Sometimes, though, that has an advantage. I was thinking of this when debugging some code a colleague had written, which worked fine up to a point, but failed if its input file was more than, say, a few tens of megabytes. When the input reached that size, the whole thing crashed with OutOfMemoryException even on a computer with multiple gigabytes of memory, a hundred times more memory than the hundred-megabyte example file the client had sent.

When I was younger, you see, that would have seemed a ridiculous amount of data, unimaginable to fit in one file. Even when I had my first PC, the thought of a file too big to fit on even a superfloppy like a Zip disk was a little bit mindblowing, even though the PC seemed massive compared to what I’d experienced before.

Back when I was at school, I’d tried to teach myself how to code on an Amstrad CPC, a mid-1980s 8-bit machine with a 64k address space and a floppy disk drive of 180k capacity. It was the second-generation of 8-bit home machine really, more powerful than a C64 or a Sinclair Spectrum despite sharing the same CPU as the latter. Unlike those, it had a fully-bitmapped screen with individual pixels all fully addressable; however, that took up 16k of the 64k address space, so the actual code on it had to be pretty damn tight to fit. The programmers’ Firmware Manual—what we’d now call the API reference documentation—is of course scanned and online; one of the reasons I was never very succssful coding on the machine itself* was that in the 1980s and 90s copies of it were almost impossible to find once Amstrad’s print run was exhausted. On the CPC, every byte you used counted; a lot of software development houses ended up cross-assembling their code purely because for a large program it was difficult to fit the source code itself onto the machine.** That’s the background I came from, and it makes me wary still nowadays not to waste too much memory or resources. I’m the sort of developer who will pass an expected size parameter to the List<T> constructor if it’s known, to avoid unnecessary reallocations, who doesn’t add ToList() automatically by reflex to the end of every LINQ operation—which is a good idea in any case, as long as you know when you do need to.

Returning to the present: what had my team member done, then, that he was provoking a machine into running out of memory when in theory he had plenty to play with? Well, there were two problems at work.

Firstly, yes, we’re talking about someone who has never tried building code on a tiny tiny environment. The purpose of this particular code was to take an input zip file, open it, modify some of its content, recompress it, and send it off to an API elsewhere. Moreover, this had been done re-using existing internal code, some of which wanted to operate on a Stream and some of which, for whatever reason I don’t know, wanted to operate on a byte[]. We had ended up with code that received the data in a MemoryStream, unzipped it in memory, and copied the contents out into more MemoryStream objects. Each of those was being copied into a byte array which was being passed to a routine that immediately copied its input into a new MemoryStream, before deserializing…well, you get the idea. The whole thing ended up with many, many copies of the input data in memory, either in essentially its original format, or in a slightly modified form, and all of these copies were still in memory at the end of the process.

Secondly, there was another issue that was not quite so much the developer’s responsibility. This .NET code was being combined in “Portable” form, and the server was, again for reasons best known to itself, deciding that it should run it with the 32-bit runtime. Therefore, although there should have been 16Gb of memory on the server instance, we were working with a 2Gb memory ceiling.

I did dig in and rewrite as much of the code as I thought I needed to. Some of the copying could be elided altogether; and as this wasn’t a time-critical piece of code, I changed a lot of the rest to use a temporary file instead of memory. The second issue had an easy, lazy fix: compile the thing as 64-bit only, so the server would have no choice of runtime. As a result I never did get to the bottom of why it was preferring the 32-bit runtime, but I had working, shippable, code at the end of the day, and that’s what mattered here.

What I couldn’t help thinking, though, was that the rewriting might not have been needed to begin with. A young developer—who’s never worked on a genuinely small system—has spent so much time, though, never worrying about working anywhere near the boundaries of what their virtual servers can cope with, that when they do hit those boundaries, it comes as a nasty, sudden shock. They have no idea at all what to do, or even where to start: an OutOfMemoryException may as well be an act of the gods. Maybe when I’m helping train people up, I should give them all an Amstrad CPC emulator and see what the result is.

* My high point was successfully cloning Minesweeper, but with keyboard controls.

** Some software was shipped on 16k ROMs, to go along with third-party ROM socket boxes that attached to the expansion bus; this kept the assembler and editor code out of the main address space, but it could still be difficult to fit the source code and the assembler output in memory at the same time. The ROMs were scanned on boot and each declared named entrypoints which could then be accessed as BASIC commands. At least one game I can remember—The Bard’s Tale—crashed if too many ROMs were attached, because each ROM could reserve an area of RAM for its own bookkeeping, and the game found itself without enough memory available.

Going through things one by one

Or, a coding exercise

One of my flaws is that as soon as I’m familiar with something, I assume it must be common knowledge. I love tutoring and mentoring people, but I’m bad at pitching exactly where their level might be, and in working out what they might not have come across before. Particularly, in my career, software development is one of those skills where beyond a certain base level nearly all your knowledge is picked up through osmosis and experience, rather than through formal training. Sometimes, when I’m reviewing my team’s code I come across things that surprise me a little. That’s where this post comes from, really: a few months back I spotted something in a review and realised it wouldn’t work.

This post is about C#, so apologies to anyone with no interest in coding in general or C# in particular; I’ll try to explain this at a straightforward level, so that even if you don’t know the language you can work out what’s going on. First, though, I have to explain a few basics. That’s because there’s one particular thing in C# (in .NET, in fact) that you can’t do, that people learn very on that you can’t do, and you have to find workarounds for. This post is about a very similar situation, which doesn’t work for the same reason, but that isn’t necessarily immediately obvious even to an experienced coder. In order for you to understand that, I’m going to explain the well-known case first.

Since its first version over twenty years ago, C# has had the concept of “enumerables” and “enumerators”. An enumerable is essentially something that consists of a set of items, all of the same type, that you can process or handle one-by-one. An enumerator is a thing that lets you do this. In other words, you can go to an enumerable and say “can I have an enumerator, please”, and you should get an enumerator that’s linked to your enumerable. You can then keep saying to the enumerator: “can I have the next thing from the enumerable?” until the enumerator tells you there’s none left.

This is all expressed in the methods IEnumerable<T>.GetEnumerator()* and IEnumerator<T>.MoveNext(), not to mention the IEnumerator<T>.Current property, which nobody ever actually uses. In fact, the documentation explicity recommends you don’t use them, because they have easier wrappers. For example, the foreach statement.

List<string> someWords = new List<string>() { "one", "two", "three" };
foreach (string word in someWords)
{
    Process(word);
}

Under the hood, this is equivalent** to:

List<string> someWords = new List<string>() { "one", "two", "three" };
IEnumerator<string> wordEnumerator = someWords.GetEnumerator();
while (wordEnumerator.MoveNext())
{
    string word = wordEnumerator.Current;
    Process(word);
}

The foreach statement is essentially using a hidden enumerator that the programmer doesn’t need to worry about.

The thing that developers generally learn very early on is that you can’t modify the contents of an enumerable whilst it’s being enumerated. Well, you can, but your enumerator will be rendered unusable. On your next call to the enumerator, it will throw an exception.

// This code won't work
List<string> someWords = new List<string>() { "one", "two", "three" };
foreach (string word in someWords)
{
    if (word.Contains('e'));
    {
        someWords.Remove(word);
    }
}

This makes sense, if you think about it: it’s reasonable for an enumerator to be able to expect that it’s working on solid ground, so to speak. If you try to jiggle the carpet underneath it, it falls over, because it might not know where to step next. If you want to do this using a foreach, you will need to do it some other way, such as by making a copy of the list.

List<string> someWords = new List<string>() { "one", "two", "three" };
List<string> copy = someWords.ToList();
foreach (string word in copy)
{
    if (word.Contains('e'));
    {
        someWords.Remove(word);
    }
}

So, one of my colleagues was in this situation, and came up with what seemed like a nice, clean way to handle this. They were going to use the LINQ API to both make the copy and do the filtering, in one go. LINQ is a very helpful API that gives you filtering, projection and aggregate methods on enumerables. It’s a “fluent API”, which means it’s designed for you to be able to chain calls together. In their code, they used the Where() method, which takes an enumerable and returns an enumerable containing the items from the first enumerable which matched a given condition.

// Can you see where the bug is?
List<string> someWords = new List<string>() { "one", "two", "three" };
IEnumerable<string> filteredWords = someWords.Where(w => w.Contains('e'));
foreach (string word in filteredWords)
{
    someWords.Remove(word);
}

This should work, right? We’re not iterating over the enumerable we’re modifying, we’re iterating over the new, filtered enumerable. So why does this crash with the same exception as the previous example?

The answer is that LINQ methods—strictly speaking, here, we’re using “LINQ-To-Objects”—don’t return the same type of thing as their parameter. They return an IEnunerable<T>, but they don’t guarantee exactly what implementation of IEnumerable<T> they might return. Moreover, in general, LINQ prefers “lazy evaluation”. This means that Where() doesn’t actually do the filtering when it’s called—that would be a very inefficient strategy on a large dataset, because you’d potentially be creating a second copy of the dataset in memory. Instead, it returns a wrapper object, which doesn’t actually evaulate its filter until something tries to enumerate it.

In other words, when the foreach loop iterates over filteredWords, filteredWords isn’t a list of words itself. It’s an object that, at that point, goes to its source data and thinks: “does that match? OK, pass it through.” And the next time: “does that match? No, next. Does that match? Yes, pass it through.” So the foreach loop is still, ultimately, triggering one or more enumerations of someWords each time we go around the loop, even though it doesn’t immediately appear to be used.

What’s the best way to fix this? Well, in this toy example, you really could just do this:

someWords = someWords.Where(w => !w.Contains('e')).ToList();

which gets rid of the loop completely. If you can’t do that for some reason—and I can’t remember why we couldn’t do that in the real-world code this is loosely based on—you can add a ToList() call onto the line creating filteredWords, forcing evaluation of the filter at that point. Or, you could avoid a foreach loop a different way by converting it to a for loop, which are a bit more flexible than a foreach and in this case would save memory slightly; the downside is a bit more typing and that your code becomes prone to subtle off-by-one errors if you don’t think it through thoroughly. There’s nearly always more than one way to do something like this, and they all have their own upsides and downsides.

I said at the start, I spotted the issue here straightaway just by reading the code, not by trying to run it. If I hadn’t spotted it inside somebody else’s code, I wouldn’t even have thought to write a blog post on something like this. There are always going to be people, though, who didn’t realise that the code would behave like this because they hadn’t really thought about how LINQ works; just as there are always developers who go the other way and slap a ToList() on the end of the LINQ chain because they don’t understand how LINQ works but have come across this problem before and know that ToList() fixed it. Hopefully, some of the people who read this post will now have learned something they didn’t know before; and if you didn’t, I hope at least you found it interesting.

* Note. for clarity I’m only going to use the generic interface in this post. There is also a non-generic interface, but as only the very first versions of C# didn’t support generics, we really don’t need to worry about that. If you write your own enumerable you’re still required to support the non-generic interface, but you can usually do so with one line of boilerplate: public IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

** In recent versions of C#, at any rate. In earlier versions, the equivalence was slightly different. The change was a subtle but potentially breaking one, causing a change of behaviour in cases where the loop variable was captured by a lambda expression.

The Paper Archives (part three)

The title of this series is maybe not quite as suitable as it was

The previous post in this series is here.

Sometimes, sorting through the accumulated junk that fills my mother’s house, I come across things that I remember from my childhood. For example: alongside the stack of modern radio transceivers that my dad used to speak to random strangers over the airwaves, is the radio I remember being my Nanna’s kitchen radio, sitting on top of the fridge.

The old kitchen radio

It’s a big, clunky thing for a portable, its frame made of leather-covered plywood. I know it has valves (or tubes) inside, not transistors, because I remember my dad having to source spare valves for it and plug them in back when my Nanna still used it daily—he was the only person in the family who knew how to work out which of the valves had popped when it stopped working.

With only a vague idea how old it might be, I looked at the tuning dial to see if it would give me any clues.

The tuning dial

Clearly from before the Big BBC Renaming of the late 1960s. I’m not sure how much it can be trusted for dating, though, as Radio Athlone officially changed to Radio Éireann in the 1930s, but I was fairly sure the radio probably wasn’t quite that old. Of course, I should really have beeen looking at the bottom.

The makers' plate

And of course the internet can tell you exactly when a Murphy BU183M was first sold: 1956, a revision of the 1952 BU183, which had the same case. The rather more stylish B283 model came out the following year, so I suspect not that many of the BU183M were made.

I’m intrigued by the wide range of voltages it can run off: nowadays that sort of input voltage range is handled simply and automatically by power electronics, but in the 1950s you had to open your radio up and make sure the transformer was set correctly before you tried to plug it in, just in case you were about to blow yourself up otherwise. I suppose this is what radio shops were for, to do that for you, and potentially to hire out the large, chunky high-voltage batteries you might need if you didn’t have mains electricity. This radio is from the last years of the valve radio: low-voltage transistor sets were about to enter the marketplace and completely change how we listened to music. This beast—or the B283, which at least looks like an early transistor radio—needed a 90-volt battery to heat up the valves if you wanted to run them without mains power, not the sort of battery you can easily carry around in your handbag. The world has changed a lot in seventy years.

State of independence

Or, getting the web back to its roots

When I rewrote and “relaunched” this site, back in 2020, I very consciously chose to stay simple. I didn’t want to tie myself to one of the major “content platforms”, because over the years too many of them have closed down on barely more than a whim. I didn’t want a complex system that would be high-maintenance in return for more functionality. I didn’t want to have to moderate what other people might want to say in my space. More importantly, though, I did want a space more like the online spaces I inhabited 20 or so years ago; or at least, like the online spaces of my imagination, where people would create in their own little corner not worrying about influence or monetisation or that sort of thing. It’s possible that place never really existed, except in my mind, but it was something I always aspired towards, and it was a place where I met a whole load of other people who shared a similar outlook on why they were writing down so much stuff out there on the internet for other people to read. That was why, when I rewrote this site, I kept it simple, and produced a static site that could be hosted almost anywhere, with source code that can be put into any private Git hosting service. I didn’t even go for one of the mainstream static site generators; I chose a relatively simple and straightforward open-source one that works by gluing a number of other open-source tools together to output HTML. It’s about as plain and independent as you can get.

There is, nowadays, a movement towards making the web more independent, making it more like it used to be, or at least as some of us remember it. It’s called the IndieWeb movement. The basic idea behind the IndieWeb is exactly this: that when you, an individual, post something online, it should stay yours. It should belong to you, under your control, forever. Essentially, that’s one of the main things I’ve always been aiming for.

I’m clearly IndieWeb-adjacent, whatever that phrase I’ve just invented means. This site, though, is a long way from being IndieWeb-complient. And the reason is: I’ve looked through their Getting Started pages, and, frankly, it takes effort. That might sound like me being lazy, and I’d be the first to agree that I am lazy, but it’s also because there are only so many hours in the day. The day job takes up a good chunk of them, of course, then there are The Children, there’s my other coding projects, all my craft projects,* the various organisations I do volunteer work for, all the other ways I’m trying to improve myself, not to mention the attraction of just going out for a long walk for a few hours. Aside from the original setup and occasional tweaks, this site is largely something to exercise the side of my brain that isn’t involved in coding. Spending time setting up and creating my own personal h-card, and automating syndication, isn’t really something I want to do in my relaxation hours.

Hopefully, though, the idea behind IndieWeb will grow, and will flourish, and we can make the web something that isn’t driven by advertising revenue, or by monetising hate and bigotry. I’d like us to make the web a place where seeds have space to germinate and flower, where everyone controls their own output and can express themselves without the point being to increase shareholder value or to feed the ego of some not-as-bright-as-he-thinks entrepreneur. Maybe I’ll add more IndieWeb features to this site, one by one, as time goes by. Hopefully, whatever I do, I’ll just keep doing my own thing for as longa as it makes me happy.

* I mean, I literally started two separate new ones yesterday.

The Paper Archives (part two)

More relics from the past

The previous post in this series is here.

Spending some more time going through the things The Parents should arguably have thrown out decades ago, I came across a leather bag, which seemed to have belonged to my father. Specifically, he seemed to have used it for going to college, in the 1970s. Him being him, he’d never properly cleaned it out, so it had accumulated all manner of things from all across the decade. There were “please explain your non-attendance” slips from 1972; an unread railway society magazine from 1977; and the most recent thing with a date on was an Open University exam paper from 1983. It was about relational database design, and to be honest some of the questions wouldn’t be out of place in a modern exam paper if you asked for the answers in SQL DDL rather than in CODASYL DDL, so I might come back to that and give it its own post. What he scored on the exam, I don’t know. There were coloured pencils, and an unopened packet of gum.

Juicy Fruit gum

It seems to be from before the invention of the Best Before date, but the RRP printed on the side is £0.04.

Slightly more expensive: a rather nice slide rule. Look, it has a Standard Deviation scale and all. Naturally, my dad being my dad, it was still in its case and with the original instruction book, which will be useful if I ever try to work out how to use it.

Slide rule

And finally (for today) I spotted what appeared to be a slip of paper at the bottom of the bag with “NEWTON’S METHOD” written on it in small capitals, in fountain-pen ink. Had he been cheating in his exams? Had he written a crib to the Newton-Raphson method down and slipped it into the bottom of the bag? I pulled it out and…I was wrong.

Paper tape

It was a rolled-up 8-bit paper tape! Presumably with his attempt at a program to numerically solve a particular class of equation using Newton’s method.

I don’t know what type of machine it would have been written for, but I could see that it was likely binary data or text in some unfamiliar encoding, as whichever way around you look at it a good proportion of the high bits would be set so it was unlikely to be ASCII. Assuming I’m holding the tape the right way round, this is a transcription of the first thirty-two bytes…

0A 8D 44 4E C5 A0 35 B8 0A 8D 22 30 A0 59 42 A0 47 4E C9 44 C9 56 C9 44 22 A0 D4 4E C9 D2 50 A0

That’s clearly not ASCII. In fact, I think I know what it might: an 8080/Z80 binary. I recognise those repeated C9 bytes: that’s the opcode for the ret instruction, which has survived all the way through to the modern-day x64 instruction set. If I try to hand-disassemble those few bytes assuming it’s Z80 code we get:

ld a,(bc)
adc a,l
ld b,h
ld c,(hl)
push bc
and b
dec (hl)
cp b
ld a,(bc)
adc a,l

This isn’t the place to go into Z80 assembler syntax—that might be a topic for the future—other than to say that it reads left-to-right and brackets are a pointer dereference, so ld c,(hl) means “put the value in register c into the memory location whose address is in register hl. As valid code it doesn’t look too promising to my eyes—I didn’t even realise dec (hl) was something you could do—but I’ve never been any sort of assembly language expert. The “code” clearly does start off making assumptions about the state of the registers, but on some operating systems that would make sense. This disassembly only takes us as far as the repeated 0A8D, though: maybe that’s some sort of marker separating segments of the file, and the actual code is yet to come. The disassembly continues…

ld (&a030),hl
ld e,c
ld b,d
and b
ld b,a
ld c,(hl)
ret
ld b,h
ret
ld d,(hl)
ret
ld b,h
ld (&a0d4),hl
ld c,(hl)
ret
jp nc,(&a050)

Well, that sort of makes some sort of sense. The instructions that reference fixed addresses all appear to point to a consistent place in the address space. It also implies code and data is in the same address space, in the block starting around &a000 which means you’d expect that some of the binary wouldn’t make sense when decompiled. If this was some other arbitrary data, I’d expect references like that to be scattered around at random locations. As the label says this is an implementation of Newton’s method, we can probably assume that this is a college program that includes an implementation of some mathematical function, an implementation of its first derivative, and the Newton’s method code that calls the first two repeatedly to find a solution for the first. I wouldn’t expect it to be so sophisticated as to be able to operate on any arbitrary function, or to work out the derivative function itself.

If I could find jumps or calls pointing to the instructions after those ret opcodes, I’d be happier. Maybe, if I ever have too much time on my hands, I’ll try to decompile the whole thing.

The next post in this series is here

Teaching an image to think

Computers work in unexpected ways

Following on from yesterday’s post about log4j: another security article fascinated me in the last week, too. You might have already seen it, because it was widely shared on Twitter and computer people everywhere were amazed and aghast at its engineering and its possibilities. The log4j vulnerability is a relatively pedestrian one by comparison, using something that is an entirely documented and public feature of the library. This, on the other hand, is a completely different animal.

It’s a hack which lets you run code on a stranger’s iPhone just by sending them a message. They don’t have to click on anything, they don’t even have to open it, all their phone has to do is receive it and the hacker can take their phone over. At least, could: the fix for this security hole was fixed three months ago in iOS 14.8 and later. If you are running an older version of iOS on your phone or tablet, then, er, maybe don’t. The analysis of how this hack works, by Google Project Zero, has started to be published; and if you’re a programming nerd, it is beautiful and amazing and horrific in just the same way that a biological virus is.

In short, this hack relied on the fact that an iOS device, when it receives an animated GIF, tries to hack the GIF a little so it will always loop forever whatever the GIF itself actually says to do. It does this in an unhealthy way, though. When it opens the file to change it, it doesn’t matter if it’s not actually a GIF. The software will try to be clever and say “ah, looks like your file’s got the wrong name there, don’t worry, I still know how to open one of these” and do it. Even if it’s not a GIF and therefore doesn’t really need to.

Secondly, the hack relies on a bug in an open source PDF-reading library, in the part of the code used to open embedded images that are in an obscure and rather out-of-date format mostly used by fax machines. PDF is a big, complex and rambly format (believe me I know, I’ve been on-off trying to write a .NET PDF writing library for some years now) so it’s not surprising there are bugs and holes in PDF-reading software. What this hack does, though, is frankly brilliant. It uses the capabilities of the compression algorithm of this particular graphics format to implement an entire virtual CPU in the memory of the target device. It’s a small CPU but it is a Turing-complete one, which in technical terms mean that if you ignore practical limits of time and memory, it’s just as powerful as any other computer. An entire virtual CPU…created by feeding a carefully-designed image into a buggy image decompression routine.*

Frankly, if you’re a software developer, this is genius. Evil genius, to be sure, but genius nonetheless. I’m somewhat in awe of it, in a dirty way. It’s a wonderful level of lateral thinking, to know that the bug is there to exploit and work out a way to reach it and trip it up to begin with; and then to build an entire virtual machine from the basic Boolean logic operations available inside a particular image format. As I said above, it’s beautiful, it’s amazing, and it’s horrific in the original sense of the word. It’s awe-inspiring. I might be good at my job, but I can only look upon this with amazement and envy.

* I assume the image itself looks like just so much white noise if you could actually view it, but you can’t have everything. It reminds me a little of Neal Stephenson’s early-90s novel Snow Crash, in which a carefully-designed image that looks like white noise can hack the viewer’s brain.

Some logical relief

In which we discuss a topical flaw

In many ways I lead a charmed life and hold a wide range of privileges in my hand. Not least, this week just gone, the fact that I’m a software developer who generally works with the .NET software stack. More specifically, I am not a software developer who works with Java. Java developers have not, generally speaking, been having a good week.

This is all because of a software vulnerability discovered just over a week ago in a Java library called “log4j”. To summarise, for non-experts: “log4j” is a logging library. No, not the let’s-clear-the-rainforests sort. “Logging” means your software writing diagnostic information as it goes along: records such as “user etoainshrdlu asked to see their bank balance at 9.10am from this address with that web browser”. You can see why…

Regular reader E Shrdlu (from Clacton) writes: Oi! You can’t go around giving my bank balance to people!

Hush now, I was just using you as an example! You can see why it’s useful to have this information stored away somewhere, and log4j is a software library that makes it really easy to do. Virtually all Java server-side code out there uses log4j somewhere inside it, to handle this sort of thing.

Unfortunately, log4j has a few handy features that were originally intended to be useful features, but aren’t necessarily a good idea to have running on an internet-facing server that does important work such as process your banking requests. Particularly, in this case, if you put a certain specialist type of URL into a log record, log4j will see it, try to download another program from it, and will then run that program in a certain well-defined way. Of course, you might say, there’s nothing wrong with that because all of the log record messages are just written by the bank’s own software developers, so everything’s perfectly safe. However, as I said above, one thing they may very well be logging is which browser you happen to be using, because that’s very useful diagnostic data if people start having problems. “Which browser you happen to be using”, though, is just a field that you send them, and if you know what you’re doing, you can change it to whatever you want to. Including a special type of URL which will…well, hopefully you get the picture. And now you’re running whatever programs you like on one of your bank’s internal servers. Ah. You can see now why Java developers have not been having a good week.

The fix for this is straightforward, but rolling the fix out will have involved a huge proportion of the Java code running in the world being checked, double-checked, and redeployed when it’s known to be safe. Moreover, all of the developers doing this will have had several queries a day from their managers asking just how much they are exposed to this issue. I know: I’ve had several myself, even though my response is straightforwardly “we don’t run any Java code at all, so don’t worry.” I do tell them to tell the clients we have thoroughly and conscientiously audited our systems because from a client-relations point of view it does sound a bit more professional than “no, and our tech lead is very glad of her career choices”. But it still means plenty of messages for me to answer.

Incidentally, I don’t feel any sort of schadenfreude about this, in case you were wondering. I genuinely feel sorry for a lot of people I know, who will not have had a good week fixing this stuff. I’ve worked in big banks and other similar organisations, and I know a lot of former colleagues and current friends who will have spent the last week focusing on this above all else. It’s not nice when you are suddenly bowled by a risk like this; and moreover, it’s not as if Java is uniquely likely to suffer from this type of problem. There are nuances to this that I may come back to in a later post; but next time something like this happens, the person fixing it might well be me.

Code archaeology

When things become relevant again

One thing I have been doing over the past few weeks is: finally, finally, taking the hard drive out of my last desktop computer—last used about 8 years ago at a guess—and actually copying all the documents off it. It also had stuff preserved from pretty much every desktop machine I’d had before that, so there was a whole treasure-chest of photographs I hadn’t seen in years, things I’d written, and various incomplete coding projects.

Some of the photos will no doubt get posted on here over the coming weeks, but this post isn’t about those. Because, by pure coincidence, I was browsing my Twitter feed this morning and saw this tweet from @ireneista:

we were trying to help a friend get up to speed on how to make a Unix process into a daemon, which is something we found plenty of guides on in the 90s but it’s largely forgotten knowledge

Hang on a minute, I think. Haven’t I just been pulling old incomplete coding projects off my old hard disk and saving them into Github repositories instead? And don’t some of those have exactly that code in? A daemon, on Unix, is roughly the equivalent of a “Service” on Windows. It’s a program that runs all the time in the background on a computer, doing important work.* Many servers don’t even run anything else to speak of. On both Unix and Windows systems, there are special steps you have to take to properly “detach” your code and let it run in the background as part of the system, and if you don’t do all those steps properly you will either produce something that is liable to break and stop running that it’s not supposed to, or write something that fills up your system’s process table with so-called “zombie” entries for processes that have stopped running but still need some bookkeeping information kept about them.

Is this forgotten knowledge? Well, it’s certainly not something I would be able to do, off the top of my head, without a lot of recourse to documentation. For a start all the past projects I’m talking about were written in C, for Linux systems, and I haven’t touched the language nor the operating system much for a number of years now.

None of the projects I’m talking about ever approached completion or were properly tested, so there’s not that much point releasing their full source code to the world. However, clearly, the information about how to set up a daemon has disappeared out of circulation a bit. Moreover, that code was generally stuff that I pulled wholesale from Usenet FAQs myself, tidying it up and adding extra logging as I needed, so compared to the rest of the projects, it’s probably much more reliable. The tweet thread above links to some CIA documentation released by Wikileaks which is nice and explanatory, but doesn’t actually include some of the things I always did when starting up a daemon. You could, of course, argue they’re not always needed. So, here is some daemonisation code I have cobbled together by taking an average across the code I was writing about twenty-ish years ago and adding a bit of explanation. Hopefully this will be useful to somebody.

Bear in mind this isn’t real code: it depends on functions and variables that you can assume we’ve declared in headers, or in the parts of the code that have been omitted. As the old saying goes, I accept no responsibility if this code causes loss, damage, or demons flying out of your nose.**

/* You can look up yourself which headers you'll need to include */

int main(int argc, char **argv)
{
    /* 
     * First you'll want to read config and process command line args,
     * because it might be nice to include an argument to say "dont'
     * run as a daemon!" if you fancy that.
     *
     * This code is also written to use GNU intltools, and the setup for that
     * goes here too.
     */

    /* Assume the daemonise variable was set by processing the config */
    if (daemonise)
    {
        /* First we fork to a new process and exit the original process */
        switch (fork ())
        {
        case -1:
            syslog (LOG_ERR, _("Forking hell, aborting."));
            exit (EXIT_FAILURE);
        default:
            exit (0);
        case 0:
            break;
        }

        /* Then we call setsid() to become a process group leader, making sure we are detached
         * from any terminals */
        if (setsid () == -1)
        {
            syslog (LOG_ERR, _("setsid() failed, aborting."));
            exit (EXIT_FAILURE);
        }

        /* Then we fork again */
        switch (fork ())
        {
        case -1:
            syslog (LOG_ERR, _("Forking hell x2, aborting."));
            exit (EXIT_FAILURE);
        default:
            exit (0);
        case 0:
            break;
        }

        /* Next, a bit of cleanup.  Change our CWD to / so we don't block any umounts, and 
         * redirect our standard streams to taste */
        umask (0022);
        if (chdir ("/"))
            syslog (LOG_WARNING, _("Cannot chdir to root directory"));
        freopen ("/dev/null", "w", stdout);
        freopen ("/dev/null", "r", stdin);
        freopen ("/dev/console", "w", stderr); /* This one in particular might not be what you want */

        /*
         * Listen to some signals.  The second parameters are function pointers which 
         * you'll have to imagine are defined elsewhere.  Reloading config on SIGHUP
         * is a common daemon behaviour you might want.  I can't remember why I thought
         * it important to ignore SIGPIPE
         */
        signal (SIGPIPE, SIG_IGN);
        signal (SIGHUP, warm_restart);
        signal (SIGQUIT, graceful_shutdown);
        signal (SIGTERM, graceful_shutdown);

        /* And now we're done!  Let's go and run the rest of our code */
        run_the_daemon ();
    }
}

The above probably includes some horrible mistake somewhere along the way, but hopefully it’s not too inaccurate, and hopefully would work in the real world. If you try it—or have opinions about it—please do get in touch and let me know.

* NB: this is a simplification for the benefit of the non-technical. Yes, I know I’m generalising and lots of daemons and services don’t run all the time. Please don’t write in with examples.

** “demons flying out of your nose” was a running joke in the comp.lang.c Usenet group, for something it would be considered entirely legitimate for a C compiler to do if you wrote code that was described in the C language standard as having “undefined behaviour”.

Milestones

Or, how and how not to learn languages

I passed a very minor milestone yesterday. Duolingo, the language-learning app, informed me that I had a “streak” of 1,000 days. In other words, for the past not-quite-three-years, most days, I have fired up the Duolingo app or website and done some sort of language lesson. I say “most days”: in theory the “streak” is supposed to mean I did it every single day, but in practice you can skip days here and there if you know what you’re doing. I’ve mostly been learning Welsh, with a smattering of Dutch, and occasionally revising my tourist-level German.

My Welsh isn’t, I have to admit, at any sort of level where I can actually hold a conversation. I barely dare say “Ga i psygod a sglodion bach, plîs,” in the chip shop when visiting I’m Welsh-speaking Wales, because although I can say that I am wary I wouldn’t be any use at comprehending the response, if they need to ask, for example, exactly what type of fish I want. To be honest, I see this as a big drawback to the whole Duolingo-style learning experience, which seems essentially focused around rote learning of a small number of set phrases in the hope that a broader understanding of grammar and vocabulary will follow. I’ve been using Duolingo much longer than three years—I first used it to start revising my knowledge of German back in 2015. When I last visited Germany, though, I was slightly confused to find that after over a year of Duolingo, if anything, I felt less secure in my command of German, less confident in my ability to use it day-to-day. Exactly why I don’t now, but it helped me realise that I can’t just delegate that sort of learning to a question-and-answer app. If I want to progress with my Welsh, I know I’m going to have to find some sort of conversational class.

Passing the 1,000 days milestone made me start wondering if anyone has produced something along the same lines as Duolingo but for computer languages. In some ways it should be a less difficult problem than for natural language learning, because, after all, any nuances of meaning are less ambiguous. I lose track of the number of times Duolingo marks me down because I enter an English answer which means the same as the accepted answer but uses some other synonym or has a slightly different word order. With a coding language, if you have your requirements and the output meets them, your answer is definitely right. In theory it shouldn’t be too hard to create a Duolingo-alike thing but with this sort of question:

Given a List<Uri> called uris, return a list of the Uris whose hostnames end in .com in alphabetical order.

  1. uris.Select(u => u.Host).Where(h => h.EndsWith(".com")).OrderBy();
  2. uris.Where(u => u != null && u.Host.EndsWith(".com")).OrderBy(u => u.AbsoluteUri).ToList();
  3. uris.SelectMany(u => u.Where(Host.EndsWith(".com"))).ToList().Sort();

The answer, by the way, is 2. Please do write in if I’ve made any mistakes by being brave enough to write this off the top of my head; writing wrong-but-plausible-looking code is harder than you think. Moreover, I know the other two answers contain a host of errors and wouldn’t even compile, just as the wrong answers in Duolingo often contain major errors in grammar and vocabulary.

Clearly, you could do something like this, and you could memorise a whole set of “cheat sheets” of different coding fragments that fit various different circumstances. Would you, though, be able to write decent, efficient, and most importantly well-understood code this way? Would you understand exactly the difference between the OrderBy() call in the correct answer, and the Sort() call in answer three?* I suspect the answer to these questions is probably no.

Is that necessarily a bad thing, though? It’s possibly the level that junior developers often work at, and we accept that that’s just a necessary phrase of their career. Most developers start their careers knowing a small range of things, and they start out by plugging those things together and then sorting the bugs out. As they learn and grow they learn more, they fit things together better, they start writing more original code and slowly they become fluent in writing efficient, clean and idiomatic code from scratch. It’s a good parallel to the learner of a natural language, learning how to put phrases together, learning the grammar for doing so and the idioms of casual conversation, until finally they are fluent.

I realise Duolingo is only an early low-level step in my language-learning. It’s never going to be the whole thing; I doubt it would even get you to GCSE level on its own. As a foundational step, though, it might be a very helpful one. One day maybe I’ll be fluent in Welsh or German just as it’s taken me a few years to become fully fluent in C#. I know, though, it’s going to take much more than Duolingo to get me there.

* The call in answer 2 is a LINQ method which does not modify its source but instead returns a new enumeration containing the sorted data. The call in answer 3 modifies the list in-place.