Thinking Like a Programmer: Escaping Text

Published Thu Feb 15 2024

Spend enough time around a software engineer, and you might hear something about escaping text. This is not a wish to avoid text, or replace it with other media (images, animated GIFs, movies, etc.).

"Escaping text", in the context of software engineering, refers to the practice of switching between the literal and pragmatic modes of interpreting text. This can be a daunting idea to wrap your head around, so allow me to illustrate it progressively.

Let's start with strings

When a programmer writes code, he or she is usually making a list of statements in a text file. This text file gets transformed (either compiled or interpreted) into language that the computer itself can read and carry out.

Conventionally, when learning a new language, a programmer will be exposed to a program called Hello World. This program -- no matter the language it is programmed in -- displays the phrase "Hello, World!" somewhere on the screen. (Confusingly, programmers refer to this as printing text to the screen).

In Python, the "Hello World" program looks like this:

#!/usr/bin/python3

print("Hello World!")

As expected, running it produces the text "Hello World!" on screen.

The first important thing to notice, for our purposes here, is that the text file that we used to define the program is different than the result of running that program -- which, of course, is the whole fun of programming -- you make the computer actually do things!

In particular, you'll notice that the text "/usr/bin/python3" and "print" do not appear in the program's output; only the phrase "Hello World!" appears in the output. This is because "Hello World!" is defined in the program as a string.

Strings are a special type of text (or code) that gets passed around by computer programs. Strings can be sent by programs to a variety of places -- to a spreadsheet on your hard drive, to your monitor, to your printer, to a text-to-speech synthesizer, and so on.

Strings generally contain text designed for a human to read, but not always. Website addresses, for instance (e.g. "https://www.artandlogic.com/") are usually treated as strings within computer code, even if these are meant to be interpretable by web browsers rather than people.

The important take-away here is that strings are not code. They are used by computer code, but they are pieces of text that are not transformed into computer code. If programs are airports, strings are luggage. Strings get shuttled around by the airport workers, and although they contain an assortment of objects inside, they are treated as self-contained units.

This analogy is actually quite useful, because it lets us make a further comparison: the quotes around the string "Hello World!" are the suitcase: the opening quotation mark says "what follows are the contents of my string", while the closing quotation mark says "now we're going back to the code". This is why quotes come in pairs. One opens the string, the other closes it.

This is what I meant earlier by switching between the literal and pragmatic modes of interpreting text. All the pieces of text in our program that make the computer do things are pragmatic. This is the text that defines what the airport looks like and how it operates. All of the pieces of text in our program that get passed around like luggage are literal. Quotation marks let us switch back-and-forth between designing the airport and describing the luggage.

What does Charlie say?

I started this discussion using double quotes ("), because readers of English will recognize them from newspaper articles and books, where double quotes are often used to encapsulate what someone has said.

Avid fiction readers may be intuitively familiar with one of the tricky aspects of quotes, which is reporting on what one person says another person says. If Alice asks Bob what Charlie said, the writer needs to find a way to communicate to the reader that within Bob's response are Charlie's words.

"What did Charlie say?", asked Alice.
"He stared off into the distance and waxed poetically, 'If things were different they wouldn't be the same'", replied Bob.

Bob has just presented the reader with a suitcase within a suitcase. The writer has used two types of quotation marks to differentiate them: Bob's words (the outer suitcase) are encapsulated using double quotes ("). Charlie's words (the inner suitcase) are encapsulated using single quotes (').

This can quickly get messy. What would Bob's response be if Charlie were quoting Dennis?

English punctuation doesn't comfortably handle multiple-layered scenarios, and so a resourceful author might prefer to find an alternative way to describe the interaction, perhaps using more context or even a literary device like a flashback or a parenthetical remark.

Poetry and code

While literary authors have the luxury of being able to play with the conventions and form of the language (they have poetic license to bend the rules), that kind of freedom is much more restricted in programming. There are a lot of creative choices you can make when writing a computer program, but these don't extend to the syntax of the language.

In other words, computer code has to be unambiguous -- it can only mean one thing and one thing only. The rules of syntax for any given programming language are generally very strict.

(Even though we talk about computer languages being "interpreted", this is another term of art which -- unlike its everyday usage -- doesn't imply any choice on the part of the reader, i.e. the computer. My "Hello World" Python program above can be interpreted (i.e. run) millions of times, and it will function the exact same way every time.)

Programming, as a discipline, draws a lot of inspiration from mathematical notation. While math formulas don't report what people say, they do make heavy use of the "suitcase inside a suitcase" phenomenon, more generally called nesting. The inner suitcase from our analogy can be said to be nested within the outer suitcase.

Parentheses accomplish the same thing in math:

a = 12 × (b + 2.4 × (g + 2))

You can see here that we have three levels of nesting:

The entire right-hand-side of the equation provides the outermost suitcase,
The first open parenthesis declares the start of the first inner suitcase, and
The second open parenthesis declares the start of the innermost suitcase.

The nesting you see here is quite easy to follow, because it is accomplished using just one pair of symbols: the open and closed parentheses.

Nesting within text strings is a bit more complicated because of a key difference: parentheses in math formulas are always pragmatic, never literal.

Quotation marks in the context of programming, on the other hand, may be either literal or pragmatic.

Bringing it all together

Reviewing what we know so far, we've discussed three really important ideas:

Strings are text that computer programs treat as whole and distinct entities. They are like suitcases.
The start of a string is declared when a double quote (") is encountered within the computer code. The end of the string is declared when the exact same symbol (") is encountered again. (Other symbols such as single quotes, backticks or slashes are sometimes used, but for simplicity, let's stick with double quotes).
Strings, like suitcases, and like math formulas, may contain sub-objects. This is called nesting.

Reflecting and meditating on these three ideas and their implications raises a difficulty: How can points #2 and #3 above co-exist?

For instance, what if I wanted to modify my Hello World program above so that I have a nested string? Instead of "Hello World", suppose I wanted my program to give me a piece of advice inspired by the TV show Arrested Development:

Just remember: "There's always money in the banana stand!"

I would have to replace the contents of the "Hello World" string with the above, to give me something like this:

#!/usr/bin/python3

print("Just remember: "There's always money in the banana stand!"")

However, pairing up the quotes from left to right as dictated by principle #2 above, disappointingly results in two non-nested strings with some gibberish between them.

Scanning left to right, the first string (between the first two double quotes) is "Just remember: ". Of course, we meant for both of these to be opening quotation marks (which would be unambiguous if we were using parentheses), but because we're forced to use the same character for both opening and closing quotes, we have to rely on left-to-right scanning to pair them up and decide which one is an opening quote and which one is a closing quote.

The second string, not recognized until another pair of quotation marks is encountered, is empty, "".

Between the two strings is the text (not a string!) "There's always money in the banana stand!", which the Python interpreter is not going to know how to deal with. We meant to put one suitcase inside another, but instead we've created two suitcases, one of which is empty, and between them, an armful of loose items which is just going to gum up the works of the baggage carousel.

So, how can we have nesting (#3) when we must close one suitcase before creating a new one?

Escaping

Now, finally, we arrive at the concept of escaping.

When we use double quotes to mark the start and stop of a string, we are using quotes in their pragmatic sense. These quotes are telling Python something: while scanning text left to right, start then stop treating this text as a string.

What we need, in order to reconcile points #2 and #3 above, is a method of turning a pragmatic double quote into a literal one. This is known as escaping.

In our banana stand string, to accomplish nesting, we need to somehow tell Python to ignore the string-demarcating behaviour of the two inner double quotes, and to treat them like any other letter, number or character within a string. We need to switch those two quotes from pragmatic quotes to literal quotes.

We do this with an escape character. In Python, this is the backslash: \.

Our Python program, with the inner quotes escaped appropriately, now looks like this:

#!/usr/bin/python3

print("Just remember: \"There's always money in the banana stand!\"")

While somewhat uglier and less readable to humans, this is unambiguous to the Python interpreter: the inner two quotes are part of the string, and you can safely ignore them when looking for the second double quote that closes the suitcase.

Naturally, I don't want my final output to contain any ugly backslashes, which is why escape characters are removed before the string gets displayed, sent as a text message, read aloud, or otherwise.

Now that we understand what strings are, how strings are delimited in a left-to-right way, what nesting is, and how escape characters allow strings to be nested within one another, there's a final distinction to master:

Are escape characters pragmatic or literal?

Escape characters are pragmatic. We can see this in two ways:

They actually tell Python to do something (in this case, "ignore the quotation marks that directly follow"), and
They could not possibly be literal because they are removed from the string.

So, what happens if I want my program to output something that contains a backslash?

The file you are looking for is found under C:\Windows\System32

The answer is logical. Since the escape character tells Python to ignore the following character (treat it as a literal), we can use the escape character on itself, simply by doubling it!

#!/usr/bin/python3

print("The file you are looking for is found under C:\\Windows\\System32")

When the backslashes are duplicated: the first in a pair tells Python to ignore the second in the pair, or at least to treat it literally.

Closing thoughts: Why this is important to programmers

Escaping text is important to understand because it is central to the art and craft of programming, just as luggage is central to the experience of traveling.

More broadly, strings, suitcases, trunks, cans, enveloppes, cardboard boxes and many other types of containers are designed to shuttle contents through a system. Their outsides are meant to be handled by the system, while their contents are meant to be kept separate from the system. Both the system and the contents of the container benefit from a sturdy and leak-proof divider between them.

Picture a malformed string as a suitcase missing one side, or a paint can with a leak, and it's easy to see the kind of havoc it could wreak making its way to where it needs to go.

An improperly-escaped string can be just as bad. In fact, Apple's installer for iTunes 2 (released in 2001) contained a string which was not properly escaped, and this caused a bug which erased several users' hard drives!

This is why escaping text matters. A programmer who doesn't craft strings carefully risks the introduction of bugs, errors, and even security vulnerabilities anywhere strings are used, which is to say almost anywhere.

Forget About [Code] Style

By Paul Hendry

programming

Good code style, being highly subjective, is something often debated among developers. After all, we spend more time reading code than writing it, so it's worth making sure our code is styled to be as easy as possible to read and to understand. On the other hand, deciding upon and continuously enforcing a style is also time-consuming, and the benefits are near-impossible to quantify. Given that modern code formatting tools can fully automate the process, is it still worth fretting about style?

Don't Give Up on Semantic HTML

By Paul Hendry

programming

Since the early days of the Web, there has been tension between the ideal of "semantic HTML" and the practical reality of designing complex page layouts, which often could not be achieved without inserting style concerns into the document. More recently, frameworks like Tailwind CSS have emerged which challenge the very idea that semantic HTML is an ideal to strive for, and which commit to thoroughly embedding style concerns into HTML documents. With modern CSS features however, semantic HTML is more achievable than ever, and I do think it remains a worthy goal.

Thinking Like a Programmer: Escaping Text

Let's start with strings

What does Charlie say?

Poetry and code

Bringing it all together

Escaping

Closing thoughts: Why this is important to programmers

Previous

Next