Ten things I didn’t know about HTML by Mathias Bynens
[silence 00:00 to 00:05]
Hello, everyone. Welcome to the graveyard slot. I realize that you’re all probably a bit sleepy after having that delicious lunch, but the next 45 minutes will be my attempt at keeping you all awake. This is me, I’m a frontend web developer from Belgium. Are there any other Belgians in the room? Where you at? [cheering]
Oh, you’re spread out across the room, I like that. In my spare time I do all kinds of stuff on the web. I collaborate on various open source projects, one of them is the HTML5 Boilerplate, for which I’m a core developer. I also created some other things like jsPerf, which I’ll hopefully get to talk about a little bit in a few slides.
This talk is titled “Ten things I didn’t know about HTML,” but that’s mostly because I needed a short title. In reality there’s a lot more than just ten things that I would like to talk about. But we’ll just have to see how much time there is left. Also, I won’t be limiting myself to just HTML…
Hopefully I’ll be able to surprise some of you with some of these little fun facts today. First, I have a feeling that there will be more than enough talks that will focus on HTML5 and CSS3 and responsive web design. All these modern, new, fancy hip features.
That’s all super cool. But, for now, why don’t we just put on our hipster glasses and take a close look at HTML 3.2. Back in the day, when there was no such thing as CSS, HTML actually included some presentational elements. There were also a number of presentational attributes, which still allowed web developers to add some level of styling to their pages, without using actual CSS.
One of those presentational attributes is the bgcolor attribute. It’s fairly straightforward. If you set the bgcolor of an element that supports it to, for example, the value pink, you would get something like this wonderful slide. This is probably the most amazing slide that you will get to see at this year’s Fronteers Conference. Enjoy it while you can.
Luckily, you can also use other color notations, like the hexadecimal color notation which you’re probably used to from CSS. This would work just as well. This is a hexadecimal color notation for the CSS color pink. I’m probably not telling you anything new right now. What surprised me here is that you can actually omit the first character there, the hash symbol, and it will still work exactly the same way.
Browsers don’t really care about the hash symbol in this case. Now, they do care about that symbol in another case: if you’re using the shorthand notation, which consists of only three hexadecimal digits.
In this example if you want to use the color #f3f and you want it to map to #ff33ff, which is what I would expect, you really have to use the hash symbol. If you omit it, you end up with a completely different color. Instead of this pinkish type of color, you will end up with #0f030f, which is something that closely resembles the color black.
All of this is defined in the HTML5 spec. I know I told you that I wouldn’t talk about HTML5 just five minutes ago, but this is actually one of the coolest parts of HTML5: it’s the first version of HTML that describes a complete parsing algorithm that all browsers can use. This algorithm even defines how legacy content, or how invalid markup and deprecated elements and attributes like these, can still be parsed.
In a way this is still HTML5, even though it’s deprecated and invalid code. This is just a very small part of this parsing algorithm and that’s the algorithm for parsing a legacy color value. Basically that’s an algorithm that applies some rules to convert an attribute value into a color, which is the end result that we’re after.
One of those rules is that any characters that you add into the attribute value that aren’t a hexadecimal digit, those characters get turned into a zero. They get interpreted as a zero. Now that we know about this rule, if you’re easily bored like me, you can have some geeky fun with this by trying out different attribute values and seeing which colors they map to.
I did just that and I noticed that whenever you enter the value sick you end up with the color green, which makes sense, right? [laughter]
What really happens here is the C character is the only hexadecimal digit in this value, so everything else gets turned into a 0. We get #00C0. The rest of the algorithm converts that into a full hexadecimal color value, in this case, #00C000, which is the green that you’re seeing right now.
There are some other examples as well that are appropriate in some weird and crazy way. For example, if you try the value MrT… [laughter]
…which is the actor who played B.A. Baracus in The A-Team, you end up with a full black background, as black as it can be. I thought that was fitting as well. Similarly, and this is actually a rule in HTML5 specification, if you use chucknorris as a value, the browser will render the background using the blood of Chuck’s enemies. [laughter]
That’s a true story. There are some of other things that you could try. For example, if you enter the value, crap, you end up with this type of color. I’m not sure if the projection really does it any justice, but on my screen it kinds of looks like a mixture between yellow and brown.
Kinda gross. You could try some other stuff as well. For example, if you use the pile of poo symbol, you’ll notice that it will get interpreted as two consecutive zero characters, instead of just one. The end result, in this case, is again the full black color. I think the takeaway here is that, in HTML, crap is brown but poo is black. OK? [laughter]
I also noticed that if you use the value fronteers, it turns into this vibrant pink color, which is surprising, but weirdly appropriate, if you know the conference organizers personally. [laughter]
Anyhow… Sorry, Peter. Sorry. [applause]
But anyhow, there are probably some other colors that would be even more appropriate for something like Fronteers or for this conference. Let’s take a look at the Fronteers logo. Here it is. I don’t know the first thing about graphic design or color theory, but I think it’s safe to say that the primary color in this icon is yellow, right?
It would be cool if we were able to find an attribute value that would somehow be related to Fronteers and that would still map to a yellowish color. That’s my kind of fun, at least. The first thing I tried was using the hashtag for this conference (#fronteers12). Sadly, that didn’t turn out to work. Instead of the yellow that we expected, we get a red-like color.
Luckily, by simply tweaking the attribute value a little bit, we can get to a result that we want. If you use the value fronteers2012, or the value fronteersconf, you get almost exactly the same yellow color. How amazing is that? [applause]
That’s pretty cool, right? Yeah. I thought so. OK. I think we’ve seen enough colored slides for today. If you don’t mind, let’s just go back to the regular slide background before your eyes all start to bleed or something. All those slides that you’ve just seen, those were all, obviously, invalid HTML. You should never use the bgcolor attribute in any document that you create from now on.
Unless, of course, you’re doing it ironically. Now that we’re on the subject of valid and invalid HTML, let’s talk about validation for a little bit. For a long time, when I started out with web development, I always assumed that whenever an HTML validator told me that my markup was valid, that my work as a developer was done. It meant that I had created a document that was written according to web standards.
That’s all there is to it. Over the years, my stance on this changed quite a bit. The first thing I learned is that there are actually three separate layers of conformance criteria that a proper HTML validator should check for. The first one of these layers is the Document Type Definition, or the DTD. Basically, the DTD is a document that defines a list of elements that are supported in a language.
Then for each element, it also tells you which attributes that element supports, and where in the document the element is allowed. There is a DTD for HTML4. There is a DTD for XHTML1. You get the idea. It’s just a structured data file that defines all these different things. Let’s say we wanted to create an HTML validator of our own, using nothing but the DTD.
This validator would be able to detect the validation error in this code, because it would know that the paragraph element, the <p> element is not allowed in the <head>. Similarly, our DTD-based validator would be able to detect that the <kitchensink> element is no real thing. It doesn’t exist in HTML, so the DTD wouldn’t have an entry for that.
As you can see, using nothing but the DTD, we can already detect a number of typos and other common mistakes. But there is more to it than just the DTD. For example, the placeholder attribute may only be used for certain <input> types. The date input type turns out not to be one of those input types. This looks like it may be valid HTML, because you’re using the <input> element, which exists.
You’re using the type and the placeholder attributes, and both of them exist as far as the DTD is concerned. But, in reality, this is still invalid HTML. The DTD wouldn’t be able to detect it. Here’s another example. If you have a <table> and its first row only consists of a single column, but the second row consists of two columns, something’s not right.
This is actually invalid HTML, even though you’re not using any elements that you’ve made up, or attributes that don’t exist. Again, the DTD would not be able to detect this. It’s obvious that we need a second layer of conformance criteria. Those are the criteria that cannot be expressed in the DTD, but a computer or a computer program can still check for these things.
Sadly there’s also a third layer of conformance criteria and those are the things that only a human being can really check. This includes stuff like checking if the alt attribute for images has been used correctly, or if the text content of a <time> element does really represent a time, or if the text content of a <blockquote> element is really actually a quote, as is required by the spec.
There’s no way that a computer program can ever validate that for you, because it has no concept of context in the web page. Only humans can check these things. I think the takeaway here is that, in my opinion, HTML validators are very useful tools, but I think they’re a tiny bit overrated.
We shouldn’t obsess about validation too much. I think it’s OK to have an HTML validation error in your document, as long as you know why the error is there and you have a good reason to have that invalid piece of code there in your page. A good example of that is the X-UA-Compatible <meta> tag that you see here.
You can use this one line of HTML to make sure that Internet Explorer will always use the latest available rendering mode that it has to render your document. This is a very welcome behavior and one that you probably want for all of the web pages that you create from now on, but if you include it in your HTML document, the document becomes invalid.
Should you care about what the validator tells you in this case? I don’t really think so. Finally, of course you should remember that automated validators can only check for two out of the three layers of conformance criteria. The rest is still up to you.
Just because you’re using an HTML validator doesn’t mean that you can stop thinking about your markup and the way you use different HTML elements and attributes. OK. Let’s talk about character references or character entities in HTML. This is probably nothing new, but if you want to include a special character in your HTML you can always use a character reference.
This usually starts with the ampersand character, followed by a number of other characters. For this reason, if you want to include the ampersand character itself, you can always escape that and use a character reference for that character. In this case if you want to use the ampersand character, we can escape it as &. Note here that the semicolon marks the end of the character reference.
You’re all front-end web developers so I suppose that you know that scientific research indicates that the semicolon is the single most dangerous form of punctuation in the history of programming. Case in point, Twitter Bootstrap Issue #3057. Some of you may remember this, some of you may have even been there.
They said, “Sorry, but this is not an issue in our code; it’s an issue in the third-party minifier that you’re using.” Soon after that, the author of that minifier — Douglas Crockford — chimed in. He wrote, and I’m going to quote this, “This is insanely stupid code. I’m not going to dumb down my minifier for this case.” [laughter]
If we get back to our previous example, even in this case it turns out that we can omit the semicolon if we want. Look at me, I’m super cool, I omitted a semicolon. Now the only thing that really changed here is that this HTML code is invalid. Other than that you won’t see a difference; browsers will still render it exactly the same way as the code in our previous slide.
You know how I feel about HTML validators because I just told you five minutes ago. If the only thing keeping you from having a valid HTML document is a single semicolon, maybe you should just add it. Another reason why I wouldn’t recommend omitting semicolons just because you can get away with is that it gets even more confusing.
Especially if you use it in attribute values as well. Here, we’re using exactly the same string value as the text content of the paragraph and as the content of the title attribute for the same element. You’ll see that the text gets rendered as foo&bar, while the title gets rendered as foo&bar.
There’s a difference there. The reason for that is that attribute values have different parsing rules than simple text content. That’s another reason why I wouldn’t recommend using this “trick.” Now, what really surprised me is that in some cases, you don’t even need to escape the ampersand symbol at all.
Like, for example, in this case, you can get away with escaping it all together, because it’s followed by a space character. I’m going to spare you all the gory details and the exceptions to all these rules, but if you’re interested in that, you can always check out the URL at the bottom of this slide [http://mths.be/bdu]. But, for now, let’s just take a look at another example.
Let’s say we want to use the “greater than” symbol in our page. As you know, the greater than symbol has a special meaning in HTML, because we use it to close tags. If you want to use it in your text content, you can always, of course, escape it using a character reference. The character reference for this character is >.
As you probably guessed, this is another one of those character references for which you can get away with omitting the semicolon. Browsers will still render it exactly the same way. The only difference is that the code became invalid now. Of course, here, the same applies as in the previous example. There’s still that difference between how attribute values are parsed and how the text content is parsed.
This is needlessly confusing and I wouldn’t recommend doing this. Just use the full form of the character reference, ending in the semicolon. However, what really was a surprise for me here is that, again, you can get away with simply not escaping the symbol.
You can simply use the greater-than symbol as the raw character in your source code. The reason for that is that unless the browser is currently parsing an open tag, there’s no ambiguity there. The browser knows that you just mean to use the raw character, so there’s no need to escape it.
Now that we’re on the subject of tags and elements and HTML, let’s take a look at a very simple HTML document. Now, I realize that this is using the HTML5 DOCTYPE, but what I’m about to say mostly applies to HTML4 or even HTML2, if you want.
You see we have a DOCTYPE there to trigger standards mode — that’s important. We then have the <html> element, which contains a <head> and a <body> element, and then the <head> element contains the <title> element, and then the <body> element contains all the contents that we want to display on our page.
Well, the first thing I learned here is that it turns out that you can simply omit the closing tags for the <html>, the <body> and the <head> elements. We can simply scratch those, omit them from your markup, and you will still end up with a valid HTML document that will render exactly the same way in all browsers.
That was the first surprise. But it gets even better. It turns out that you can even omit the starting tags for these elements, just like that. The end result is a very compact, but still valid HTML document that will still render exactly the same way on all browsers.
These elements are implied and there’s not really a need to include them in the markup. On that note, this is probably the most useless tattoo ever. If you’re going to get an HTML tattoo, at least pick some tags that aren’t implied or optional. It’s a waste of tattoo ink.
With that in mind we can say that this tweet is misinformed. It says “<html><head></head><body></body></html> is all the HTML you ever really need.” That’s just plain wrong — luckily it’s very easy to fix this tweet. There, I fixed it. These are actually the only elements that you never really need in the document. What you do need, that’s missing from this tweet is a DOCTYPE and in 99 percent of all cases, a <title> element. It’s a bit wrong.
Now let’s talk about CSS for a little bit. Let’s talk about font family names. For a long time whenever I wanted to use a font on my website using CSS, I would always quote the font family name if it contains spaces. The reason for that is, I think this was inspired by the warnings of the CSS validator of the W3C, but I’m not sure anymore.
I lived by this mantra that whenever there’s whitespace in the font family name, it must be quoted or else it wouldn’t work. Only recently I found out that that’s actually not really true. This mantra doesn’t really make sense. If you’re interested in all the exact rules, when the quotes are needed and when they’re not needed, you can always check out the URL in the middle of this slide [http://mths.be/bft].
In short, whenever the font family name is a space-separated set of CSS identifiers, then you don’t have to use the quotes at all. It turns out that that just happens to be the case in 90 percent of all font family names that are actually being used. I just totally made that number up by the way, but it’s something close to 90 percent, I’m sure.
In this case we can just do away with all the quotes. Let’s take a look at another font family name. Let’s say we have a font called 456bereastreet. Because this font family name starts with a digit, it’s not a valid CSS identifier. For that reason, this line of CSS code won’t actually work. It will silently be discarded.
There are a couple of things that we can do to fix this. The first thing that we could do is, we could simply escape the first digit. This makes the whole thing a valid identifier again, but this looks a little bit messy and confusing. The font family name is 456bereastreet, but now it says \34 56bereastreet and there’s this weird space in between the 4 and the 5, there’s this weird backslash at the beginning, and it’s really confusing.
A better solution in this case, I think is to simply use quotes. I’d say whenever you’re in doubt about something, always simply use quotes. It avoids a lot of problems in many different programming languages and it rarely even introduces new problems. Just play it safe and always use quotes. I think the same can be said for semicolons, by the way.
If you’re confused: I created a tool for this [http://mths.be/bjm] that can be used to play around with different font family names. You can enter any value that you want and it will tell you if it would make sense to wrap it in quotes or not. I realize no one ever needs a tool like this, but I still made it just for fun. I think it can be useful if you want to learn about this stuff.
OK, now let’s take a closer look at attribute values in both HTML and CSS. I hope I’m not telling you anything new here, but as you can see, in both HTML and CSS you can use quotes around your attribute values. Mind blown, right? In this case the attribute value is foo in the HTML document. Then we have some CSS that selects all the anchors in the document whose href attribute is set to the value foo.
In this case it will only select the first link in this document here and it would give it a hot pink background. In this case we can simply omit the quotes in both HTML and in CSS. It would still work exactly the same way, and it would still be valid html, and the CSS would be valid, too. There’s no issue there.
However, if we try this with another value, let’s say foo|bar for example, you’ll notice that some things start to go wrong in a very weird and unexpected way. In this example, the HTML is valid, but the CSS isn’t valid. The reason for that is that foo|bar is not a CSS identifier. There’s always an explanation, but you never expect these things to happen. It’s important to note here that there’s a difference between the rules for unquoted attribute values between HTML and CSS.
Again in this case, the solution is to simply just use the damn quotes. I would probably even add the quotes in HTML, just to be safe, just to be sure. However, if you’re that guy or girl who doesn’t want to use quotes unless you absolutely have to, well you can find out about all the different rules and exceptions if you simply check out the URL at the bottom of this slide [http://mths.be/bal].
I’ve also made a tool [http://mths.be/bjn] that can help you with that, if you’re [laughs] interested in that. It’s another one of those useless tools, I know, but I had fun making it. You can simply enter just about any attribute value that you want to use and it will tell you instantly if it’s a valid unquoted attribute value in HTML and in CSS, so there.
Audience member: [indecipherable]
Mathias: I’m sorry?
Audience member: There is no function for that.
No, there is no function for that, but my first thought would be to use the length property. All strings have a length property. For example, if we try this, if we create a string that contains the capital letter A, its length property will have the value 1. If we make a string with the capital letter B, again its length property will be 1.
In these cases, as you can see, the length property of the string just happens to reflect the number of characters in the string. It’s important to note that this is not always the case. As you can see here, I’ve included the Unicode escape sequences for these code points. For these symbols, in this case it’s very clear that there’s only one escape sequence there so the length of the string will be 1.
Basically, if you simply convert the string to the array first and then get the length of the array, instead of getting the length of the string directly, we end up with the correct result that we were looking for. As you can see here… If you use our brand new countSymbols function, for both the “normal”, the regular capital letter A, or for the mathematical bold capital letter A… We get 1 as a result in both cases, which is exactly what we want.
I have five different variable declarations here. I would like to know which one of these is invalid. Any ideas? Who thinks it’s the first one? Raise your hand. Who thinks it’s the second one? Some people. Who thinks it’s the third one? Some more people.
Who thinks it’s the fourth one? Yeah, it does contain some characters like plus, minus and greater than symbols — there’s some weird stuff in there. Let’s just take a look… Yes, I believe these are all valid except for the last one. The reason for that is there’s a zero-width non-breaking space right there. [laughter]
There it is, see. [applause]
I can’t believe you didn’t see that, come on. Stay awake, you all. If you’re interested in the exact rules and which Unicode categories and characters are allowed, you can always check out the link at the bottom there [http://mths.be/ber], as usual.
Give it a minute. It’s literally over nine thousand characters long by the way. There we go, it’s actually 11 thousand, three hundred and something characters in total. I actually wrote a Python script to generate this for me — I didn’t manually write it out by hand or anything. I’m not that crazy.
If you didn’t manage to write that down in time, there’s always this other useless tool [http://mths.be/bjo] that I created [laughs], in which you can simply enter just about any string value that you want.
Well, the ECMAScript specification defines the following algorithm which defines whether a given value is truthy or falsy. As you can see, if you coerce undefined into a boolean, it becomes false. You could say that undefined is falsy.
Similarly null is another falsy value. If the original value is already a boolean, so it’s true or false, well in that case it’ll simply be the same as the input. true is truthy and false is falsy. That makes sense, right?
If it’s a number it depends on the value: if it’s plus or minus zero or if it’s the number NaN it will be falsy, and any other number value will be truthy. The same goes for strings: if it’s the empty string, that will be falsy, but every other string value is truthy.
Then finally, and this is what you should remember from this slide, is that all the other objects that aren’t listed here are all truthy. It doesn’t matter if it’s an empty array or an empty object literal — all those objects are supposed to be truthy according to the ECMAScript specification.
Of course there’s one exception to this rule in the DOM. Does anyone know which exception I’m talking about? There’s one object that lives in the DOM that is falsy, instead of truthy.
Yeah, it’s document.all. If you inspect the value, you’ll see that it’s an HTMLCollection object, so it is a real object that contains references to various elements that live in the DOM. If you coerce it into a boolean, you’ll see that it’s falsy.
Why has this been done? This is actually a willful violation of the ECMAScript specification for backwards compatibility. The reason this change has been made and it has been specified this way in the DOM specification and HTML5 spec, is that a lot of existing code on the web uses stuff like this.
As you can see it checks for document.all first, tries to use document.all if it’s available, and only if it’s not available it falls back to using document.getElementById. Most modern browsers implement both of these things. They implement document.all for backwards compatibility with pages that rely on it, and they implement document.getElementById because it’s standard and it’s the best way of getting an element based on its ID.
In modern browsers we would prefer to end up in the else fork, instead of the if fork. As long as we support document.all in our modern browser, we’ll never get there, unless of course we make document.all be a falsy object without changing its actual value. That’s the reason why this was changed.
My indentation is a bit messed up. I blame Keynote for that. This pattern is probably the most popular one. It’s used in the popular SlickSpeed, TaskSpeed, SunSpider and Kraken benchmark suites. It’s being used a lot. Basically, what it does is it gets a timestamp.
Then it executes the code that you what to benchmark. Then it repeats it for a predefined number of iterations. Then, finally, you get another timestamp and compare it to the original timestamp. That gives you the difference.
Which is, essentially, a useless result. A slightly more future-proof way of doing things is the following. You simply keep on running the test code for at least a second and you keep track manually, in a variable, of how many times you were able to run the code in total. Then, after that, of course, you can easily calculate how long each run of the test code took. Or how many runs per second were achieved.
This is a pattern that is being used in Dromaeo and the V8 Benchmarks Suite. Unfortunately, it’s still not that simple. There are still some other issues that are much harder to solve. For example, I know most of you won’t even care about this anymore. But, if you’re using Windows XP, you should probably know that the internal system clock only gets updated every 10 or 15 milliseconds.
Of course, if that one only updates once every 15 milliseconds, it’s very hard to get accurate results. For example, let’s say you have a test that takes only two milliseconds to run. This may show a result of either zero milliseconds, if the internal system clock didn’t update while the test was running. Or, if the timer did update, you would get a result of 15 milliseconds.
Like Benchmark.js, for example. It’s able to detect the various different timers, each with their own resolution. For example, there is of course (new Date).getTime(), which is the most common timer, which is supposed to have a millisecond resolution. That’s not bad, but as you know, on Windows XP, you don’t nearly get that one-millisecond resolution.
There is also chrome.Interval, which is, of course, a Chrome-specific API that is only available if you start Chrome with a certain common line flag. I doubt a lot of people have this flag enabled. But still, Benchmark.js is able to detect it and use this timer, if it turns out to be the best one that it could find. This timer offers a microsecond resolution.
Then once we get the results, Benchmark.js will perform statistical analysis of all these results, which is really important because if you want to run the same benchmark on the same machine twice in a row, you’re going to expect to have more or less the same results. However if you run the SunSpider test for example and you run it on same machine twice in a row, it’ll probably tell you that your machine is faster or slower than itself. That’s because they’re not doing statistical analysis of the results and removing the outliers and stuff like that.
You get a neat table and an overview. Then finally when you run the test it will tell you, which of the results was fastest and which of them was slowest. As you can see in the screenshot here, there are three different results that get highlighted as being the fastest. Even though their numeric values are slightly different. The reason for that is that we consider the margin of error for these results.
If you do that, there’s no way of knowing which of those three green highlighted results is actually the fastest. Statistically, they’re all equally fast. For that reason, we will simply highlight all of them. OK. Those were a couple of things that I learned. I think that’s it for me. If you have any questions, feel free to ping me on Twitter [@mathias].
I’ll get back to you as soon as I can. Thank you for listening. [applause]