Fronteers — vakvereniging voor front-end developers

jsmpeg: Why a JavaScript Video Decoder Actually Makes Sense by Dominic Szablewski

Transcript

[APPLAUSE] (dominic szablewski) Thank you.

Hi, everyone.

My name is Dominic Szablewski.

And you can find me online under the name Phoboslab.

Phobos was actually one of the two Mars moons, Phobos and Deimos.

And that's an actual photo of the moon, you can see in the background right here.

I do lots of crazy experiments.

And you know, the Phobos Laboratory is originally the place where this happened.

So my experiments are crazy, but not as crazy as opening portals to hell-- yet.

So I do lots of different stuff.

Back in 2010, I made one of the first HTML5 games, and maybe the first HTML5 game that's actually fun to play.

I [CHUCKLES] I later published the game engine behind this game and I sold it.

And this game engine is one of the main reasons that I can do all these crazy experiments right now, because it was quite successful with that.

Later, I published one of a first HTML5 games for the Nintendo VU.

They actually have a framework, called Nintendo Web Framework, which runs straight for console.

So you can do this crazy stuff like this with hundreds of bullets on the screen.

I started a project called, Ejecta.

Ejecta is a Canvas implementation for the-- for iOS.

So think of it like a Canvas element, without the browser surrounding it.

You just get the JavaScript runtime and the Canvas implementation that is really, really fast, because it's implemented with OpenGL natively.

And it doesn't have to get through all the layers, like in a normal browser.

I was hoping that this project would get obsolete for a number of times in the last few years, because the browsers for smartphones get better and better.

But there's always a use case where this project still remains relevant, the newest one being Apple's TV OS for their Apple TV platform, which doesn't support web views at all.

So If you want to have some HTML5 animation or HTML5 games, Ejecta is probably the way to go.

I also do a bit of native stuff.

So earlier this year, I ported one of my favorite games to the Oculus Rift to play the virtual reality.

And I tried virtual reality, or more specifically, the FVR, for one of my own HTML5 games.

So you can play the virtual reality version in your browser, if you have the headset.

Both games are actually horrible to play.

Because of the fast movement, you will get sick in a few minutes.

So I don't recommend it, but it's still fun to look at.

[AUDIENCE CHUCKLES] And lastly, a few months ago, I ported one of-- or reverse engineered one of my other favorite games from the '90s, and extracted all the data formats, the textures, and models, to be displayed a browser.

So this actually runs with FGR] right now.

And it's intact, if you can zoom around the map.

That's the whole trick for this-- the whole racetrack-- render that once in a browser.

And it also works on mobile phones with 60 frames per second.

If you think back at the 90s where you played this game, you have a render distance of maybe 10 meters before you got a blank screen.

So it's quite, quite crazy that you can render the whole track at once in a browser right now.

So today, I want to talk about something else, which is jsmpeg, as was announced already.

It's a JavaScript MPEG1 video decoder.

And as you can say-- see in the example here, it's quite easy to use.

You just hand it over a Canvas element and the file name of your MPEG1 file, and it will start playing in your browser.

And it looks something like this.

This is actually rendered in a web page.

It's rendered with 30 frames per second.

And it's a 720p video, so it's quite a good resolution.

You still get the a bunch of encoding artifacts, but it's not as bad.

And it also runs on a mobile phone with 30 frames per second.

So you can get this on your iPhone 5 and watch 720p video decoded in JavaScript in real time.

The whole library is only about 1,800 lines of code.

And this is readable code, so not garbled or minifed.

And maybe a good third of the code is actually just code tables for the Huffman decoding.

So if you want to get into video decoding, it's a really nice place to start, and to look at, and to experiment with.

And to show you that this runs in a browser, I can just bring up the console.

And you get all the controls that you would expect.

So I can pause the video right now.

I can seek to a specific frame.

Here's a good one.

And I can-- I can single-step through the frames.

And you can pinpoint the exact moment when his heart breaks.

[AUDIENCE LAUGHTER] And he doesn't care anymore.

So again, this now runs really well in a mobile browser.

But it hasn't always been this way.

And I will explain how we got here.

But first, I want to dive in into the MPEG1 video format a bit.

MPEG1 was developed in the 90s, so it's quite old.

It's more than 20 years old.

It was designed for 90s hardware, so really slow computers or discrete chips that were designed to decode this video format.

The fact that it's obsolete and designed for old hardware also means that it's simple, and it runs fast enough to do it in JavaScript.

I have looked at a lot of other video formats, and it's unbelievable how complicated they are.

There's so much stuff piled on top of each other.

And MPEG really is a nice foundation to understand video codecs.

It's interesting, because most modern video codecs, like H.264, actually work in

much the same way as MPEG does.

But they add a lot of special casing and different stuff to make it even better than MPEG1, which also makes it way, way more complicated.

So let's dive in a bit.

Here's a single video frame.

This is actually a lower resolution version of the frame we just saw.

And it's also a lower bitrate version.

And this frame is a, so-called, "intra" frame, which means that the whole frame is encoded.

Intra frames in videos are typically followed by, so-called, P-frames, or "predicted" frames, which just encode the differences from the previous intra frame.

So you get one intra frame, and a bunch of predicted frames after that.

Modern codecs also refer in other frames in their predicted frames.

So the predicted frame might actually refer in the frame that's ahead of it and encode the difference to this frame.

But in MPEG, it's quite simple, that it only just refers the last decoded frame.

This also means that, if you want to seek to a specific place in the video, you can only seek to the nearest intra frame before it, because you have to have a complete picture.

You have to decode this picture completely.

And then you can decode all the predicted frames on top of it, until you reach the place where you actually want to go.

So you have to find a good balance between intra frames and predicted frames.

If you only have one intra frame at the very beginning of the video and have predicted frames on top of each other for the whole video length, you not only lose the ability to search through this video, but you also have a lot of decoding artifacts and rounding errors that pile on top of each other.

So typically, video files will have about one intra frame per second, so one fully encoded frame in the video file.

So if we look at this frame a bit more closely, you can already see lots of the encoding artifacts.

That's why I choose a low bitrate to highlight these.

You will see that the video frame is divided into 8 x 8 pixel chunks.

These chunks are called macroblocks.

And each macroblock is encoded separately.

So if we have a look at one of these macroblocks-- and we also have a look at just one channel for this macroblock.

So we just ignore the colors for now and go with a grayscale version.

So this macroblock is not encoded with the actual pixel value.

So in an 8 x 8 macroblock, you have, obviously, 64 pixels that have to be encoded.

But the values of these pixels are not stored independently of each other, but rather the macroblock is divided into the frequencies of this image.

So these are all the possible frequencies that could be in an 8 x 8 pixel image.

And if you combine these frequencies in a specific manner, you can actually perfectly reconstruct this macroblock that you see on the bottom of the screen.

So you take, for instance, in this example, operant 0.73 times the upper

left frequency of the image, and then operant 0.24 times

the next one, and so on.

And this is sort of like-- like-- if you analyze an audio file, and you will know that it-- if you are in Winamp, you have a frequency analyzer.

And it shows you the frequencies that are just currently playing.

It doesn't show the wave that is currently playing, but the frequencies.

And if you do this with a high enough resolution, you can perfectly reconstruct the audio file.

And in this case, if you do-- if you say, if this step is completely lossless-- so you could, from these coefficients, you could perfectly recreate this 8 x 8 pixel image.

What MPEG does now is it has that some of these frequencies are more important than others.

So the low frequencies that you see in the upper left corner encoded with a higher accuracy.

And the high frequencies that you see in the lower right corner are encoded with lower accuracy.

So many of these high frequencies either zero out or are fully opaque.

The JPEG format actually works in much the same way with the encoding.

And you might have seen this.

This is the exact reason why JPEG is a terrible formant for high frequency, high contrast content, like an image of text.

If you have text, you have very high contrast and very high frequencies.

And these are typically destroyed in the encoding process.

So this is one of the tricks that MPEG does to lessen the file size of the encoded image.

These coefficients are stored.

They are run length encoded, which means that, if there are several values that are the same after each other in this coefficient table, it will just say, this value comes five times.

And this saves a little bit of space.

And then it's also Huffman encoded, which is similar to ZIP compression.

So you have a table that mentions the values that are most frequently seen in an image.

These encoded coefficients and compressed coefficients are stored in a macroblock.

And for predicted frames, only the difference in this macroblock to the previous macroblock is encoded.

But predicted frames also store a reference block address.

So they say, this macroblock references another macroblock from the last frame.

And it stores a motion vector that says, the last macroblock moved 4.5 pixels in this direction.

And so you not only encode the difference between the last macroblock, but the difference between a macroblock that was moved from the last frame.

This is also the main reason why encoders are so much slower than decoders, because you have to find these reference macroblocks.

You have to search through the image and see where the last macroblock was that was kind of similar.

So these macroblocks are then stored-- oh.

Hold on.

Sorry.

[LAUGHS] These macroblocks are then stored in, so-called, slices.

A slice typically runs for one roll of macroblocks in the image, and then encodes some more properties of the slice.

Or each slice can be stored a bit more efficiently.

And a bunch of these slices are finally stored into one picture.

And this picture also stores, if it is an intra frame or predicted frame and a bunch of other attributes for this picture.

So let's have a look at this in motion.

Again, this is the video we saw earlier.

And again, it's a lower resolution, lower bitrate version of this video, so we can see the encoding artifacts.

One thing I haven't talked about right now is the color space that MPEG stores in the video file.

So typically, in an ima-- [CLEARS THROAT] sorry-- in an image, you have it stored in RGB color format, so red, green, and blue.

Each get their own channel.

But MPEG actually stores all the information in three channels, one which is for the lightness, or luma, and two channels, which are for the chroma values, for the colors.

And this format is called YCbCr.

And you can see the luma channel is displayed in grayscale here.

And both color channels are displayed in, sort of, ugly, greenish-blue colors.

And if we have a closer look, you can see the color channel on the right.

And if we display it as grayscale, so that you have a bit more contrast, you can see another trick that the MPEG format uses, which is its color channel with a much lower accuracy than the lightness channel.

So it looks quite ugly.

And if you look closely, you can also see some of the patterns that you saw in this-- in this wave-- in the frequency table.

So you see some of these frequencies, some of these checkerboard patterns encoded in the video file, because the resolution and the bitrate is so low.

But together, these color channels and lightness channel work out pretty well.

So you don't notice that the color information is in such a low resolution, because the lightness channel is in a higher resolution and sort of-- it's over-- displayed over the color information.

So here's what happens, if you, instead of decoding intra frames-- those frames that are fully encoded in the video file-- you just clear the screen.

I also have lowered the frame rate a bit, so that you can see what's going on.

And this green color might actually seem familiar to you.

It's what happens when all three channel values are zero.

And this is the color that gets displayed then.

And as I said, we ignore these intra frames completely and clear the screen instead.

So you see only the differences encoded on top of each other.

So you see those slight movements in the green color.

But you also see some complete macroblocks being displayed completely, because the encoder decided it was actually more efficient to store this macroblock completely with all color information, instead of storing the differences from the last frame.

So you can see those blocks popping up, if there's a lot of motion on the screen.

Let's go back to normal.

And here's what happens if you just ignore the intra frames, or 90% of the intra frames.

And that's an error you probably have seen also countless of times in video codecs.

So the screen isn't cleared right now if there's an intra frame, but just completely ignored.

And you can see the movement is displayed on top of the old frames.

And sometimes, there's some quite funny results of that.

So this color conversion, I talked about the-- the YCbCr color space has actually to be converted into RGB to be displayed on a Canvas element or to be displayed on a screen, because all our screens typically work with the RGB color space.

And this color space conversion from YCbCr to RGB was one of the main bottlenecks when I developed this decoder.

Just think about it.

If you have a 720p video, you have to decode almost-- or convert almost one million pixels per frame, from RGB to these-- sorry-- from YCbCr to RGB.

So this inner loop you see here, this actually has to run for close to a million times per frame.

That's 30 million times her second, which is quite a lot for JavaScript.

Works fine on desktop PCs, but it was a bit too slow to decode 720p video on mobile devices.

So I already did some checks here to try to speed it up.

The conversion from these color spaces is done with integers only, so floating point mass, which made it a faster, but still not fast enough.

The next thing I tried was to use 32 bit writes for these arrays.

So instead of storing each color component separately, like you see in the upper example where the RGB and A values are each stored as a separate instruction, I casted the-- or I created a view of the RGB A array, that is 32 pixel-- 32 bits wide, and just write one color value with RGB and A values completely together as one step.

So here's a bit of bit shift stuff going on to encode the RGBA into a 32-bit number and store it into the pixels array.

And this actually was quite a bit faster than the old version, but it still wasn't fast enough.

So one guy on GitHub suggested that we use WebGL for the conversion.

And this actually worked out great.

So what you can see here is a very simple WebGL shader.

And we just hand over these decoder buffers for the YCb and Cr textures-- sorry.

These buffers are handed over as textures.

And for each pixel on the screen, the shader grabs the value from these buffers.

That's deconversion and output set on the screen.

And this is sort of like the bread and butter for each GPU, just do something for each pixel independently.

And this is so fast that you can't even measure it, because, you know, it runs on the GPU end and grants in another process.

But the only thing you can measure is the texture upload time that you need.

And on mobile devices, it's about one millisecond to upload these three texture values.

So with this change, the whole thing run fast enough so that you can actually decode these 720p videos on mobile devices.

So with this working, there's only one question remaining.

Why? And the answer is that simple.

I actually started with this out of interest in video compression, but it was useful as well.

If we have a look at the current situation with the HTML5 video element, you get these differences in supported codecs.

So let's start with the shitty browsers from companies that don't care about the web.

Internet Explorer and Safari don't decode WebM, because-- I don't know-- they hate the web and hate open source.

They just-- but they love H.264, because it's patented

and it's-- you know-- [LAUGHING] I don't know.

I have no idea why they don't decode WebM.

Chrome can decode both formats.

Chromium, on the other hand, can only decode WebM, because H.264, there is

no open source solution that is not patented or patent-encumbered.

And same with Firefox.

But recent versions of Firefox actually tried to use an H.264 codec,

if it's installed on the system under certain conditions.

I couldn't find documentation anywhere what this actually means.

So I don't-- your best bet is to not use H.246,

if you want to support Firefox.

But there's hope.

Microsoft Edge browser announced that they will support WebM in the future.

So maybe, in one year, we will have WebM decoding in Microsoft Edge.

And then it will only take three or four more years before Safari catches up.

[LAUGHTER] Same situation with live streaming for the video element.

There are currently two standards that are competing, HLS, which is HTTP live streaming, developed by Apple, and MPEG DASH.

Both formats are quite similar.

They provide a playlist of short chucks of video.

So you have, maybe, 20 video files, each five seconds.

And you have a playlist of these video files.

And this playlist can be appended dynamically.

And if you have a live stream, you just request to this playlist continuously and get new segments.

This also means that you have, at least, a latency of 5 to 10 milliseconds from the live stream.

So it's not really "live," but-- it's good enough for broadcasting, but not for-- I don't know-- VNC, or screen sharing, or stuff like that.

And the situation is a bit better than with the codecs, because all those other stuff is just-- [INAUDIBLE] HTTP request, or the playlist, the stuff does an HTTP request over HTTP.

And the video files are served over HTTP.

So you can build clients for [INAUDIBLE] and HLS with JavaScript.

There's one hiccup for Firefox, which doesn't support HLS under certain conditions.

I haven't found anything more about-- out about this.

In theory, you should be able to build something that works on all browsers, but I haven't seen a demo of this working yet.

If you know a demo, please get in touch.

So let's try something.

So I have a small .no-js script

here that creates a WebSocket server.

And it creates an HTTP server.

And when this HTTP server receives data, it just broadcasts this data over the WebSocket connection.

So you send something over HTTP, and it gets broadcasted to all connected WebSocket clients.

And I have this index.html file here.

Just creates a WebSocket client, and hands it over to jsmpeg.

So I can start the .no-js server here.

And it gets two ports, the one port for accepting the HTTP connection, and one port for accepting WebSocket connections.

And let's start this.

And I have another tab here.

This tab runs FFmpeg, which is a media encoder and decoder.

It supports almost all video formats and even image formats out there.

And it supports a few output formats, so you can save it into a file, or stream it over HTTP.

And that's exactly what we're going to do here.

Let's just start this.

And you can see the stream connected on my local server.

And let's visit-- so that's a live-- live stream from my webcam through FFmpeg, and then through .no-js into

the browser over WebSockets.

And it's decoded in jsmpeg.

And you can see the latency isn't too bad.

[APPLAUSE] So let's try something else.

What's my IP address actually? OK.

Everyone get out your phone, tablet, or laptop, and visit this address.

I have no idea if this is going to work.

You should actually be able to-- oh, one guy connected already.

Nobody? (audience) No.

[INTERPOSING VOICES] [INAUDIBLE] I tried this earlier with my-- with my phone.

And it took awhile to load.

I have no idea why.

(audience) [INAUDIBLE].

Let's give it a few more seconds.

Well, I'll leave this running and just get on with my talk.

And if anyone can connect, just scream.

And we will all be happy.

So one thing you could build with this is-- this is an iPhone app, called Instant Webcam, which basically does the same thing as I did with .no-js.

But it's all packaged into one single app.

And it serves iPhone camera through HTTP.

So you can open any browser and connect directly to your iPhone.

And it serves a small web page and over WebSocket, the MPEG stream.

And another thing I built is-- can only show you a video, because this is for Windows only.

It's a small app that you run on your Windows desktop.

And it's serves the whole screen over WebSockets to any browser that's connected.

And this is actually fast enough to do some gaming with it.

I was able to bring down the latency to about 50 milliseconds, because everything is integrated into one app.

And there's no overhead for the stuff, so most of the time, I'm just waiting for the frame to be rendered by Windows before I can grab it and send it over the network.

So all the input from your browser is sent back over the network to your Windows desktop.

And this also works on a mobile phone.

So you can actually play games in your browser.

There's currently sound missing, but it works surprisingly well, for what it is.

So in conclusion, I only have to say that, always bet on JavaScript.

There are so many solutions for video decoding and video streaming out there, but they all are so complicated.

And frankly, I don't understand why.

So it was actually easier to write an MPEG1 decoder in JavaScript and the server for WebSockets and have this whole setup completely written by myself, than to deal with all the different vendors that try to make streaming work.

So thank you.

[APPLAUSE] So everything I talked about is you can find on this URL that's back here.

There are links to all the examples I had at the beginning and also for this jsmpeg stuff.

Whoa.

Come and join me on the comfy chairs, Dominic.

I need a puff of the head explosion bong after seeing that, to be honest.

That was extraordinary.

Thank you.

But extraordinary.

A bunch of questions, one from Yaris who hates open source.

[LAUGHTER] Why do you encode the alpha channel at all? Because it seems like 25% over-- 25% overhead for no reason.

Or have we misunderstood? Well, the alpha channel is not encoded in the MPEG video.

So in the file, there's no alpha channel.

But if you want to render anything on a Canvas element on your screen, you have to set the alpha component.

And in the example I showed, the alpha component was always full alpha, so 255, just to display anything.

You have to set this pixel value so that it get-- gets displayed on the Canvas element.

So it's not encoded in the video file.

The video just encodes these three channels, the one luma channel and two chroma channels.

Got ya.

Thank you.

Would you recommend using this in production? It depends.

Ha.

So I've seen a few examples where [INAUDIBLE] in production.

There was, for instance, one kickstarter project that produced a lamp that you can program and that responds to certain events.

So it was, sort of, like a cylinder that-- with lots of LEDs in it.

And you can program it to-- if you get an SMS, it will highlight the top half of the cylinder in red, or something.

So you can-- it's like a fountain of light.

You can completely program it.

And he had a demo on his web page, using jsmpeg, with controlling stuff in the browser.

So you could control this light and try lots of different settings and immediately see the result, live-streamed, into your browser.

And this wasn't possible with HLS or the MPEG DASH streaming, because the latency was 5 or 10 seconds.

And here, you get 100 or 200 milliseconds.

And it's almost-- it's good enough for things like that where you turn on a switch, and you see the result.

Got ya.

And a question from somebody thinking very pragmatically.

Presumably, if you use jsmpeg, it can auto-play on iOS, which HTML5 video can't do.

Right That's one of the-- one of the benefits you get with jsmpeg.

You have so much more control over what you do.

You can display single frames.

And you can single-frame step through the video.

You can seek to a specific frame.

And these are things that kind of work in the HTML5 video, but it depends where you have your intra frames set.

And It's really complicated.

And as the question said, that you can't pre-load the video file and you can't auto-play this video file.

So one thing I use that's actually for-- as a replacement for GIF animations.

So lots of websites now have DEPM videos.

And these run great on browsers that support it.

But if you use your smartphone, you can't see this video file.

And as a GIF, it would be, maybe, 20 or 30 megabytes.

So encoding it as MPEG1 and serving it through jsmpeg was a no brainer.

And it worked quite well.

Obviously, you shouldn't auto-play video, by the way.

Have you thought about writing an MPEG4 decoder? Or is that too darn tricky? It's actually very tricky.

Yes.

There's been an experiment, called Broadway.js, which tries to

decode H.264 in JavaScript.

And there's some demos available, but I haven't been able to make this work consistently.

It's very flaky.

It tries to decode different stuff, and different threads And it barely works, if it works at all, so-- and you have to download, maybe, one megabyte of JavaScript for this.

It's all part of EM script.

And it's-- yeah, it's very complicated to get working, which is why the MPEG1 form of this is so nice for this, because it's so simple.

And you end up with a decoder that's 30 kilobytes in size.

And you can modify the source code and see what's going on, which you-- it's quite impossible with a modern decoder, sadly.

I'm admiring the fact that the long description you gave of all the blocks and the encoding you describe as quite simple.

It didn't seem particularly simple to me, in my mind and several other people's minds, looking at the Twitter stream.

Well, yes.

It's quite simple, compared to what we have now in current codecs.

But this also surprised me when I dove into this is, if you are used to clean APIs with modern web frameworks, you are surprised how-- I don't want to say, how bad these file formats are.

But they are really cumbersome to work with.

And some of this is for legacy reasons, because they have to support some old hardware and really have to be set in stone what this format should do so that the hardware can deal with the decoding.

But even the MPEG1 format is not a pretty format.

It's simpler than many current formats we have, but it's still not nice.

Presumably, as well, if you wrote an MP4-- an MPEG4 decoder, you would have to pay royalty to the MPEG LA, wouldn't you? Pardon? You would have to pay royalties to the MPEG LA, if you wrote a decoder.

I don't think you have to pay royalties for the decoder.

You have to pay royalties, if you encode and serve these video files, I think.

I don't know what the situation is exactly.

I'm probably in the grey area with jsmpeg as well, but nobody cares anymore after 20 years.

[LAUGHTER] Don't tell the bad guys.

Last question.

You seem to be doing most of that in vanilla JavaScript.

And I imagine that, whereas, it's fast-- I didn't see any latency-- it must be pretty heavy on the CPU cycles.

Did you think about using asm.js, or something like that?

I actually did.

asm.js would be nice to have,

but there's-- I don't-- I don't know if there's a good way to write asm.js,

without writing in the native language, and then cross-compiling.

I would like to see a language that sort of directly compares to asm.js, so that you

have cleaner source code that is directly optimized for this kind of thing, instead of writing C, and then cross-compiling and ending up with a huge binary again.

And it would also mean that the source code wouldn't be as readable anymore.

You have this compile step in between.

And you can't just fiddle around with it.

You know, for this demo, I-- it took me about five minutes to just threw in a condition to turn off intra frames and show you are the effects that happen when no intra frames are encoded.

And this would be way more complicated, if you supported asm.js.

What I'm actually looking for is a thing called, is SIMD.js, which can use

single-instruction, multiple data instruction sets from modern CPUs to work on several values at the same time.

And this would be great for jsmpeg to speed up some of the decoding steps.

Well, thank you, very much.

A crazed genius of a project and a crazed genius of a gentleman.

Dominic Szabelwski, ladies and gentlemen.

[APPLAUSE]

Post a comment