Category Archives: Computer Science

A few words on main thread disk access for general audiences

I’m writing this in lieu of a traditional Firefox Front-end Performance Update, as I think this will be more useful in the long run than just a snapshot of what my team is doing.

I want to talk about main thread disk access (sometimes referred to more generally as “main thread IO”). Specifically, I’m going to argue that main thread disk access is lethal to program responsiveness. For some folks reading this, that might be an obvious argument not worth making, or one already made ad nauseam — if that’s you, this blog post is probably not for you. You can go ahead and skip most or all of it, if you’d like. Or just skim it. You never know — there might be something in here you didn’t know or hadn’t thought about!

For everybody else, scoot your chairs forward, grab a snack, and read on.

Disclaimer: I wouldn’t call myself a disk specialist. I don’t work for Western Digital or Seagate. I don’t design file systems. I have, however, been using and writing software for computers for a significant chunk of my life, and I seem to have accumulated a bunch of information about disks. Some of that information might be incorrect or imprecise. Please send me mail at mike dot d dot conley at gmail dot com if any of this strikes you as wildly inaccurate (though forgive me if I politely disregard pedantry), and then I can update the post.

The mechanical parts of a computer

If you grab a screwdriver and (carefully) open up a laptop or desktop computer, what do you see? Circuit boards, chips, wires and plugs. Lots of electrons flowing around in there, moving quickly and invisibly.

Notably, there aren’t many mechanical moving parts of a modern computer. Nothing to grease up, nowhere to pour lubricant. Opening up my desktop at home, the only moving parts I can really see are the cooling fans near the CPU and power supply (and if you’re like me, you’ll also notice that your cooling fans are caked with dust and in need of a cleaning).

There’s another moving part that’s harder to see — the hard drive. This might not be obvious, because most mechanical drives (I’ve heard them sometimes referred to as magnetic drives, spinny drives, physical drives, platter drives and HDDs. There are probably more terms.) hide their moving parts inside of the disk enclosure.¹

If you ever get the opportunity to open one of these enclosures (perhaps the disk has failed or is otherwise being replaced, and you’re just about to get rid of it) I encourage you to.

As you disassemble the drive, what you’ll probably notice are circular parts, layered on top of one another on a motor that spins them. In between those circles are little arms that can move back and forth. This next image shows one of those circles, and one of those little arms.

There are several of those circles stacked on top of one another, and several of those arms in between them. We’re only seeing the top one in this photo.

Does this remind you of anything? The circular parts remind me of CDs and DVDs, but the arms reaching across them remind me of vinyl players.

The comparison isn’t that outlandish. If you ignore some of the lower-level details, CDs, DVDs, vinyl players and hard drives all operate under the same basic principles:

The circular part has information encoded on it.
An arm of some kind is able to reach across the radius of the circular part.
Because the circular part is spinning, the arm is able to reach all parts of it.
The end of the arm is used to read the information encoded on it.

There’s some extra complexity for hard drives. Normally there’s more than one spinning platter and one arm, all stacked up, so it’s more like several vinyl players piled on top of one another.

Hard drives are also typically written to as well as read from, whereas CDs, DVDs and vinyls tend to be written to once, and then used as “read-only memory.” (Though, yes, there are exceptions there.)

Lastly, for hard drives, there’s a bit I’m skipping over involving caches, where parts of the information encoded on the spinning platters are temporarily held elsewhere for easier and faster access, but we’ll ignore that for now for simplicity, and because it wrecks my simile.²

So, in general, when you’re asking a computer to read a file off of your hard drive, it’s a bit like asking it to play a tune on a vinyl. It needs to find the right starting place to put the needle, then it needs to put the needle there and only then will the song play.

For hard drives, the act of moving the “arm” to find the right spot is called seeking.

Contiguous blocks of information and fragmentation

Have you ever had to defragment your hard drive? What does that even mean? I’m going to spend a few moments trying to explain that at a high-level. Again, if this is something you already understand, go ahead and skip this part.

Most functional hard drives allow you to do the following useful operations:

Write data to the drive
Read data from the drive
Remove data from the drive

That last one is interesting, because usually when you delete a file from your computer, the information isn’t actually erased from the disk. This is true even after emptying your Trash / Recycling Bin — perhaps surprisingly, the files that you asked to be removed are still there encoded on the circular platters as 1’s and 0’s. This is why it’s sometimes possible to recover deleted files even when it seems that all is lost.

Allow me to explain.

Just like there are different ways of organizing a sock drawer (at random, by colour, by type, by age, by amount of damage), there are ways of organizing a hard drive. These “ways” are called file systems. There are lots of different file systems. If you’re using a modern version of Windows, you’re probably using a file system called NTFS. One of the things that a file system is responsible for is knowing where your files are on the spinning platters. This file system is also responsible for knowing where there’s free space on the spinning platters to write new data to.

When you delete a file, what tends to happen is that your file system marks those sectors of the platter as places where new information can be written to, but doesn’t immediately overwrite those sectors. That’s one reason why sometimes deleted files can be recovered.

Depending on your file system, there’s a natural consequence as you delete and write files of different sizes to the hard drive: fragmentation. This kinda sounds like the actual physical disk is falling apart, but that’s not what it means. Data fragmentation is probably a more precise way of thinking about it.

Imagine you have a sheet of white paper broken up into a grid of 5 boxes by 5 boxes (25 boxes in total), and a box of paints and paintbrushes.

Each square on the paper is white to start. Now, starting from the top-left, and going from left-to-right, top-to-bottom, use your paint to fill in 10 of those boxes with the colour red. Now use your paint to fill in the next 5 boxes with blue. Now do 3 more boxes with yellow.

So we’ve got our colour-filled boxes in neat, organized rows (red, then blue, then yellow), and we’ve got 18 of them filled, and 7 of them still white.

Now let’s say we don’t care about the colour blue. We’re okay to paint over those now with a new colour. We also want to fill in 10 boxes with the colour purple. Hm… there aren’t enough free white boxes to put in that many purple ones, but we have these 5 blue ones we can paint over. Let’s paint over them with purple, and then put the next 5 at the end in the white boxes.

So now 23 of the boxes are filled, we’ve got 2 left at the end that are white, but also, notice that the purple boxes aren’t all together — they’ve been broken apart into two sections. They’ve been fragmented.

This is an incredibly simplified model, but (I hope) it demonstrates what happens when you delete and write files to a hard drive. Gaps open up that can be written to, and bits and pieces of files end up being distributed across the platters as fragments.

This also occurs as files grow. If, for example, we decided to paint two more white boxes red, we’d need to paint the ones at the very end, breaking up the red boxes so that they’re fragmented.

So going back to our vinyl player example for a second — the ideal scenario is that you start a song at the beginning and it plays straight through until the end, right? The more common case with disk drives, however, is you read bits and pieces of a song from different parts of the vinyl: you have to lift and move the arm each time until eventually you have heard the song from start to finish. That seeking of the arm adds overhead to the time it takes to listen to the song from beginning to end.

When your hard drive undergoes defragmentation, what your computer does is try to re-organize your disk so that files are in contiguous sectors on the platters. That’s a fancy way of saying that they’re all in a row on the platter, so they can be read in without the overhead of seeking around to assemble it as fragments.

Skipping that overhead can have huge benefits to your computer’s performance, because the disk is usually the slowest part of your computer.

I’ve skipped over and simplified a bunch of stuff here in the interests of brevity, but this is a great video that gives a crash course on file systems and storage. I encourage you to watch it.

On the relative input / output speeds of modern computing components

I mentioned in the disclaimer at the start of this post that I’m not a disk specialist or expert. Scott Davis is probably a better bet as one of those. His bio lists an impressive wealth of experience, and mentions that he’s “a recognized expert in virtualization, clustering, operating systems, cloud computing, file systems, storage, end user computing and cloud native applications.”

I don’t know Scott at all (if you’re reading this, Hi, Scott!), but let’s just agree for now that he probably knows more about disks than I do.

I’m picking Scott as an expert because of a particularly illustrative analogy that was posted to a blog for a company he used to work for. The analogy compares the speeds of different media that can be used to store information on a computer. Specifically, it compares the following:

RAM
The network with a decent connection
Flash drives
Magnetic hard drives — what we’ve been discussing up until now.

For these media, the post claims that input / output speed can be measured using the following units:

RAM is in nanoseconds
10GbE Network speed is in microseconds (~50 microseconds)
Flash speed is in microseconds (between 20-500+ microseconds)
Disk speed is in milliseconds

That all seems pretty fast. What’s the big deal? Well, it helps if we zoom in a little bit. The post does this by supposing that we pretend that RAM speed happens in minutes.

If that’s the case, then we’d have to measure network speed in weeks.

And if that’s the case, then we’d want to measure the speed of a Flash drive in months.

And if that’s the case, then we’d have to measure the speed of a magnetic spinny disk in decades.

Update (May 23, 2019): My Uncle Mark, who also works in computing, sent me links that show similar visualizations of computing latency: this one has a really excellent infographic, and this one has more discussion. These articles highlight network latency as the worst offender, which is true especially when the quality of service is low, but I’m mostly writing this post for folks who hack on Firefox where the vast majority of networking occurs off of the main thread.

I wish I had some ACM paper, or something written by a computer science professor that I could point to you to bolster the following claim. I don’t, not because one doesn’t exist, but because I’m too lazy to look for one. I hope you’ll forgive me for that, but I don’t think I’m saying anything super controversial when I say:

In the common case, for a personal computer, it’s best to assume that reading and writing to the disk is the slowest operation you can perform.

Sure, there are edge cases where other things in the system might be slower. And there is that disk cache that I breezed over earlier that might make reading or writing cheaper. And sometimes the operating system tries to do smart things to help you. For now, just let it go. I’m making a broad generalization that I think covers the common cases, and I’m talking about what’s best to assume.

Single and multi-threaded restaurants

When I try to describe threading and concurrency to someone, I inevitably fall back to the metaphor of cooks in a kitchen in a restaurant. This is a special restaurant where there’s only one seat, for a single customer — you, the user.

Single-threaded programs

Let’s imagine a restaurant that’s very, very small and simple. In this restaurant, the cook is also acting as the waiter / waitress / server. That means when you place your order, the server / cook goes into the kitchen and makes it for you. While they’re gone, you can’t really ask for anything else — the server / cook is busy making the thing you asked for last.

This is how most simple, single-threaded programs work—the user feeds in requests, maybe by clicking a button, or typing something in, maybe something else entirely—and then the program goes off and does it and returns some kind of result. Maybe at that point, the program just exits (“The restaurant is closed! Come back tomorrow!”), or maybe you can ask for something else. It’s really up to how the restaurant / program is designed that dictates this.

Suppose you’re very, very hungry, and you’ve just ordered a complex five-course meal for yourself at this restaurant. Blanching, your server / cook goes off to the kitchen. While they’re gone, nobody is refilling your water glass or giving you breadsticks. You’re pretty sure there’s activity going in the kitchen and that the server / cook hasn’t had a heart attack back there, but you’re going to be waiting a looooong time since there’s only one person working in this place.

Maybe in some restaurants, the server / cook will dash out periodically to refill your water glass, give you some breadsticks, and update you on how things are going, but it sure would be nice if we gave this person some help back there, wouldn’t it?

Multi-threaded programs

Let’s imagine a slightly different restaurant. There are more cooks in the kitchen. The server is available to take your order (but is also able to cook in the kitchen if need be), and you make your request from the menu.

Now suppose again that you order a five-course meal. The server goes to the kitchen and tells the cooks what you just ordered. In this restaurant, suppose the kitchen staff are a really great team and don’t get in each other’s way³, so they divide up the order in a way that makes sense and get to work.

The server can come back and refill your water glass, feed you breadsticks, perhaps they can tell you an entertaining joke, perhaps they can take additional orders that won’t take as long. At any rate, in this restaurant, the interaction between the user and the server is frequent and rarely interrupted.

The waiter / waitress / server is the main thread

In these two examples, the waiter / waitress / server is what is usually called the main thread of execution, which is the part of the program that the user interacts with most directly. By moving expensive operations off of the main thread, the responsiveness of the program increases.

Have you ever seen the mouse turn into an hourglass, seen the “This program is not responding” message on Windows? Or the spinny colourful pinwheel on macOS? In those cases, the main thread is off doing something and never came back to give you your order or refill your water or breadsticks — that’s how it generally manifests in common operating systems. The program seems “unresponsive”, “sluggish”, “frozen”. It’s “hanging”, or “stuck”. When I hear those words, my immediate assumption is that the main thread is busy doing something — either it’s taking a long time (it’s making you your massive five course meal, maybe not as efficiently as it could), or it’s stuck (maybe they fell down a well!).

In either case, the general rule of thumb to improving program responsiveness is to keep the server filling the user’s water and breadsticks by offloading complex things on the menu to other cooks in the kitchen.

Accessing the disk on the main thread

Recall that in the common case, for a personal computer, it’s best to assume that reading and writing to the disk is the slowest operation you can perform. In our restaurant example, reading or writing to the disk on the main thread is a bit like having your server hop onto their bike and ride out to the next town over to grab some groceries to help make what you ordered.

And sometimes, because of data fragmentation (not everything is all in one place), the server has to search amongst many many shelves all widely spaced apart to get everything.

And sometimes the grocery store is very busy because there are other restaurants out there that are grabbing supplies.

And sometimes there are police checks (anti-virus / anti-malware software) occurring for passengers along the road, where they all have to show their IDs before being allowed through.

It’s an incredibly slow operation. Hopefully by the time the server comes back, they don’t realize they have to go back out again to get more, but they might if they didn’t realize they were missing some more ingredients.⁴

Slow slow slow. And unresponsive. And a great way to lose a hungry customer.

For super small programs, where the kitchen is well stocked, or the ride to the grocery store doesn’t need to happen often, having a single-thread and having it read or write is usually okay. I’ve certainly written my fair share of utility programs or scripts that do main thread disk access.

Firefox, the program I spend most of my time working on as my job, is not a small program. It’s a very, very, very large program. Using our restaurant model, it’s many large restaurants with many many cooks on staff. The restaurants communicate with each other and ship food and supplies back and forth using messenger bikes, to provide to you, the customer, the best meals possible.

But even with this large set of restaurants, there’s still only a single waiter / waitress / server / main thread of execution as the point of contact with the user.

Part of my job is to help organize the workflows of this restaurant so that they provide those meals as quickly as possible. Sending the server to the grocery store (main thread disk access) is part of the workflow that we absolutely need to strike from the list.

Start-up main-thread disk access

Going back to our analogy, imagine starting the program like opening the restaurant. The lights go on, the chairs come off of the tables, the kitchen gets warmed up, and prep begins.

While this is occurring, it’s all hands on deck — the server might be off in the kitchen helping to do prep, off getting cutlery organized, whatever it takes to get the restaurant open and ready to serve. Before the restaurant is open, there’s no point in having the server be idle, because the customer hasn’t been able to come in yet.

So if critical groceries and supplies needed to open the restaurant need to be gotten before the restaurant is open, it’s fine to send the server to the store. Somebody has to do it.

For Firefox, there are various things that need to take place before we can display any UI. At that point, it’s usually fine to do main-thread disk access, so long as all of the things being read or written are kept to an absolute minimum. Find how much you need to do, and reduce it as much as possible.

But as soon as UI is presented to the user, the restaurant is open. At that point, the server should stay off their bike and keep chatting with the customer, even if the kitchen hasn’t finished setting up and getting all of their supplies. So to stay responsive, don’t do disk access on the main thread of execution after you’ve started to show the user some kind of UI.

Disk contention

There’s one last complication I want to capture here with our restaurant example before I wrap up. I’ve been saying that it’s important to send anyone except the server to the grocery store for supplies. That’s true — but be careful of sending too many other people at the same time.

Moving disk access off of the main thread is good for responsiveness, full stop. However, it might do nothing to actually improve the overall time that it takes to complete some amount of work. Put it another way: just because the server is refilling your glass and giving you breadsticks doesn’t mean that your five-course meal is going to show up any faster.

Also, disk operations on magnetic drives do not have a constant speed. Having the disk do many things at once within a single program or across multiple programs can slow the whole set of operations down due to the overhead of seeking and context switching, since the operating system will try to serve all disk requests at once, more or less.⁵

Disk contention and main thread disk access is something I think a lot about these days while my team and I work on improving Firefox start-up performance.

Some questions to ask yourself when touching disk

So it’s important to be thoughtful about disk access. Are you working on code that touches disk? Here are some things to think about:

Is UI visible, and responsiveness a goal?

If so, best to move the disk access off of the main-thread. That was the main thing I wanted to capture, and I hope I’ve convinced you of that point by now.

Does the access need to occur?

As programs age and grow and contributors come and go, sometimes it’s important to take a step back and ask, “Are the assumptions of this disk access still valid? Does this access need to happen at all?” The fastest code is the code that doesn’t run at all.

What else is happening during this disk access? Can disk access be prioritized more efficiently?

This is often trickier to answer as a program continues to run. Thankfully, tools like profilers can help capture recordings of things like disk access to gain evidence of simultaneous disk access.

Start-up is a special case though, since there’s usually a somewhat deterministic / reliably stable set of operations that occur in the same way in roughly the same order during start-up. For start-up, using a tool like a profiler, you can gain a picture of the sorts of things that tend to happen during that special window of time. If you notice a lot of disk activity occurring simultaneously across multiple threads, perhaps ponder if there’s a better way of ordering those operations so that the most important ones complete first.

Can we reduce how much we need to read or write?

There are lots of wonderful compression algorithms out there with a variety of performance characteristics that might be worth pondering. It might be worth considering compressing the data that you’re storing before writing it so that the disk has to write less and read less.

Of course, there’s compression and decompression overhead to consider here. Is it worth the CPU time to save the disk time? Is there some other CPU intensive task that is more critical that’s occurring?

Can we organize the things that we want to read ahead of time so that they’re more likely to be read contiguously (without seeking the disk)?

If you know ahead of time the sorts of things that you’re going to be reading off of the disk, it’s generally a good strategy to store them in that read order. That way, in the best case scenario (the disk is defragmented), the read head can fly along the sectors and read everything in, in exactly the right order you want them. If the user has defragmented their disk, but the things you’re asking for are all out of order on the disk, you’re adding overhead to seek around to get what you want.

Supposing that the data on the disk is fragmented, I suspect having the files in order anyways is probably better than not, but I don’t think I know enough to prove it.

Flawed but useful

One of my mentors, Greg Wilson, likes to say that “all models are flawed, but some are useful”. I don’t think he coined it, but he uses it in the right places at the right times, and to me, that’s what counts.

The information in this post is not exhaustive — I glossed over and left out a lot. It’s flawed. Still, I hope it can be useful to you.

Thanks

Thanks to the following folks who read drafts of this and gave feedback:

Mandy Cheang
Emily Derr
Gijs Kruitbosch
Doug Thayer
Florian Quèze

There are also newer forms of disks called Flash disks and SSDs. I’m not really going to cover those in this post. ↩
The other thing to keep in mind is that the disk cache can have its contents evicted at any time for reasons that are out of your control. If you time it right, you can maybe increase the probability of a file you want to read being in the cache, but don’t bet the farm on it. ↩
When writing multi-threaded programs, this is much harder than it sounds! Mozilla actually developed a whole new programming language to make that easier to do correctly. ↩
Keen readers might notice I’m leaving out a discussion on Paging. That’s because this blog post is getting quite long, and because it kinda breaks the analogy a bit — who sends groceries back to a grocery store? ↩
I’ve never worked on an operating system, but I believe most modern operating systems try to do a bunch of smart things here to schedule disk requests in efficient ways. ↩

Firefox Performance Update #10

Hey folks – another Performance Update coming at you! It’s been a few weeks since I posted one of these, mostly due to travel, holidays and the Mozilla SF All-Hands. However, we certainly haven’t been idle during that time. Much work has been done Performance-wise, and there’s a lot to tell. So strap in! But first…

This Performance Update is brought to you by: promiseDocumentFlushed

promiseDocumentFlushed is a utility that’s available for browser engineers in chrome documents on the window global. The goal of promiseDocumentFlushed is to help avoid synchronous layout flushes in our JavaScript code by scheduling work to only occur after the next “natural” layout flush occurs¹.

promiseDocumentFlushed takes a function and returns a Promise. The function it takes will run the next time a natural layout flush and paint has finished occurring. At this point, the DOM should not be “dirty”, and size and position queries should be very cheap to calculate. It is critically important for the callback to not modify the DOM. I’ve filed bugs to make modifying the DOM inside that callback enter some kind of failure state, but it hasn’t been resolved yet.

The return value of the callback is what promiseDocumentFlushed’s returned Promise resolves with. Once the Promise resolves, it is then safe to modify the DOM.

This mechanism means that if, for some reason, you need to gather information about the size or position of things in the DOM, you can do it without forcing a synchronous layout flush – however, a paint will occur before that information is given to you. So be on the look-out for flicker, since that’s the trade-off here.

And now, here’s a list of the projects that the team has been working on lately:

ClientStorage (In-Progress by Doug Thayer)

The ClientStorage project should allow Firefox to communicate with the GPU more efficiently on macOS, which should hopefully reduce jank on the compositor thread². This is right on the verge of landing³, and we’re very excited to see how this impacts our macOS users!

Init WindowsJumpLists off-main-thread (Completed by Doug Thayer)

The JumpList is a Windows-only feature – essentially an application-specific context menu that opens when you right-click on the application in the task bar. Adding entries to this context menu involves talking to Windows, and unfortunately, the way we were originally doing this involved writing to the disk on the main thread. Thankfully, the API is thread-safe, so Doug was able to move the operation onto a background thread. This is good, because arewesmoothyet was reporting the Windows JumpList code as one of the primary causes of main-thread hangs caused by our front-end code.

Reduce painting while scrolling panels on macOS (Completed by Doug Thayer)

Matt Woodrow noticed that the recently added All Tabs list was performing quite poorly when scrolling it on macOS. After turning on paint-flashing for our browser UI, he noticed that we were re-painting the entire menu every time it scrolled. After some investigation, Matt realized that this was because our Graphics code was skipping some optimizations due to the rounded corners of the panels on macOS. We briefly considered removing the rounded corners on macOS, but then Doug found a more general fix, and now we only re-paint the minimum necessary to scroll the menu, and it’s much smoother!

Make the RemotePageManager lazy (In-Progress by Felipe Gomes)

The RemotePageManager is the way that the parent process communicates with a whitelist of privileged about: pages running in the content process. The RemotePageManager hooks itself in pretty early in a content process’s lifetime, but it’s really only necessary if and when one of those whitelisted about: pages loads. Felipe is working on using some of our new lazy script machinery to load RemotePageManager at the very last moment.

Overhauling about:performance (In-Progress by Florian Quèze)

Florian is working on improving about:performance, with the hopes of making it more useful for browser engineers and users for diagnosing performance problems in Firefox. Here’s a screenshot of what he has so far:

A screenshot of the nascent about:performance showing how much CPU tabs are consuming.

Apparently, mining cryptocurrency takes a lot of CPU!

Thanks to the work of Tarek Ziade, we now have a reliable mechanism for getting information on which tabs are consuming CPU cycles. For example, in the above screenshot, we can see that the coinhive tab that Firefox has open is consuming a bunch of CPU in some workers (mining cryptocurrency). Florian has also been clearing out some of the older code that was supporting about:performance, including the subprocess memory table. This table was useful for our browser engineers when developing and tuning the multi-process project, but we think we can replace it now with something more actionable and relevant to our users. In the meantime, since gathering the memory data causes jank on the main thread, he’s removed the table and the supporting infrastructure. The about:performance work hasn’t landed in the tree yet, but Florian is aiming to get it reviewed and landed (preffed off) soon.

Browser Adjustment Project (In-Progress by Gijs Kruitbosch)

This is a research project to find ways that Firefox can classify the hardware it’s running on, which should make it easier for the browser to make informed decisions on how to deal with things like CPU scheduling, thread and process priority, graphics and UI optimizations, and memory reclamation strategies. This project is still in its early days, but Gijs has already identified prior art and research that we can build upon, and is looking at lightweight ways we can assign grades to a user’s CPU, disk, and graphics hardware. Then the plan is to try hooking that up to the toolkit.cosmeticAnimations pref, to test disabling those animations on weaker hardware. He’s also exploring ways in which the user can override these measurements in the event that they want to bypass the defaults that we choose for each environment.

Avoiding spurious about:blank loads in the parent process (In-Progress by Gijs Kruitbosch)

When we open new browser windows, the initial browser tab inside them runs in the parent process and loads about:blank. Soon after, we do a process flip to load a page in the content process. However, that initial about:blank still has cost, and we think we can avoid it. There’s a test failure that Gijs is grappling with, but after much thorough detective work deep in the complex ball of code that supports our window opening infrastructure, he’s figured out a path forward. We expect this project to be wrapped up soon, which should hopefully make window opening cheaper and also produce less flicker.

Load Activity Stream scripts from ScriptPreloader (Completed by Jay Lim)

Jay has recently made it possible for Activity Stream to load its start-up scripts from the ScriptPreloader. From his local measurements on his MBP, this saves a sizeable chunk of time (around 20-30ms if I recall) on the time to load and render Activity Stream! This optimization is not available, however, unless the separate Activity Stream content process is enabled.

Enable the separate Activity Stream content process by default (In-Progress by Jay Lim)

This project not only ensures that Activity Stream content activity doesn’t impact other tabs (and vice versa), but also allows Firefox to take advantage of the ScriptPreloader to load Activity Stream faster. This does, however, mean an extra process flip when moving from about:home, about:newtab or about:welcome to a new page and back again. Because of this, Jay is having to modify some of our tests to accommodate that, as well as part of our Session Restore code to avoid unnecessary loading indicators when moving between processes.

Defer calculating Activity Stream state until idle (In-Progress by Jay Lim)

When Firefox starts up, one of the first things it prepares to do is show you the Activity Stream page, since that’s the default home and new tab page. Jay thinks we might be able to save the state of Activity Stream at shutdown, and load it again quickly during startup within the content process, and then defer the calculations necessary to produce a more recent state until after the parent process has become idle. We’re unsure yet what this will buy us in terms of start-up speed, but Jay is hacking together a prototype to see. I’m eager to find out!

Grab bag of Notable Performance Work

Luca Greco landed all of the infrastructure to move the WebExtension storage.local backend from a file in the profile directory to indexedDB. This should particularly help the performance of the browser when WebExtensions write small changes to large storage structures, since historically this would cause the entire JSON object for the structure to be recomputed and flushed to disk. This should also help with memory consumption. The infrastructure is disabled by default, and once this bug is fixed, it will be switched on.
Doug Thayer made our layerization logic smarter for pages that historically created many, many layers. This resulted in a nice win on our MotionMark score, and one user reported that it improved power usage as well.
Mark Banner made it so that moving many bookmarks in bulk isn’t nearly as expensive to complete. This dropped the cost of dropping 300 bookmarks with async transactions from ~2s to ~400ms!
Kartikaya Gupta made it so that users of the Gecko Profiler can use <pid>:<thread filter> in the thread filter input to gather samples of particular subprocesses. This will be very handy as we scale up the number of content processes!
Hiroyuki Ikezoe made it so that we more often throttle computations for transform animations for out-of-view elements.
Gijs Kruitbosch made it so that our DevTools don’t cause synchronous layout flushes when resizing the Inspector pane.
Kris Maglione made it so that we more lazily load PluginContent.jsm, which should result in a content process start-up and memory win.
Anny Gakhokize made it so that instead of sending 8 synchronous IPC messages to retrieve supported clipboard data types, we only send 1 with all of the necessary information.
Marco Bonardo fixed a very important Places regression, where an entire table was being recalculated when deleting certain records.
Dave Townsend fixed an issue where we were requesting the favicon for new pages twice instead of once. This resulted in a 2%-3% win on our internal session restoration bench on 64-bit Linux!
PSPDFKit noted that Firefox is absolutely crushing it at WebAssembly performance.
Andrew Swan enabled the delayed background page start-up optimization for WebExtensions by default, and it should ride out in the Firefox 63 release!
Blake Kaplan got rid of the PBrowser::Msg_GetTabCount synchronous IPC message!
The Graphics team has enabled WebRender by default for a subset of our Nightly population to test it. If you’re in that group, please file bugs if you see them! Check about:studies to see if you’re in the testing group.

Thank you Jay Lim!

As I draw this update to a close, I want to give a shout-out to my intern and colleague Jay Lim, whose internship is ending in a few short days. Jay took to performance work like a duck in water, and his energy, ideas and work were greatly appreciated! Thank you so much, Jay!

By “natural”, I mean a layout flush triggered by the refresh driver, and not by some JavaScript requesting size or position information on a dirty DOM ↩
And when it comes to smoothness and responsiveness, jank on the compositor thread is deadly ↩
it landed and bounced once due to a crash test failure, but Doug has just gotten a fix for it approved ↩

Firefox Performance Update #9

Hello, Internet! Here we are with yet another Firefox Performance Update for your consumption. Hold onto your hats – we’re going in!

But first a word from our sponsor: ScriptPreloader!

A lot of the Firefox front-end is written using JavaScript. With the possible exception of system add-ons that update outside of the normal release cycle, these scripts tend to be the same until you update.

About a year ago, Mozilla developer Kris Maglione had an idea: let’s try to optimize browser start time by noticing which scripts are being loaded during start-up, and then converting those scripts into a binary representation¹ that we can cache on disk. That way, next time we start up, we can just grab the cached binaries off of the disk, skip the parsing step and start executing the JavaScript right away.

Long-time Mozillians might know that we already do some aggressive caching to improve start time for things like XUL, XBL, manifests and other things that are read at start-up. I think we actually were already caching JavaScript files too – but I don’t think we were storing them pre-parsed. And the old caching stuff was definitely not caching scripts that were loading in content processes (since content processes didn’t exist when the old caching stuff was designed).

At any rate, my understanding is that the ScriptPreloader pays attention to script loads between main process start and the point where the first browser window fires the “browser-delayed-startup-finished” observer notification (after the window paints and does post-painting script loading). At that point, the ScriptPreloader examines the list of scripts that the parent and content processes have loaded, and² writes their pre-parsed bytecode representation to disk.

After that cache is written, the next time the main process or content processes start up, the cache is checked for the binary data. If it exists, this means that we can skip the parsing step. The ScriptPreloader goes one step further and starts to “decode”³ that binary format off of the main thread, even before those scripts are requested. Then, when the scripts are finally requested, they’re very much ready to execute right away.

When the ScriptPreloader landed, we saw some really nice wins in our start-up performance!

I’m now working on a series of patches in this bug that will widen the window of time where we note scripts that we can cache. This will hopefully improve the speed of privileged scripts that run up until the idle point of the first browser window.

And now for some Performance Project updates!

Early first blank paint (lead by Florian Quèze)

User Research has hired a contractor to perform a study to validate our hypothesis that the early first blank paint perceived performance optimization will make Firefox seem like it’s starting faster. More data to come out of that soon!

Faster content process start-up time (lead by Felipe Gomes)

The patches that Felipe wrote a few weeks back have landed and have had a positive impact! The proof is in the pudding – let’s look at some graphs:

The cpstartup impact. Those two clusters are test runs “before” and “after” Felipe’s patches landed, respectively.

The above graph shows a nice drop in the cpstartup Talos test. The cpstartup test measures the time it takes to boot up the content process and have it be ready to show you web pages.

This is a screen capture of a Base Content JS improvement in the AreWeSlimYet test. This graph measures the amount of memory that content processes consume via JavaScript not long after starting up.

In the graph above, we can see that the patches also helped reduce the memory that content processes use by default, by making more scripts only load when they’re needed.

It’s always nice to see our work have an impact in our graphs. Great work, Felipe! Keep it up!

LRU cache for tab layers (lead by Doug Thayer)

The patch to introduce the LRU cache landed last week, and was enabled for a few days so we could collect some data on its performance impact.

The good news is that it appears that this has had a significant and positive impact on tab switch times – tab switch times went down, and the number of Nightly instances reporting tab switch spinners went down by about 10%. Great work, Doug!

A number of bugs were filed against the original bug due to some glitchy edge-cases that we don’t handle well just yet.

We also detected a ~8% resident memory regression in our automated testing suites. This was expected (keeping layers around isn’t free!) and gave us a sense of how much memory we might consume were we to enable this by default.

The experiment is concluded for now, and we’re going to disable the cache for a bit while we think about ways to improve the implementation.

ClientStorageTextureSource for macOS (lead by Doug Thayer)

This project should allow us to be more efficient when uploading layers to the compositor on macOS. Doug has solved the crashing issues he was getting in automation(yay!), and is now attempting to figure out some Talos regressions on the MotionMark test suite. Deeper profiling is likely required to untangle what’s happening there.

Swapping DataURLs for Blobs in Activity Stream (lead by Jay Lim)

Jay’s patch to swap out DataURLs for Blobs for Activity Stream images has passed a first round of review from Mardak! He’s now waiting for a second review from k88hudson, and then hopefully this can land and give us a bit of a memory win. Having done some analysis, we expect this buy back quite a bit of memory that was being contained within those long DataURL strings.

Caching Activity Stream JS in the JS Bytecode Cache (lead by Jay Lim)

After examining the JavaScript Bytecode Cache that’s used for Web Content, Jay has determined that it’s really not the right mechanism for caching the Activity Steam scripts.

However, that ScriptPreloader that I was talking about earlier sounds like a much more reasonable candidate. Jay is now doing a deep dive on the ScriptPreloader to see whether or not the Activity Stream scripts are already being cached – and if not, why not.

Tab warming (lead by Mike Conley)

No news is good news here. Tab warming continues to ride and no new bugs have been filed. The work to reduce the number of paints when warming tabs has stalled a bit while I dealt with a rather strange cpstartup Talos regression. Ultimately, I think I can get rid of the second paint when warming by keeping background tabs display port suppressed⁴, and then only triggering the display port unsuppression after a tab switch. This will happily take advantage of a painting mechanism that Doug Thayer put in as part of the LRU cache experiment.

Firefox’s Most Wanted: Performance Wins (lead by YOU!)

Before we go into the grab-bag list of performance-related fixes – have you seen any patches landing that should positively impact Firefox’s performance? Let me know about it so I can include it in the list, and give appropriate shout-outs to all of the great work going on! That link again!

Grab-bag time

And now, without further ado, a list of performance work that took place in the tree:

(🌟 indicates a volunteer contributor)

Kris Maglione added helpers for generating QueryInterface functions on JS objects in native code to cut down on Native Code -> JS border crossings. Specifically, this should speed up situations where native code is calling QueryInterface on JS-implemented XPCOM components. This also means hand-rolling QueryInterface is no longer necessary, and is actively discouraged.
Jon Coppeard fixed a particularly bad Cycle Collector performance regression that was occurring when certain add-ons were installed. This fix was deemed important enough to ride along to the 60.0.1 builds (both release channel and ESR).
Gabriel Luong made it so that resizing the Inspector pane with the DevTools Toolbox open doesn’t cause layout flushes for every resize event. Instead, it throttles them via an idleCallback. This bug tracks the work to remove the layout flush entirely.
Ryan Hunt made it so that paint threads (see Off Main Thread Painting) no longer block Display List and Frame Layer building. This means we can get more done before offloading instructions to the paint threads, and this gives paint threads more time to finish up their work, increasing the probability that they’ll be free by the time the transaction to the paint thread needs to take place.
Felipe Gomes removed a bunch of unnecessary code for the old about:home that was still registering event listeners and consuming memory despite never being used anymore. This also shrunk our installer size down a bit, since it got rid of around 200kB worth of images and ~1600 lines of JS!
Marco Bonardo got rid of a sync layout flush that would occur when starting the browser or opening new windows with the Bookmarks Toolbar enabled.
Xidorn Quan identified a regression in Firefox 61+ where DOMSubtreeModified events were being nested and could result in infinite recursion on some sites, making the site freeze. The regressing changeset has been backed out in Firefox 61, and in Firefox 62+, DOMSubtreeModified and DOMAttrModified events will no longer fire when style attributes change.

Thanks, folks!

XDR, I think? ↩
My understanding breaks down here a little ↩
I assume that’s a type of de-serialization ↩
This is an optimization that we do that shrinks the painted area to just the region that’s visible to the browser. We normally paint a bit outside the viewable area so that it’s ready when a user starts scrolling ↩

Firefox Performance Update #8

Howdy folks! Another Firefox Performance Update coming at you. Buckle up.

But first a word from our sponsor: Talos!

Talos is a framework that we use to measure various aspects of Firefox performance as part of our continuous integration pipeline.

There are a number of Talos “suites”, where each suite contains some number of tests. These tests, in turn, report some set of numbers that are then stored and graphable via our graph viewer here.

Here’s a full list of the Talos tests, including their purpose, the sorts of measurements they take, and who’s currently a good person to ask about them if you have questions.

A lot of work has been done to reduce the amount of noise in our Talos tests, but they’re still quite sensitive and noisy. This is why it’s often necessary to do 5-10 retriggers of Talos test runs in order to do meaningful comparisons.

Sometimes Talos detects regressions that aren’t actually real regressions¹, and that can be a pain. However, for the times where real regressions are caught, Talos usually lets us know much faster than Telemetry or user reports.

Did you know that you can get profiles from Try for Talos runs? This makes it much simpler to diagnose Talos regressions. Also, we now have Talos profiles being generated on our Nightly builds for added convenience!

And now for some Performance Project updates!

Early first blank paint (lead by Florian Quèze)

No new bugs have been filed against the feature yet from our beta population, and we are seeing an unsurprising drop in the time-to-first-paint probe on that channel. User Research is in the process of getting a (very!) quick study launched to verify our assumption that users will perceive the first blank paint as the browser having started more quickly.

Faster content process start-up time (lead by Felipe Gomes)

Felipe has some patches up for review to make our frame scripts as lazy as possible. To support that, he’s added some neat infrastructure using Proxy and Reflect to make it possible to create an object that can be registered as an event handler or observer, and only load the associated script when the events / observer notifications actually fire.

We’re excited to see how this work impacts our memory and content process start-up graphs!

LRU cache for tab layers (lead by Doug Thayer)

The patch to introduce the LRU cache landed and bounced a few times. There appears to be an invalidation bug with the approach that needs to be ironed out first. dthayer has a plan to address this (forcing re-paints when switching to a tab that’s already rendered in the background), and is just waiting for review.

ClientStorageTextureSource for macOS (lead by Doug Thayer)

Doug is working on finishing a project that should allow us to be more efficient when uploading things to the compositor on macOS (by handing memory over to the GPU rather than copying it). He’s currently dealing with strange crashes that he can only reproduce on Try. Somehow, Doug seems to always run into the weird bugs that only appear in automation, and the whole team is crossing our fingers for him on this one.

Swapping DataURLs for Blobs in Activity Stream (lead by Jay Lim)

Our new intern Jay Lim is diving right into performance work, and already has his first patch up. This patch makes it so that Activity Stream no longer uses DataURLs to serialize images down to the content process, and instead uses Blobs and Blob URLs. This should allow the underlying infrastructure to make better use of memory, as well as avoiding the cost of converting images to and from DataURLs.

Caching Activity Stream JS in the JS Bytecode Cache (lead by Jay Lim)

This project is still in the research phase. Jay is trying to determine if it’s possible to stash the parsed Activity Stream JS code in the JS bytecode cache that we normally use for webpages. We’re still evaluating how much this would save us on page load, and we’re also still evaluating the cost of modifying the underlying infrastructure to allow this. Stay tuned for updates.

AwesomeBar improvements (led by Gijs Kruitbosch)

Gijs has started this work by making it much cheaper to display long URLs in the AwesomeBar. This is particularly useful for DataURLs that might happen to be in your browsing history for some reason!

This is a long-pull effort, so expect this work to be spread out over a bunch of bugs.

Tab warming (lead by Mike Conley)

I’ve been focusing on determining why warming tabs seems to result in two consecutive paints. My findings are here, and I suspect that in the warming case, the second paint is avoidable. I suspect that this, coupled with dthayer’s work on ClientStorageTextureSource will greatly improve tab warming’s performance on macOS, and allow us to ship on that OS.

Firefox’s Most Wanted: Performance Wins (lead by YOU!)

Grab-bag time

And now, without further ado, a list of performance work that took place in the tree:

(🌟 indicates a volunteer contributor)

Emilio Cobos Álvarez made it so that we do less work dealing with fonts in Stylo code. This appears to impact YouTube pages, so that’s a nice win!
Marco Bonardo made it so that our Windows Jump Lists code no longer uses synchronous Places APIs to determine if browser history is empty.
Andrew Swan made it so that background pages for WebExtensions can have their loading deferred. It’s quite a clever optimization and has resulted in nice improvements on a number of our start-up performance benchmarks! Great work!
Andrew McCreight made it so that we can do less work when cycle collecting sometimes! This appears to fix some multi-second hangs that some of our users have been seeing on popular sites like Google Inbox.
Matt Woodrow made it so that we schedule paints in a more intelligent way, especially when painting is recovering from falling behind vsync.

Thanks, folks!

Sometimes, for example, the test is just measuring the wrong thing. ↩

Firefox Performance Update #7

G’day folks, just another Firefox Performance Update coming down the pike¹ for you, so strap in.

But first a word from our sponsor: health.graphics!

This performance update is brought to you by The Quantum Dashboards at health.graphics! The first step to changing something is to measure it over time, and once you have those measurements, it’s usually a good idea to find some kind of visual representation for that measurement so that you can track your progress.

Contrary to the domain name, health.graphics measures much more than just the health of our graphics layer. The dashboards at health.graphics show visualizations for a bunch of measurements that we care about – from crash rates, to platform feature state, to raw performance numbers, these dashboards help us make sure that we’re not back-sliding on things that truly matter to us and our users.

And now for some Performance Project updates!

Early first blank paint (lead by Florian Quèze)

Florian sent out an Intent to Ship for this perceived performance optimization for Firefox 61. The beta channel will transition to 61 in a bit over a week, and we’ll use that cycle to ensure that the feature should ship out to release.

Faster content process start-up time (lead by Felipe Gomes)

After some research and examination of how our content processes initialize themselves, the first few bugs have started to get filed to get fixed. This bug, for example, is for introducing infrastructure to make the privileged JavaScript loaded in the content processes more lazy. Another bug was filed to shine some light on the “dark matter” that exists at the start of many content processes in our profiler tools.

Get ContentPrefService init off of the main thread (lead by Doug Thayer)

After much heroic effort, this has landed! If you’re curious about the shutdown leak that was preventing this from landing earlier, here’s the patch that fixed it. Spoiler alert: it was the spellchecker, of all things.

This project is done, and will be removed from the updates from here forward.

Blocklist async-ification (lead by Gijs Kruitbosch)

As of a few days ago, all public blocklist API calls are asynchronous! This was a monumental effort from Gijs, and should result in faster start-up times for some of our users (especially ones with slower magnetic disks).

There are still some very minor internal mechanisms that can still cause the blocklist to be loaded synchronously, but hitting these should be super rare. In the meantime, now that the async-ification is complete, we have an eye towards migrating the back-end to indexedDB.

As the async-ification is wrapped, we’ll be removing this section from the updates from here forward.

LRU cache for tab layers (lead by Doug Thayer)

The patch to introduce the LRU cache have been written and are just waiting until the 62 cycle to begin on Nightly in order to land. No doubt there’ll be some interesting edge-cases to hammer out, but we’re very excited to see how this improves tab switching times for our users!

Tab warming (lead by Mike Conley)

After sending out an Intent to Ship, the prefs to allow Tab warming to ride to release on Windows and Linux were flipped. If all goes well on Beta, Windows and Linux Desktop users should see some nice tab switching performance improvements in Firefox 61!

While investigating the behaviour that’s preventing us from shipping tab warming on macOS, a new bug was filed to try to reduce the number of paints that are occurring during tab switches.

Firefox’s Most Wanted: Performance Wins (lead by YOU!)

Grab-bag time

And now, without further ado, a list of performance work that took place in the tree:

(🌟 indicates a volunteer contributor)

Daniel Holbert made it so that we’re more efficient when calculating the sizes of flex containers
Mike de Boer made it so that when you restore a session with several windows, the most recently used window appears first and stays on top during the restore – this means users have to spend less time getting back to what they were doing.
Kris Maglione flipped the pref that loads WebExtension background scripts in their own separate process for macOS! This was previously only supported on Windows. This is great for stability, security and performance! Kris also optimized some Add-on Manager code that runs on every start-up, which should gain us a nice start-up performance win
Marco Bonardo made it so that we don’t send unnecessary Download API updates when adding session history entries
Matt Woodrow got rid of an unneeded hash table and all associated look-ups in our retained displaylist code
Bas Schouten added some SIMD optimizations to our 2D geometry code. Bas says that, depending on the circumstances, this should buy us a 2x-4x performance boost for some frequently used geometric calculations.
Hiroyuki Ikezoe made it so that we now skip running calculations for animations that should not have changed yet
Mike Conley made it so that we can avoid some main-thread IO when initializing the Gecko Media Plug-ins backend.
Gijs Kruitbosch made it so that we do less synchronous layout flushing when resizing browser windows
Andrew McCreight fixed a rather nasty ghost window leak, which potentially means less cycle and garbage collecting for some users.

Thanks, folks!

I used to think it was pipe. It’s pike. ↩