Improving Firefox Startup Time With The about:home Startup Cache

Don’t bury the lede

We’re working on a thing to make Firefox start faster! It appears to work! Here’s a video showing off a before (left) and after (right):

Improving Firefox Startup Time With The about:home Startup Cache

For the past year or so, the Firefox Desktop Front-End Performance team has been concentrating on making improvements to browser startup performance.

The launching of an application like Firefox is quite complex. Meticulous profiling of Firefox startup in various conditions has, thankfully, helped reveal a number of opportunities where we can make improvements. We’ve been evaluating and addressing these opportunities, and several have made it into the past few Firefox releases.

This blog post is about one of those improvements that is currently in the later stages of development. I’m going to describe the improvement, and how we went about integrating it.

In a default installation of Firefox, the first (and only) tab that loads is about:home¹.

The about:home page is actually the same thing that appears when you open a new tab (about:newtab). The fact that they have different addresses allows us to treat their loading differently.

Your about:home might look slightly different from the above — depending on your locale, it may or may not include the Pocket stories.

Do not be fooled by what appears to be a very simple page of images and text. This page is actually quite sophisticated under the hood. It is designed to be customized by the user in the following ways:

Users can

Collapse or expand sections
Remove sections entirely
Reorganize the order of their Top Sites by dragging and dropping
Pin and unpin Top Sites to their positions
Add their own custom Top Sites with custom thumbnails
Add or remove search engines from their Top Sites
Change the number of rows in the Top Sites and Recommended by Pocket sections
Choose to have the Highlights composed of any of the following:
- Visited pages
- Recent bookmarks
- Recent downloads
- Pages recently saved to Pocket

The user can customize these things at any time, and any open copies of the page are expected to reflect those customizations immediately.

There are further complexities beyond user customization. The page is also designed to be easy for our design and engineering teams to experiment with reorganizing the layout and composition of the page so that they can test variations on its layout in the wild.

The about:home page also has special privileges not afforded to normal websites. It can

Save and remove bookmarks
Add pages to Pocket
Cause the URL bar to be focused and selected
Show thumbnails for pages that the user has visited
Access both high and normal resolution favicons
Render information about the user’s recent activity (recent page visits, downloads, saves to Pocket, etc.)

So while at first glance, this appears to be a static page of just images and text, rest assured that the page can do much more.

Like the Firefox Developer Tools UI, about:home is written with the help of the React and Redux libraries. This has allowed the about:home development team to create sophisticated, reusable, and composable components that could be easily tested using modern JavaScript testing methods.

Unsurprisingly, this complexity and customizability comes at a cost. The page needs to request a state object from the parent process in order to have the Redux store populated and to have React render it. Essentially, the page is dynamically rendering itself after the markup of the page loads.

Startup is a critical time for an application. The user has expressed a need for their browser, and we have an obligation to serve the user as quickly and efficiently as possible. The user’s time is a resource that we should not squander. Similarly, because so much needs to occur during startup,² disk reads, disk writes, and CPU time are also considered precious resources. They should only be used if there’s no other choice.

In this case, we believed that the CPU time and disk accesses spent constructing the state object and dynamically rendering the about:home page was competing with all of the other CPU and disk access happening during startup, and this was slowing us down from presenting about:home to the user in a timely way.

Generally speaking, in my mind there are four broad approaches to performance problems once a bottleneck has been identified.

You can widen the bottleneck (make the operations more efficient)
You can divide the bottleneck (split the work into smaller slices that can be done over a longer period of time with rests in between)
You can move the bottleneck (defer work until later when it seems that there is less competition for resources, or move it to a different thread)
You can remove the bottleneck (don’t do the work)

We started by trying to apply the last two approaches, wondering what startup performance would be like if the page did not render itself dynamically, but was instead a static page generated periodically and pulled off of the disk at startup.

Prototype when possible

The first step to improving something is finding a way to measure it. Thankfully, we already have a number of logged measurements for startup. One of those measurements gives us the time from process start to rendering the Top Sites section of about:home. This is not a perfect measurement—ideally, we’d measure to the point that the page finally “settles” and stops changing³—but for this project, this measurement served our purposes.

Before investing a bunch of time into a potential improvement, it’s usually a good idea to try to see if what you’re gaining is worth the development time. It’s not always possible to build a prototype for performance improvements, but in this case it was.

The team quickly threw together a static copy of about:home and hacked together a patch to load that document during startup, rather than dynamically rendering the page. We then tested that page on our reference hardware. As of this writing, it’s been about five months since that test was done, but according to this comment, the prototype yielded what appears to be an almost 20% win on time from process start to about:home painting Top Sites.

So, with that information, we thought we had a real improvement opportunity here. We decided to proceed with the idea, and began a long arduous search for “the right way to do it.”

Pre-production

As I mentioned earlier, about:home is complex. The infrastructure that powers it is complex. Coupled with the fact that no one on the Firefox Front-End Performance team had spent much time studying React and Redux meant that we had a lot of learning to do.

The first step was to get some React and Redux fundamentals under our belt. This meant building some small toy applications and getting familiar with the framework idioms and how things are organized.

With that grounding, the next step was to start reading the code — starting from the entrypoint into the code that powers about:home when the browser starts. This was an intense period of study that branched into many different directions. Part of the complexity was because much of the code is asynchronous and launched work on different threads, which introduced some non-determinism. While it is generally good for responsiveness to move work off of the main thread, it can lead to some complex reading and interpretation of the code when more than two threads are involved.

A tool we used during this analysis was the Firefox Profiler, to get a realistic sense of the order of executions during startup. These profiles helped to inform much of our reading of the code.

This analysis helped us solidify our mental model of how about:home loads. With that model in place, it was much easier to propose practical approaches for introducing a static about:home document into the ecosystem of pre-existing code. The Firefox Front-End Performance team documented our findings and recommendations and then presented them to the team that originally built the about:home system to ensure that we were all on the same page and that we hadn’t missed anything critical. They were already aware that we were investigating potential performance improvements, and had very useful feedback for us, as well as historical product decision context that clarified our understanding.

Critically, we presented our recommendation for loading a static about:home page at startup and ensured that there were no upcoming plans for about:home that would break our mental model or render the recommendation no longer valid. Thankfully, it sounded like we were aligned and fine to proceed with our plan.

So what was the plan? We knew that since about:home is quite dynamic and can change over time⁴ we needed a startup cache for about:home that could be periodically updated during the course of a browsing session. We would then load from that cache at startup. Clearly, I’m glossing over some details here, but that was the general plan.

As usual, no plan survives breakfast, and as we started to architect our solution, we identified things we would need to change along the way.

Development

We knew that the process that loads about:home would need to be able to read from the about:home startup cache. We also knew that about:home can potentially contain information about what pages the user has visited, and that about:home can do privileged things that normal web pages cannot. It seemed that this project would be a good opportunity to finish a project that was started (and mothballed) a year or so earlier: creating a special privileged content process for about:home. We would load about:home in that process, and add assertions to ensure that privileged actions from about:home could only happen from that content process type⁵

So getting the “privileged about content process”⁶ fixed up and ready for shipping was the first step.

This also paved the way for solving the next step, which was to enable the moz-page-thumb:// protocol for the “privileged about content process.” The moz-page-thumb:// protocol is used to show the screenshot thumbnails for pages that the user has visited in the past. The previous implementation was using Blob URLs to send those thumbnails down to the page, and those Blob URLs exist only during runtime and would not work properly after a restart.

The next step was figuring out how to build the document that would be stored in the cache. Thankfully, ReactDOMServer has the ability to render a React application to a string. This is normally used for server-side rendering of React-powered applications. This feature also allows the React library to passively attach to the server-side page without causing the DOM to be modified. With some small modifications, we were able to build a simple mechanism in a Web Worker to produce this cached document string off of the main thread. Keeping this work off of the main thread would help maintain responsiveness.

With those pieces of foundational work out of the way, it was time to figure out the cache storage mechanism. Firefox already has a startupcache module that it uses for static resources like markup and JavaScript, but that cache is not designed to be written to periodically at runtime. We would need something different.

We had originally supposed that we would need to give the privileged about content process special access to a file on the filesystem to read from and to write to (since our sandbox prevents content processes from accessing disks directly). Initial experiments along this line worried us — we didn’t like the idea of poking holes in the sandbox if we didn’t need to. Also, adding yet another read from the filesystem during startup seemed counter to our purposes.

We evaluated IndexedDB as a storage mechanism, but the DOM team talked us out of it. The performance characteristics of IndexedDB, especially during startup, were unlikely to work for us.

Finally, after some consulting, we were directed to the HTTP cache. The HTTP cache’s job is to cache pages that the user visits (when appropriate) and to offer those caches to the user instead of hitting the network when retrieving the resource within the expiration time⁷. Functionally speaking, this seemed like a storage mechanism perfectly suited to our purposes.

After consulting with the Necko team and building a few proof-of-concepts, we figured out how to tie the whole system together. Importantly, we figured out how to get the initial about:home load to pull a document out from the HTTP cache rather than reading it from the application resource package.

We also figured out the cache writing mechanism. The cached document that would periodically get built inside of the privileged about content process inside of a Worker off of the main thread, would then send that data back up to the parent to stream into the cache.

At this point, we felt we had all of the pieces that we needed. Construction on each component began.

Construction was remarkably smooth thanks to our initial research and consulting with the relevant teams. We also took the opportunity to carefully document each component.

Testing

One of the more gratifying parts of implementation was when we modified one of our startup tests to use the new caching mechanism.

In this graph, the Y axis is the geometric mean time to render the about:home Top Sites over 20 restarts of the browser, in milliseconds. Lower is better. The dots along the top are without the cache. The dots along the bottom are with the cache enabled. According to our measurements, we improved the rendering time from process start to Top Sites by just over 20%! We beat our prototype!

Noticeable differences

But the real proof will be if there’s actually a noticeable visual change. Here’s that screen recording again from one of our reference devices⁸.

The screen on the left is with the cache disabled, and on the right with the cache enabled. Looks to me like we made a noticeable dent!

Try it out!

We haven’t yet enabled the about:home startup cache in Nightly by default, but we hope to do so soon. In the meantime, Nightly users can try it out right now by going to about:preferences#experimental and toggling it on. If you find problems and have a Bugzilla account, here’s a form for submitting bugs to the right place.

You can tell if the about:home you’re looking at is from the cache by opening up the DevTools Inspector and looking for a  comment just above the <body> tag.

Caveat emptor

There are a few cases where the cache isn’t used or is invalidated.

The first case is if you’ve configured something other than about:home as your home page (where the cache isn’t used). In this case, the cache won’t be read from, and the code to create the cache won’t ever run. If the user ever resets about:home to be their home page, then the caching code will start working for them.

The second case is if you’ve configured Firefox to restore your previous session by default. In this case, it’s unlikely that the first tab you’ll see is about:home, so the cache won’t be read from, and the code to create the cache won’t ever run. As before, if the user switches to not loading their previous session by default, then the cache will start working for them.

Another case is when the Firefox build identifier doesn’t match the build identifier from when the cache was created. This is also how the other startupcache module for static resources works. This ensures that when an update is applied, we don’t accidentally load old assets from the cache. So the first time you launch Firefox after you apply an update will not pull the about:home document from the cache, even if one exists (and will throw the cache out if it does). For Nightly users that generally receive updated builds twice a day, this makes the cache somewhat useless. Beta and Release users update much less frequently, so we expect to see a greater impact there.

The last case is in the event that your disk was in a situation such that reading the dynamic code from the asset bundle was faster than reading from the cache. If by the time the about:home document attempts to load and the cache isn’t ready, we fall back to loading it the old way. We don’t expect this to happen too often, but it’s theoretically possible, so we handle the case.

Future work

The next major step is to get the about:home startup cache turned on by default on Nightly and get it tested by a broader audience. At that point, hopefully we’ll get a better sense of its behaviour out in the wild via bug reports and Telemetry. Then our improvement will either ride the release train, or we might turn it on for subsets of the Beta or Release populations to measure its impact on more realistic restart scenarios. Once we’re confident that it’s helping more than hindering, we’ll turn it on by default for everyone.

After that, I think it would be worth seeing if we can load from the cache more often. Perhaps we could load about:newtab from there as well, for example.

One step at a time!

Thanks to

Florian Quèze and Doug Thayer, both of whom independently approached me with the idea of creating a static about:home
Jay Lim and Ursula Sarracini, both of whom wrote some of the groundwork code that was needed for this feature (namely, the infrastructure for the privileged about content process, and the first version of the moz-page-thumbs support)
Gijs Kruitbosch, Kate Hudson, Ed Lee, Scott Downe, and Gavin Suntop for all of the consulting and reviews
Honza Bambas and Andrew Sutherland for storage consultations
Markus Stange, Andrew Creskey, Eric Smyth, Asif Youssuff, Dan Mosedale, Emily Derr for their generous feedback on this post

This is only true if the user hasn’t just restarted after applying an update, and if they haven’t set a custom home page or configured Firefox to restore their previous session on start. ↩
You can think of startup like a traveling circus coming to town. You have to get the trucks and trailers parked, get the tents set up, hook up power, then lighting and sound … it’s a big, complex operation, and we haven’t even shot a clown out of a cannon yet. ↩
We’re working on something like that ↩
As the user browses, bookmarks and downloads things, their Highlights and Top Sites sections might change. If Pocket is enabled, new stories will also be downloaded periodically. ↩
It’s vitally important that content processes have limited abilities. That way, if they’re ever compromised by a bad actor, there are limits to what damage they can do. The assertions mentioned in this case mean that if a compromised content process tries to “pretend” to be the privileged about content process by sending one of its messages, that the parent process will terminate that content process immediately. ↩
Naming is hard. ↩
This has changed slightly in the past few years with a feature called Race Cache With Network, which races the disk cache with the network instead of relying on the disk entirely. ↩
This device is an Acer Aspire E-15 E5-575-33BM ↩

A few words on main thread disk access for general audiences

I’m writing this in lieu of a traditional Firefox Front-end Performance Update, as I think this will be more useful in the long run than just a snapshot of what my team is doing.

I want to talk about main thread disk access (sometimes referred to more generally as “main thread IO”). Specifically, I’m going to argue that main thread disk access is lethal to program responsiveness. For some folks reading this, that might be an obvious argument not worth making, or one already made ad nauseam — if that’s you, this blog post is probably not for you. You can go ahead and skip most or all of it, if you’d like. Or just skim it. You never know — there might be something in here you didn’t know or hadn’t thought about!

For everybody else, scoot your chairs forward, grab a snack, and read on.

Disclaimer: I wouldn’t call myself a disk specialist. I don’t work for Western Digital or Seagate. I don’t design file systems. I have, however, been using and writing software for computers for a significant chunk of my life, and I seem to have accumulated a bunch of information about disks. Some of that information might be incorrect or imprecise. Please send me mail at mike dot d dot conley at gmail dot com if any of this strikes you as wildly inaccurate (though forgive me if I politely disregard pedantry), and then I can update the post.

The mechanical parts of a computer

If you grab a screwdriver and (carefully) open up a laptop or desktop computer, what do you see? Circuit boards, chips, wires and plugs. Lots of electrons flowing around in there, moving quickly and invisibly.

Notably, there aren’t many mechanical moving parts of a modern computer. Nothing to grease up, nowhere to pour lubricant. Opening up my desktop at home, the only moving parts I can really see are the cooling fans near the CPU and power supply (and if you’re like me, you’ll also notice that your cooling fans are caked with dust and in need of a cleaning).

There’s another moving part that’s harder to see — the hard drive. This might not be obvious, because most mechanical drives (I’ve heard them sometimes referred to as magnetic drives, spinny drives, physical drives, platter drives and HDDs. There are probably more terms.) hide their moving parts inside of the disk enclosure.¹

If you ever get the opportunity to open one of these enclosures (perhaps the disk has failed or is otherwise being replaced, and you’re just about to get rid of it) I encourage you to.

As you disassemble the drive, what you’ll probably notice are circular parts, layered on top of one another on a motor that spins them. In between those circles are little arms that can move back and forth. This next image shows one of those circles, and one of those little arms.

There are several of those circles stacked on top of one another, and several of those arms in between them. We’re only seeing the top one in this photo.

Does this remind you of anything? The circular parts remind me of CDs and DVDs, but the arms reaching across them remind me of vinyl players.

The comparison isn’t that outlandish. If you ignore some of the lower-level details, CDs, DVDs, vinyl players and hard drives all operate under the same basic principles:

The circular part has information encoded on it.
An arm of some kind is able to reach across the radius of the circular part.
Because the circular part is spinning, the arm is able to reach all parts of it.
The end of the arm is used to read the information encoded on it.

There’s some extra complexity for hard drives. Normally there’s more than one spinning platter and one arm, all stacked up, so it’s more like several vinyl players piled on top of one another.

Hard drives are also typically written to as well as read from, whereas CDs, DVDs and vinyls tend to be written to once, and then used as “read-only memory.” (Though, yes, there are exceptions there.)

Lastly, for hard drives, there’s a bit I’m skipping over involving caches, where parts of the information encoded on the spinning platters are temporarily held elsewhere for easier and faster access, but we’ll ignore that for now for simplicity, and because it wrecks my simile.²

So, in general, when you’re asking a computer to read a file off of your hard drive, it’s a bit like asking it to play a tune on a vinyl. It needs to find the right starting place to put the needle, then it needs to put the needle there and only then will the song play.

For hard drives, the act of moving the “arm” to find the right spot is called seeking.

Contiguous blocks of information and fragmentation

Have you ever had to defragment your hard drive? What does that even mean? I’m going to spend a few moments trying to explain that at a high-level. Again, if this is something you already understand, go ahead and skip this part.

Most functional hard drives allow you to do the following useful operations:

Write data to the drive
Read data from the drive
Remove data from the drive

That last one is interesting, because usually when you delete a file from your computer, the information isn’t actually erased from the disk. This is true even after emptying your Trash / Recycling Bin — perhaps surprisingly, the files that you asked to be removed are still there encoded on the circular platters as 1’s and 0’s. This is why it’s sometimes possible to recover deleted files even when it seems that all is lost.

Allow me to explain.

Just like there are different ways of organizing a sock drawer (at random, by colour, by type, by age, by amount of damage), there are ways of organizing a hard drive. These “ways” are called file systems. There are lots of different file systems. If you’re using a modern version of Windows, you’re probably using a file system called NTFS. One of the things that a file system is responsible for is knowing where your files are on the spinning platters. This file system is also responsible for knowing where there’s free space on the spinning platters to write new data to.

When you delete a file, what tends to happen is that your file system marks those sectors of the platter as places where new information can be written to, but doesn’t immediately overwrite those sectors. That’s one reason why sometimes deleted files can be recovered.

Depending on your file system, there’s a natural consequence as you delete and write files of different sizes to the hard drive: fragmentation. This kinda sounds like the actual physical disk is falling apart, but that’s not what it means. Data fragmentation is probably a more precise way of thinking about it.

Imagine you have a sheet of white paper broken up into a grid of 5 boxes by 5 boxes (25 boxes in total), and a box of paints and paintbrushes.

Each square on the paper is white to start. Now, starting from the top-left, and going from left-to-right, top-to-bottom, use your paint to fill in 10 of those boxes with the colour red. Now use your paint to fill in the next 5 boxes with blue. Now do 3 more boxes with yellow.

So we’ve got our colour-filled boxes in neat, organized rows (red, then blue, then yellow), and we’ve got 18 of them filled, and 7 of them still white.

Now let’s say we don’t care about the colour blue. We’re okay to paint over those now with a new colour. We also want to fill in 10 boxes with the colour purple. Hm… there aren’t enough free white boxes to put in that many purple ones, but we have these 5 blue ones we can paint over. Let’s paint over them with purple, and then put the next 5 at the end in the white boxes.

So now 23 of the boxes are filled, we’ve got 2 left at the end that are white, but also, notice that the purple boxes aren’t all together — they’ve been broken apart into two sections. They’ve been fragmented.

This is an incredibly simplified model, but (I hope) it demonstrates what happens when you delete and write files to a hard drive. Gaps open up that can be written to, and bits and pieces of files end up being distributed across the platters as fragments.

This also occurs as files grow. If, for example, we decided to paint two more white boxes red, we’d need to paint the ones at the very end, breaking up the red boxes so that they’re fragmented.

So going back to our vinyl player example for a second — the ideal scenario is that you start a song at the beginning and it plays straight through until the end, right? The more common case with disk drives, however, is you read bits and pieces of a song from different parts of the vinyl: you have to lift and move the arm each time until eventually you have heard the song from start to finish. That seeking of the arm adds overhead to the time it takes to listen to the song from beginning to end.

When your hard drive undergoes defragmentation, what your computer does is try to re-organize your disk so that files are in contiguous sectors on the platters. That’s a fancy way of saying that they’re all in a row on the platter, so they can be read in without the overhead of seeking around to assemble it as fragments.

Skipping that overhead can have huge benefits to your computer’s performance, because the disk is usually the slowest part of your computer.

I’ve skipped over and simplified a bunch of stuff here in the interests of brevity, but this is a great video that gives a crash course on file systems and storage. I encourage you to watch it.

On the relative input / output speeds of modern computing components

I mentioned in the disclaimer at the start of this post that I’m not a disk specialist or expert. Scott Davis is probably a better bet as one of those. His bio lists an impressive wealth of experience, and mentions that he’s “a recognized expert in virtualization, clustering, operating systems, cloud computing, file systems, storage, end user computing and cloud native applications.”

I don’t know Scott at all (if you’re reading this, Hi, Scott!), but let’s just agree for now that he probably knows more about disks than I do.

I’m picking Scott as an expert because of a particularly illustrative analogy that was posted to a blog for a company he used to work for. The analogy compares the speeds of different media that can be used to store information on a computer. Specifically, it compares the following:

RAM
The network with a decent connection
Flash drives
Magnetic hard drives — what we’ve been discussing up until now.

For these media, the post claims that input / output speed can be measured using the following units:

RAM is in nanoseconds
10GbE Network speed is in microseconds (~50 microseconds)
Flash speed is in microseconds (between 20-500+ microseconds)
Disk speed is in milliseconds

That all seems pretty fast. What’s the big deal? Well, it helps if we zoom in a little bit. The post does this by supposing that we pretend that RAM speed happens in minutes.

If that’s the case, then we’d have to measure network speed in weeks.

And if that’s the case, then we’d want to measure the speed of a Flash drive in months.

And if that’s the case, then we’d have to measure the speed of a magnetic spinny disk in decades.

Update (May 23, 2019): My Uncle Mark, who also works in computing, sent me links that show similar visualizations of computing latency: this one has a really excellent infographic, and this one has more discussion. These articles highlight network latency as the worst offender, which is true especially when the quality of service is low, but I’m mostly writing this post for folks who hack on Firefox where the vast majority of networking occurs off of the main thread.

I wish I had some ACM paper, or something written by a computer science professor that I could point to you to bolster the following claim. I don’t, not because one doesn’t exist, but because I’m too lazy to look for one. I hope you’ll forgive me for that, but I don’t think I’m saying anything super controversial when I say:

In the common case, for a personal computer, it’s best to assume that reading and writing to the disk is the slowest operation you can perform.

Sure, there are edge cases where other things in the system might be slower. And there is that disk cache that I breezed over earlier that might make reading or writing cheaper. And sometimes the operating system tries to do smart things to help you. For now, just let it go. I’m making a broad generalization that I think covers the common cases, and I’m talking about what’s best to assume.

Single and multi-threaded restaurants

When I try to describe threading and concurrency to someone, I inevitably fall back to the metaphor of cooks in a kitchen in a restaurant. This is a special restaurant where there’s only one seat, for a single customer — you, the user.

Single-threaded programs

Let’s imagine a restaurant that’s very, very small and simple. In this restaurant, the cook is also acting as the waiter / waitress / server. That means when you place your order, the server / cook goes into the kitchen and makes it for you. While they’re gone, you can’t really ask for anything else — the server / cook is busy making the thing you asked for last.

This is how most simple, single-threaded programs work—the user feeds in requests, maybe by clicking a button, or typing something in, maybe something else entirely—and then the program goes off and does it and returns some kind of result. Maybe at that point, the program just exits (“The restaurant is closed! Come back tomorrow!”), or maybe you can ask for something else. It’s really up to how the restaurant / program is designed that dictates this.

Suppose you’re very, very hungry, and you’ve just ordered a complex five-course meal for yourself at this restaurant. Blanching, your server / cook goes off to the kitchen. While they’re gone, nobody is refilling your water glass or giving you breadsticks. You’re pretty sure there’s activity going in the kitchen and that the server / cook hasn’t had a heart attack back there, but you’re going to be waiting a looooong time since there’s only one person working in this place.

Maybe in some restaurants, the server / cook will dash out periodically to refill your water glass, give you some breadsticks, and update you on how things are going, but it sure would be nice if we gave this person some help back there, wouldn’t it?

Multi-threaded programs

Let’s imagine a slightly different restaurant. There are more cooks in the kitchen. The server is available to take your order (but is also able to cook in the kitchen if need be), and you make your request from the menu.

Now suppose again that you order a five-course meal. The server goes to the kitchen and tells the cooks what you just ordered. In this restaurant, suppose the kitchen staff are a really great team and don’t get in each other’s way³, so they divide up the order in a way that makes sense and get to work.

The server can come back and refill your water glass, feed you breadsticks, perhaps they can tell you an entertaining joke, perhaps they can take additional orders that won’t take as long. At any rate, in this restaurant, the interaction between the user and the server is frequent and rarely interrupted.

The waiter / waitress / server is the main thread

In these two examples, the waiter / waitress / server is what is usually called the main thread of execution, which is the part of the program that the user interacts with most directly. By moving expensive operations off of the main thread, the responsiveness of the program increases.

Have you ever seen the mouse turn into an hourglass, seen the “This program is not responding” message on Windows? Or the spinny colourful pinwheel on macOS? In those cases, the main thread is off doing something and never came back to give you your order or refill your water or breadsticks — that’s how it generally manifests in common operating systems. The program seems “unresponsive”, “sluggish”, “frozen”. It’s “hanging”, or “stuck”. When I hear those words, my immediate assumption is that the main thread is busy doing something — either it’s taking a long time (it’s making you your massive five course meal, maybe not as efficiently as it could), or it’s stuck (maybe they fell down a well!).

In either case, the general rule of thumb to improving program responsiveness is to keep the server filling the user’s water and breadsticks by offloading complex things on the menu to other cooks in the kitchen.

Accessing the disk on the main thread

Recall that in the common case, for a personal computer, it’s best to assume that reading and writing to the disk is the slowest operation you can perform. In our restaurant example, reading or writing to the disk on the main thread is a bit like having your server hop onto their bike and ride out to the next town over to grab some groceries to help make what you ordered.

And sometimes, because of data fragmentation (not everything is all in one place), the server has to search amongst many many shelves all widely spaced apart to get everything.

And sometimes the grocery store is very busy because there are other restaurants out there that are grabbing supplies.

And sometimes there are police checks (anti-virus / anti-malware software) occurring for passengers along the road, where they all have to show their IDs before being allowed through.

It’s an incredibly slow operation. Hopefully by the time the server comes back, they don’t realize they have to go back out again to get more, but they might if they didn’t realize they were missing some more ingredients.⁴

Slow slow slow. And unresponsive. And a great way to lose a hungry customer.

For super small programs, where the kitchen is well stocked, or the ride to the grocery store doesn’t need to happen often, having a single-thread and having it read or write is usually okay. I’ve certainly written my fair share of utility programs or scripts that do main thread disk access.

Firefox, the program I spend most of my time working on as my job, is not a small program. It’s a very, very, very large program. Using our restaurant model, it’s many large restaurants with many many cooks on staff. The restaurants communicate with each other and ship food and supplies back and forth using messenger bikes, to provide to you, the customer, the best meals possible.

But even with this large set of restaurants, there’s still only a single waiter / waitress / server / main thread of execution as the point of contact with the user.

Part of my job is to help organize the workflows of this restaurant so that they provide those meals as quickly as possible. Sending the server to the grocery store (main thread disk access) is part of the workflow that we absolutely need to strike from the list.

Start-up main-thread disk access

Going back to our analogy, imagine starting the program like opening the restaurant. The lights go on, the chairs come off of the tables, the kitchen gets warmed up, and prep begins.

While this is occurring, it’s all hands on deck — the server might be off in the kitchen helping to do prep, off getting cutlery organized, whatever it takes to get the restaurant open and ready to serve. Before the restaurant is open, there’s no point in having the server be idle, because the customer hasn’t been able to come in yet.

So if critical groceries and supplies needed to open the restaurant need to be gotten before the restaurant is open, it’s fine to send the server to the store. Somebody has to do it.

For Firefox, there are various things that need to take place before we can display any UI. At that point, it’s usually fine to do main-thread disk access, so long as all of the things being read or written are kept to an absolute minimum. Find how much you need to do, and reduce it as much as possible.

But as soon as UI is presented to the user, the restaurant is open. At that point, the server should stay off their bike and keep chatting with the customer, even if the kitchen hasn’t finished setting up and getting all of their supplies. So to stay responsive, don’t do disk access on the main thread of execution after you’ve started to show the user some kind of UI.

Disk contention

There’s one last complication I want to capture here with our restaurant example before I wrap up. I’ve been saying that it’s important to send anyone except the server to the grocery store for supplies. That’s true — but be careful of sending too many other people at the same time.

Moving disk access off of the main thread is good for responsiveness, full stop. However, it might do nothing to actually improve the overall time that it takes to complete some amount of work. Put it another way: just because the server is refilling your glass and giving you breadsticks doesn’t mean that your five-course meal is going to show up any faster.

Also, disk operations on magnetic drives do not have a constant speed. Having the disk do many things at once within a single program or across multiple programs can slow the whole set of operations down due to the overhead of seeking and context switching, since the operating system will try to serve all disk requests at once, more or less.⁵

Disk contention and main thread disk access is something I think a lot about these days while my team and I work on improving Firefox start-up performance.

Some questions to ask yourself when touching disk

So it’s important to be thoughtful about disk access. Are you working on code that touches disk? Here are some things to think about:

Is UI visible, and responsiveness a goal?

If so, best to move the disk access off of the main-thread. That was the main thing I wanted to capture, and I hope I’ve convinced you of that point by now.

Does the access need to occur?

As programs age and grow and contributors come and go, sometimes it’s important to take a step back and ask, “Are the assumptions of this disk access still valid? Does this access need to happen at all?” The fastest code is the code that doesn’t run at all.

What else is happening during this disk access? Can disk access be prioritized more efficiently?

This is often trickier to answer as a program continues to run. Thankfully, tools like profilers can help capture recordings of things like disk access to gain evidence of simultaneous disk access.

Start-up is a special case though, since there’s usually a somewhat deterministic / reliably stable set of operations that occur in the same way in roughly the same order during start-up. For start-up, using a tool like a profiler, you can gain a picture of the sorts of things that tend to happen during that special window of time. If you notice a lot of disk activity occurring simultaneously across multiple threads, perhaps ponder if there’s a better way of ordering those operations so that the most important ones complete first.

Can we reduce how much we need to read or write?

There are lots of wonderful compression algorithms out there with a variety of performance characteristics that might be worth pondering. It might be worth considering compressing the data that you’re storing before writing it so that the disk has to write less and read less.

Of course, there’s compression and decompression overhead to consider here. Is it worth the CPU time to save the disk time? Is there some other CPU intensive task that is more critical that’s occurring?

Can we organize the things that we want to read ahead of time so that they’re more likely to be read contiguously (without seeking the disk)?

If you know ahead of time the sorts of things that you’re going to be reading off of the disk, it’s generally a good strategy to store them in that read order. That way, in the best case scenario (the disk is defragmented), the read head can fly along the sectors and read everything in, in exactly the right order you want them. If the user has defragmented their disk, but the things you’re asking for are all out of order on the disk, you’re adding overhead to seek around to get what you want.

Supposing that the data on the disk is fragmented, I suspect having the files in order anyways is probably better than not, but I don’t think I know enough to prove it.

Flawed but useful

One of my mentors, Greg Wilson, likes to say that “all models are flawed, but some are useful”. I don’t think he coined it, but he uses it in the right places at the right times, and to me, that’s what counts.

The information in this post is not exhaustive — I glossed over and left out a lot. It’s flawed. Still, I hope it can be useful to you.

Thanks

Thanks to the following folks who read drafts of this and gave feedback:

Mandy Cheang
Emily Derr
Gijs Kruitbosch
Doug Thayer
Florian Quèze

There are also newer forms of disks called Flash disks and SSDs. I’m not really going to cover those in this post. ↩
The other thing to keep in mind is that the disk cache can have its contents evicted at any time for reasons that are out of your control. If you time it right, you can maybe increase the probability of a file you want to read being in the cache, but don’t bet the farm on it. ↩
When writing multi-threaded programs, this is much harder than it sounds! Mozilla actually developed a whole new programming language to make that easier to do correctly. ↩
Keen readers might notice I’m leaving out a discussion on Paging. That’s because this blog post is getting quite long, and because it kinda breaks the analogy a bit — who sends groceries back to a grocery store? ↩
I’ve never worked on an operating system, but I believe most modern operating systems try to do a bunch of smart things here to schedule disk requests in efficient ways. ↩

Firefox Front-End Performance Update #17

Hello, folks. I wanted to give a quick update on what the Firefox Front-end Performance team is up to, so let’s get into it.

The name of the game continues to be start-up performance. We made some really solid in-roads last quarter, and this year we want to continue to apply pressure. Specifically, we want to focus on reducing IO (specifically, main-thread IO) during browser start-up.

Reducing main thread IO during start-up

There are lots of ways to reduce IO – in the best case, we can avoid start-up IO altogether by not doing something (or deferring it until much later). In other cases, when the browser might be servicing events on the main thread, we can move IO onto another thread. We can also re-organize, pack or compress files differently so that they’re read off of the disk more efficiently.

If you want to change something, the first step is measuring it. Thankfully, my colleague Florian has written a rather brilliant test that lets us take accounting of how much IO is going on during start-up. The test is deterministic enough that he’s been able to write a whitelist for the various ways we touch the disk on the main thread during start-up, and that whitelist means we’ve made it much more difficult for new IO to be introduced on that thread.

That whitelist has been processed by the team, and have been turned into bugs, bucketed by the start-up phase where the IO is occurring. The next step is to estimate the effort and potential payoff of fixing those bugs, and then try to whittle down the whitelist.

And that’s effectively where we’re at. We’re at the point now where we’ve got a big list of work in front of us, and we have the fun task of burning that list down!

Being better at loading DLLs on Windows

While investigating the warm-up service for Windows, Doug Thayer noticed that we were loading DLLs during start-up oddly. Specifically, using a tool called RAMMap, he noticed that we were loading DLLs using “read ahead” (eagerly reading the entirety of the DLL into memory) into a region of memory marked as not-executable. This means that anytime we actually wanted to call a library function within that DLL, we needed to load it again into an executable region of memory.

Doug also noticed that we were unnecessarily doing ReadAhead for the same libraries in the content process. This wasn’t necessary, because by the time the content process wanted to load these libraries, the parent process would have already done it and it’d still be “warm” in the system file cache.

We’re not sure why we were doing this ReadAhead-into-unexecutable-memory work – it’s existence in the Firefox source code goes back many many years, and the information we’ve been able to gather about the change is pretty scant at best, even with version control. Our top hypothesis is that this was a performance optimization that made more sense in the Windows XP era, but has since stopped making sense as Windows has evolved.

UPDATE: Ehsan pointed us to this bug where the change likely first landed. It’s a long and wind-y bug, but it seems as if this was indeed a performance optimization, and efforts were put in to side-step effects from Prefetch. I suspect that later changes to how Prefetch and SuperFetch work ultimately negated this optimization.

Doug hacked together a quick prototype to try loading DLLs in a more sensible way, and the he was able to capture quite an improvement in start-up time on our reference hardware:

This graph measures various start-up metrics. The scatter of datapoints on the left show the “control” build, and they tighten up on the right with the “test” build. Lower is better.

At this point, we all got pretty excited. The next step was to confirm Doug’s findings, so I took his control and test builds, and tested them independently on the reference hardware using frame recording. There was a smaller¹, but still detectable improvement in the test build. At this point, we decided it was worth pursuing.

Doug put together a patch, got it reviewed and landed, and we immediately saw an impact in our internal benchmarks.

We’re also seeing the impact reflected in Telemetry. The first Nightly build with Doug Thayer’s patch went out on April 14th, and we’re starting to see a nice dip in some of our graphs here:

This graph measures the time at which the browser window reports that it has first painted. April 14th is the second last date on the X axis, and the Y axis is time. The top-most line is plotting the 95th percentile, and there’s a nice dip appearing around April 14th.

There are other graphs that I’d normally show for improvements like this, except that we started tracking an unrelated regression on April 16th which kind of muddies the visualization. Bad timing, I guess!

We expect this improvement to have the greatest impact on weaker hardware with slower disks, but we’ll be avoiding some unnecessary work for all Windows users, and that gets a thumbs-up in my books.

If all goes well, this fix should roll out in Firefox 68, which reaches our release audience on July 9th!

My test machine has SuperFetch disabled to help reduce noise and inconsistency with start-up tests, and we suspect SuperFetch is able to optimize start-up better in the test build ↩

Firefox Front-End Performance Update #16

With Firefox 67 only a few short weeks away, I thought it might be interesting to take a step back and talk about some of the work that the Firefox Front-end Performance team is shipping to users in that particular release.

To be clear, this is not an exhaustive list of the great performance work that’s gone into Firefox 67 – but I picked a few things that the front-end team has been focused on to talk about.

Stop loading things we don’t need right away

The fastest code is the code that doesn’t run at all. Sometimes, as the browser evolves, we realize that there are components that don’t need to be loaded right away during start-up, and can instead of deferred until sometime after start-up. Sometimes, that means we can wait until the very last moment to initialize some component – that’s called loading something lazily.

Here’s a list of things that either got deferred until sometime after start-up, or made lazy:

FormAutofillContent and FormValidationChild

These are modules that support, you guessed it, Form Autofill – that part of the browser that helps you fill in web forms, and makes sure forms are passing validation. We were loading these modules too early, and now we load them only when there are forms to auto-fill or validate on a page.

The hidden window

The Hidden Window is a mysterious chunk of code that manages the state of the global menu bar on macOS when there are no open windows. The Hidden Window is also sometimes used as a singleton DOM window where various operations can take place. On Linux and Windows, it turns out we were creating this Hidden Window far early than needs be, and now it’s quite a bit lazier.

Page style

Page Style is a menu you can find under View in the main menu bar, and it’s used to switch between alternative style sheets on a page. It’s a pretty rarely used feature from what we can tell, but we were scanning pages for their alternative stylesheets far earlier than we needed to. We were also scanning pages that we know don’t have alternative stylesheets, like the about:home / about:newtab page. Now we only scan normal web pages, and we do so only after we service the idle event queue.

Cache invalidation

The Startup Cache is an important part of Firefox startup performance. It’s primary job is to cache computations that occur during each startup so that they only have to happen every once in a while. For example, the mark-up of the browser UI often doesn’t change from startup to startup, so we can cache a version of the mark-up that’s faster to read from disk, and only invalidate that during upgrades.

We were invalidating the whole startup cache every time a WebExtension was installed or uninstalled. This used to be necessary for old-style XUL add-ons (since those could cause changes to the UI that would need to go into the cache), but with those add-ons no longer available, we were able to remove the invalidation. This means faster startups more often.

Don’t touch the disk

The disk is almost always the slowest part of the system. Reading and writing to the disk can take a long time, especially on spinning magnetic drives. The less we can read and write, the better. And if we’re going to read, best to do it off of the main thread so that the UI remains responsive.

Old XUL icons code

We were reading from the disk on the main thread to search for window-specific icons to display in the window titlebar.

Firefox doesn’t use window-specific icons, so we made it so that we skip these checks. This means less disk activity, which is great for responsiveness and start-up!

Hitting every directory on the way down

We noticed that when we were checking that a directory exists on Windows (to write a file to it), we were using the CreateDirectoryW Windows API. This API checks each folder on the way down to the last one to see if they exist. That’s a lot of disk IO! We can avoid this if we assume that the parent directories exist, and only fall into the slow path if we fail to write our file. This means that we hit the faster path with less IO more often, which is good for responsiveness and start-up time.

Enjoy your Faster Fox!

Firefox 67 is slated to ship with these improvements on May 14th – just a little over a month away. Enjoy!

Firefox Front-End Performance Update #15

Firefox 66 has been released, Firefox 67 is out on the beta channel, and Firefox 68 is cooking for the folks on the Nightly channel! These trains don’t stop!

With that, let’s take a quick peek at what the Firefox Front-end Performance team has been doing these past few weeks…

Volunteer Contributor Highlight: Nikki!

I first wanted to call out some great work from Nikki, who’s a volunteer contributor. Nikki fixed a bug where we’d stall the parent process horribly if ever hovering a link with a really really long URL (like a base64 encoded Data URL). Stalling the parent process is the worst, because it makes everything else seem slow as a result.

Thank you for your work, Nikki!

Document Splitting Foundations for WebRender (In-Progress by Doug Thayer)

An impressive set of patches were recently queued to land, which should bring document splitting to WebRender, but in a disabled state. The gfx.webrender.split-render-roots pref is what controls it, but I don’t think we can reap the full benefits of document splitting until we get retained display lists enabled in the parent process for the UI. I believe, at that point, we can start enabling document splitting, which means that updating the browser UI area will not involve sending updates to the content area for WebRender.

In other WebRender news, it looks like it should be enabled by default for some of our users on the release channel in Firefox 67, due to be released in mid-May!

Warm-up Service (In-Progress by Doug Thayer)

Doug has written the bits of code that tie a Firefox preference to an HKLM registry key, which can be read by the warm-up service at start-up. The next step is to add a mode to the Firefox executable that loads its core DLLs and then exits, and then have the warm-up service call into that mode if enabled.

Once this is done, we should be in a state where we can user test this feature.

Startup Cache Telemetry (In-Progress by Doug Thayer)

Two things of note here:

With the probes having now uplifted to Beta, data will slowly trickle in these next few days that will show us how the Firefox startup cache is behaving in the wild for users that aren’t receiving two updates a day (like our Nightly users). This important, because oftentimes, those updates cause some or all of the startup cache to be invalidated. We’re eager to see how the startup caches are behaving in the wild on Beta.
One of the tests that was landed for the startup cache Telemetry appears to have caught an issue with how the QuantumBar code works with it – this is useful, because up until now, we’ve had very little automated testing to ensure that the startup cache is working as expected.

Smoother Tab Animations (Paused by Felipe Gomes)

UX, Product and Engineering have been having discussions about how the new tab animations work, and one thing has been decided upon: we want our User Research team to run some studies to see how tab animations are perceived before we fully commit to changing one of the fundamental interactions in the browser. So, at this time, Felipe is pausing his efforts here until User Research comes back with some information on guidance.

Browser Adjustment Project (Concluded by Gijs Kruitbosch)

We originally set out to see whether or not we could do something for users running weaker hardware to improve their browsing experience. Our initial hypothesis was that by lowering the frame rate of the browser on weaker hardware, we could improve the overall page load time.

This hypothesis was bolstered by measurements done in late 2018, where it appeared that by manually lowering the frame rate on a weaker reference laptop, we could improve our internal page load benchmarks by a significant degree. This measurement was reproduced by Denis Palmeiro on Vicky Chin’s team, and so Gijs started implementing a runtime detection mechanism to do that lowering of the frame rate for machines with 2 or fewer cores where each core’s clockspeed was 1.8Ghz or slower¹.

However, since then, we’ve been unable to reproduce the same positive effect on page load time. Neither has Denis. We suspect that recent work on the RefreshDriver, which changes how often the RefreshDriver runs during the page load window, is effectively getting the same kind of win².

We did one final experiment to see whether or not lowering the frame rate would improve battery life, and it appeared to, but not to a very high degree. We might revisit that route were we tasked with trying to improve power usage in Firefox.

So, to reduce code complexity, Gijs landed patches to remove the low-end hardware switches and frame rate lowering code today. This experiment and project is now concluded. It’s not a satisfying end with a slum dunk perf win, but you can’t win them all.

Better about:newtab Preloading (Completed by Gijs Kruitbosch)

The patch to preload about:newtab in an idle callback has landed and stuck! This means that we don’t preload about:newtab immediately after opening a new tab (which is good for responsiveness right around the time when you’re likely to want to do something), and also means that we have the possibility of preloading the first new tab in new windows! Great job, Gijs!

Experiments with the Process Priority Manager (In-Progress by Mike Conley)

I had a meeting today with Saptarshi, one of our illustrious Data Scientists, to talk about the upcoming experiment. One of the things he led me to conclude was that this experiment is going to have a lot of confounds, and it will be difficult to conclude things from.

Part of the reason for that is because there are often times when a background tab won’t actually have its content process priority lowered. The potential reasons for this are:

The tab is running in a content process which is also hosting a tab that is running in the foreground of either the same or some other browser window.
The tab is playing audio or video.

Because of this, we can’t actually do things like measure how page load is being impacted by this feature because we don’t have a great sense of how many tabs have their content process priorities lowered. That’s just not a thing we collect with Telemetry. It’s theoretically possible, either due to how many windows or videos or tabs our Beta users have open, that very few of them will ever actually have their content process priorities lowered, and then the data we’d draw from Telemetry would be useless.

I’m working with Saptarshi now to try to find ways of either altering the process priority manager or adding new probes to reduce the number of potential confounds.

Grab bag of other performance improvements

A member of the Wikimedia Performance team noticed that Firefox 66 dramatically improved it’s page load time on their Facebook benchmark. This is likely due to the hard work of Randell Jesup, who made setTimeout lower priority during page load.
Nathan Froyd made it so that we do less main-thread IO on Windows when checking for the existence of filesystem directories

These criteria for what makes “weak hardware” was mostly plucked from the air, but we had to start somewhere. ↩
But for all users, not just users on weaker hardware. ↩