Tag Archives: firefox

Improving Firefox Startup Time With The about:home Startup Cache

Don’t bury the lede

We’re working on a thing to make Firefox start faster! It appears to work! Here’s a video showing off a before (left) and after (right):

Improving Firefox Startup Time With The about:home Startup Cache

For the past year or so, the Firefox Desktop Front-End Performance team has been concentrating on making improvements to browser startup performance.

The launching of an application like Firefox is quite complex. Meticulous profiling of Firefox startup in various conditions has, thankfully, helped reveal a number of opportunities where we can make improvements. We’ve been evaluating and addressing these opportunities, and several have made it into the past few Firefox releases.

This blog post is about one of those improvements that is currently in the later stages of development. I’m going to describe the improvement, and how we went about integrating it.

In a default installation of Firefox, the first (and only) tab that loads is about:home¹.

The about:home page is actually the same thing that appears when you open a new tab (about:newtab). The fact that they have different addresses allows us to treat their loading differently.

Your about:home might look slightly different from the above — depending on your locale, it may or may not include the Pocket stories.

Do not be fooled by what appears to be a very simple page of images and text. This page is actually quite sophisticated under the hood. It is designed to be customized by the user in the following ways:

Users can

Collapse or expand sections
Remove sections entirely
Reorganize the order of their Top Sites by dragging and dropping
Pin and unpin Top Sites to their positions
Add their own custom Top Sites with custom thumbnails
Add or remove search engines from their Top Sites
Change the number of rows in the Top Sites and Recommended by Pocket sections
Choose to have the Highlights composed of any of the following:
- Visited pages
- Recent bookmarks
- Recent downloads
- Pages recently saved to Pocket

The user can customize these things at any time, and any open copies of the page are expected to reflect those customizations immediately.

There are further complexities beyond user customization. The page is also designed to be easy for our design and engineering teams to experiment with reorganizing the layout and composition of the page so that they can test variations on its layout in the wild.

The about:home page also has special privileges not afforded to normal websites. It can

Save and remove bookmarks
Add pages to Pocket
Cause the URL bar to be focused and selected
Show thumbnails for pages that the user has visited
Access both high and normal resolution favicons
Render information about the user’s recent activity (recent page visits, downloads, saves to Pocket, etc.)

So while at first glance, this appears to be a static page of just images and text, rest assured that the page can do much more.

Like the Firefox Developer Tools UI, about:home is written with the help of the React and Redux libraries. This has allowed the about:home development team to create sophisticated, reusable, and composable components that could be easily tested using modern JavaScript testing methods.

Unsurprisingly, this complexity and customizability comes at a cost. The page needs to request a state object from the parent process in order to have the Redux store populated and to have React render it. Essentially, the page is dynamically rendering itself after the markup of the page loads.

Startup is a critical time for an application. The user has expressed a need for their browser, and we have an obligation to serve the user as quickly and efficiently as possible. The user’s time is a resource that we should not squander. Similarly, because so much needs to occur during startup,² disk reads, disk writes, and CPU time are also considered precious resources. They should only be used if there’s no other choice.

In this case, we believed that the CPU time and disk accesses spent constructing the state object and dynamically rendering the about:home page was competing with all of the other CPU and disk access happening during startup, and this was slowing us down from presenting about:home to the user in a timely way.

Generally speaking, in my mind there are four broad approaches to performance problems once a bottleneck has been identified.

You can widen the bottleneck (make the operations more efficient)
You can divide the bottleneck (split the work into smaller slices that can be done over a longer period of time with rests in between)
You can move the bottleneck (defer work until later when it seems that there is less competition for resources, or move it to a different thread)
You can remove the bottleneck (don’t do the work)

We started by trying to apply the last two approaches, wondering what startup performance would be like if the page did not render itself dynamically, but was instead a static page generated periodically and pulled off of the disk at startup.

Prototype when possible

The first step to improving something is finding a way to measure it. Thankfully, we already have a number of logged measurements for startup. One of those measurements gives us the time from process start to rendering the Top Sites section of about:home. This is not a perfect measurement—ideally, we’d measure to the point that the page finally “settles” and stops changing³—but for this project, this measurement served our purposes.

Before investing a bunch of time into a potential improvement, it’s usually a good idea to try to see if what you’re gaining is worth the development time. It’s not always possible to build a prototype for performance improvements, but in this case it was.

The team quickly threw together a static copy of about:home and hacked together a patch to load that document during startup, rather than dynamically rendering the page. We then tested that page on our reference hardware. As of this writing, it’s been about five months since that test was done, but according to this comment, the prototype yielded what appears to be an almost 20% win on time from process start to about:home painting Top Sites.

So, with that information, we thought we had a real improvement opportunity here. We decided to proceed with the idea, and began a long arduous search for “the right way to do it.”

Pre-production

As I mentioned earlier, about:home is complex. The infrastructure that powers it is complex. Coupled with the fact that no one on the Firefox Front-End Performance team had spent much time studying React and Redux meant that we had a lot of learning to do.

The first step was to get some React and Redux fundamentals under our belt. This meant building some small toy applications and getting familiar with the framework idioms and how things are organized.

With that grounding, the next step was to start reading the code — starting from the entrypoint into the code that powers about:home when the browser starts. This was an intense period of study that branched into many different directions. Part of the complexity was because much of the code is asynchronous and launched work on different threads, which introduced some non-determinism. While it is generally good for responsiveness to move work off of the main thread, it can lead to some complex reading and interpretation of the code when more than two threads are involved.

A tool we used during this analysis was the Firefox Profiler, to get a realistic sense of the order of executions during startup. These profiles helped to inform much of our reading of the code.

This analysis helped us solidify our mental model of how about:home loads. With that model in place, it was much easier to propose practical approaches for introducing a static about:home document into the ecosystem of pre-existing code. The Firefox Front-End Performance team documented our findings and recommendations and then presented them to the team that originally built the about:home system to ensure that we were all on the same page and that we hadn’t missed anything critical. They were already aware that we were investigating potential performance improvements, and had very useful feedback for us, as well as historical product decision context that clarified our understanding.

Critically, we presented our recommendation for loading a static about:home page at startup and ensured that there were no upcoming plans for about:home that would break our mental model or render the recommendation no longer valid. Thankfully, it sounded like we were aligned and fine to proceed with our plan.

So what was the plan? We knew that since about:home is quite dynamic and can change over time⁴ we needed a startup cache for about:home that could be periodically updated during the course of a browsing session. We would then load from that cache at startup. Clearly, I’m glossing over some details here, but that was the general plan.

As usual, no plan survives breakfast, and as we started to architect our solution, we identified things we would need to change along the way.

Development

We knew that the process that loads about:home would need to be able to read from the about:home startup cache. We also knew that about:home can potentially contain information about what pages the user has visited, and that about:home can do privileged things that normal web pages cannot. It seemed that this project would be a good opportunity to finish a project that was started (and mothballed) a year or so earlier: creating a special privileged content process for about:home. We would load about:home in that process, and add assertions to ensure that privileged actions from about:home could only happen from that content process type⁵

So getting the “privileged about content process”⁶ fixed up and ready for shipping was the first step.

This also paved the way for solving the next step, which was to enable the moz-page-thumb:// protocol for the “privileged about content process.” The moz-page-thumb:// protocol is used to show the screenshot thumbnails for pages that the user has visited in the past. The previous implementation was using Blob URLs to send those thumbnails down to the page, and those Blob URLs exist only during runtime and would not work properly after a restart.

The next step was figuring out how to build the document that would be stored in the cache. Thankfully, ReactDOMServer has the ability to render a React application to a string. This is normally used for server-side rendering of React-powered applications. This feature also allows the React library to passively attach to the server-side page without causing the DOM to be modified. With some small modifications, we were able to build a simple mechanism in a Web Worker to produce this cached document string off of the main thread. Keeping this work off of the main thread would help maintain responsiveness.

With those pieces of foundational work out of the way, it was time to figure out the cache storage mechanism. Firefox already has a startupcache module that it uses for static resources like markup and JavaScript, but that cache is not designed to be written to periodically at runtime. We would need something different.

We had originally supposed that we would need to give the privileged about content process special access to a file on the filesystem to read from and to write to (since our sandbox prevents content processes from accessing disks directly). Initial experiments along this line worried us — we didn’t like the idea of poking holes in the sandbox if we didn’t need to. Also, adding yet another read from the filesystem during startup seemed counter to our purposes.

We evaluated IndexedDB as a storage mechanism, but the DOM team talked us out of it. The performance characteristics of IndexedDB, especially during startup, were unlikely to work for us.

Finally, after some consulting, we were directed to the HTTP cache. The HTTP cache’s job is to cache pages that the user visits (when appropriate) and to offer those caches to the user instead of hitting the network when retrieving the resource within the expiration time⁷. Functionally speaking, this seemed like a storage mechanism perfectly suited to our purposes.

After consulting with the Necko team and building a few proof-of-concepts, we figured out how to tie the whole system together. Importantly, we figured out how to get the initial about:home load to pull a document out from the HTTP cache rather than reading it from the application resource package.

We also figured out the cache writing mechanism. The cached document that would periodically get built inside of the privileged about content process inside of a Worker off of the main thread, would then send that data back up to the parent to stream into the cache.

At this point, we felt we had all of the pieces that we needed. Construction on each component began.

Construction was remarkably smooth thanks to our initial research and consulting with the relevant teams. We also took the opportunity to carefully document each component.

Testing

One of the more gratifying parts of implementation was when we modified one of our startup tests to use the new caching mechanism.

In this graph, the Y axis is the geometric mean time to render the about:home Top Sites over 20 restarts of the browser, in milliseconds. Lower is better. The dots along the top are without the cache. The dots along the bottom are with the cache enabled. According to our measurements, we improved the rendering time from process start to Top Sites by just over 20%! We beat our prototype!

Noticeable differences

But the real proof will be if there’s actually a noticeable visual change. Here’s that screen recording again from one of our reference devices⁸.

The screen on the left is with the cache disabled, and on the right with the cache enabled. Looks to me like we made a noticeable dent!

Try it out!

We haven’t yet enabled the about:home startup cache in Nightly by default, but we hope to do so soon. In the meantime, Nightly users can try it out right now by going to about:preferences#experimental and toggling it on. If you find problems and have a Bugzilla account, here’s a form for submitting bugs to the right place.

You can tell if the about:home you’re looking at is from the cache by opening up the DevTools Inspector and looking for a  comment just above the <body> tag.

Caveat emptor

There are a few cases where the cache isn’t used or is invalidated.

The first case is if you’ve configured something other than about:home as your home page (where the cache isn’t used). In this case, the cache won’t be read from, and the code to create the cache won’t ever run. If the user ever resets about:home to be their home page, then the caching code will start working for them.

The second case is if you’ve configured Firefox to restore your previous session by default. In this case, it’s unlikely that the first tab you’ll see is about:home, so the cache won’t be read from, and the code to create the cache won’t ever run. As before, if the user switches to not loading their previous session by default, then the cache will start working for them.

Another case is when the Firefox build identifier doesn’t match the build identifier from when the cache was created. This is also how the other startupcache module for static resources works. This ensures that when an update is applied, we don’t accidentally load old assets from the cache. So the first time you launch Firefox after you apply an update will not pull the about:home document from the cache, even if one exists (and will throw the cache out if it does). For Nightly users that generally receive updated builds twice a day, this makes the cache somewhat useless. Beta and Release users update much less frequently, so we expect to see a greater impact there.

The last case is in the event that your disk was in a situation such that reading the dynamic code from the asset bundle was faster than reading from the cache. If by the time the about:home document attempts to load and the cache isn’t ready, we fall back to loading it the old way. We don’t expect this to happen too often, but it’s theoretically possible, so we handle the case.

Future work

The next major step is to get the about:home startup cache turned on by default on Nightly and get it tested by a broader audience. At that point, hopefully we’ll get a better sense of its behaviour out in the wild via bug reports and Telemetry. Then our improvement will either ride the release train, or we might turn it on for subsets of the Beta or Release populations to measure its impact on more realistic restart scenarios. Once we’re confident that it’s helping more than hindering, we’ll turn it on by default for everyone.

After that, I think it would be worth seeing if we can load from the cache more often. Perhaps we could load about:newtab from there as well, for example.

One step at a time!

Thanks to

Florian Quèze and Doug Thayer, both of whom independently approached me with the idea of creating a static about:home
Jay Lim and Ursula Sarracini, both of whom wrote some of the groundwork code that was needed for this feature (namely, the infrastructure for the privileged about content process, and the first version of the moz-page-thumbs support)
Gijs Kruitbosch, Kate Hudson, Ed Lee, Scott Downe, and Gavin Suntop for all of the consulting and reviews
Honza Bambas and Andrew Sutherland for storage consultations
Markus Stange, Andrew Creskey, Eric Smyth, Asif Youssuff, Dan Mosedale, Emily Derr for their generous feedback on this post

This is only true if the user hasn’t just restarted after applying an update, and if they haven’t set a custom home page or configured Firefox to restore their previous session on start. ↩
You can think of startup like a traveling circus coming to town. You have to get the trucks and trailers parked, get the tents set up, hook up power, then lighting and sound … it’s a big, complex operation, and we haven’t even shot a clown out of a cannon yet. ↩
We’re working on something like that ↩
As the user browses, bookmarks and downloads things, their Highlights and Top Sites sections might change. If Pocket is enabled, new stories will also be downloaded periodically. ↩
It’s vitally important that content processes have limited abilities. That way, if they’re ever compromised by a bad actor, there are limits to what damage they can do. The assertions mentioned in this case mean that if a compromised content process tries to “pretend” to be the privileged about content process by sending one of its messages, that the parent process will terminate that content process immediately. ↩
Naming is hard. ↩
This has changed slightly in the past few years with a feature called Race Cache With Network, which races the disk cache with the network instead of relying on the disk entirely. ↩
This device is an Acer Aspire E-15 E5-575-33BM ↩

Firefox Front-End Performance Update #12

Well, here I am again – apologizing about a late update. Lots of stuff has been going on performance-wise in the Firefox code-base, and I’ll just be covering a small section of it here.

You might also notice that I changed the title of the blog series from “Firefox Performance Update” to “Firefox Front-end Performance Update”, to reflect that the things the Firefox Front-end Performance team is doing to keep Firefox speedy (though I’ll still add a grab-bag of other performance related work at the end).

So what are we waiting for? What’s been going on?

Migrate consumers to the new Places Observer system (Paused by Doug Thayer)

Doug was working on this later in 2018, and successfully ported a good chunk of our bookmarks code to use the new batched Places Observer system. There’s still a long-tail of other call sites that need to be updated to the new system, but Doug has shifted focus from this to other things in the meantime.

Document Splitting (In-Progress by Doug Thayer)

With WebRender becoming an ever-closer reality to our general user population, Doug has been focusing on “Document Splitting”, which makes WebRender more efficient by splitting updates that occur in the browser UI from updates that occur in the content area.

This has been a pretty long-haul task, but Doug has been plugging away, and landed a significant chunk of the infrastructure for this. At this time, Doug is working with kats to make Document Splitting integrate nicely with Async-Pan-Zooming (APZ).

The current plan is for Document Splitting to land disabled by default, since it’s blocked by parent-process retained display lists (which still have a few bugs to shake out).

Warm-up Service (In-Progress by Doug Thayer)

Doug is investigating the practicalities of having a service run during Windows start-up to preload various files that Firefox will need when started.

Doug’s prototype shows that this can save us something like 1 second of net start-up time, at least on the reference hardware.

We’re still researching this at multiple levels, and haven’t yet determined if this is a thing that we’d eventually want to ship. Stay tuned.

Smoother Tab Animations (In-Progress by Felipe Gomes)

After much ado, simplification, and review back-and-forth, the initial set of new tab animations have landed in Nightly. You can enable them by setting browser.tabs.newanimations to true in about:config and then restarting the browser. These new animations run entirely on the compositor, instead of painting at each refresh driver tick, so they should be smoother than the current animations that we ship.

There are still some cases that need new animations, and Felipe is waiting on UX for those.

Overhauling about:performance (V1 Completed by Florian Quèze)

The new about:performance shipped late last year, and now shows both energy as well as memory usage of your tabs and add-ons.

The current iteration allows you to close the tabs that are hogging your resources. Current plans should allow users to pause JavaScript execution in busy background tabs as well.

Browser Adjustment Project (In-Progress by Gijs Kruitbosch)

Gijs has landed some patches in Nightly (which have recently uplifted to Beta, and are only enabled on early Betas), which lowers the default frame rate of Firefox from 60fps to 30fps on devices that are considered “low-end”¹.

This has been on Nightly for a while, but as our Nightly population tends to skew to more powerful hardware, we expect not a lot of users have experienced the impact there.

At least one user has noticed the lowered frame rate on Beta, and this has highlighted that our CPU sampling code doesn’t take dynamic changes to clock speed into account.

While the lowered frame rate seemed to have a positive impact on page load time in the lab on our “low-end” reference hardware, we’re having a much harder time measuring any appreciable improvement in CI. We have scheduled an experiment to see if improvements are detectable via our Telemetry system on Beta.

We need to be prepared that this particular adjustment will either not have the desired page load improvement, or will result in a poorer quality of experience that is not worth any page load improvement. If that’s the case, we still have a few ideas to try, including:

Lowering the refresh driver tick, rather than the global frame rate. This would mean things like scrolling and videos would still render at 60fps, but painting the UI and web content would occur at a lower frequency.
Use the hardware vsync again (switching to 30fps turns hardware vsync off), but just paint every other time. This is to test whether or not software vsync results in worse page load times than hardware vsync.

Avoiding spurious about:blank loads in the parent process (Completed by Gijs Kruitbosch)

Gijs short-circuited a bunch of places where we were needlessly creating about:blank documents that we were just going to throw away (see this bug and dependencies). There are still a long tail of cases where we still do this in some cases, but they’re not the common cases, and we’ve decided to apply effort for other initiatives in the meantime.

Experiments with the Process Priority Manager (In-Progress by Mike Conley)

This was originally Doug Thayer’s project, but I’ve taken it on while Doug focuses on the epic mountain that is WebRender Document Splitting.

If you recall, the goal of this project is to lower the process priority for tabs that are only sitting in the background. This means that if you have tabs in the background that are attempting to use system resources (running JavaScript for example), those tabs will have less priority at the operating system level than tabs that are in the foreground. This should make it harder for background tabs to cause foreground tabs to be starved of processing resources.

After clearing a few final blockers, we enabled the Process Priority Manager by default last week. We also filed a bug to keep background tabs at a higher priority if they’re playing audio and video, and the fix for that just landed in Nightly today.

So if you’re on Windows on Nightly, and you’re curious about this, you can observe the behaviour by opening up the Windows Task Manager, switching to the “Details” tab, and watching the “Base priority” reading on your firefox.exe processes as you switch tabs.

Cheaper tabs in titlebar (Completed by Mike Conley)

After an epic round of review (thanks, Dao!), the patches to move our tabs-in-titlebar logic out of JS and into CSS landed late last year.

Along with simplifying our code, and hammering out at least one pretty nasty layout bug, this also had the benefit of reducing the number of synchronous reflows caused when opening new windows to zero.

This project is done!

Enable the separate Activity Stream content process by default (In-Progress by Mike Conley

There’s one known bug remaining that’s preventing us from letting the privileged content process from being enabled by default.

Thankfully, the cause is understood, and a fix is being worked on. Unfortunately, this is one of those bugs where the proper solution involves refactoring a bit of old crufty stuff, so it’s taking longer than I’d like.

Still, if all goes well, this bug should be closed out soon, and we can see about letting the privileged content process ride the trains.

Grab bag of notable performance work

This is an informal list of things that I’ve seen land in the tree lately that I believe will have a positive performance impact for our users. Have you seen something that you’d like to nominate for a future list? Submit the bug here!

Also, keep in mind that some of these landed months ago and already shipped to release. That’s what I get for taking so long to write a blog post.

Back in October, Gabriel Luong landed a set of patches that greatly improved the time to open DevTools panels. Web developers rejoice! Check out these public dashboards for more information on DevTools performance.
Sotaro Ikeda made it so that, with WebRender enabled, we do less frame invalidation with slow animations. This is great for energy usage!
Joel Maher made it so that Talos jobs can be retriggered from Treeherder so that they produce profiles! This means no re-pushing if you need to diagnose a Talos regression. That’s great news!
Gabriele Svelto made it so that the minidump analyzer, which runs after a crash has occurred to produce stack dumps for crash pings, runs with much lower process priority. This means that the CPU is being hogged less after a crash, when you’re likely to want to get right back to what you were doing.
Paolo Amadini got rid of a whole layer of DOM between the primary UI and the <browser> elements for each tab, resulting in some nice perf wins! A pleasant side-effect from the ongoing de-XBL work!
Mathieu Leplatre moved a bunch of client-side off of the main-thread to a Worker. This should help keep the main thread free and clear to process user events and keep the browser responsive.
Dan Holbert made layout calculations faster in certain situations, which resulted in some nice wins!
Jed Davis made it so that we preallocate the next content process off of the main thread. This is great for responsiveness, and helps set the stage for more off-main-thread process launchings.
Makoto Kato made spellchecking in web content asynchronous! This means we get to drop another sync IPC message, which is good for everybody.
Olli Pettay made it so that we don’t tick the refresh driver for a page as often when we’re still loading it. That should have a nice impact on page load performance!
Randall Jesup made it so that setTimeout’s run with very low priority during page load. This also should have a nice impact on page load performance!

For now, “low-end” means a machine with 2 or fewer cores, and a clock speed of 1.8Ghz or slower ↩

Photon Engineering Newsletter #14

Just like jaws did last week, I’m taking over for dolske this week to talk about stuff going on with Photon Engineering. So sit back, strap in, and absorb Photon Engineering Newsletter #14!

If you’ve got the release calendar at hand, you’ll note that Nightly 57 merges to Beta on September 20th. Given that there’s usually a soft-freeze before the merge, this means that there are less than 4 weeks remaining for Photon development. That’s right – in less than a month’s time, folks on the Beta channel who might not be following Nightly development are going to get their first Photon experience. That’ll be pretty exciting!

So with the clock winding down, the Photon team has started to shift more towards polish and bug-fixing. At this point, all of the major changes should have landed, and now we need to buff the code to a sparkling sheen.

The first thing you may have noticed is that, after a solid run of dogefox, the icon has shifted again:

We now return you to your regularly scheduled programming

The second big change are our new 60fps¹ loading throbbers in the tabs, coming straight to you from the Photon Animations team!

I think it’s fair to say that Photon Animations are giving Firefox a turbo boost!

Other recent changes

Menus and structure

A “Bookmarking Tools” subview has been added to the Library button so you can easily get to the bookmarks toolbar, sidebar, and bookmarks menu button
- How convenient!
You might have noticed that the downloads button will only appear when downloads exist and isn’t movable anymore. This is something we’re still tinkering with, so stay tuned.
We made the sync animation prettier! Check it out!

Animations

Did we mention the new tab loading throbber?

Preferences

All MVP work is completed! The team is now fixing polish bugs. Outstanding!

Visual redesign

The styles for the sidebar have been updated on WIndows!
- The bookmarks sidebar finally gets some attention!
New icons have landed for a number of Firefox functions. Can you find them all?
The team also landed a slew of polish and bug fixes. Here’s the full list!

Onboarding

The Firefox 57 tours have been enabled!
The PageAction UITour highlight style has been updated to be more consistent with the rest of Photon
Nifty!
A bunch of polish and bugfixes for the onboarding tour have landed!

Performance

The performance team quickly diagnosed a regression in the FX_NEW_WINDOW_MS Telemetry probe and the regressing patch has been backed out.

The screen capturing software I used here is only capturing at 30fps, so it’s really not doing it justice. This tweet might capture it better. ↩

A printing story and a PSA: outparams over the IPC layer might not behave like you’d expect

Here’s a story about a printing bug on OS X, and a lesson about how our IPC layer works.

Last week, :ehsan came up to my desk and said “Mike… printing is broken on OS X with e10s enabled. Did you know this?”

It was sad to hear. Nobody, as far as I knew, had been touching printing code (besides bobowen, but he hadn’t landed his patches yet), which meant that some mystery change had landed and it was my responsibility (as the one who had implemented printing on OS X for e10s) to fix it.

The first step was to confirm it. Yep, it looked like I couldn’t print on my Nightly. Crap¹. So we get the bug filed, and then I fired up my local build and lldb, and tried to trace out where the problem was occurring.

Drilling down to the regressing changeset

The problem was that my local build printed just fine. Eyebrow raised, I tried using my default profile with my local build to see if something about my profile was causing printing to break. Printing continued to work with my local build².

Scratching my head, I used mozregression to bisect the problem down to a single changeset.

Here’s the bug it spat out:

Bug 1209930 – Update Mac clang to match the version in use everywhere else

The hair stood up on the back of my neck. Printing got broken by a compiler upgrade? This had bad news written all over it.

Many try builds

So that, I guess, explained why I couldn’t reproduce the problem locally. I needed to build with a particular compiler in order to experience the bug.

The bad news was that I found out that the version of clang that we’d upgraded to didn’t work on my version of OS X (10.10.5). It was supposed to work on machines running OS X 10.7, but I didn’t have access to any³.

I was able to get debug builds off of try, but I had no luck convincing lldb to let me step through it, despite having the symbols and the right revision checked out⁴.

So this meant a lot of try builds. The good news is that I was able to add a bunch of logging to try, and just kinda walk away while it built. A few hours later, I’d have my build, and I’d get my results. I wasn’t blocked waiting on it to build, so I could work on other things.

What did the logging reveal?

Our printing code sometimes shows this progress dialog when printing starts. It’s put up so that while we’re laying out the page for printing, the user knows that stuff is happening. We show this progress dialog on Linux and Windows, but not on OS X (I guess OS X users don’t expect such a progress window… this was a decision made way back in the day).

The printing engine in the content process sends up a message to the parent saying, “I’m all set to print, let’s show the progress dialog!”. On OS X, the parent responds, “Uh, nope” (which is a return code of NS_ERROR_NOT_IMPLEMENTED), and the child is supposed to go, “Oh okay, I’ll just print right away”.

But here’s the kicker – along with message to show the progress dialog, the child sends an outparam⁵. That outparam is sent because for some platforms that show the progress dialog, we want to wait until the dialog closes before starting the actual print. On other platforms, we may just want to start printing right away when laying out the page finishes without waiting for the dialog to close.

That outparam is a bool initialized to a value of false in the print engine, and in the non-e10s case where we return “Sorry, no progress dialog” with NS_ERROR_NOT_IMPLEMENTED, that bool stays false, and when it stays false, it means that we start printing right away.

My logging showed that for some reason, that value was getting flipped to true despite not being touched by (seemingly) anything in the IPC sender / receiver nor the print progress dialog backend (which still just returns NS_ERROR_NOT_IMPLEMENTED when asked to open the printing dialog).

What the hell?

Here’s the PSA

Ehsan helped me dig into what was going on, and we figured it out. Here’s the big lesson / PSA here:

outparams over IPC, when untouched on the receiver side, get filled with values from uninitialized memory when returned to the sender.

I’ll give you an example. Here’s some sorta-pseudocode:

#include <stdio.h>

void someFunc(bool* aOutParam) {
}

int main(int argc, char** argv) {
  bool myBool = false;
  someFunc(&myBool);
  if (myBool) {
    printf("Wait, WHAT?");
    return 1;
  }
  return 0;
}

We should never enter that “Wait, WHAT?” conditional block because myBool remains false throughout. It’s initialized to false, and even though we pass a pointer to it to someFunc, someFunc never touches it, so it stays false, and so we skip the conditional and return 0 as expected.

If, however, you did the same thing, but over IPC… well, now you’re in for a fun treat.

On the receiver side of the IPC message, here’s a chunk of the generated code that calls into RecvShowPrintProgress on the parent side:

            bool notifyOnOpen;
            bool success;
            int32_t id__ = mId;
            if \⁶, (&(success)))))) {
                mozilla::ipc::ProtocolErrorBreakpoint("Handler for ShowProgress returned error code");
                return MsgProcessingError;
            }

            reply__ = new PPrinting::Reply_ShowProgress(id__);

            Write(notifyOnOpen, reply__);
            Write(success, reply__);
            (reply__)->set_sync();
            (reply__)->set_reply();

That notifyOnOpen bool is the one that is the outparam in the child. Notice that it’s not being initialized to any value! That means its current value could be, well, anything. When we call into RecvShowProgress, we pass a pointer to that anything value. If RecvShowProgress doesn’t do anything with that pointer (like in the OS X case), well… that random value is what gets written to the IPC message that gets sent back to the child.

And somehow a clang upgrade made it far more likely that this value would evaluate to TRUE as a boolean instead of FALSE.

And so the child assumed that it’d have to wait for a dialog to close before starting to print – a dialog that would never open, because we don’t show it on OS X.

So the patch I eventually wrote initializes the outparam to false before calling into the OS X widget code that just returns NS_ERROR_NOT_IMPLEMENTED, and that fixed the bug.

It’s a small patch, but in my mind, it’s a big lesson, at least for me. I had assumed that outparams would work the same over IPC as they do in the same process, and that’s simply not the case.

I’ve filed bug 1220168 to try to make this sort of thing easier to spot in the future.

Our automated testing story for printing is pretty bad. We test some of the built-in UI for printing, but as for actually sending stuff to printer drivers… zero automated tests, which explains why this had gone uncaught for so long. ↩
If you’re wondering, I wasn’t printing out reams of paper to test this. XCode comes with a Printer Simulator, which is like a PDF printer, except that I can get a better sense of the communications being sent to the printer ↩
Okay, there’s one in the office, but apparently trying to build on it is a slow nightmare ↩
Happy to hear and share how to do this if someone will show me ↩
Read this about passing multiple values back from a method call if you’re curious about what outparams are ↩
!(RecvShowProgress(mozilla::Move(browser), mozilla::Move(printProgressDialog), mozilla::Move(isForPrinting), (&(notifyOnOpen ↩

The Joy of Coding – Ep’s 23 – 29

Wow! I’ve been a way from this blog for too long. I also haven’t posted any new episodes for The Joy of Coding. I also haven’t been keeping up with my Things I’ve Learned posts.

Time to get back in the saddle. First thing’s first, here are 6 episodes of The Joy of Coding that have aired. Unfortunately, I haven’t put together summaries for any of them, but I’ve put their agendas near the videos so that might give some clues.

Here we go!

Episode 23

Agenda

Episode 24

Agenda

Episode 25

Agenda

Episode 26

Agenda

Episode 27

Agenda

Episode 28

Agenda

Episode 29

Agenda