Category Archives: Computer Science

A printing story and a PSA: outparams over the IPC layer might not behave like you’d expect

Here’s a story about a printing bug on OS X, and a lesson about how our IPC layer works.

Last week, :ehsan came up to my desk and said “Mike… printing is broken on OS X with e10s enabled. Did you know this?”

It was sad to hear. Nobody, as far as I knew, had been touching printing code (besides bobowen, but he hadn’t landed his patches yet), which meant that some mystery change had landed and it was my responsibility (as the one who had implemented printing on OS X for e10s) to fix it.

The first step was to confirm it. Yep, it looked like I couldn’t print on my Nightly. Crap1. So we get the bug filed, and then I fired up my local build and lldb, and tried to trace out where the problem was occurring.

Drilling down to the regressing changeset

The problem was that my local build printed just fine. Eyebrow raised, I tried using my default profile with my local build to see if something about my profile was causing printing to break. Printing continued to work with my local build2.

Scratching my head, I used mozregression to bisect the problem down to a single changeset.

Here’s the bug it spat out:

Bug 1209930 Update Mac clang to match the version in use everywhere else

The hair stood up on the back of my neck. Printing got broken by a compiler upgrade? This had bad news written all over it.

Many try builds

So that, I guess, explained why I couldn’t reproduce the problem locally. I needed to build with a particular compiler in order to experience the bug.

The bad news was that I found out that the version of clang that we’d upgraded to didn’t work on my version of OS X (10.10.5). It was supposed to work on machines running OS X 10.7, but I didn’t have access to any3.

I was able to get debug builds off of try, but I had no luck convincing lldb to let me step through it, despite having the symbols and the right revision checked out4.

So this meant a lot of try builds. The good news is that I was able to add a bunch of logging to try, and just kinda walk away while it built. A few hours later, I’d have my build, and I’d get my results. I wasn’t blocked waiting on it to build, so I could work on other things.

What did the logging reveal?

Our printing code sometimes shows this progress dialog when printing starts. It’s put up so that while we’re laying out the page for printing, the user knows that stuff is happening. We show this progress dialog on Linux and Windows, but not on OS X (I guess OS X users don’t expect such a progress window… this was a decision made way back in the day).

The printing engine in the content process sends up a message to the parent saying, “I’m all set to print, let’s show the progress dialog!”. On OS X, the parent responds, “Uh, nope” (which is a return code of NS_ERROR_NOT_IMPLEMENTED), and the child is supposed to go, “Oh okay, I’ll just print right away”.

But here’s the kicker – along with message to show the progress dialog, the child sends an outparam5. That outparam is sent because for some platforms that show the progress dialog, we want to wait until the dialog closes before starting the actual print. On other platforms, we may just want to start printing right away when laying out the page finishes without waiting for the dialog to close.

That outparam is a bool initialized to a value of false in the print engine, and in the non-e10s case where we return “Sorry, no progress dialog” with NS_ERROR_NOT_IMPLEMENTED, that bool stays false, and when it stays false, it means that we start printing right away.

My logging showed that for some reason, that value was getting flipped to true despite not being touched by (seemingly) anything in the IPC sender / receiver nor the print progress dialog backend (which still just returns NS_ERROR_NOT_IMPLEMENTED when asked to open the printing dialog).

What the hell?

Here’s the PSA

Ehsan helped me dig into what was going on, and we figured it out. Here’s the big lesson / PSA here:

outparams over IPC, when untouched on the receiver side, get filled with values from uninitialized memory when returned to the sender.

I’ll give you an example. Here’s some sorta-pseudocode:

#include <stdio.h>

void someFunc(bool* aOutParam) {

int main(int argc, char** argv) {
  bool myBool = false;
  if (myBool) {
    printf("Wait, WHAT?");
    return 1;
  return 0;

We should never enter that “Wait, WHAT?” conditional block because myBool remains false throughout. It’s initialized to false, and even though we pass a pointer to it to someFunc, someFunc never touches it, so it stays false, and so we skip the conditional and return 0 as expected.

If, however, you did the same thing, but over IPC… well, now you’re in for a fun treat.

On the receiver side of the IPC message, here’s a chunk of the generated code that calls into RecvShowPrintProgress on the parent side:

            bool notifyOnOpen;
            bool success;
            int32_t id__ = mId;
            if \((!(RecvShowProgress(mozilla::Move(browser), mozilla::Move(printProgressDialog), mozilla::Move(isForPrinting), (&(notifyOnOpen)), (&(success)))))) {
                mozilla::ipc::ProtocolErrorBreakpoint("Handler for ShowProgress returned error code");
                return MsgProcessingError;

            reply__ = new PPrinting::Reply_ShowProgress(id__);

            Write(notifyOnOpen, reply__);
            Write(success, reply__);

That notifyOnOpen bool is the one that is the outparam in the child. Notice that it’s not being initialized to any value! That means its current value could be, well, anything. When we call into RecvShowProgress, we pass a pointer to that anything value. If RecvShowProgress doesn’t do anything with that pointer (like in the OS X case), well… that random value is what gets written to the IPC message that gets sent back to the child.

And somehow a clang upgrade made it far more likely that this value would evaluate to TRUE as a boolean instead of FALSE.

And so the child assumed that it’d have to wait for a dialog to close before starting to print – a dialog that would never open, because we don’t show it on OS X.

So the patch I eventually wrote initializes the outparam to false before calling into the OS X widget code that just returns NS_ERROR_NOT_IMPLEMENTED, and that fixed the bug.

It’s a small patch, but in my mind, it’s a big lesson, at least for me. I had assumed that outparams would work the same over IPC as they do in the same process, and that’s simply not the case.

I’ve filed bug 1220168 to try to make this sort of thing easier to spot in the future.

2 people like this post.

  1. Our automated testing story for printing is pretty bad. We test some of the built-in UI for printing, but as for actually sending stuff to printer drivers… zero automated tests, which explains why this had gone uncaught for so long. 

  2. If you’re wondering, I wasn’t printing out reams of paper to test this. XCode comes with a Printer Simulator, which is like a PDF printer, except that I can get a better sense of the communications being sent to the printer 

  3. Okay, there’s one in the office, but apparently trying to build on it is a slow nightmare 

  4. Happy to hear and share how to do this if someone will show me 

  5. Read this about passing multiple values back from a method call if you’re curious about what outparams are 

The Joy of Coding (Ep. 20): Reviewin’ and Mystery Solvin’

After a two week hiatus, we’re back with Episode 20!

In this episode, I start off by demonstrating my new green screen1, and then dive right into reviewing some code to make the Lightweight Theme web installer work with e10s.

After that, I start investigating a mystery that my intern ran into a few days back, where for some reason, preloaded about:newtab pages were behaving really strangely when they were loaded in the content process. Strangely, as in, the pages wouldn’t do simple things, like reload when the user pressed the Reload button.

Something strange was afoot.

Do we solve the mystery? Do we figure out what’s going on? Do we find a solution? Tune in and find out!

Episode agenda.


Bug 653065 – Make the lightweight theme web installer ready for e10s
Bug 1181601 – Unable to receive messages from preloaded, remote newtab pageNotes
@mrrrgn hacks together a WebSocket server implementation in Go. To techno!

2 people like this post.

  1. Although throughout the video, the lag between the audio and the video gets worse and worse – sorry about that. I’ll see what I can do to fix that for next time. 

Things I’ve Learned This Week (May 25 – May 29, 2015)

MozReview will now create individual attachments for child commits

Up until recently, anytime you pushed a patch series to MozReview, a single attachment would be created on the bug associated with the push.

That single attachment would link to the “parent” or “root” review request, which contains the folded diff of all commits.

We noticed a lot of MozReview users were (rightfully) confused about this mapping from Bugzilla to MozReview. It was not at all obvious that Ship It on the parent review request would cause the attachment on Bugzilla to be r+’d. Consequently, reviewers used a number of workarounds, including, but not limited to:

  1. Manually setting the r+ or r- flags in Bugzilla for the MozReview attachments
  2. Marking Ship It on the child review requests, and letting the reviewee take care of setting the reviewer flags in the commit message
  3. Just writing “r+” in a MozReview comment

Anyhow, this model wasn’t great, and caused a lot of confusion.

So it’s changed! Now, when you push to MozReview, there’s one attachment created for every commit in the push. That means that when different reviewers are set for different commits, that’s reflected in the Bugzilla attachments, and when those reviewers mark “Ship It” on a child commit, that’s also reflected in an r+ on the associated Bugzilla attachment!

I think this makes quite a bit more sense. Hopefully you do too!

See gps’s blog post for the nitty gritty details, and some other cool MozReview announcements!

1 person likes this post.

Electrolysis and the Big Tab Spinner of Doom

Have you been using Firefox Nightly and seen this big annoying spinner?

Big Tab Spinner of Doom in an e10s tab

Aw, crap. You again.

I hate that thing. I hate it.

Me, internally, when I see the spinner.

And while we’re working on making the spinner itself less ugly, I’d like to eliminate, or at least reduce its presence to the absolute minimum.

How do I do that? Well, first, know your enemy.

What does it even mean?

That big spinner means that the graphics part of Gecko hasn’t given us a frame yet to paint for this browser tab. That means we have nothing yet to show for the tab you’ve selected.

In the single-process Firefox that we ship today, this graphics operation of preparing a frame is something that Firefox will block on, so the tab will just not switch until the frame is ready. In fact, I’m pretty sure the whole browser will become unresponsive until the frame is ready.

With Electrolysis / multi-process Firefox, things are a bit different. The main browser process tells the content process, “Hey, I want to show the content associated with the tab that the user just selected”, and the content process computes what should be shown, and when the frame is ready, the parent process hears about it and the switch is complete. During that waiting time, the rest of the browser is still responsive – we do not block on it.

So there’s this window of time where the tab switch has been requested, and when the frame is ready.

During that window of time, we keep showing the currently selected tab. If, however, 300ms passes, and we still haven’t gotten a frame to paint, that’s when we show the big spinner.

So that’s what the big spinner means – we waited 300ms, and we still have no frame to draw to the screen.

How bad is it?

I suspect it varies. I see the spinner a lot less on my Windows machine than on my MacBook, so I suspect that performance is somehow worse on OS X than on Windows. But that’s purely subjective. We’ve recently landed some Telemetry probes to try to get a better sense of how often the spinner is showing up, and how laggy our tab switching really is. Hopefully we’ll get some useful data out of that, and as we work to improve tab switch times, we’ll see improvement in our Telemetry numbers as well.

Where is the badness coming from?

This is still unclear. And I don’t think it’s a single thing – many things might be causing this problem. Anything that blocks up the main thread of the content process, like slow JavaScript running on a web-site, can cause the spinner.

I also seem to see the spinner when I have “many” tabs open (~30), and have a build going on in the background (so my machine is under heavy load).

Maybe we’re just doing things inefficiently in the multi-process case. I recently landed profile markers for the Gecko Profiler for async tab switching, to help figure out what’s going on when I experience slow tab switch. Maybe there are optimizations we can make there.

One thing I’ve noticed is that there’s this function in the graphics layer, “ClientTiledLayerBuffer::ValidateTile”, that takes much, much longer in the content process than in the single-process case. I’ve filed a bug on that, and I’ll ask folks from the Graphics Team this week.

How you can help

UPDATE (June 1, 2015): Getting profiles from Windows is currently broken because the symbol server appears to be busted. Any profiles from Windows machines will be useless until this bug is fixed. Alternatively, set profiler.symbolicationUrl to in about:config.

If you’d like to help me find more potential causes, Profiles are very useful! NOTE – I don’t mean “user profiles”, as in, your bookmarks / customizations / history, etc, in the profile folder. I don’t mean this thing. I mean a performance profile.

A performance profile is a read-out of everything that Firefox / Gecko is doing over a particular span of time. When the profiler is running, Firefox / Gecko will record where the process is in the stack every 1ms or so. It’ll also record information about how long since it’s serviced the event loop, which helps us find jank.

To help, grab the Gecko Profiler add-on, make sure it’s enabled, and then dump a profile when you see the big spinner of doom. The interesting part will be between two markers, “AsyncTabSwitch:Start” and “AsyncTabSwitch:Finish”. There are also markers for when the parent process displays the spinner – “AsyncTabSwitch:SpinnerShown” and “AsyncTabSwitch:SpinnerHidden”. The interesting stuff, I believe, will be in the “Content” section of the profile between those markers. Here are more comprehensive instructions on using the Gecko Profiler add-on.

And here’s a video of me demonstrating how to use the profiler, and how to attach a profile to the bug where we’re working on improving tab switch times:

And here’s the link I refer you to in the video for getting the add-on.

So hopefully we’ll get some useful data, and we can drive instances of this spinner into the ground.

I’d really like that.

11 people like this post.

The Joy of Coding (Ep. 12): Making “Save Page As” Work

After giving some updates on the last bug we were working on together, I started a new bug: Bug 1128050 – [e10s] Save page as… doesn’t always load from cache. The problem here is that if the user were to reach a page via a POST request, attempting to save that page from the Save Page item in the menu would result in silent failure1.

Luckily, the last bug we were working on was related to this – we had a lot of context about cache keys swapped in already.

The other important thing to realize is that fixing this bug is a bandage fix, or a wallpaper fix. I don’t think those are official terms, but it’s what I use. Basically, we’re fixing a thing with the minimum required effort because something else is going to fix it properly down the line. So we just need to do what we can to get the feature to limp along until such time as the proper fix lands.

My proposed solution was to serialize an nsISHEntry on the content process side, deserialize it on the parent side, and pass it off to nsIWebBrowserPersist.

So did it work? Watch the episode and find out!

I also want to briefly apologize for some construction noise during the video – I think it occurs somewhere halfway through minute 20 of the video. It doesn’t last long, I promise!

Episode Agenda


Bug 1128050 – [e10s] Save page as… doesn’t always load from cache – Notes

1 person likes this post.

  1. Well, it’d show something in the Browser Console, but for a typical user, I think that’s still a silent failure.