Category Archives: Firefox

Australis Performance Post-mortem Summary

Over the last few months, I’ve been talking about all of the work we put into making Australis feel fast when it shipped in Firefox 29.

I talked about where we started with our performance work, and how we grappled with the ts_paint and tpaint performance (“talos”) tests. After that, I talked a bit about the excellent tools we have (and ones we developed ourselves) to make finding our performance bottlenecks easier.

After a brief delay, I rounded out the series by talking about our tab animation performance work, and the customization transition performance work.

I think over the course of working on these things, I’ve learned quite a bit about performance work in general. If I had to distill it down to a few tidbits, it’d be:

Measure first to get a baseline, then try to improve. (Alternatively, “you can’t improve what you can’t measure”)
Finding the solutions to performance problems is usually the easy part. The hard part is finding and isolating the problems to begin with.
While performance work can be a bit of a grind, users do feel and appreciate the efforts. It’s totally worth it.

So that’s it on the series. Enjoy your zippy Firefox!

Australis Performance Post-mortem Part 5: The Customize Mode Transition

Another new thing that came out with Firefox 29 is a sexy new customization interface. We wanted to make UI customization something that anybody would feel comfortable doing, instead of something only a few mighty power users might do.

The new customization mode is accessible by pressing the Menu Button (☰), and then clicking Customize.

Give it a shot now if you’ve never done it.

BAM, did you see that? What you just saw was a full-on mode switch in the browser chrome to indicate to you that you’ve put the browser into a different state – specifically, a state where much of the browser UI is malleable. You can drag and drop things in your toolbars and the menu, enable toolbars, etc.

Going in and out of customization mode is not something that most people will do frequently, but we still invested a bunch of time trying to make it smooth. I still think there is more we can do on that front, and I’ll get into that at the end of this post.

Anyhow, as with any performance-related project, we started with a way of measuring. In this case, our trusty performance team re-purposed the TART test that we used for tab animation, and pointed it at the customize mode transition. We called this new test CART (customization animation regression test, natch).

This is slightly different from the tab animation stuff, because we didn’t really have a baseline measurement to compare against – there was no old customization mode transition to try to match or beat. So we just had to do our best to do all of our processing under 16ms in order to draw the transition at 60fps.

That means that instead of comparing two sets of data over time, we’ll only be looking at a single series of points, and how they change over time.

So let’s see where we started, and where we finished, and how we got there.

CART results – pre-optimizations

So in an effort to get this blog post done, I’ve cut some corners and combined the data from a number of different platforms into a single graph. I apologize if this makes it difficult to interpret for some of you – but I’ll do my best to explain what the graph means.

I chose a representative sample of the platforms that we measured – we’ve got OS X (10.6, 10.8), Ubuntu 12.04 (32-bit), Windows XP, Windows 7, and Windows 8.

The X-axis is obviously plotting the time (we started gathering CART data right at the end of January 2014). The Y-axis plots the “final CART value”. CART, like TART, measures a number of things, and we “boil that number down” into a single value that we can plot. Each of the subtests are measured in milliseconds, so I guess you could say the Y-axis is also in milliseconds – but as it’s an aggregate, that’s not too meaningful. It’s not the greatest for detecting small shifts in the subtests (we use Datazilla for that), but it allows me to illustrate the changes over time in a pretty simple way.

So it looks like my sampled platforms are all floating around the 40’s and 50’s. Is that good or bad? Well, it’s neither really. It’s just our starting point, and we wanted to try to improve on that.

And so we set out trying to move those dots downwards (for this data, without going into grand detail, lower milliseconds = faster and smoother animation).

As with TART, we did a lot of profiling, and used a lot of the tools I mentioned in this post. Here are the bugs that seemed to move the needle the most.

The Rogue’s Gallery

As before, this is not a complete list, but captures some of the more interesting bugs we worked on.

Bug 932963 – Break customize mode transition into several phases

This change allowed us to break the transition in and out of customization mode into some phases:

Not in customization mode (default)
Entering customization mode
Entered customization mode
Exiting customization mode

Phases 2, 3 and 4 have attributes on the main browser window that we can target with CSS selectors. We were then able to “lighten” the CSS during phases 2 and 4, in order to optimize the frames during the transition. For example, we don’t display the semi-transparent grid texture on the main window unless we’re in phase 3.

Bug 972485 – Find out why we’re doing a bunch of synchronous file reading at the start of the customize mode transition

This was a rather surprising one – using the Gecko Profiler, it looked like we were doing sync file IO as the blank about:customizing document was loading.

For context, we have a page registered at about:customizing that’s a blank document. When we detect that about:customizing has been loaded or switched to, we enter customize mode.

So this sync file IO was causing some jank at the start of the customization mode transition.

Strangely, it turned out that the XHTML file we were loading for the blank about:customizing document was synchronously loading a bunch of MathML localization stuff as it loaded.

We switched the document from XHTML to XUL, and the sync IO load went away. We filed a follow-up bug to investigate why exactly we were doing sync file IO on loading XHTML files, because that’s a bad thing to do.

Anyhow, this was a small but significant win for almost all platforms.

Bug 975552 – Preload about:customizing like we do with about:newtab

I think this was the biggest win we achieved during the performance work. That browser that we load the blank about:customizing page in is not free – a bunch of stuff gets instantiated in order for a working browser to be created, and all of that expense is wasted on a blank document. That expense also causes enough main thread thinking that it reduces the smoothness of the customization entering transition.

So the solution was a hack that takes after the same strategy we use for about:newtab. Essentially, we preload the about:customizing browser and document in the background at what seems like a logical time (right after the user has opened the menu). This allows us to front-load all of the expense in creating the browser, and we end up with something much smoother.

Here’s a before video and an after video from my Windows 7 machine, for reference.

This was a pretty bodacious hack, and in the future, I’d like to remove it completely, and find some way of sidestepping all of the browser internal loading for the about:customizing document (or, even better, find a way of not using a browser element at all!). That’s filed here.

Bug 977796 – Disable subpixel anti-aliasing during customize mode transition.

A profile on Windows 8 showed that we were spending an inordinate amount of time rendering text during the customize mode transition, and this has to do with subpixel anti-aliasing in the menu that animates in.

So I landed a patch that temporarily disables subpixel anti-aliasing on that element during the customize mode transition, and that bought us a huge win for Window 8 (about 20%). Not much of a win for any of the other platforms.

So where did we end up?

CART results – post-optimizations

Those numbers are indeed lower!

Now, before you lose your mind about that epic cross-platform win around Feburary 20th, I investigated that changeset range and didn’t find anything we worked on directly in there.

There are some platform patches in there (specifically in Graphics) that might account for this win, but I’m pretty certain it’s because of bug 974621, which updated our test runners to use a different version of Talos that included the patch in bug 967186. 967186 altered the CART test to be more accurate in its measurements, so actually a lot of our initial data was erroneous. That blows a bit because it adds some unnecessary noise to our graph, but it’s also good because accuracy is a thing we definitely want.

Future work

I still think there’s more wins to be made in the customize mode transition. Finding a way of getting rid of the about:customizing preloading hack and replace it with something smarter (like a thin browser that doesn’t need to instantiate much of its backend, or no browser at all) is probably the first step. More profiling might find more wins. I’m learning more and more about platform as I work on Electrolysis, so maybe I’ll come back to this problem with some new skills and information, and I can get it performing reliably on all platforms as it really should: 60fps, smooth as silk.

Australis Performance Post-mortem Part 4: On Tab Animation

Whoops, forgot that I had a blog post series going on here where I talk about the stuff we did to make Australis blazing fast. In that time, we’ve shipped the thing (Firefox 29 represent!), so folks are actually feeling the results of this performance work, which is pretty excellent.

I ended the last post on an ominous note – something about how we were clear to land Australis on Nightly, or so we thought.

This next bit is all about timing.

There is a Performance Team at Mozilla, who are charged with making our products crazy-fast and crazy-smooth. These folks are geniuses at wringing out every last millisecond possible from a computation. It’s what they do all day long.

A pretty basic principle that I’ve learned over the years is that if you want to change something, you first have to develop a system for measuring the thing you’re trying to change. That way, you can determine if what you’re doing is actually changing things in the way you want them to.

If you’ve been reading this Australis Performance Post-mortem posts, you’ll know that we have some performance tests like this, and they’re called Talos tests.

Just about as we were finishing off the last of the t_paint and ts_paint regressions, the front-end team was suddenly made aware of a new Talos test that was being developed by the Performance team. This test was called TART, and stands for Tab Animation Regression Test. The purpose of this test is to exercise various tab animation scenarios and measure the time it takes to paint each frame and to proceed through the entire animation of a tab open and a tab close.

The good news was that this new test was almost ready for running on our Nightly builds!

The bad news was that the UX branch, which Australis was still on at the time, was regressing this test. And since we cannot land if we regress performance like this… it meant we couldn’t land.

Bad news indeed.

Or was it? At the time, the lot of us front-end engineers were groaning because we’d just slogged through a ton of other performance regressions. Investigating and fixing performance regressions is exhausting work, and we weren’t too jazzed that another regression had just shown up.

But thinking back, I’m somewhat glad this happened. The test showed that Australis was regressing tab animation performance, and tabs are opened every day by almost every Firefox user. Regressing tab performance is simply not a thing one does lightly. And this test caught us before we landed something that regressed those tabs mightily!

That was a good thing. We wouldn’t have known otherwise until people started complaining that their tabs were feeling sluggish when we released it (since most of us run pretty beefy development machines).

And so began the long process of investigating and fixing the TART regressions.

So how bad were things?

Let’s take a look at the UX branch in comparison with mozilla-central at the time that we heard about the TART regressions.

Here are the regressing platforms. I’ll start with Windows XP:

Where we started with TART on Windows XP

Forgive me, I couldn’t get the Graph Server to swap the colours of these two datasets, so my original silent pattern of “red is the regressor” has to be dropped. I could probably spend some time trying to swap the colours through various tricky methods, but I honestly don’t think anybody reading this will care too much.

So here we can see the TART scores for Windows XP, and the UX branch (green) is floating steadily over mozilla-central (red). Higher scores are bad. So here’s the regression.

Now let’s see OS X 10.6.

Where we started with TART on OS X 10.6

Same problem here – the UX nodes (tan) are clearly riding higher than mozilla-central. This was pretty similar to OS X 10.7 too, so I didn’t include the graph.

On OS X 10.8, things were a little bit better, but not too much:

Where we started with TART on OS X 10.8

Here, the regression was still easily visible, but not as large in magnitude.

Ubuntu was in the same boat as OS X 10.6/10.7 and Windows XP:

Where we started with TART on Ubuntu

But what about Windows 7 and Windows 8? Well, interesting story – believe it or not, on those platforms, UX seemed to perform better than mozilla-central:

Windows 7 (the blue nodes are mozilla-central, the tan nodes are UX)

Where we started with TART on Windows 7

Windows 8 (the green nodes are mozilla-central, the red nodes are UX)

Where we started with TART on Windows 8

So what the hell was going on?

Well, we eventually figured it out. I’ll lay it out in the next few paragraphs. The following is my “rogue’s gallery” of regressions. This list does not include many false starts and red herrings that we followed during the months working on these regressions. Think of this as “getting to the good parts”.

Backfilling

The problem with having a new test, and having mozilla-central better than UX, is not knowing where UX got worse; there was no historical measurements that we could look at to see where the regression got introduced.

MattN, smart guy that he is, got us a few talos loaner machines, and wrote some scripts to download the Nightlies for both mozilla-central and UX going back to the point where UX split off. Then, he was able to run TART on these builds, and supply the results to his own custom graph server.

So basically, we were able to backfill our missing TART data, and that helped us find a few points of regression.

With that data, now it was time to focus in on each platform, and figure out what we could do with it.

Windows XP

We started with XP, since on the regressing platforms, that’s where most of our users are.

Here’s what we found and fixed:

Bug 916946 – Stop animating the back-button when enabling or disabling it.

During some of the TART tests, we start with single tab, open a new tab, and then close the new tab, and repeat. That first tab has some history, so the back button is enabled. The new tab has no history, so the back button is disabled.

Apparently, we had some CSS that was causing us to animate the back-button when we were flipping back and forth from the enabled / disabled states. That CSS got introduced in the patch that bound the back/forward/stop/reload buttons to the URL bar. It seemed to affect Windows only. Fixing that CSS gave us our first big win on Windows XP, and gave us more of a lead with Windows 7 and Windows 8!

Bug 907544 – Pass the D3DSurface9 down into Cairo so that it can release the DC and LockRect to get at the bits

I don’t really remember how this one went down (and I don’t want to really spend the time swapping it back in by reading the bug), but from my notes it looks like the Graphics team identified this possible performance bottleneck when I showed them some profiles I gathered when running TART.

The good news on this one, was that it definitely gave the UX branch a win on Windows XP. The bad news is that it gave the same win to mozilla-central. This meant that while overall performance got better on Windows XP, we still had the same regression preventing us from landing.

Bug 919541 – Consider not animating the opacity for Australis tabs

Jeff Muizelaar helped me figure this out while we were using paint flashing to analyze paint activity while opening and closing tabs. When we slowed down the transitions, we noticed that the closing tabs were causing paints even though they weren’t visible. Closing tabs aren’t visible because with Australis, we don’t show the tab shape around tabs when they’re not selected – and closing a tab automatically unselects it for something else.

For some reason, our layout and graphics code still wanted to paint this transition even though the element was not visible. We quickly nipped that in the bud, and got ourselves a nice win on tab close measurements for all platforms!

Bug 921038 – Move selected tab curve clip-paths into SVG-as-an-image so it is cached.

This was the final nail in the coffin for the TART regression on Windows XP. Before this bug, we were drawing the linear-gradients in the tab shape using CSS, and the clipping for the curve background colour was being pulled off using clip-path and an SVG curve defined in the browser.xul document.

In this bug, we moved from clipping a background to create the curve, to simply drawing a filled curve using SVG, and putting the linear-gradient for the texture in the “stroke” image (the image that overlays the border on the tab curve).

That by itself was not enough to win back the regression – but thankfully, Seth Fowler had been working on SVG caching, and with that cache backend, our patch here knocked the XP and Ubuntu regressions out! It also took out a chunk from OS X. Things were looking good!

OS X

Bug 924415 – Find out why setting chromemargin to 0,-1,-1,-1 is so expensive for TART on UX branch on OS X.

I don’t think I’ll ever forget this bug.

I had gotten my hands on a Mac Mini that (after some hardware modifications) matched the specs of our 10.6 Talos test machines. That would prove to be super useful, as I was easily able to reproduce the regression that machine, and we could debug and investigate locally, without having to remote in to some loaner device.

With this machine, it didn’t take us long to identify the drawing of the tabs in the titlebar as the main culprit in the OS X regression. But the “why” eluded us for weeks.

It was clear I wasn’t going to be able to solve it on my own, so Jeff Muizelaar from the Graphics team joined in to help me.

We looked at OpenGL profiles, we looked at apitraces, we looked at profiles using the Gecko Profiler, and we looked at profiles from Instruments – the profiler that comes included with XCode.

It seemed like the performance bottleneck was coming from the operating system, but we needed to prove it.

Jeff and I dug and dug. I remember going home one day, feeling pretty deflated by another day of getting nowhere with this bug, when as I was walking into my apartment, I got a phone call.

It was Jeff. He told me he’d found something rather interesting – when the titlebar of the browser overlapped the titlebar of another window, he was able to reproduce the regression. When it did not overlap, the regression went away.

Talos tests open a small window before they open any test browser windows. That little test window stays in the background, and is (from my understanding) a dispatch point for making talos tests occur. That little window has a titlebar, and when we opened new browser windows, the titlebars would overlap.

Jeff suggested I try modifying the TART test to move the browser approximately 22px (the height of a standard OS X titlebar) so that they would no longer overlap. I set that up, triggered a bunch of test runs, and went to bed.

I wasn’t able to sleep. Around 4AM, I got out of bed to look at the results – SUCCESS! The regression had gone away! Jeff was right!

I slept like a baby the rest of the night.

We closed this bug as a WONTFIX due to it being way outside our control.

Comment 31 and onward in that bug are the ones that describe our findings.

Eat it TART, your tears are delicious

Those were the big regressions we fixed for TART. It was a long haul, but we got there – and in the end, it means faster and smoother tab animation for our users, which means a better experience – and it’s totally worth it.

I’m particularly proud of the work we did here, and I’m also really happy with the cross-team support and collaboration we had – from Performance, to Layout, to Graphics, to Front-end – it was textbook teamwork.

Here’s an e-mail I wrote about us beating TART.

Where did we end up?

After the TART regression was fixed, we were set to land on mozilla-central! We didn’t just land a more beautiful browser, we also landed a more performant one.

Noice.

Stay tuned for Part 5 where I talk about CART.

Electrolysis Code Spelunking: How links open new windows in Firefox

Hey. I’ve started hacking on Electrolysis bugs. I’m normally a front-end engineer working on Firefox desktop, but I’ve been temporarily loaned out to help get Electrolysis ready to be enabled by default on Nightly.

I’m working on bug 989501. Basically, when you click on a link that targets “_blank” or uses window.open, we open a new tab instead. That’s no good – assuming the user’s profile is set to allow it, we should open the link in a new window.

In order to fix this, I need a clearer picture on what happens in the Firefox platform when we click on one of these links.

This isn’t really a tutorial – I’m not going to go out of my way to explain much here. Think of this more as a public posting of my notes during my exploration.

So, here goes.

(Note that the code in this post was current as of revision 400a31da59a9 of mozilla-central, so if you’re reading this in the future, it’s possible that some stuff has greatly changed).

I know for a fact that once the link is clicked, we eventually call mozilla::dom::TabChild::ProvideWindow. I know this because of conversations I’ve had with smaug, billm and jdm in and out of Bugzilla, IRC, and meatspace.

Because I know this, I can hook up gdb to see how I get to that call. I have some notes here on how to hook up gdb to the content process of an e10s window.

Once that’s hooked up, I set a breakpoint on mozilla::dom::TabChild::ProvideWindow, and click on a link somewhere with target=”_blank”.

I hit my breakpoint, and I get a backtrace. Ready for it? Here we go:

#0  mozilla::dom::TabChild::ProvideWindow (this=0x109afb400, aParent=0x10b098820, aChromeFlags=4094, aCalledFromJS=false, aPositionSpecified=false, aSizeSpecified=false, aURI=0xffe, aName=@0x0, aFeatures=@0x0, aWindowIsNew=0x10b098820, aReturn=0x7fff5fbfb648) at TabChild.cpp:1201
#1  0x00000001018682e4 in nsWindowWatcher::OpenWindowInternal (this=0x10b05b540, aParent=0x10b098820, aUrl=<value temporarily unavailable, due to optimizations>, aName=<value temporarily unavailable, due to optimizations>, aFeatures=<value temporarily unavailable, due to optimizations>, aCalledFromJS=false, aDialog=<value temporarily unavailable, due to optimizations>, aNavigate=<value temporarily unavailable, due to optimizations>, _retval=<value temporarily unavailable, due to optimizations>) at nsWindowWatcher.cpp:601
#2  0x0000000101869544 in non-virtual thunk to nsWindowWatcher::OpenWindow2(nsIDOMWindow*, char const*, char const*, char const*, bool, bool, bool, nsISupports*, nsIDOMWindow**) () at nsWindowWatcher.cpp:417
#3  0x0000000100e5dc63 in nsGlobalWindow::OpenInternal (this=0x10b098800, aUrl=@0x7fff5fbfbf90, aName=@0x7fff5fbfc038, aOptions=@0x103d77320, aDialog=false, aContentModal=false, aCalleePrincipal=<value temporarily unavailable, due to optimizations>, aJSCallerContext=<value temporarily unavailable, due to optimizations>, aReturn=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/dom/base/nsGlobalWindow.cpp:11498
#4  0x0000000100e5e3a4 in non-virtual thunk to nsGlobalWindow::OpenNoNavigate(nsAString_internal const&, nsAString_internal const&, nsAString_internal const&, nsIDOMWindow**) () at /Users/mikeconley/Projects/mozilla-central/dom/base/nsGlobalWindow.cpp:7463
#5  0x000000010184d99d in nsDocShell::InternalLoad (this=<value temporarily unavailable, due to optimizations>, aURI=0x113eed200, aReferrer=0x1134c0fe0, aOwner=0x114a69070, aFlags=0, aWindowTarget=0x10b098820, aLoadType=<value temporarily unavailable, due to optimizations>, aSHEntry=<value temporarily unavailable, due to optimizations>, aSourceDocShell=<value temporarily unavailable, due to optimizations>, aDocShell=<value temporarily unavailable, due to optimizations>, aRequest=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:9079
#6  0x0000000101855758 in nsDocShell::OnLinkClickSync (this=0x10b075000, aContent=0x112865eb0, aURI=0x113eed3c0, aTargetSpec=<value temporarily unavailable, due to optimizations>, aFileName=@0x106f27f10, aPostDataStream=0x0, aDocShell=<value temporarily unavailable, due to optimizations>, aRequest=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:12699
#7  0x0000000101857f85 in mozilla::Maybe<mozilla::AutoCxPusher>::~Maybe () at /Users/mikeconley/Projects/mozilla-central/obj-x86_64-apple-darwin12.5.0/dist/include/nsCxPusher.h:12499
#8  0x0000000101857f85 in nsCxPusher::~nsCxPusher () at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:41
#9  0x0000000101857f85 in nsCxPusher::~nsCxPusher () at /Users/mikeconley/Projects/mozilla-central/obj-x86_64-apple-darwin12.5.0/dist/include/nsCxPusher.h:66
#10 0x0000000101857f85 in OnLinkClickEvent::Run (this=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:12502
#11 0x0000000100084f60 in nsThread::ProcessNextEvent (this=0x106f245e0, mayWait=false, result=0x7fff5fbfc947) at nsThread.cpp:715
#12 0x0000000100023241 in NS_ProcessPendingEvents (thread=<value temporarily unavailable, due to optimizations>, timeout=20) at nsThreadUtils.cpp:210
#13 0x0000000100d41c47 in nsBaseAppShell::NativeEventCallback (this=0x1096e8660) at nsBaseAppShell.cpp:98
#14 0x0000000100cfdba1 in nsAppShell::ProcessGeckoEvents (aInfo=0x1096e8660) at nsAppShell.mm:388
#15 0x00007fff86adeb31 in __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ ()
#16 0x00007fff86ade455 in __CFRunLoopDoSources0 ()
#17 0x00007fff86b017f5 in __CFRunLoopRun ()
#18 0x00007fff86b010e2 in CFRunLoopRunSpecific ()
#19 0x00007fff8ad65eb4 in RunCurrentEventLoopInMode ()
#20 0x00007fff8ad65c52 in ReceiveNextEventCommon ()
#21 0x00007fff8ad65ae3 in BlockUntilNextEventMatchingListInMode ()
#22 0x00007fff8cce1533 in _DPSNextEvent ()
#23 0x00007fff8cce0df2 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#24 0x0000000100cfd266 in -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] (self=0x106f801a0, _cmd=<value temporarily unavailable, due to optimizations>, mask=18446744073709551615, expiration=0x422d63c37f00000d, mode=0x7fff7205e1c0, flag=1 '\001') at nsAppShell.mm:165
#25 0x00007fff8ccd81a3 in -[NSApplication run] ()
#26 0x0000000100cfe32b in nsAppShell::Run (this=<value temporarily unavailable, due to optimizations>) at nsAppShell.mm:746
#27 0x000000010199b3dc in XRE_RunAppShell () at /Users/mikeconley/Projects/mozilla-central/toolkit/xre/nsEmbedFunctions.cpp:679
#28 0x00000001002a0dae in MessageLoop::AutoRunState::~AutoRunState () at message_loop.cc:229
#29 0x00000001002a0dae in MessageLoop::AutoRunState::~AutoRunState () at /Users/mikeconley/Projects/mozilla-central/ipc/chromium/src/base/message_loop.h:197
#30 0x00000001002a0dae in MessageLoop::Run (this=0x0) at message_loop.cc:503
#31 0x000000010199b0cd in XRE_InitChildProcess (aArgc=<value temporarily unavailable, due to optimizations>, aArgv=<value temporarily unavailable, due to optimizations>, aProcess=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/toolkit/xre/nsEmbedFunctions.cpp:516
#32 0x0000000100000f1d in main (argc=<value temporarily unavailable, due to optimizations>, argv=0x7fff5fbff4d8) at /Users/mikeconley/Projects/mozilla-central/ipc/app/MozillaRuntimeMain.cpp:149

Oh my. Well, the good news is, we can chop off a good chunk of the lower half because that’s all message / event loop stuff. That’s going to be in every single backtrace ever, pretty much, so I can just ignore it. Here’s the more important stuff:

#0  mozilla::dom::TabChild::ProvideWindow (this=0x109afb400, aParent=0x10b098820, aChromeFlags=4094, aCalledFromJS=false, aPositionSpecified=false, aSizeSpecified=false, aURI=0xffe, aName=@0x0, aFeatures=@0x0, aWindowIsNew=0x10b098820, aReturn=0x7fff5fbfb648) at TabChild.cpp:1201
#1  0x00000001018682e4 in nsWindowWatcher::OpenWindowInternal (this=0x10b05b540, aParent=0x10b098820, aUrl=<value temporarily unavailable, due to optimizations>, aName=<value temporarily unavailable, due to optimizations>, aFeatures=<value temporarily unavailable, due to optimizations>, aCalledFromJS=false, aDialog=<value temporarily unavailable, due to optimizations>, aNavigate=<value temporarily unavailable, due to optimizations>, _retval=<value temporarily unavailable, due to optimizations>) at nsWindowWatcher.cpp:601
#2  0x0000000101869544 in non-virtual thunk to nsWindowWatcher::OpenWindow2(nsIDOMWindow*, char const*, char const*, char const*, bool, bool, bool, nsISupports*, nsIDOMWindow**) () at nsWindowWatcher.cpp:417
#3  0x0000000100e5dc63 in nsGlobalWindow::OpenInternal (this=0x10b098800, aUrl=@0x7fff5fbfbf90, aName=@0x7fff5fbfc038, aOptions=@0x103d77320, aDialog=false, aContentModal=false, aCalleePrincipal=<value temporarily unavailable, due to optimizations>, aJSCallerContext=<value temporarily unavailable, due to optimizations>, aReturn=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/dom/base/nsGlobalWindow.cpp:11498
#4  0x0000000100e5e3a4 in non-virtual thunk to nsGlobalWindow::OpenNoNavigate(nsAString_internal const&, nsAString_internal const&, nsAString_internal const&, nsIDOMWindow**) () at /Users/mikeconley/Projects/mozilla-central/dom/base/nsGlobalWindow.cpp:7463
#5  0x000000010184d99d in nsDocShell::InternalLoad (this=<value temporarily unavailable, due to optimizations>, aURI=0x113eed200, aReferrer=0x1134c0fe0, aOwner=0x114a69070, aFlags=0, aWindowTarget=0x10b098820, aLoadType=<value temporarily unavailable, due to optimizations>, aSHEntry=<value temporarily unavailable, due to optimizations>, aSourceDocShell=<value temporarily unavailable, due to optimizations>, aDocShell=<value temporarily unavailable, due to optimizations>, aRequest=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:9079
#6  0x0000000101855758 in nsDocShell::OnLinkClickSync (this=0x10b075000, aContent=0x112865eb0, aURI=0x113eed3c0, aTargetSpec=<value temporarily unavailable, due to optimizations>, aFileName=@0x106f27f10, aPostDataStream=0x0, aDocShell=<value temporarily unavailable, due to optimizations>, aRequest=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:12699
#7  0x0000000101857f85 in mozilla::Maybe<mozilla::AutoCxPusher>::~Maybe () at /Users/mikeconley/Projects/mozilla-central/obj-x86_64-apple-darwin12.5.0/dist/include/nsCxPusher.h:12499
#8  0x0000000101857f85 in nsCxPusher::~nsCxPusher () at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:41
#9  0x0000000101857f85 in nsCxPusher::~nsCxPusher () at /Users/mikeconley/Projects/mozilla-central/obj-x86_64-apple-darwin12.5.0/dist/include/nsCxPusher.h:66
#10 0x0000000101857f85 in OnLinkClickEvent::Run (this=<value temporarily unavailable, due to optimizations>) at /Users/mikeconley/Projects/mozilla-central/docshell/base/nsDocShell.cpp:12502

That’s a bit more manageable.

So we start inside something called a docshell. I’ve heard that term bandied about a lot, and I can’t say I’ve ever been too sure what it means, or what a docshell does, or why I should care.

I found some documents that make things a little bit clearer.

Basically, my understanding is that a docshell is the thing that connects incoming stuff from some URI (this could be web content, or it might be a XUL document that’s loading the browser UI…), and connects it to the things that make stuff show up on your screen.

So, pretty important.

It seems to be a place where some utility methods and functions go as well, so it’s kind of this abstract thing that seems to have multiple purposes.

But the most important thing for the purposes of this post is this: every time you load a document, you have a docshell taking care of it. All of these docshells are structured in a tree which is rooted with a docshell owner. This will come into play later.

So one thing that a docshell does, is that it notices when a link was clicked inside of its content. That’s nsDocShell.cpp’d OnLinkClickEvent::Run, and that eventually makes its way over to nsDocShell::OnLinkClickSync.

After some initial checks and balances to ensure that this thing really is a link we want to travel to, we get sent off to nsDocShell::InternalLoad.

Inside there, there’s some more checking… there’s a policy check to make sure we’re allowed to open a link. Lots of security going on. Eventually I see this:

if (aWindowTarget && *aWindowTarget)

That’s good. aWindowTarget maps to the target=”_blank” attribute in the anchor. So we’ll be entering this block.

    if (aWindowTarget && *aWindowTarget) {
        // Locate the target DocShell.
        nsCOMPtr<nsIDocShellTreeItem> targetItem;
        rv = FindItemWithName(aWindowTarget, nullptr, this,
                              getter_AddRefs(targetItem));

So now we’re looking for the right docshell to load this new document in. That makes sense – if you have a link where target=”foo”, subsequent links from the same origin targeted at “foo” will open in the same window or tab or what have you. So we’re checking to see if we’ve opened something with the name inside aWindowTarget already.

So now we’re in nsDocShell::FindItemWithName, and I see this:

        else if (name.LowerCaseEqualsLiteral("_blank"))
        {
            // Just return null.  Caller must handle creating a new window with
            // a blank name himself.
            return NS_OK;
        }

Ah hah, so target=”_blank”, as we already knew, is special-cased – and this is where it happens. There’s no existing docshell for _blank because we know we’re going to be opening a new window (or tab if the user has preffed it that way). So we don’t return a pre-existing docshell.

So we’re back in nsDocShell::InternalLoad.

        rv = FindItemWithName(aWindowTarget, nullptr, this,
                              getter_AddRefs(targetItem));
        NS_ENSURE_SUCCESS(rv, rv);

        targetDocShell = do_QueryInterface(targetItem);
        // If the targetDocShell doesn't exist, then this is a new docShell
        // and we should consider this a TYPE_DOCUMENT load
        isNewDocShell = !targetDocShell;

Ok, so now targetItem is nullptr, targetDocShell is also nullptr, and so isNewDocShell is true.

There seems to be more policy checking going on in InternalLoad after this… but eventually, I see this:

   if (aWindowTarget && *aWindowTarget) {
        // We've already done our owner-inheriting.  Mask out that bit, so we
        // don't try inheriting an owner from the target window if we came up
        // with a null owner above.
        aFlags = aFlags & ~INTERNAL_LOAD_FLAGS_INHERIT_OWNER;
        
        bool isNewWindow = false;
        if (!targetDocShell) {
            // If the docshell's document is sandboxed, only open a new window
            // if the document's SANDBOXED_AUXILLARY_NAVIGATION flag is not set.
            // (i.e. if allow-popups is specified)
            NS_ENSURE_TRUE(mContentViewer, NS_ERROR_FAILURE);
            nsIDocument* doc = mContentViewer->GetDocument();
            uint32_t sandboxFlags = 0;

            if (doc) {
                sandboxFlags = doc->GetSandboxFlags();
                if (sandboxFlags & SANDBOXED_AUXILIARY_NAVIGATION) {
                    return NS_ERROR_DOM_INVALID_ACCESS_ERR;
                }
            }

            nsCOMPtr<nsPIDOMWindow> win =
                do_GetInterface(GetAsSupports(this));
            NS_ENSURE_TRUE(win, NS_ERROR_NOT_AVAILABLE);

            nsDependentString name(aWindowTarget);
            nsCOMPtr<nsIDOMWindow> newWin;
            nsAutoCString spec;
            if (aURI)
                aURI->GetSpec(spec);
            rv = win->OpenNoNavigate(NS_ConvertUTF8toUTF16(spec),
                                     name,          // window name
                                     EmptyString(), // Features
                                     getter_AddRefs(newWin));

So we check again to see if we’re targeted at something, and check if we’ve found a target docshell for it. We hadn’t, so we do some security checks, and then … what the hell is nsPIDOMWindow? I’m used to things being called nsIBlahBlah, but now nsPIBlahBlah… what does the P mean?

It took some asking around, but I eventually found out that the P is supposed to be for Private – as in, this is a private XPIDL interface, and non-core embedders should stay away from it.

Ok, and we also see do_GetInterface. This is not the same as QueryInterface, believe it or not. The difference is subtle, but basically it’s this: QueryInterface says “you implement X, but I think you also implement Y. If you do, please return a pointer to yourself that makes you seem like a Y.” GetInterface is different – GetInterface says “I know you know about something that implements Y. It might be you, or more likely, it’s something you’re holding a reference to. Can I get a reference to that please?”. And if successful, it returns it. Here’s more documentation about GetInterface.

It’s a subtle but important difference.

So this docshell knows about a window, and we’ve now got a handle on that window using the private interface nsPIDOMWindow. Neat.

So eventually, we call OpenNoNavigate on that nsPIDOMWindow. That method is pretty much like nsIDOMWindow::Open, except that OpenNoNavigate doesn’t send the window anywhere – it just returns it so that the caller can send it to a URI.

Through the magic of do_GetInterface, nsDocShell::GetInterface, EnsureScriptEnvironment, and NS_NewScriptGlobalObject, I know that the nsPIDOMWindow is being implemented by nsGlobalWindow, and that’s where I should go to to find the OpenNoNavigate implementation.

So off we go!

nsGlobalWindow::OpenNoNavigate just seems to forward the call, after some argument setting, to nsGlobalWindow::OpenInternal, like this:

  return OpenInternal(aUrl, aName, aOptions,
                      false,          // aDialog
                      false,          // aContentModal
                      true,           // aCalledNoScript
                      false,          // aDoJSFixups
                      false,          // aNavigate
                      nullptr, nullptr,  // No args
                      GetPrincipal(),    // aCalleePrincipal
                      nullptr,           // aJSCallerContext
                      _retval);

Having a glance around at the rest of the nsGlobalWindow::Open[foo] methods, it looks like they all call into OpenInternal. It’s the big-mamma opening method.

This method does a few things, including making sure that we’re not being abused by web content that’s trying to spam the user with popups.

Eventually, we get to this:

      rv = pwwatch->OpenWindow2(this, url.get(), name_ptr, options_ptr,
                                /* aCalledFromScript = */ false,
                                aDialog, aNavigate, aExtraArgument,
                                getter_AddRefs(domReturn));

and return the domReturn pointer back after a few more checks to our caller. Remember that the caller is going to take this new window, and navigate it to some URI.

Ok, so, pwwatch. What is that? Well, that appears to be a private interface to nsWindowWatcher, which gives us access to the OpenWindow2 method.

After prepping some arguments, much like nsGlobalWindow::OpenNoNavigate did, we forward the call over to nsWindowWatcher::OpenWindowInternal.

And now we’re almost done – we’re almost at the point where we’re actually going to open a window!

Some key things need to happen though. First, we do this:

nsCOMPtr<nsIDocShellTreeOwner>  parentTreeOwner;  // from the parent window, if any
...
GetWindowTreeOwner(aParent, getter_AddRefs(parentTreeOwner));

So what that does is it tries to get the docshell owner of the docshell that’s attempting to open the window (and that’d be the docshell that we clicked the link in).

After a few more things, we check to see if there’s an existing window with that target name which we can re-use:

  // try to find an extant window with the given name
  nsCOMPtr<nsIDOMWindow> foundWindow = SafeGetWindowByName(name, aParent);
  GetWindowTreeItem(foundWindow, getter_AddRefs(newDocShellItem));

And if so, we set it to newDocShellItem.

After some more security stuff, we check to see if newDocShellItem exists. Because name is nullptr (since we had target=”_blank”, and nsDocShell::FindItemWithName returned nullptr), newDocShellItem is null.

Because it doesn’t exist, we know we’re opening a brand new window!

More security things seem to happen, and then we get to the part that I’m starting to focus on:

      nsCOMPtr<nsIWindowProvider> provider = do_GetInterface(parentTreeOwner);
      if (provider) {
        NS_ASSERTION(aParent, "We've _got_ to have a parent here!");

        nsCOMPtr<nsIDOMWindow> newWindow;
        rv = provider->ProvideWindow(aParent, chromeFlags, aCalledFromJS,
                                     sizeSpec.PositionSpecified(),
                                     sizeSpec.SizeSpecified(),
                                     uriToLoad, name, features, &windowIsNew,
                                     getter_AddRefs(newWindow));

We ask the parentTreeOwner to get us something that it knows about that implements nsIWindowProvider. In the Electrolysis / content process case, that’d be TabChild. In the normal, non-Electrolysis case, that’s nsContentTreeOwner.

The nsIWindowProvider is the thing that we’ll use to get a new window from! So we call ProvideWindow on it, to give us a pointer to new nsIDOMWindow window, assigned to newWindow.

Here’s TabChild::ProvideWindow:

NS_IMETHODIMP
TabChild::ProvideWindow(nsIDOMWindow* aParent, uint32_t aChromeFlags,
                        bool aCalledFromJS,
                        bool aPositionSpecified, bool aSizeSpecified,
                        nsIURI* aURI, const nsAString& aName,
                        const nsACString& aFeatures, bool* aWindowIsNew,
                        nsIDOMWindow** aReturn)
{
    *aReturn = nullptr;

    // If aParent is inside an <iframe mozbrowser> or <iframe mozapp> and this
    // isn't a request to open a modal-type window, we're going to create a new
    // <iframe mozbrowser/mozapp> and return its window here.
    nsCOMPtr<nsIDocShell> docshell = do_GetInterface(aParent);
    if (docshell && docshell->GetIsInBrowserOrApp() &&
        !(aChromeFlags & (nsIWebBrowserChrome::CHROME_MODAL |
                          nsIWebBrowserChrome::CHROME_OPENAS_DIALOG |
                          nsIWebBrowserChrome::CHROME_OPENAS_CHROME))) {

      // Note that BrowserFrameProvideWindow may return NS_ERROR_ABORT if the
      // open window call was canceled.  It's important that we pass this error
      // code back to our caller.
      return BrowserFrameProvideWindow(aParent, aURI, aName, aFeatures,
                                       aWindowIsNew, aReturn);
    }

    // Otherwise, create a new top-level window.
    PBrowserChild* newChild;
    if (!CallCreateWindow(&newChild)) {
        return NS_ERROR_NOT_AVAILABLE;
    }

    *aWindowIsNew = true;
    nsCOMPtr<nsIDOMWindow> win =
        do_GetInterface(static_cast<TabChild*>(newChild)->WebNavigation());
    win.forget(aReturn);
    return NS_OK;
}

The docshell->GetIsInBrowserOrApp() is basically asking “are we b2g?”, to which the answer is “no”, so we skip that block, and go right for CallCreateWindow.

CallCreateWindow is using the IPC library to communicate with TabParent in the UI process, which has a corresponding function called AnswerCreateWindow. Here it is:

bool
TabParent::AnswerCreateWindow(PBrowserParent** retval)
{
    if (!mBrowserDOMWindow) {
        return false;
    }

    // Only non-app, non-browser processes may call CreateWindow.
    if (IsBrowserOrApp()) {
        return false;
    }

    // Get a new rendering area from the browserDOMWin.  We don't want
    // to be starting any loads here, so get it with a null URI.
    nsCOMPtr<nsIFrameLoaderOwner> frameLoaderOwner;
    mBrowserDOMWindow->OpenURIInFrame(nullptr, nullptr,
                                      nsIBrowserDOMWindow::OPEN_NEWTAB,
                                      nsIBrowserDOMWindow::OPEN_NEW,
                                      getter_AddRefs(frameLoaderOwner));
    if (!frameLoaderOwner) {
        return false;
    }

    nsRefPtr<nsFrameLoader> frameLoader = frameLoaderOwner->GetFrameLoader();
    if (!frameLoader) {
        return false;
    }

    *retval = frameLoader->GetRemoteBrowser();
    return true;
}

So after some checks, we call mBrowserDOMWindow’s OpenURIInFrame, with (among other things), nsIBrowserDOMWindow::OPEN_NEWTAB. So that’s why we’ve got a new tab opening instead of a new window.

mBrowserDOMWindow is a reference to this thing implemented in browser.js:

function nsBrowserAccess() { }

nsBrowserAccess.prototype = {
  QueryInterface: XPCOMUtils.generateQI([Ci.nsIBrowserDOMWindow, Ci.nsISupports]),

  _openURIInNewTab: function(aURI, aOpener, aIsExternal) {
    let win, needToFocusWin;

    // try the current window.  if we're in a popup, fall back on the most recent browser window
    if (window.toolbar.visible)
      win = window;
    else {
      let isPrivate = PrivateBrowsingUtils.isWindowPrivate(aOpener || window);
      win = RecentWindow.getMostRecentBrowserWindow({private: isPrivate});
      needToFocusWin = true;
    }

    if (!win) {
      // we couldn't find a suitable window, a new one needs to be opened.
      return null;
    }

    if (aIsExternal && (!aURI || aURI.spec == "about:blank")) {
      win.BrowserOpenTab(); // this also focuses the location bar
      win.focus();
      return win.gBrowser.selectedBrowser;
    }

    let loadInBackground = gPrefService.getBoolPref("browser.tabs.loadDivertedInBackground");
    let referrer = aOpener ? makeURI(aOpener.location.href) : null;

    let tab = win.gBrowser.loadOneTab(aURI ? aURI.spec : "about:blank", {
                                      referrerURI: referrer,
                                      fromExternal: aIsExternal,
                                      inBackground: loadInBackground});
    let browser = win.gBrowser.getBrowserForTab(tab);

    if (needToFocusWin || (!loadInBackground && aIsExternal))
      win.focus();

    return browser;
  },

  openURI: function (aURI, aOpener, aWhere, aContext) {
    ... (removed for brevity)
  },

  openURIInFrame: function browser_openURIInFrame(aURI, aOpener, aWhere, aContext) {
    if (aWhere != Ci.nsIBrowserDOMWindow.OPEN_NEWTAB) {
      dump("Error: openURIInFrame can only open in new tabs");
      return null;
    }

    var isExternal = (aContext == Ci.nsIBrowserDOMWindow.OPEN_EXTERNAL);
    let browser = this._openURIInNewTab(aURI, aOpener, isExternal);
    if (browser)
      return browser.QueryInterface(Ci.nsIFrameLoaderOwner);

    return null;
  },

  isTabContentWindow: function (aWindow) {
    return gBrowser.browsers.some(function (browser) browser.contentWindow == aWindow);
  },

  get contentWindow() {
    return gBrowser.contentWindow;
  }
}

So nsBrowserAccess’s openURIInFrame only supports opening things in new tabs, and then it just calls _openURIInNewTab on itself, which does the job of returning the tab’s remote browser after the tab is opened.

I might follow this up with a post about how nsContentTreeOwner opens a window in the non-Electrolysis case, and how we might abstract some of that out for re-use here. We’ll see.

And that’s about it. Hopefully this is useful to future spelunkers.

Electrolysis: Debugging Child Processes of Content for Make Benefit Glorious Browser of Firefox

Here’s how I’m currently debugging Electrolysis stuff on OS X using gdb. It involves multiple terminal windows. I live with that.

# In Terminal Window 1, I execute my Firefox build with MOZ_DEBUG_CHILD_PROCESS=1.
# That environment variable makes it so that the parent process spits out the child
# process ID as soon as it forks out. I also use my e10s profile so as to not muck up
# my default profile.

MOZ_DEBUG_CHILD_PROCESS=1 ./mach run -P e10s

# So, now my Firefox is spawned up and ready to go. I have
# browser.tabs.remote.autostart set to "true" in my about:config, which means I'm
# using out-of-process tabs by default. That means that right away, I see the
# child process ID dumped into the console. Maybe you get the same thing if
# browser.tabs.remote.autostart is false. I haven't checked.

CHILDCHILDCHILDCHILD
  debug me @ 45326

# ^-- so, this is what comes out in Terminal Window 1.

So, the next step is to open another terminal window. This one will connect to the parent process.

# Maybe there are smarter ways to find the firefox process ID, but this is what I
# use in my new Terminal Window 2.
ps aux | grep firefox

# And this is what I get back:

mikeconley     45391  17.2  5.3  3985032 883932   ??  S     2:39pm   1:58.71 /Applications/FirefoxAurora.app/Contents/MacOS/firefox
mikeconley     45322   0.0  0.4  3135172  69748 s000  S+    2:36pm   0:06.48 /Users/mikeconley/Projects/mozilla-central/obj-x86_64-apple-darwin12.5.0/dist/Nightly.app/Contents/MacOS/firefox -no-remote -foreground -P e10s
mikeconley     45430   0.0  0.0  2432768    612 s002  R+    2:44pm   0:00.00 grep firefox
mikeconley     44878   0.0  0.0        0      0 s000  Z    11:46am   0:00.00 (firefox)

# That second one is what I want to attach to. I can tell, because the executable
# path lies within my local build's objdir. The first row is my main Firefox I just
# use for work browsing. I definitely don't want to attach to that. The third line
# is just me looking for the process with grep. Not sure what that last one is.

# I use sudo to attach to the parent because otherwise, OS X complains about permissions
# for process attachment. I attach to the parent like this:

sudo gdb firefox 45322

# And now I have a gdb for the parent process. Easy peasy.

And finally, to debug the child, I open yet another terminal window.

# That process ID that I got from Terminal Window 1 comes into play now.

sudo gdb firefox 45326

# Boom - attached to child process now.

Setting breakpoints for things like TabChild::foo or TabParent::bar can be done like this:

# In Terminal Window 3, attached to the child:

b mozilla::dom::TabChild::foo

# In Terminal Window 2, attached to the parent:

b mozilla::dom::TabParent::bar

And now we’re cookin’.

A Blog by Mike Conley

The personal blog of a Toronto based software mechanic, musician, sound designer, and theatre enthusiast.