Tag Archives: performance

Australis Performance Post-mortem Summary

Over the last few months, I’ve been talking about all of the work we put into making Australis feel fast when it shipped in Firefox 29.

I talked about where we started with our performance work, and how we grappled with the ts_paint and tpaint performance (“talos”) tests. After that, I talked a bit about the excellent tools we have (and ones we developed ourselves) to make finding our performance bottlenecks easier.

After a brief delay, I rounded out the series by talking about our tab animation performance work, and the customization transition performance work.

I think over the course of working on these things, I’ve learned quite a bit about performance work in general. If I had to distill it down to a few tidbits, it’d be:

  • Measure first to get a baseline, then try to improve. (Alternatively, “you can’t improve what you can’t measure”)
  • Finding the solutions to performance problems is usually the easy part. The hard part is finding and isolating the problems to begin with.
  • While performance work can be a bit of a grind, users do feel and appreciate the efforts. It’s totally worth it.

So that’s it on the series. Enjoy your zippy Firefox!

6 people like this post.

Australis Performance Post-mortem Part 5: The Customize Mode Transition

Another new thing that came out with Firefox 29 is a sexy new customization interface. We wanted to make UI customization something that anybody would feel comfortable doing, instead of something only a few mighty power users might do.

The new customization mode is accessible by pressing the Menu Button (☰), and then clicking Customize.

Give it a shot now if you’ve never done it.

BAM, did you see that? What you just saw was a full-on mode switch in the browser chrome to indicate to you that you’ve put the browser into a different state – specifically, a state where much of the browser UI is malleable. You can drag and drop things in your toolbars and the menu, enable toolbars, etc.

Going in and out of customization mode is not something that most people will do frequently, but we still invested a bunch of time trying to make it smooth. I still think there is more we can do on that front, and I’ll get into that at the end of this post.

Anyhow, as with any performance-related project, we started with a way of measuring. In this case, our trusty performance team re-purposed the TART test that we used for tab animation, and pointed it at the customize mode transition. We called this new test CART (customization animation regression test, natch).

This is slightly different from the tab animation stuff, because we didn’t really have a baseline measurement to compare against – there was no old customization mode transition to try to match or beat. So we just had to do our best to do all of our processing under 16ms in order to draw the transition at 60fps.

That means that instead of comparing two sets of data over time, we’ll only be looking at a single series of points, and how they change over time.

So let’s see where we started, and where we finished, and how we got there.

CART results - pre-optimizations

CART results – pre-optimizations

So in an effort to get this blog post done, I’ve cut some corners and combined the data from a number of different platforms into a single graph. I apologize if this makes it difficult to interpret for some of you – but I’ll do my best to explain what the graph means.

I chose a representative sample of the platforms that we measured – we’ve got OS X (10.6, 10.8), Ubuntu 12.04 (32-bit), Windows XP, Windows 7, and Windows 8.

The X-axis is obviously plotting the time (we started gathering CART data right at the end of January 2014). The Y-axis plots the “final CART value”. CART, like TART, measures a number of things, and we “boil that number down” into a single value that we can plot. Each of the subtests are measured in milliseconds, so I guess you could say the Y-axis is also in milliseconds – but as it’s an aggregate, that’s not too meaningful. It’s not the greatest for detecting small shifts in the subtests (we use Datazilla for that), but it allows me to illustrate the changes over time in a pretty simple way.

So it looks like my sampled platforms are all floating around the 40’s and 50’s. Is that good or bad? Well, it’s neither really. It’s just our starting point, and we wanted to try to improve on that.

And so we set out trying to move those dots downwards (for this data, without going into grand detail, lower milliseconds = faster and smoother animation).

As with TART, we did a lot of profiling, and used a lot of the tools I mentioned in this post. Here are the bugs that seemed to move the needle the most.

The Rogue’s Gallery

As before, this is not a complete list, but captures some of the more interesting bugs we worked on.

Bug 932963 – Break customize mode transition into several phases

This change allowed us to break the transition in and out of customization mode into some phases:

  1. Not in customization mode (default)
  2. Entering customization mode
  3. Entered customization mode
  4. Exiting customization mode

Phases 2, 3 and 4 have attributes on the main browser window that we can target with CSS selectors. We were then able to “lighten” the CSS during phases 2 and 4, in order to optimize the frames during the transition. For example, we don’t display the semi-transparent grid texture on the main window unless we’re in phase 3.

Bug 972485 – Find out why we’re doing a bunch of synchronous file reading at the start of the customize mode transition

This was a rather surprising one – using the Gecko Profiler, it looked like we were doing sync file IO as the blank about:customizing document was loading.

For context, we have a page registered at about:customizing that’s a blank document. When we detect that about:customizing has been loaded or switched to, we enter customize mode.

So this sync file IO was causing some jank at the start of the customization mode transition.

Strangely, it turned out that the XHTML file we were loading for the blank about:customizing document was synchronously loading a bunch of MathML localization stuff as it loaded.

We switched the document from XHTML to XUL, and the sync IO load went away. We filed a follow-up bug to investigate why exactly we were doing sync file IO on loading XHTML files, because that’s a bad thing to do.

Anyhow, this was a small but significant win for almost all platforms.

Bug 975552 – Preload about:customizing like we do with about:newtab

I think this was the biggest win we achieved during the performance work. That browser that we load the blank about:customizing page in is not free – a bunch of stuff gets instantiated in order for a working browser to be created, and all of that expense is wasted on a blank document. That expense also causes enough main thread thinking that it reduces the smoothness of the customization entering transition.

So the solution was a hack that takes after the same strategy we use for about:newtab. Essentially, we preload the about:customizing browser and document in the background at what seems like a logical time (right after the user has opened the menu). This allows us to front-load all of the expense in creating the browser, and we end up with something much smoother.

Here’s a before video and an after video from my Windows 7 machine, for reference.

This was a pretty bodacious hack, and in the future, I’d like to remove it completely, and find some way of sidestepping all of the browser internal loading for the about:customizing document (or, even better, find a way of not using a browser element at all!). That’s filed here.

Bug 977796 – Disable subpixel anti-aliasing during customize mode transition.

A profile on Windows 8 showed that we were spending an inordinate amount of time rendering text during the customize mode transition, and this has to do with subpixel anti-aliasing in the menu that animates in.

So I landed a patch that temporarily disables subpixel anti-aliasing on that element during the customize mode transition, and that bought us a huge win for Window 8 (about 20%). Not much of a win for any of the other platforms.

So where did we end up?

CART results - post-optimizations

CART results – post-optimizations

Those numbers are indeed lower!

Now, before you lose your mind about that epic cross-platform win around Feburary 20th, I investigated that changeset range and didn’t find anything we worked on directly in there.

There are some platform patches in there (specifically in Graphics) that might account for this win, but I’m pretty certain it’s because of bug 974621, which updated our test runners to use a different version of Talos that included the patch in bug 967186. 967186 altered the CART test to be more accurate in its measurements, so actually a lot of our initial data was erroneous. That blows a bit because it adds some unnecessary noise to our graph, but it’s also good because accuracy is a thing we definitely want.

Future work

I still think there’s more wins to be made in the customize mode transition. Finding a way of getting rid of the about:customizing preloading hack and replace it with something smarter (like a thin browser that doesn’t need to instantiate much of its backend, or no browser at all) is probably the first step. More profiling might find more wins. I’m learning more and more about platform as I work on Electrolysis, so maybe I’ll come back to this problem with some new skills and information, and I can get it performing reliably on all platforms as it really should: 60fps, smooth as silk.

3 people like this post.

Australis Performance Post-mortem Part 4: On Tab Animation

Whoops, forgot that I had a blog post series going on here where I talk about the stuff we did to make Australis blazing fast. In that time, we’ve shipped the thing (Firefox 29 represent!), so folks are actually feeling the results of this performance work, which is pretty excellent.

I ended the last post on an ominous note – something about how we were clear to land Australis on Nightly, or so we thought.

This next bit is all about timing.

There is a Performance Team at Mozilla, who are charged with making our products crazy-fast and crazy-smooth. These folks are geniuses at wringing out every last millisecond possible from a computation. It’s what they do all day long.

A pretty basic principle that I’ve learned over the years is that if you want to change something, you first have to develop a system for measuring the thing you’re trying to change. That way, you can determine if what you’re doing is actually changing things in the way you want them to.

If you’ve been reading this Australis Performance Post-mortem posts, you’ll know that we have some performance tests like this, and they’re called Talos tests.

Just about as we were finishing off the last of the t_paint and ts_paint regressions, the front-end team was suddenly made aware of a new Talos test that was being developed by the Performance team. This test was called TART, and stands for Tab Animation Regression Test. The purpose of this test is to exercise various tab animation scenarios and measure the time it takes to paint each frame and to proceed through the entire animation of a tab open and a tab close.

The good news was that this new test was almost ready for running on our Nightly builds!

The bad news was that the UX branch, which Australis was still on at the time, was regressing this test. And since we cannot land if we regress performance like this… it meant we couldn’t land.

Bad news indeed.

Or was it? At the time, the lot of us front-end engineers were groaning because we’d just slogged through a ton of other performance regressions. Investigating and fixing performance regressions is exhausting work, and we weren’t too jazzed that another regression had just shown up.

But thinking back, I’m somewhat glad this happened. The test showed that Australis was regressing tab animation performance, and tabs are opened every day by almost every Firefox user. Regressing tab performance is simply not a thing one does lightly. And this test caught us before we landed something that regressed those tabs mightily!

That was a good thing. We wouldn’t have known otherwise until people started complaining that their tabs were feeling sluggish when we released it (since most of us run pretty beefy development machines).

And so began the long process of investigating and fixing the TART regressions.

So how bad were things?

Let’s take a look at the UX branch in comparison with mozilla-central at the time that we heard about the TART regressions.

Here are the regressing platforms. I’ll start with Windows XP:

Where we started with TART on Windows XP

Where we started with TART on Windows XP

Forgive me, I couldn’t get the Graph Server to swap the colours of these two datasets, so my original silent pattern of “red is the regressor” has to be dropped. I could probably spend some time trying to swap the colours through various tricky methods, but I honestly don’t think anybody reading this will care too much.

So here we can see the TART scores for Windows XP, and the UX branch (green) is floating steadily over mozilla-central (red). Higher scores are bad. So here’s the regression.

Now let’s see OS X 10.6.

Where we started with TART on OS X 10.6

Where we started with TART on OS X 10.6

Same problem here – the UX nodes (tan) are clearly riding higher than mozilla-central. This was pretty similar to OS X 10.7 too, so I didn’t include the graph.

On OS X 10.8, things were a little bit better, but not too much:

Where we started with TART on OS X 10.8

Where we started with TART on OS X 10.8

Here, the regression was still easily visible, but not as large in magnitude.

Ubuntu was in the same boat as OS X 10.6/10.7 and Windows XP:

Where we started with TART on Ubuntu

Where we started with TART on Ubuntu

But what about Windows 7 and Windows 8? Well, interesting story – believe it or not, on those platforms, UX seemed to perform better than mozilla-central:

Windows 7 (the blue nodes are mozilla-central, the tan nodes are UX)

Where we started with TART on Windows 7

Where we started with TART on Windows 7

Windows 8 (the green nodes are mozilla-central, the red nodes are UX)

Where we started with TART on Windows 8

Where we started with TART on Windows 8

So what the hell was going on?

Well, we eventually figured it out. I’ll lay it out in the next few paragraphs. The following is my “rogue’s gallery” of regressions. This list does not include many false starts and red herrings that we followed during the months working on these regressions. Think of this as “getting to the good parts”.

Backfilling

The problem with having a new test, and having mozilla-central better than UX, is not knowing where UX got worse; there was no historical measurements that we could look at to see where the regression got introduced.

MattN, smart guy that he is, got us a few talos loaner machines, and wrote some scripts to download the Nightlies for both mozilla-central and UX going back to the point where UX split off. Then, he was able to run TART on these builds, and supply the results to his own custom graph server.

So basically, we were able to backfill our missing TART data, and that helped us find a few points of regression.

With that data, now it was time to focus in on each platform, and figure out what we could do with it.

Windows XP

We started with XP, since on the regressing platforms, that’s where most of our users are.

Here’s what we found and fixed:

Bug 916946 – Stop animating the back-button when enabling or disabling it.

During some of the TART tests, we start with single tab, open a new tab, and then close the new tab, and repeat. That first tab has some history, so the back button is enabled. The new tab has no history, so the back button is disabled.

Apparently, we had some CSS that was causing us to animate the back-button when we were flipping back and forth from the enabled / disabled states. That CSS got introduced in the patch that bound the back/forward/stop/reload buttons to the URL bar. It seemed to affect Windows only. Fixing that CSS gave us our first big win on Windows XP, and gave us more of a lead with Windows 7 and Windows 8!

Bug 907544 - Pass the D3DSurface9 down into Cairo so that it can release the DC and LockRect to get at the bits

I don’t really remember how this one went down (and I don’t want to really spend the time swapping it back in by reading the bug), but from my notes it looks like the Graphics team identified this possible performance bottleneck when I showed them some profiles I gathered when running TART.

The good news on this one, was that it definitely gave the UX branch a win on Windows XP. The bad news is that it gave the same win to mozilla-central. This meant that while overall performance got better on Windows XP, we still had the same regression preventing us from landing.

Bug 919541 - Consider not animating the opacity for Australis tabs

Jeff Muizelaar helped me figure this out while we were using paint flashing to analyze paint activity while opening and closing tabs. When we slowed down the transitions, we noticed that the closing tabs were causing paints even though they weren’t visible. Closing tabs aren’t visible because with Australis, we don’t show the tab shape around tabs when they’re not selected – and closing a tab automatically unselects it for something else.

For some reason, our layout and graphics code still wanted to paint this transition even though the element was not visible. We quickly nipped that in the bud, and got ourselves a nice win on tab close measurements for all platforms!

Bug 921038 – Move selected tab curve clip-paths into SVG-as-an-image so it is cached.

This was the final nail in the coffin for the TART regression on Windows XP. Before this bug, we were drawing the linear-gradients in the tab shape using CSS, and the clipping for the curve background colour was being pulled off using clip-path and an SVG curve defined in the browser.xul document.

In this bug, we moved from clipping a background to create the curve, to simply drawing a filled curve using SVG, and putting the linear-gradient for the texture in the “stroke” image (the image that overlays the border on the tab curve).

That by itself was not enough to win back the regression – but thankfully, Seth Fowler had been working on SVG caching, and with that cache backend, our patch here knocked the XP and Ubuntu regressions out! It also took out a chunk from OS X. Things were looking good!

OS X

Bug 924415 - Find out why setting chromemargin to 0,-1,-1,-1 is so expensive for TART on UX branch on OS X.

I don’t think I’ll ever forget this bug.

I had gotten my hands on a Mac Mini that (after some hardware modifications) matched the specs of our 10.6 Talos test machines. That would prove to be super useful, as I was easily able to reproduce the regression that machine, and we could debug and investigate locally, without having to remote in to some loaner device.

With this machine, it didn’t take us long to identify the drawing of the tabs in the titlebar as the main culprit in the OS X regression. But the “why” eluded us for weeks.

It was clear I wasn’t going to be able to solve it on my own, so Jeff Muizelaar from the Graphics team joined in to help me.

We looked at OpenGL profiles, we looked at apitraces, we looked at profiles using the Gecko Profiler, and we looked at profiles from Instruments – the profiler that comes included with XCode.

It seemed like the performance bottleneck was coming from the operating system, but we needed to prove it.

Jeff and I dug and dug. I remember going home one day, feeling pretty deflated by another day of getting nowhere with this bug, when as I was walking into my apartment, I got a phone call.

It was Jeff. He told me he’d found something rather interesting – when the titlebar of the browser overlapped the titlebar of another window, he was able to reproduce the regression. When it did not overlap, the regression went away.

Talos tests open a small window before they open any test browser windows. That little test window stays in the background, and is (from my understanding) a dispatch point for making talos tests occur. That little window has a titlebar, and when we opened new browser windows, the titlebars would overlap.

Jeff suggested I try modifying the TART test to move the browser approximately 22px (the height of a standard OS X titlebar) so that they would no longer overlap. I set that up, triggered a bunch of test runs, and went to bed.

I wasn’t able to sleep. Around 4AM, I got out of bed to look at the results – SUCCESS! The regression had gone away! Jeff was right!

I slept like a baby the rest of the night.

We closed this bug as a WONTFIX due to it being way outside our control.

Comment 31 and onward in that bug are the ones that describe our findings.

Eat it TART, your tears are delicious

Those were the big regressions we fixed for TART. It was a long haul, but we got there – and in the end, it means faster and smoother tab animation for our users, which means a better experience – and it’s totally worth it.

I’m particularly proud of the work we did here, and I’m also really happy with the cross-team support and collaboration we had – from Performance, to Layout, to Graphics, to Front-end – it was textbook teamwork.

Here’s an e-mail I wrote about us beating TART.

Where did we end up?

After the TART regression was fixed, we were set to land on mozilla-central! We didn’t just land a more beautiful browser, we also landed a more performant one.

Noice.

Stay tuned for Part 5 where I talk about CART.

6 people like this post.

Australis Performance Post-mortem Part 3: As Good As Our Tools

While working on the ts_paint and tpaint regressions, we didn’t just stab blindly at the source code. We had some excellent tools to help us along the way. We also MacGyver‘d a few of those tools to do things that they weren’t exactly designed to do out of the box. And in some cases, we built new tools from scratch when the existing ones couldn’t cut it.

I just thought I’d write about those.

MattN’s Spreadsheet

I already talked about this one in my earlier post, but I think it deserves a second mention. MattN has mad spreadsheet skills. Also, it turns out you can script spreadsheets on Google Docs to do some pretty magical things – like pull down a bunch of talos data, and graph it for you.

I think this spreadsheet was amazingly useful in getting a high-level view of all of the performance regressions. It also proved very, very useful in the next set of performance challenges that came along – but more on those later.

MattN’s got a blog post up about his spreadsheet that you should check out.

The Gecko Profiler

This is a must-have for Gecko hackers who are dealing with some kind of performance problem. The next time I hit something performance related, this is the first tool I’m going to reach for. We used a number of tools in this performance work, but I’m pretty sure this was the most powerful one in our arsenal.

Very simply, Gecko ships with a built-in sampling profiler, and there’s an add-on you can install to easily dump, view and share these profiles. That last bit is huge – you click a button, it uploads, and bam – you have a link you can send to someone over IRC to have them look at your profile. It’s sheer gold.

We also built some tools on top of this profiler, which I’ll go into in a few paragraphs.

You can read up on the Gecko Profiler here at the official documentation.

Homebrew Profiler

At one point, jaws built a very simple profiler for the CustomizableUI component, to give us a sense of how many times we were entering and exiting certain functions, and how much time we were spending in them.

Why did we build this? To be honest, it’s been too long and I can’t quite remember. We certainly knew about the Gecko Profiler at this point, so I imagine there was some deficiency with the profiler that we were dealing with.

My hypothesis is that this was when we were dealing strictly with the ts_paint / tpaint regression on Windows XP. Take a look at the graphs in my last post again. Notice how UX (red) and mozilla-central (green) converge at around July 1st on Ubuntu? And how OS X finally converges on t_paint around August 1st?

I haven’t included the Windows 7 and 8 platform graphs, but I’m reasonably certain that at this point, Windows XP was the last regressing platform on these tests.

And I know for a fact that we were having difficulty using the Gecko Profiler on Windows XP, due to this bug.

Basically, on Windows XP, the call tree wasn’t interleaving the Javascript and native-code calls properly, so we couldn’t trust the order of tree, making the profile really useless. This was a serious problem, and we weren’t sure how to workaround it at the time.

And so I imagine that this is what prompted jaws to write the homebrew profiler. And it worked – we were able to find sections of CustomizableUI that were causing unnecessary reflow, or taking too long doing things that could be shortcutted.

I don’t know where jaws’ homebrew profiler is – I don’t have the patch on my machine, and somehow I doubt he does too. It was a tool of necessity, and I think we moved past it once we sorted out the Windows XP stack interleaving thing.

And how did we do that, exactly?

Using the Gecko Profiler on Windows XP

jaws profiler got us some good data, but it was limited in scope, since it only paid attention to CustomizableUI. Thankfully, at some point, Vladan from the Perf team figured out what was going wrong with the Gecko Profiler on Windows XP, and gave us a workaround that lets us get proper profiles again. I have since updated the Gecko Profiler MDN documentation to point to that workaround.

Reflow Profiles

This is where we start getting into some really neat stuff. So while we were hacking on ts_paint and tpaint, Markus Stange from the layout team wrote a patch for Gecko to take “reflow profiles”. This is a pretty big deal – instead of telling us what code is slow, a reflow profile tells us what things take a long time to layout and paint. And, even better, it breaks it down by DOM id!

This was hugely powerful, and I really hope something like this can be built into the Gecko Profiler.

Markus’ patch can be found in this bug, but it’ll probably require de-bitrotting. If and when you apply it, you need to run Firefox with an environment variable MOZ_REFLOW_PROFILE_FILE pointing at the file you’d like the profile written out to.

Once you have that profile, you can view it on Markus’ special fork of the Gecko Profiler viewer.

This is what a reflow profile looks like:

Screen Shot 2013-12-13 at 11.49.34 PM

I haven’t linked to one I’ve shared because reflow profiles tend to be very large – too large to upload. If you’d like to muck about with a real reflow profile, you can download one of the reflow profiles attached to this bug and upload it to Markus’ Gecko Profiler viewer.

These reflow profiles were priceless throughout all of the Australis performance work. I cannot stress that enough. They were a way for us to focus on just a facet of the work that Gecko does – layout and painting – and determine whether or not our regressions lay there. If they did, that meant that we had to find a more efficient way to paint or layout. And if the regressions didn’t show up in the reflow profiles, that was useful too – it meant we could eliminate graphics and layout from our pool of suspects.

Comparison Profiles

Profiles are great, but you know what’s even better? Comparison profiles. This is some more Markus Stange wizardry.

Here’s the idea – we know that ts_paint and tpaint have regressed on the UX branch. We can take profiles of both the UX and mozilla-central. What if we can somehow use both profiles and find out what UX is doing that’s uniquely different and uniquely slow?

Sound valuable? You’re damn right it is.

The idea goes like this – we take the “before” profile (mozilla-central), and weight all of its samples by -1. Then, we add the samples from the “after” profile (UX).

The stuff that is positive in the resulting profile is an indicator that UX is slower in that code path. The stuff that is negative means that UX is faster.

How did we do this? Via these scripts. There’s a script in this repository called create_comparison_profile.py that does all of the work in generating the final comparison profile.

Here’s a comparison profile to look at, with mozilla-central as “before” and UX as “after”.

Now I know what you’re thinking – Mike – the root of that comparison profile is a negative number, so doesn’t that mean that UX is faster than mozilla-central?

That would seem logical based on what I’ve already told you, except that talos consistently returns the opposite opinion. And here’s where I expose some ignorance on my part – I’m simply not sure why that root node is negative when we know that UX is slower. I never got a satisfying answer to that question. I’ll update this post if I find out.

What I do know is that drilling into the high positive numbers of these comparison profiles yielded very valuable results. It allowed us to quickly determine what was unique slow about UX.

And in performance work, knowing is more than half the battle – knowing what’s slow is most of the battle. Fixing it is often the easy part – it’s the finding that’s hard.

Oh, and I should also point out that these scripts were able to generate comparison profiles for reflow profiles as well. Outstanding!

Profiles from Talos

Profiling locally is all well and good, but in the end, if we don’t clear the regressions on the talos hardware that run the tests, we’re still not good enough. So that means gathering profiles on the talos hardware.

So how do we do that?

Talos is not currently baked into the mozilla-central tree. Instead, there’s a file called testing/talos/talos.json that knows about a talos repository and a revision in that repository. The talos machines then pull talos from that repository, check out that revision, and execute the talos suites on the build of Firefox they’ve been given.

We were able to use this configuration to our advantage. Markus cloned the talos repository, and modified the talos tests to be able to dump out both SPS and reflow profiles into the logs of the test runs. He then pushed those changes to his user repository for talos, and then simply modified the testing/talos/talos.json file to point to his repo and the right revision.

The upshot being that Try would happily clone Markus’ talos, and we’d get profiles in the test logs on talos hardware! Brilliant!

Extracting and symbolicating those profiles would be handled by more of Markus’ scripts – see get_profiles.py.

Now we were cooking with gas – reflow and SPS profiles from the test hardware. Could it get better?

Actually, yes.

Getting the Good Stuff

When the talos tests run, the stuff we really care about is the stuff being timed. We care about how long it takes to paint the window, but not how long it takes to tear down the window. Unfortunately, things like tearing down the window get recorded in the SPS and reflow profiles, and that adds noise.

Wouldn’t it be wonderful to get samples just from the stuff we’re interested in? Just to get samples only when the talos test has its stopwatch ticking?

It’s actually easier than it sounds. As I mentioned, Markus had cloned the talos tests, and he was able to modify tpaint and ts_paint to his liking. He made it so that just as these tests started their stopwatches (waiting for the window to paint), an SPS profile marker was added to the sample taken at that point. A profile marker simply allows us to decorate a sample with a string. When the stopwatch stopped (the window has finished painting), we added another marker to the profile.

With that done, the extraction scripts simply had to exclude all samples that didn’t occur between those two markers.

The end result? Super concentrated profiles. It’s just the stuff we care about. Markus made it work for reflow profiles too – it was really quite brilliant.

And I think that pretty much covers it.

Lessons

  • If you don’t have the tools you need, go get them.
  • If the tools you need don’t exist, build them, or find someone who can. That someone might be Markus Stange.
  • If the tools you need are broken, fix them, or find someone who can.

So with these amazing tools we were eventually able to grind down our ts_paint and tpaint regressions into dust.

And we celebrated! We were very happy to clear those regressions. We were all clear to land!

Or so we thought. Stay tuned for Part 4.

8 people like this post.

Australis Performance Post-mortem Part 2: ts_paint and t_paint

Continued from Part 1.

So we’d just gotten Talos data in, and it looked like we were regressing on ts_paint and tpaint right across the board.

Speaking just for myself, up until this point, Talos had been a black box. I vaguely knew that Talos tests were run, and I vaguely understood that they measured certain performance things, but I didn’t know what those things were nor where to look at the results.

Luckily, I was working with some pretty seasoned veterans. MattN whipped up an amazing spreadsheet that dynamically pulled in the Talos test data for each platform so that we could get a high-level view of all of the regressions. This would turn out to be hugely useful.

Here’s a link to a read-only version of that spreadsheet in all of its majesty. Or, if that link is somehow broken in the future, here’s a screenshot:

Numbers!

Numbers!

So now we had a high-level view of the regressions. The next step was determining what to do about it.

I should also mention that these regressions, at this point, were the only big things blocking us from landing on mozilla-central. So naturally, a good chunk of us focused our attention on this performance stuff. We quickly organized a daily standup meeting time where we could all get together and give reports on what we were doing to grind down the performance issues, and what results we were getting from our efforts.

That chunk of team, however, didn’t initially include me. I believe Gijs, Unfocused, mikedeboer and myself kept hacking on customization and widget bugs while jaws and MattN dug at performance. As time went on though, a few more of us eventually joined MattN and jaws in their performance work.

The good news in all of this is that ts_paint and tpaint are related – both measure the time it takes from issuing the command to open a browser window to actually painting it on the screen. ts_paint is concerned with the very first Firefox window from a cold-start, and tpaint is concerned with new windows from an already-running Firefox. It was quite possible that there was some overlap in what was making us slow on these two tests, which was somewhat encouraging.

The following bugs are just a subset of the bugs we filed and landed to improve our ts_paint and tpaint performance. Looking back, I’m pretty sure these are the ones that made the most difference, but the full list can be found as dependencies of these bugs.

Bug 890105 - TabsInTitleBar._update should group measurements and style changes to avoid unnecessary reflows

After a bit of examination, MattN dealt the first blow when he filed Bug 890105. The cross-platform code that figures out how best to place the tabs in the titlebar (while taking into account things like the system font size) is run before the window first paints, and it was being inefficient.

By inefficient, I mean it was causing more reflows than necessary. Here’s some information on reflows. The MDN page states that the article is obsolete, but the page still does a pretty good job of explaining what a reflow is.

The code would take a measurement of something on the page (causing a reflow), update that thing’s size (causing a reflow), and then repeat the process. MattN found we could cluster the measurements into a single pass, and then do all of the changes one after another. This reduced the number of reflows, which helped speed up both ts_paint and tpaint.

And boom, we saw our first win for both ts_paint and tpaint!

Bug 892532 – Add an optional fast-path to CustomizableUI.isWidgetRemovable

jaws found the next big win using a home-brewed profiler. The home-brewed profiler simply counted the number of times we entered and exited various functions in the CustomizableUI code, and recorded the time it took from entering to exiting.

I can’t really recall why we didn’t use the SPS profiler at this point. We certainly knew about it, but something tells me that at this point, we were having a hard time getting useful data from it.

Anyhow, with the home-brew profiler, jaws determined that we had the opportunity to fast-path a section of our code. Basically, we had a function that takes the ID of a widget, looks for and retrieves the widget, and returns whether or not that widget can be removed from its current location. There were some places that called this function during window start-up, and those places already had the widget that was to be found. jaws figured we could fast-path the function by being able to pass the widget itself rather than the ID, and skip the look-up.

Bug 891104 – Skip calling onOverflow during startup if there wasn’t any overflowed content before the toolbar is fully initialized

It was MattN’s turn again – this time, he found that the overflow toolbar code for the nav-bar (this is the stuff that handles putting widgets into the overflow panel if the window gets too small) was running the overflow handler as soon as the nav-bar was initted, regardless of whether anything was overflowed. This was causing a reflow because a measurement was on the overflowable toolbar to see if items needed to be moved into the overflow panel.

Originally, the automatic call of the overflow handler was to account for the case where the nav-bar is overflowed from the very beginning – but jaws made it smarter by attaching an overflow handler before the CSS attribute that made the toolbar overflowable was applied. That meant that if the nav-bar would only call the overflow handler if it really needed to, as opposed to every time.

Bug 898126 – Cache client hit test values

Around this time, a few more people started to get involved in Australis performance work. Gijs and mstange got a bug filed to investigate if there was a way to make start-up faster on Windows XP and 7. Here’s some context from mstange in that bug in comment 9:

It turns out that Windows XP sends about 200 WM_NCHITTEST events per second when we open a new window. All these events have the same position – possibly the current mouse position. And all the ClientMarginHitTestPoint optimizations we’ve been playing with only make a difference because that function is called so often during the test – one invocation is unnoticeably quick, but it starts to add up if we call it so many times.

This patch makes sure that we only send one hittest event per second if the position doesn’t change, and returns a cached value otherwise.

After some fiddling about with cache invalidation times, the patch landed, and we saw a nice win on Windows XP and 7!

Bug 906075 – Only send toolbars through buildArea if they’re not in their default state

It was around now that I started to get involved with performance work. One of my first successful bugs was to only run a toolbar through CustomizableUI’s buildArea function if the toolbar was not starting in a default state. The buildArea function’s job is to populate a customizable area with only the things that the user has moved into the area, and remove the things that the user has taken out. That involves cycling through the nodes in the area to see if they belong, and that takes time. I wrote a patch that cached a “dirty” state on a toolbar to indicate that it’d been customized in the past, and if we didn’t see that value, we didn’t run the toolbar through the function. Easy as pie, and we saw a little win on both ts_paint and tpaint on all platforms.

Bug 905695 – Skip checking for tab overflows if there is only one tab open

This was another case where we had an unnecessary reflow during start-up. And, like bug 891104, it involved an overflow event handler running when it really didn’t need to. jaws writes:

If only one tab is opened and we show the left/right arrows, we are actually removing quite a bit of space that could have been used to show the tab. Scrolling the tabbox in this state is also quite useless, since all the user can do is scroll to see the other parts of the *only* tab.

If we make this change, we can skip a synchronous reflow for new windows that only have one tab.

Which means we could skip a reflow for all new windows. Are you starting to notice a pattern? Sections of our code had been designed to operate the same way, regardless of whether or not it was in the default, common case. We were finding ways of detecting the default case, and fast-pathing them.

Chalk up another win!

Bug 907787 – Australis: toolbar overflow button should be hidden by default

Yet another example where we could fast-path the default case. The overflow button in the nav-bar is only supposed to be displayed if there are too many items in the nav-bar, resulting in some getting put into the overflow panel, which anchors on the overflow button.

If nothing is being overflowed and the panel is empty, the button should not be displayed.

We were, however, displaying the button by default, and then hiding it when we determined that nothing was overflowed. Bug 907787 inverted that logic, and hid the button by default, and only showed it when things got overflowed (which was not the default case).

We were getting really close to performance parity with mozilla-central…

Bug 908326 – default the navbar to overflowable to avoid needless reflowing

Once again, an example of us not greasing the default-path. Our overflowable toolbar code applies an overflowable attribute to the nav-bar in order to apply some CSS styles to give the toolbar its overflowing properties. Adding that attribute dynamically means a reflow.

Instead, we just added the attribute to the node’s definition in browser.xul, and dropped that unnecessary reflow like a hot brick.

So how far had we come?

Let’s take a look at the graphs, shall we? Remember, in these graphs, the red points represent UX, and the green represent mozilla-central. Up is bad, and down is good. Our goal was to sink the red dots down into the noise of the green dots, which would give us performance parity.

ts_paint

Windows XP - ts_paint improvements

Windows XP – ts_paint improvements

Ubuntu - ts_paint improvements

Ubuntu – ts_paint improvements

OSX 10.6 ts_paint improvements

OSX 10.6 ts_paint improvements

You might be wondering what that bug jump is for ts_paint for OSX 10.6 at the end of the graph. This thread explains.

tpaint

Windows XP - tpaint improvements

Windows XP – tpaint improvements

 

Ubuntu - tpaint improvements

Ubuntu – tpaint improvements

OSX 10.6 tpaint improvements

OSX 10.6 tpaint improvements

Looking good.

The big lessons

I think the big lesson here is to identify the common, default case, and optimize it as best you can. By definition, this is the path that’s going to be hit the most, so you can special-case it, and build in fast paths for it. Your users will thank you.

Close the feedback loop as much as you can. To test our theories, we’d push our patches to try and use compare-talos to compare our tpaint and ts_paint numbers to baseline pushes to see if we were making improvements. This requires several hours for the try builds to complete. This is super slow. Release Engineering was awesome and lent us some Windows XP talos slaves for us to experiment on, and that helped us close the feedback loop a lot. Don’t be afraid to ask Release Engineering for talos slaves.

Also note that while it’s easy for me to rattle off bug numbers and explain where we were being slow, all of that investigation and progress occurred over several months. Performance work can be really slow. The bottleneck is not making the slow code faster – the bottleneck is identifying where the slow code is. Profiling is the key here. If you’re not using some kind of profiler while doing performance work, you’re seriously impeding yourself. If you don’t have a profiler, build a simple one. If you don’t know how to build a simple one, find someone who can.

I mentioned Gecko’s built-in SPS profiler a few paragraphs back. The SPS profiler was instrumental (pun intended) in getting our performance back up to snuff. We also built a number of tools alongside the SPS profiler to help us in our analyses.

Read up about those tools we built in Part 3…

7 people like this post.