Category Archives: Technology

Research Experiment: A Recap

Before I start diving into results, I’m just going to recap my experiment so we’re all up to speed.

I’ll try to keep it short, sweet, and punchy – but remember, this is a couple of months of work right here.

Ready? Here we go.

What I was looking for

A quick refresher on what code review is

Code review is like the software industry equivalent of a taste test. A developer makes a change to a piece of software, puts that change up for review, and a few reviewers take a look at that change to make sure it’s up to snuff. If some issues are found during the course of the review, the developer can go back and make revisions. Once the reviewers give it the thumbs up, the change is put into the software.

That’s an oversimplified description of code review, but it’ll do for now.

So what?

What’s important is to know that it works. Jason Cohen showed that code review reduces the number of defects that enter the final software product. That’s great!

But there are some other cool advantages to doing code review as well.

It helps to train up new hires. They can lurk during reviews to see how more experienced developers look at the code. They get to see what’s happening in other parts of the software. They get their code reviewed, which means direct, applicable feedback. All good things.
It helps to clean and homogenize the code. Since the code will be seen by their peers, developers are generally compelled to not put up “embarrassing” code (or, if they do, to at least try to explain why they did). Code review is a great way to compel developers to keep their code readable and consistent.
It helps to spread knowledge and good practices around the team. New hires aren’t the only ones to benefit from code reviews. There’s always something you can learn from another developer, and code review is where that will happen. And I believe this is true not just for those who receive the reviews, but also for those who perform the reviews.

That last one is important. Code review sounds like an excellent teaching tool.

So why isn’t code review part of the standard undergraduate computer science education? Greg and I hypothesized that the reason that code review isn’t taught is because we don’t know how to teach it.

I’ll quote myself:

What if peer code review isn’t taught in undergraduate courses because we just don’t know how to teach it? We don’t know how to fit it in to a curriculum that’s already packed to the brim. We don’t know how to get students to take it seriously. We don’t know if there’s pedagogical value, let alone how to show such value to the students.

The idea

Inspired by work by Joordens and Pare, Greg and I developed an approach to teaching code review that integrates itself nicely into the current curriculum.

Here’s the basic idea:

Suppose we have a computer programming class. Also suppose that after each assignment, each student is randomly presented with anonymized assignment submissions from some of their peers. Students will then be asked to anonymously peer grade these assignment submissions.

Now, before you go howling your head off about the inadequacy / incompetence of student markers, or the PeerScholar debacle, read this next paragraph, because there’s a twist.

The assignment submissions will still be marked by TA’s as usual. The grades that a student receives from her peers will not directly affect her mark. Instead, the student is graded based on how well they graded their peers. The peer reviews that a student completes will be compared with the grades that the TA’s delivered. The closer a student is to the TA, the better the mark they get on their “peer grading” component (which is distinct from the mark they receive for their programming assignment).

Now, granted, the idea still needs some fleshing out, but already, we’ve got some questions that need answering:

Joordens and Pare showed that for short written assignments, you need about 5 peer reviews to predict the mark that the TA will give. Is this also true for computer programming assignments?
Grading students based on how much their peer grading matches TA grading assumes that the TA is an infallible point of reference. How often to TA’s disagree amongst themselves?
Would peer grading like this actually make students better programmers? Is there a significant difference in the quality of their programming after they perform the grading?
What would students think of peer grading computer programming assignments? How would they feel about it?

So those were my questions.

How I went about looking for the answers

Here’s the design of the experiment in a nutshell:

Writing phase

I have a treatment group, and a control group. Both groups are composed of undergraduate students. After writing a short pre-experiment questionnaire, participants in both groups will have half an hour to work on a short programming assignment. The treatment group will then have another half an hour to peer grade some submissions for the assignment they just wrote. The submissions that they mark will be mocked up by me, and will be the same for each participant in the treatment group. The control group will not perform any grading – instead, they will do an unrelated vocabulary exercise for the same amount of time. Then, participants in either group will have another half an hour to work on the second short programming assignment. Participants in my treatment group will write a short post-experiment questionnaire to get their impressions on their peer grading experience. Then the participants are released.

Here’s a picture to help you visualize what you just read.

So now I’ve got two piles of submissions – one for each assignment, 60 submissions in total. I add my mock-ups to each pile. That means 35 submissions in each pile, and 70 submissions in total.

Marking phase

I assign ID numbers to each submission, shuffle them up, and hand them off to some graduate level TA’s that I hired. The TA’s will grade each assignment using the same marking rubric that the treatment group used to peer grade. They will not know if they are grading a treatment group submission, a control group submission, or a mock-up.

Choosing phase

After the grading is completed, I remove the mock-ups, and pair up submissions in both piles based on who wrote it. So now I’ve got 30 pairs of submissions: one for each student. I then ask my graders to look at each pair, knowing that they’re both written by the same student, and to choose which one they think is better coded, and to rate and describe the difference (if any) between the two. This is an attempt to catch possible improvements in the treatment group’s code that might not be captured in the marking rubric.

So that’s what I did

So everything you’ve just read is what I’ve just finished doing.

Once the submissions are marked, I’ll analyze the marks for the following:

Comparing the two groups, is there any significant improvement in the marks from the first assignment to the second in the treatment group?
1. If there was an improvement, on which criteria? And how much of an improvement?
How did the students do at grading my mock-ups? How similar were their peer grades to what the TAs gave?
How much did my two graders agree with one another?
During the choosing phase, did my graders tend to choose the second assignment over the first assignment more often for the treatment group?

And I’ll also analyze the post-experiment questionnaire to get student feedback on their grading experience.

Ok, so that’s where I’m at. Stay tuned for results.

Welp, I did it.

I have successfully run my experiment on 30 participants.

Both of my graders have finished marking.

I’ve begun data analysis. Details soon.

It's a gorilla high-fiving a shark in front of an explosion. Nice.

Credit: Dr. McNinja (http://www.drmcninja.com/)

Look down. Now back up again. The Defects are now Issues.

From my software development experience, there are a few different words for the generic notion of a “problem”.

A bug or a defect, for example, is defined as the following from Wikipedia:

A software bug is the common term used to describe an error flaw, mistake, failure, or fault in a computer program or system that produces an incorrect or unexpected result, or causes it to behave in unintended ways. Most bugs arise from mistakes and errors made by people in either a program’s source code or its design, and a few are caused by compilers producing incorrect code.

An issue goes a level higher – a bug is an issue, but an issue might not be a bug. Wikipedia says:

In computing, the term issue is a unit of work to accomplish an improvement in a system. An issue could be a bug, a requested feature, task, missing documentation, and so forth. The word “issue” is popularly misused in lieu of “problem.” This usage is probably related.

Where am I going with all of this?

Well, remember when I said I was going to add defect reporting/tracking capabilities to Review Board? I asked for some feedback on my UI mockups on the developer mailing list, and an interesting conversation on terminology erupted.

Anyhow, the long and the short of it is – we’re going to be calling “problems still existing within a review request revision” issues. And this is distinct from the sort of thing that might show up in an issue tracker.

Maybe down the line, we’ll have a way for administrators to set their own word for it. From the thread, it sounds like everybody and their brother has their own favourite terminology. “Issue” will have to do for now.

Thanks again to everyone on the list who contributed to the conversation.

Filing Defects in Review Board

In my last post, I talked about an extension for Review Board that would allow users to register “defects”, “TODOs” or “problems” with code that’s up for review.

After chatting with the lead RB devs for a bit, we’ve decided to scrap the extension.

[audible gasp, booing, hissing]

Instead, we’re just going to put it in the core of Review Board.

[thundering applause]

Defects

Why is this useful? I’ve got a few reasons for you:

It’ll be easier for reviewees to keep track of things left to fix, and similarly, it’ll be harder for reviewees to accidentally skip over fixing a defect that a reviewer has found
My statistics extension will be able to calculate useful things like defect detection rate, and defect density
Maybe it’s just me, but checking things off as “fixed” or “completed” is really satisfying
Who knows, down the line, I might code up an extension that lets you turn finding/closing defects into a game

However, since we’re adding this to the core of Review Board, we have to keep it simple. One of Review Board’s biggest strengths is in its total lack of clutter. No bells. No whistles. Just the things you need to get the job done. Let the extensions bring the bells and whistles.

So that means creating a bare-bones defect-tracking mechanism and UI, and leaving it open for extension. Because who knows, maybe there are some people out there who want to customize what kind of defects they’re filing.

I’ve come up with a design that I think is pretty simple and clean. And it doesn’t rock the boat – if you’re not interested in filing defects, your Review Board experience stays the same.

Filing a Defect

I propose adding a simple checkbox to the comment dialog to indicate that this comment files a defect, like so:

No bells. No whistles. Just a simple little checkbox.

While I’m in there, I’ll try to toss in some hooks so that extension developers can add more fields – for example, the classification or the priority of the defect. By default, however, it’s just a bare-bones little checkbox.

So far, so good. You’ve filed a defect. Maybe this is how it’ll look like in the in-line comment viewer:

The inline comment viewer is showing that a defect report has been filed.

A defect has been reported!

Two Choices

A reviewer can file defects reports, and the reviewee is able to act on them.

Lets say I’m the reviewee. I’ve just gotten a review, and I’ve got my editor / IDE with my patch waiting in the background. I see a few defect reports have been filed. For the ones I completely agree with, I fix them in my editor, and then go back to Review Board and mark them as Fixed.

The defect report has been marked as being fixed.

All fixed!

It’s also possible that I might not agree with one or more of the defect reports. In this case, I’ll reply to the comment to argue my case. I might also mark the defect report as Pass, which means, “I’ve seen it, but I think I’ll pass on that”.

The defect report has been marked as "pass".

I think I'll pass on that, thanks.

These comments and defect reports are also visible in the review request details page:

A defect report has been filed, and we're in the review request detail page.

A defect has been filed.

The defect is marked as fixed, and we're in the review request detail page.

All fixed up.

We're passing on the defect report, and we're in the review request detail page.

It's all good - just pass this defect report.

Thoughts?

What do you think? Am I on the right track? Am I missing a case? Does “pass” make sense? Will this be useful? I’d love to hear your thoughts.

Review Board Statistics Extensions: Karma, Stopwatch, and FixIt

I just spent the long weekend in Ottawa and Québec City with my parents and my girlfriend Em.

During the long drive back to Toronto from Québec City, I had plenty of time to think about my GSoC project, and where I want to go with it once GSoC is done.

Here’s what I came up with.

Detach Reviewing Time from Statistics

I think it’s a safe assumption that my reviewing-time extension isn’t going to be the only one to generate useful statistical data.

So why not give extension developers an easy mechanism to display statistical data for their extension?

First, I’m going to extract the reviewing-time recording portion of the extension. Then, RB-Stats (or whatever I end up calling it), will introduce it’s own set of hooks for other extensions to register with. This way, if users want some stats, there will be one place to go to get them. And if an extension developer wants to make some statistics available, a lot of the hard work will already be done for them.

And if an extension has the capability of combining its data with another extensions data to create a new statistic, we’ll let RB-Stats manage all of that business.

Stopwatch

The reviewing-time feature of RB-Stats will become an extension on its own, and register its data with RB-Stats. Once RB-Stats and Stopwatch are done, we should be feature equivalent with my demo.

Review Karma

I kind of breezed past this in my demo, but I’m interested in displaying “review karma”. Review karma is the reviews/review-requests ratio.

But I’m not sure karma is the right word. It suggests that a low ratio (many review requests, few reviews) is a bad thing. I’m not so sure that’s true.

Still, I wonder what the impact will be to display review karma? Not just in the RB-Stats statistics view, but next to user names? Will there be an impact on review activity when we display this “reputation” value?

FixIt

This is a big one.

Most code review tools allow reviewers to register “defects”, “todos” or “problems” with the code up for review. This makes it easier for reviewees to keep track of things to fix, and things that have already been taken care of. It’s also useful in that it helps generate interesting statistics like defect density and defect detection rate (assuming Stopwatch is installed and enabled).

I’m going to tackle this extension as soon as RB-Stats, Stopwatch and Karma are done. At this point, I’m quite confident that the current extension framework can more or less handle this.

Got any more ideas for me? Or maybe an extension wish-list? Let me know.

A Blog by Mike Conley

The personal blog of a Toronto based software mechanic, musician, sound designer, and theatre enthusiast.