A major part of my Master’s degree requirements was my research paper. If you heard me lament over the past year or so about my “thesis”, I was referring to this research paper.
Anyhow, after lots of hard work, my research paper was finally signed off by my supervisor, Dr. Greg Wilson, and second reader Dr. Yuri Takhteyev. A huge thanks to both of them!
Here’s the abstract, followed by a download link for the PDF. Enjoy!
Abstract
Peer code review is commonly used in the software development industry to identify and fix problems during the development process. An additional benefit is that it seems to help spread knowledge and expertise around the team conducting the review. So is it possible to leverage peer code review as a learning tool? Our experiment results show that peer code review seems to cause a performance boost in students. They also show that the average total peer mark generated by students seems to be similar to the total mark that a graduate-level teaching assistant might give. We found that students agree that peer code review teaches them something – however, we also found they do not enjoy grading their peers’ work. We are encouraged by these results, and feel that they are a strong motive for further research in this area.
While the benefits of code review are proven, documented, numerous and awesome, it doesn’t change the fact that most people, in general, don’t like doing it.
I guess code review just isn’t really all that fun.
So a few months ago, I broadcast the idea of turning code review into a game. It was my way of trying to mix things up – “let’s add points, and have reviewers/developers competing to be the best participant in the code review process”.
Well, if there’s one thing that my supervisor Greg has taught me, it’s how I shouldn’t rush headlong into something before all of the facts are in. So before I decide to do something like game-ifize code review, I should take a look at some prior work in the area…
In particular, check out the following slide-show. Flip through it if you have the time. If you don’t have the time, scroll down, where I get to the salient point with respect to game-ificating code review.
Sebastian seems to be saying that adding points to apps and trying to incite competition does not make something a game. If it did, then this should be countless hours of fun.
Without play, there is no game. Points do not equal a game. It’s not nearly that simple.
Free Pizza and Pop
I’m going to divert for a second here.
Last week, a company set themselves up a couple of booths in the lobby of the Bahen Center where I work. They were there to recruit university students to work for their company – either as interns, or full-timers.
They were also handing out free pizza and pop.
Needless to say, I wanted a few slices – but I figured it would be polite if I engaged them in conversation before waltzing off with some of the free food and drink they’d brought.
So I sparked up a conversation with one of the recruiters, and he told me about the company. I’m going to call this recruiter Vlad.
I ended up gently steering the conversation towards code review, and I asked my inevitable question:
“So, do you guys do code review?”
I felt like a dentist asking a patient if he’s been flossing. Vlad waffled a bit, but the general impression was:
“Not as much as we should. We don’t have a prescribed workflow. It’d be hard to persuade all of the teams to do it.”
And then we started talking about code review in general. It turns out that Vlad had worked in a few companies where they’d done code review, and he always felt a little short changed. He said something along the lines of:
“I never felt compelled to do reviews. They just sort of happened…and I did it, and it felt like…unrecognized effort. I mean, what’s the incentive? Do you know what I mean? There’s incentive for the software, but I’m talking incentive for me. And some people did really lousy reviews…but my reviews were treated the same as theirs. I didn’t get recognized, and didn’t get rewarded if I did a good review. So it was hard for me to do them. I want to be recognized for my good reviews, for my good contributions.”
I wish I’d had a tape-recorder running so I could have gotten Vlad’s exact words. But that’s what I remember him saying.
Feedback and Recognition
Maybe instead of trying to game-ulize code review, I can instead hear what Vlad is saying and work off of that.
With the code review that Vlad participated in, all of the feedback went to the code author, and none went to the reviewers. And the reviewers are the ones who are doing all of the heavy lifting! As a reviewer, Vlad also wants feedback, and recognition for code review done well.
There’s a company in Toronto that specializes in feedback like this. They’re one of the major players in the Toronto start-up scene, and have built a pretty sweet suite of tools to facilitate quick and easy feedback/recognition.
The company is called Rypple. And maybe that’s the name of the application, too. (checks website) Yeah, it’s both.
So Rypple has this feature called Kudos that let’s people publicly acknowledge the good work of their team.
Normally, I don’t pimp companies. And it upsets me when people comment on my blog, and their sub-text is to try to sell their product or service. However, I think this video is relevant, so I’m posting their demo video so you can see how Kudos work:
So Rypple’s idea is to have a feed that the team subscribes to, and publicly display things like Kudos. The badges for the Kudos are also limited in how many you can give per week, so they’re a valuable commodity that can’t just be handed out all over the place. Cool idea.
So there’s one approach – use a service like Rypple to give your reviewers better feedback and recognition.
Or maybe we could build an extension for Review Board that does something similar, and more oriented around code review.
It’s not oriented like a game, like I had originally envisioned. But somehow, I think this idea has more meaning and traction than just “adding points”.
More on this idea in a few days. But please, comment if you have any thoughts or ideas to add.
If you’ve read about my experiment, you’ll know that there were two Python programming assignments that my participants worked on, and a rubric for each assignment.
There were also 5 mock-up submissions for each assignment that I had my participants grade. I developed these mock-ups, after a few consultations with some of our undergraduate instructors, in order to get a sense of the kind of code that undergraduate programmers tend to submit.
I’ve decided to post these materials to this blog, in case somebody wants to give them a once over. Just thought I’d open my science up a little bit.
Before I start diving into results, I’m just going to recap my experiment so we’re all up to speed.
I’ll try to keep it short, sweet, and punchy – but remember, this is a couple of months of work right here.
Ready? Here we go.
What I was looking for
A quick refresher on what code review is
Code review is like the software industry equivalent of a taste test. A developer makes a change to a piece of software, puts that change up for review, and a few reviewers take a look at that change to make sure it’s up to snuff. If some issues are found during the course of the review, the developer can go back and make revisions. Once the reviewers give it the thumbs up, the change is put into the software.
That’s an oversimplified description of code review, but it’ll do for now.
So what?
What’s important is to know that it works.Jason Cohen showed that code review reduces the number of defects that enter the final software product. That’s great!
But there are some other cool advantages to doing code review as well.
It helps to train up new hires. They can lurk during reviews to see how more experienced developers look at the code. They get to see what’s happening in other parts of the software. They get their code reviewed, which means direct, applicable feedback. All good things.
It helps to clean and homogenize the code. Since the code will be seen by their peers, developers are generally compelled to not put up “embarrassing” code (or, if they do, to at least try to explain why they did). Code review is a great way to compel developers to keep their code readable and consistent.
It helps to spread knowledge and good practices around the team. New hires aren’t the only ones to benefit from code reviews. There’s always something you can learn from another developer, and code review is where that will happen. And I believe this is true not just for those who receive the reviews, but also for those who perform the reviews.
That last one is important. Code review sounds like an excellent teaching tool.
So why isn’t code review part of the standard undergraduate computer science education? Greg and I hypothesized that the reason that code review isn’t taught is because we don’t know how to teach it.
I’ll quote myself:
What if peer code review isn’t taught in undergraduate courses because we just don’t know how to teach it? We don’t know how to fit it in to a curriculum that’s already packed to the brim. We don’t know how to get students to take it seriously. We don’t know if there’s pedagogical value, let alone how to show such value to the students.
The idea
Inspired by work by Joordens and Pare, Greg and I developed an approach to teaching code review that integrates itself nicely into the current curriculum.
Here’s the basic idea:
Suppose we have a computer programming class. Also suppose that after each assignment, each student is randomly presented with anonymized assignment submissions from some of their peers. Students will then be asked to anonymously peer grade these assignment submissions.
Now, before you go howling your head off about the inadequacy / incompetence of student markers, or the PeerScholar debacle, read this next paragraph, because there’s a twist.
The assignment submissions will still be marked by TA’s as usual. The grades that a student receives from her peers will not directly affect her mark. Instead, the student is graded based on how well they graded their peers. The peer reviews that a student completes will be compared with the grades that the TA’s delivered. The closer a student is to the TA, the better the mark they get on their “peer grading” component (which is distinct from the mark they receive for their programming assignment).
Now, granted, the idea still needs some fleshing out, but already, we’ve got some questions that need answering:
Joordens and Pare showed that for short written assignments, you need about 5 peer reviews to predict the mark that the TA will give. Is this also true for computer programming assignments?
Grading students based on how much their peer grading matches TA grading assumes that the TA is an infallible point of reference. How often to TA’s disagree amongst themselves?
Would peer grading like this actually make students better programmers? Is there a significant difference in the quality of their programming after they perform the grading?
What would students think of peer grading computer programming assignments? How would they feel about it?
So those were my questions.
How I went about looking for the answers
Here’s the design of the experiment in a nutshell:
Writing phase
I have a treatment group, and a control group. Both groups are composed of undergraduate students. After writing a short pre-experiment questionnaire, participants in both groups will have half an hour to work on a short programming assignment. The treatment group will then have another half an hour to peer grade some submissions for the assignment they just wrote. The submissions that they mark will be mocked up by me, and will be the same for each participant in the treatment group. The control group will not perform any grading – instead, they will do an unrelated vocabulary exercise for the same amount of time. Then, participants in either group will have another half an hour to work on the second short programming assignment. Participants in my treatment group will write a short post-experiment questionnaire to get their impressions on their peer grading experience. Then the participants are released.
Here’s a picture to help you visualize what you just read.
So now I’ve got two piles of submissions – one for each assignment, 60 submissions in total. I add my mock-ups to each pile. That means 35 submissions in each pile, and 70 submissions in total.
Marking phase
I assign ID numbers to each submission, shuffle them up, and hand them off to some graduate level TA’s that I hired. The TA’s will grade each assignment using the same marking rubric that the treatment group used to peer grade. They will not know if they are grading a treatment group submission, a control group submission, or a mock-up.
Choosing phase
After the grading is completed, I remove the mock-ups, and pair up submissions in both piles based on who wrote it. So now I’ve got 30 pairs of submissions: one for each student. I then ask my graders to look at each pair, knowing that they’re both written by the same student, and to choose which one they think is better coded, and to rate and describe the difference (if any) between the two. This is an attempt to catch possible improvements in the treatment group’s code that might not be captured in the marking rubric.
So that’s what I did
So everything you’ve just read is what I’ve just finished doing.
Once the submissions are marked, I’ll analyze the marks for the following:
Comparing the two groups, is there any significant improvement in the marks from the first assignment to the second in the treatment group?
If there was an improvement, on which criteria? And how much of an improvement?
How did the students do at grading my mock-ups? How similar were their peer grades to what the TAs gave?
I just spent the long weekend in Ottawa and Québec City with my parents and my girlfriend Em.
During the long drive back to Toronto from Québec City, I had plenty of time to think about my GSoC project, and where I want to go with it once GSoC is done.
Here’s what I came up with.
Detach Reviewing Time from Statistics
I think it’s a safe assumption that my reviewing-time extension isn’t going to be the only one to generate useful statistical data.
So why not give extension developers an easy mechanism to display statistical data for their extension?
First, I’m going to extract the reviewing-time recording portion of the extension. Then, RB-Stats (or whatever I end up calling it), will introduce it’s own set of hooks for other extensions to register with. This way, if users want some stats, there will be one place to go to get them. And if an extension developer wants to make some statistics available, a lot of the hard work will already be done for them.
And if an extension has the capability of combining its data with another extensions data to create a new statistic, we’ll let RB-Stats manage all of that business.
Stopwatch
The reviewing-time feature of RB-Stats will become an extension on its own, and register its data with RB-Stats. Once RB-Stats and Stopwatch are done, we should be feature equivalent with my demo.
Review Karma
I kind of breezed past this in my demo, but I’m interested in displaying “review karma”. Review karma is the reviews/review-requests ratio.
But I’m not sure karma is the right word. It suggests that a low ratio (many review requests, few reviews) is a bad thing. I’m not so sure that’s true.
Still, I wonder what the impact will be to display review karma? Not just in the RB-Stats statistics view, but next to user names? Will there be an impact on review activity when we display this “reputation” value?
FixIt
This is a big one.
Most code review tools allow reviewers to register “defects”, “todos” or “problems” with the code up for review. This makes it easier for reviewees to keep track of things to fix, and things that have already been taken care of. It’s also useful in that it helps generate interesting statistics like defect density and defect detection rate (assuming Stopwatch is installed and enabled).
I’m going to tackle this extension as soon as RB-Stats, Stopwatch and Karma are done. At this point, I’m quite confident that the current extension framework can more or less handle this.
Got any more ideas for me? Or maybe an extension wish-list? Let me know.