Tag Archives: peer review

Some Preliminary Results

But first, a confession…

Sometimes I play a little fast and loose with my English. If there’s anything that my Natural Language Processing course taught me last year, it’s that I really don’t have a firm grasp on the formal rules of grammar.

The reason I mention this is because of the word “peer”. The plural of peer is peers. And the plural possessive of peer is peers’. With the apostrophe.

I didn’t know that a half hour ago. Emily told me, and she’s a titan when it comes to the English language.

The graphs below were created a few days ago, before I knew this rule. So they use peer’s instead of peers’. I dun goofed. And I’m too lazy to change them (and I don’t want to use OpenOffice Draw more than I have to).

I just wanted to let you Internet people know that I’ve realized this, since their are so many lot of grammer nazi’s out they’re on the webz.

Now, with that out of the way, where were we?

The Post-Experiment Questionnaire

If you read my experiment recap, then you know that my treatment group wrote a questionnaire after they were done all of their assignment writing.

The questionnaire was used to get an impression of how participants felt about their peer reviewing experience.

A note on the peer reviewing experience

Just to remind you, my participants were marking mock-ups that I created for an assignment that they had just written. There were 5 mock-ups per assignment, so 10 mock-ups in total. Some of my mock-ups were very concise. Others were intentionally horrible and hard to read. Some were extremely vigilant in their documentation. Others were laconic. I tried to capture a nice spectrum of first year work. None of my participants knew that I had mocked the assignments up.

Anyhow, back to the questionnaire…

The questionnaire made the following statements, and asked students to agree on a scale from 1 to 5, where 1 was Strongly Disagree and 5 was Strongly Agree:

It is unusual for me to see code written by my peers.
Seeing my peer’s code taught me things I didn’t already know.
Because I saw and graded my peer’s work, I believe I know more about the quality of my own work.
I am interested in knowing how my peers graded me.
I would have written the code for my first assignment differently if I had seen the rubric beforehand.
During this experiment, I enjoyed seeing other student’s assignments.
I enjoyed grading my peer’s work.
I found grading my peer’s work difficult.
I’m confident that the grading I did was fair.
Because I knew that my peers would be seeing and grading my code for the first assignment, I coded it differently than I would have normally.

For questions 2, 5, 7, 8, and 10, participants were asked to expand with a written comment if they answered 3 or above.

Of the 30 participants in my study, 15 were in my treatment group, and therefore only 15 people filled out this questionnaire.

The graphs are histograms – that means that the higher the bar is, the more participants answered the question that way.

So, without further ado, here are the results…

While there’s more weight on the positive side, opinion seems pretty split on this one. It might really depend on what kind of social / working group you have in your programming classes.

It might also depend on how adherent students are to the rules, since sharing code with your peers is a bit of a no-no according to the UofT Computer Science rules of conduct. Most programming courses have something like the following on their syllabus:

Never look at another student’s assignment solution, whether it is on paper or on the computer
screen. Never show another student your assignment solution. This applies to all drafts of a solution
and to incomplete solutions.

Of course, this only applies before an assignment is due. Once the due date has passed, it’s OK to look at one another’s code…but how many students do that?

Anyhow, looking at the graph, I don’t think we got too much out of that one. Let’s move on.

Well, that’s a nice strong signal. Clearly, there’s more weight on the positive side. So my participants seem to understand that grading the code is teaching them something. That’s good.

And now for an interesting question: is there any relationship between the amount of programming experience of the participant, and how they answered this question? Good question. Before the experiment began, all participants filled out a brief questionnaire. The questionnaire asked them to provide, in months, how much time they’ve spent in either a programming intensive course, or a programming job. So that’s my fuzzy measure for programming experience.

The result was surprising.

For participants who answered 5 (strongly agreed that they learned things they didn’t already know):

Number of participants: 7
Maximum number of months: 36
Minimum number of months: 4
Average number of months: 16

For participants who answered 4:

Number of participants: 1
Number of months: 16

For participants who answered 3:

Number of participants: 4
Maximum number of months: 16
Minimum number of months: 8
Average number of months: 13

For participants who answered 2:

Number of participants: 1
Average number of months: 5

For participants who answered 1 (strongly disagreed that they learned things they didn’t already know):

Number of participants: 1
Average number of months: 16

So there’s no evidence here that participants with more experience felt they learned less from the peer grading.

This was one of those questions where participants were asked to expand if they answered 3 or above. Here are some juicy morsels:

If you answered 3 or greater to the question above, what did you learn?

I learned some tricks and shortcuts of coding that make the solution more elegant and sometimes shorter.

…it showed me how hard some code are to read since I do not know what is in the programmer’s head.

I learned how different their coding style are compared to mine, as well as their reasoning to the assignment.

l learned about how other people think differently on same question and their programming styles can be different very much.

one of the codes I marked is very elegant and clear. It uses very different path from others. I really enjoyed that code. I think good codes from peers help us learn more.

I didn’t know about the random.shuffle method. I also didn’t know that it would have been better to use Exceptions which I don’t really know.

The different design or thinking towards the same question’s solution, and other ways to interpret a matter.

Other people can have very convoluted solutions…

Different ways of solving a problem

A few Python shortcuts, especially involving string manipulation. As well, I learned how to efficiently shuffle a list.

algorithm (ways of thinking), different ways of doing the same thing

Sometimes a few little tricks or styles that I had forgotten about. Also just a few different ways to go about solving the problem.

So what conclusions can I draw from this?

It looks like, regardless of experience, students seem to think peer grading teaches them something – even if it’s just a different design, or an approach to a problem.

Another clear signal in the “strongly agree” camp. This one is kind of a no-brainer though – seeing work by others certainly gives us a sense of how our own work rates in comparison. We do this kind of comparison all the time.

Anyhow, my participants seem to agree with that.

Again, a lot of agreement there. Students are curious to know what their peers think of their work. They care what their peers think. This is good. This is important.

Hm. More of a mixed reaction here. There’s more weight on the “strongly agree” side, but not a whole lot more.

This is interesting though. If I find that my treatment group does perform better on their second assignment, is it possible that their improvement isn’t from the grading, but rather from their intense study of the rubric?

So, depending on whether or not there’s an improvement, my critics could say I might have a wee case of confounding factor syndrome, here.

And I would agree with them. However, I would also point out that if there was an improvement in the treatment group, it wouldn’t matter what the actual source of the learning was – the peer grading (along with the rubric) caused an improvement. And that’s fine. That’s an OK result.

Of course, this is all theoretical until I find out if there was an improvement in the treatment group grades. Stay tuned for that.

Anyhow, this was another one of those questions where I asked for elaboration for answers 3 and up. Here’s what the participants had to say:

If you answered 3 or greater to the question above, what would you have done differently?

I would have checked for exceptions (and know what exceptions to check). I would have put more comments and docstrings into my code. I would have named my variables more reasonably.

I would’ve wrote out documentation. (ie. docstrings) Though I found that internal commenting wasn’t necessary.

i’ll add more comments to my code and maybe some more exceptions.

Added comments and docstrings.

Code’s design, style, clearness, readability and docstrings.

Made more effort to write useful docstrings and comments

I would’ve included things that I wouldn’t have included if I was coding for myself (such as comments and docstrings).

Added more documentation (I forget what it’s called but it’s when you surround the comments with “” ”’ “”)

Written more docstrings and comments (even though I think the code was simple enough and the method names self-explanatory enough that the code didn’t need more than one or two terse docstrings).

I forgot about docstrings and commenting my code

So it sounds like evaluation on documentation wasn’t clear enough in my assignment specification. There’s also some indication that participants thought that documentation wasn’t necessary if the code is simple enough. With respect to Docstrings, I’d have to disagree, since Docstrings are overwhelmingly useful for generating and compiling documentation. That’s just my own personal feelings on the matter, though.

Note: this is not to be confused with “I enjoyed grading my peers’ work”, which is the next question.

Mostly agreement here. So that’s interesting – participants enjoyed the simple act of seeing and reading code written by their peers.

It looks like, in general, students don’t really enjoy grading their peers’ code. Clearly, it’s not a universal opinion – you can see there’s some disagreement in the graph. Still, the trend seems to go towards the “strongly disagree” camp.

That’s a very useful finding. There’s nothing worse than sweating your butt off to design and construct a new task for students, only to find out that they hate doing it. We may have caught this early.

And I don’t actually find this that surprising: code review isn’t exactly a pleasurable experience. The benefits are certainly nice, but code review is a bit like flossing… it just seems to slow the morning routine down, regardless of the benefits.

Here’s what some participants had to say about their answers:

If you answered 3 or greater to the question above, why did you enjoy grading your peer’s work?

Because I like to compare my thoughts and other people’s thoughts.

well, some of the codes are really hard to read. But I did learn something from the grading. And letting students grade the codes is more fair.

I got to see where I went wrong and saw more creative/efficient solutions which will give me ideas for future assignments. But otherwise it was really boring.

So that I can learn from my peer’s thinking which gives me more diversity of coding and problem-solving.

Sometimes you see other student’s styles of coding/commenting/documenting and it helps you write better code. Sometimes you learn things that you didn’t know before. Sometimes it’s funny to see how other people code.

It was interesting to see their ideas, although sometimes painful to see their style.

not so much the grading part, but analyzing/looking at the different ways of coding the same thing

It gave me a rare prospective to see how other people with a similar educational background write their code.

Makes you think more critically about the overall presentation of your code. You ask yourself : “What would someone think of my code if they were doing this? Would I get a good mark?”

This one is more or less split right down the middle, with a little more weight on the agree side.

Again, participants who answered 3 or above were asked to elaborate. Here are some comments:

If you answered 3 or greater to the question above, what about grading your peer’s work was difficult?

The hardest part was trying to trace through messy code in order to figure out if it actually works.

Emotionally, I know what the student is doing but I have to give bad marks for comments or style which makes me feel bad. Sometimes it is hard to distinguish the mark whether it is 3 or 4. The time was critical (did not have time to finish all papers) which might result in giving the wrong mark. I kept comparing marks and papers so I could get almost the fairest result between all students. It is hard to mark visually, i.e. not testing the code. Some codes are hard to read which make it hard for marking and I can assume it is wrong but it actually works.

Giving bad marks are hard! Reading bad code is painful! It wasn’t fun! 🙁

It just became really tedious trying to understand people’s code.

To test and verify their code is hard sometimes as their method of solving a problem might be complicated. I need to think very carefully and test their code progressively.

The rubric felt a little too strict. Sometimes a peer’s code had small difficulties that could easily be overcome, but would be labeled as very poor. Also, the rubric wasn’t clear enough, especially on the error handling portions and style. There could be many ways of coding for example the __str__ functions (using concatenation versus using format eg. ‘ %s’ % string as opposed to using + str(string) +)

I just found it hard to read other’s code because I already have a set idea of how to solve the problems. I did not see how the solutions of my peers would’ve improved my own solutions, so I did not find value in this.

Reading through each line of code and trying to figure out what it does

Reading through convoluted, circuitous code to determine correctness.

Not every case is clear-cut, and sometimes it’s hard to decide which score to give.

Being harsh and honest. I guess it’s good not to ever meet the people who wrote the codes (unlike TAs) because they aren’t there to defend themselves. Saves some headaches 🙂

Ok, more or less full agreement here. At least, no disagreement. But also no full agreement. It’s sort of a lethargic “meh” with a flaccid thumbs up.

The conclusion? My participants felt that, more or less, their grading was probably fair. I guess.

Now this one…

This one is tricky, because I might have to toss it out. Each one of my participants was told flat out that other participants in the study may or may not see their code. This is true, since the graders are also participants in the study.

However, I did not outright tell them that other participants would be grading their code for the first assignment. So I think this question may have come as a surprise to them.

That was an oversight on my part. I screwed up. I’m human.

The two lone participants who answered 3 or above wrote:

If you answered 3 or greater to the question above, what did you do differently?

Making the docstring comments more clear, simplifying my design as possible, writing in a better style.

Added a bit more comments to explain my code in case peers don’t understand.

Anyhow, so those are my initial findings. If you have any questions about my data, or ideas on how I could analyze it, please let me know. I’m all ears.

Research Experiment: A Recap

Before I start diving into results, I’m just going to recap my experiment so we’re all up to speed.

I’ll try to keep it short, sweet, and punchy – but remember, this is a couple of months of work right here.

Ready? Here we go.

What I was looking for

A quick refresher on what code review is

Code review is like the software industry equivalent of a taste test. A developer makes a change to a piece of software, puts that change up for review, and a few reviewers take a look at that change to make sure it’s up to snuff. If some issues are found during the course of the review, the developer can go back and make revisions. Once the reviewers give it the thumbs up, the change is put into the software.

That’s an oversimplified description of code review, but it’ll do for now.

So what?

What’s important is to know that it works. Jason Cohen showed that code review reduces the number of defects that enter the final software product. That’s great!

But there are some other cool advantages to doing code review as well.

It helps to train up new hires. They can lurk during reviews to see how more experienced developers look at the code. They get to see what’s happening in other parts of the software. They get their code reviewed, which means direct, applicable feedback. All good things.
It helps to clean and homogenize the code. Since the code will be seen by their peers, developers are generally compelled to not put up “embarrassing” code (or, if they do, to at least try to explain why they did). Code review is a great way to compel developers to keep their code readable and consistent.
It helps to spread knowledge and good practices around the team. New hires aren’t the only ones to benefit from code reviews. There’s always something you can learn from another developer, and code review is where that will happen. And I believe this is true not just for those who receive the reviews, but also for those who perform the reviews.

That last one is important. Code review sounds like an excellent teaching tool.

So why isn’t code review part of the standard undergraduate computer science education? Greg and I hypothesized that the reason that code review isn’t taught is because we don’t know how to teach it.

I’ll quote myself:

What if peer code review isn’t taught in undergraduate courses because we just don’t know how to teach it? We don’t know how to fit it in to a curriculum that’s already packed to the brim. We don’t know how to get students to take it seriously. We don’t know if there’s pedagogical value, let alone how to show such value to the students.

The idea

Inspired by work by Joordens and Pare, Greg and I developed an approach to teaching code review that integrates itself nicely into the current curriculum.

Here’s the basic idea:

Suppose we have a computer programming class. Also suppose that after each assignment, each student is randomly presented with anonymized assignment submissions from some of their peers. Students will then be asked to anonymously peer grade these assignment submissions.

Now, before you go howling your head off about the inadequacy / incompetence of student markers, or the PeerScholar debacle, read this next paragraph, because there’s a twist.

The assignment submissions will still be marked by TA’s as usual. The grades that a student receives from her peers will not directly affect her mark. Instead, the student is graded based on how well they graded their peers. The peer reviews that a student completes will be compared with the grades that the TA’s delivered. The closer a student is to the TA, the better the mark they get on their “peer grading” component (which is distinct from the mark they receive for their programming assignment).

Now, granted, the idea still needs some fleshing out, but already, we’ve got some questions that need answering:

Joordens and Pare showed that for short written assignments, you need about 5 peer reviews to predict the mark that the TA will give. Is this also true for computer programming assignments?
Grading students based on how much their peer grading matches TA grading assumes that the TA is an infallible point of reference. How often to TA’s disagree amongst themselves?
Would peer grading like this actually make students better programmers? Is there a significant difference in the quality of their programming after they perform the grading?
What would students think of peer grading computer programming assignments? How would they feel about it?

So those were my questions.

How I went about looking for the answers

Here’s the design of the experiment in a nutshell:

Writing phase

I have a treatment group, and a control group. Both groups are composed of undergraduate students. After writing a short pre-experiment questionnaire, participants in both groups will have half an hour to work on a short programming assignment. The treatment group will then have another half an hour to peer grade some submissions for the assignment they just wrote. The submissions that they mark will be mocked up by me, and will be the same for each participant in the treatment group. The control group will not perform any grading – instead, they will do an unrelated vocabulary exercise for the same amount of time. Then, participants in either group will have another half an hour to work on the second short programming assignment. Participants in my treatment group will write a short post-experiment questionnaire to get their impressions on their peer grading experience. Then the participants are released.

Here’s a picture to help you visualize what you just read.

So now I’ve got two piles of submissions – one for each assignment, 60 submissions in total. I add my mock-ups to each pile. That means 35 submissions in each pile, and 70 submissions in total.

Marking phase

I assign ID numbers to each submission, shuffle them up, and hand them off to some graduate level TA’s that I hired. The TA’s will grade each assignment using the same marking rubric that the treatment group used to peer grade. They will not know if they are grading a treatment group submission, a control group submission, or a mock-up.

Choosing phase

After the grading is completed, I remove the mock-ups, and pair up submissions in both piles based on who wrote it. So now I’ve got 30 pairs of submissions: one for each student. I then ask my graders to look at each pair, knowing that they’re both written by the same student, and to choose which one they think is better coded, and to rate and describe the difference (if any) between the two. This is an attempt to catch possible improvements in the treatment group’s code that might not be captured in the marking rubric.

So that’s what I did

So everything you’ve just read is what I’ve just finished doing.

Once the submissions are marked, I’ll analyze the marks for the following:

Comparing the two groups, is there any significant improvement in the marks from the first assignment to the second in the treatment group?
1. If there was an improvement, on which criteria? And how much of an improvement?
How did the students do at grading my mock-ups? How similar were their peer grades to what the TAs gave?
How much did my two graders agree with one another?
During the choosing phase, did my graders tend to choose the second assignment over the first assignment more often for the treatment group?

And I’ll also analyze the post-experiment questionnaire to get student feedback on their grading experience.

Ok, so that’s where I’m at. Stay tuned for results.

Lessons from peerScholar: An Approach to Teaching Code Review

We Don’t Know How To Teach Code Review

If you go to my very first blog post about code review, you’ll discover what my original research question was:

Code reviews. They can help make our software better. But how come I didn’t learn about them, or perform them in my undergrad courses? Why aren’t they taught as part of the software engineering lifecycle right from the get-go? I learn about version control, but why not peer code review? Has it been tried in the academic setting? If so, why hasn’t it succeeded and become part of the general CS curriculum? If it hasn’t been tried, why not? What’s the hold up? What’s the problem?

I have mulled the question for months, and read several papers that discuss different models for introducing code review into the classroom.

But I’m no teacher. I really don’t know what it’s like to run a university level course. Thankfully, two course instructors from our department gave their input on the difficulty of introducing peer code review in the classroom. Here’s the first:

The problem is that is completely un-assessable. You can’t get the students to hand in reports from their inspection, and grade them on it, because they quickly realise it’s easier to fake their reports than it is to do a real code inspection. And the assignment never gets them to understand and internalize the real reasons for doing code inspection – here they just do it to jump through an artificial hoop set by the course instructor.

What we really need to do is to assess code quality, and let them figure out for themselves how the various tools we show them (e.g. test-case first, code inspection, etc) will help them achieve that quality. Better still, we give them ways of measuring directly how the various tools they use affect code quality for each assignment. But I haven’t thought enough yet about how to achieve this.

So, I’ve long since dropped the idea of a specific marked assignment on code inspections, but still teach inspection in all of my SE courses. I need to find a way to teach it so that the students themselves understand why it’s so useful.

(From Steve Easterbrook, commenting on this post)

And here’s the second:

1. How many different tasks can we ask students to do on a 3-week assignment? I think students should learn to use an IDE, a debugger, version control, and a ticket system. We have been successful in getting students to use version control because that’s the only way they can submit an assignment. We have had mixed success getting students to use IDE’s and debuggers, partly because it is hard to assign marks for their use. We have been even less successful in convincing students to use tickets because a 3-week assignment isn’t big enough or long enough to make tickets essential.

2. If the focus of my course is teaching operating systems, how much time (and grades) should I devote to software development tools and practices that aren’t centered on operating systems?

(From Karen Reid, commenting on this post)

All of this swirls around a possible answer that Greg Wilson and I have been approaching since September:

What if peer code review isn’t taught in undergraduate courses because we just don’t know how to teach it? We don’t know how to fit it in to a curriculum that’s already packed to the brim. We don’t know how to get students to take it seriously. We don’t know if there’s pedagogical value, let alone how to show such value to the students.

If that’s really the problem… Greg and I may have come up with a possible solution.

But First, Some Background

In 2008, Steve Joordens and Dwayne Pare published Peering into Large Lectures: Examining Peer and Expert Mark Agreement Using peerScholar, an Online Peer Assessment Tool.

It’s a good read, but in the interests of brevity, I’ll break it down for you:

Joordens and Pare are both at the University of Toronto Scarborough, in the Psych Department
Psych classes (especially for the first year) are large. For large classes, it is generally difficult to introduce writing assignments simply due to the sheer volume of writing that would need to be marked by the TAs. Alternatives (like multiple-choice tests) are often used to counteract this.
But writing is important.
The idea: what if we let students grade one another? There’s research showing the benefits of peer evaluation for writing assignments. So lets see what kind of grades peers give to one another.
A tool is built (peerScholar), and an experiment is run: after submitting their writing assignments, show students 5 submissions from other students, and have them grade the work (with specific grading instructions from the instructor). Then, compare the grades that the students gave with grades from the TAs.
A significant positive correlation was found between averaged TA marks and average peer marks. More statistical analysis shows that there is no significant difference between the agreement levels of TA and peer markers.
To ensure repeatability, a second experiment is run – similar to the first. Except, this time, students who receive the marks from their peers are able to “mark the marker” and flag any marks that seem suspicious (a 1/10, for example, if all the other students and the TA gave something closer to a 7/10).
It looks good – numbers were closer this time.
Conclusion: the average grade grade given by a set of peer markers was similar to the grade given by the TAs in terms of overall level and rank ordering of assignments.

This is a very interesting result. Why can’t we apply it to courses in a computer science department? What if students started marking each others code?

What they’d be doing would be called code review.

The Idea

Let’s modify Joorden and Pare’s model a little bit.

Let’s say I’m teaching an undergraduate computer science course where students tend to do quite a bit of coding. Traditionally, source code written by students would be collected through some mechanism or another, be marked by TAs, and then be returned to students after a few weeks.

What if, after all of the submissions have been collected, each student must anonymously grade 5 submissions, chosen randomly from the system (with the only stipulation that students cannot grade their own work).

But here’s the twist:

Instead of just calculating a mark for students based on the peer reviews that they get, how about we mark the students based on the reviews that they give – specifically, based on how close they are to generating the same marks that the TAs give?

So now a students mark will be partially based on how well they are able to review code.

Questions / Answers (or Concerns / Freebies)

I can think of a few initial concerns with this idea.

Q: What if the TA makes a huge mistake, or makes an oversight? They’re not infallible. How can students possibly make the same mistake / give the same mark?

A: I agree that TAs are not infallible. Nobody is. However, if a TA gives a submission a 3/10, and the rest of the students give 9/10’s, this is useful information. It either means that the TA missed something, or might signal that the students in general have not learned something crucial. In either case, this sort of problem can be easily detected, and sorted out via human intervention.

Q: What if students game the system by just giving their peers all 10/10’s, or try to screw each other by just giving 0/10’s?

A: Remember, students are being marked on their ability to review. If the TAs gave a more appropriate mark, and a student starts behaving as above, they’re going to get a poor reviewing mark. No harm done to the reviewee.

Q: I’m already swamped. How can I cram a system like this into my course?

A: I’m one of the developers on MarkUs, a tool that is being used to grade source code for students at the University of Toronto and the University of Waterloo. It would not be impossible to adapt MarkUs to follow this model. Through MarkUs, a lot of this idea can be automated. Besides some possible human intervention for edge cases, I don’t see there being a whole lot of course-admin overhead to get this sort of thing going. But it does mean a little bit more work for students who have to review the code.

Q: This is nice in theory, but is there any real pedagogical value in this? And if so, how can I show it to my students?

A: First off, as a recent undergraduate student at UofT, I must say how rare it is to be given the opportunity to read another student’s code. It just doesn’t happen much. I would have found it interesting – I’d be able to see the techniques that my peers employed to solve the same problems that I was trying to solve. It would give me a good informal measuring stick to see how I rank in the class – and students always want to know how they rank in the class.

Would they learn anything from it though?

That’s a good question. Would students learn anything from this, and realize the benefits? Remember – that’s what Steve Easterbrook says was the major stumbling block to introducing peer review…we have to show them that it’s useful.

The Questions

How good are students at grading their peers? How close to they get to the grades that a TA would give?
- By study year
- By their perceived programming ability
- By their perceived programming experience
- By their programming confidence
What happens to students’ ability to review their peers as they perform each review? Do they get better after each one? And is there a point where their accuracy gets poorer from fatigue?
How many student reviewers are needed to approximate the grade that a TA would give?
How long do students generally take to peer review code? (bonus)
How long do graduate students generally take to mark an assignment? (bonus)
Do the students actually learn anything from the process?
How do the students feel about being graded on their ability to review?
- Do they think that this process is fair?
- Do they think that they’re learning anything useful?
- Do they feel like it is worth their time?
- Do they enjoy reading other students’ code?
- If it was introduced into their classes, how would they feel?

Lots of questions. Luckily, it just so happens that I’m a scientist.

The Experiment

First, I mock up (or procure) 10 submissions for a programming assignment that our undergraduates might write.

I then get/convince some graduate students to grade those 10 submissions to the best of their ability, using MarkUs. These marks are recorded.

I then take a cross-section of our undergraduate student body, and (after a brief survey to determine their opinions of their coding experience/confidence), I get the students to peer review and grade those 10 submissions. They will be told that their goal is to try to give the same type of marks that a graduate student TA might give.

After the grades are recorded, I take the submission that they reviewed first, and get them to grade it again. Do they get closer to the TAs mark than their first attempt?

Students are then given a second survey (probably Likert scales) to assess their opinions on the process. Would it be fair if their ability to grade was part of their mark? Did you get anything useful out of this? Did you feel that it was worth your time? Did you enjoy reading other students’ code? How would you feel if it was part of your class? …

The final survey will (hopefully) knock out the last series of questions in my list. Timing information recorded during marking will help answer the bonus questions. Analysis of the marks that the students give in relation to the marks that the TA give will hopefully help answer the rest.

What Am I Missing?

Am I missing anything here? Is there a gaping hole in my thinking somewhere? Would this be a good, interesting experiment to run? For those who teach…if my results are encouraging, would you ever try implementing this in your classroom?

And if this was introduced into the classroom…what would happen to student learning? What would happen to marks? How would instructors like it?

So, what do you think? I’m all ears.

MarkUs, Squad, How’s / Refactor My Code, Belated Happy Holidays, and Oh Yeah – I’m Not Dead

Belated happy holidays! My last post was over a month ago, and so my blog has a nice layer of web-dust on it right now. Well, here I am to ease your mind. I’m still alive!

But that almost wasn’t true.

I won’t bore you with the details – I’ll just give you the facts, and let you fill in the blanks.

My girlfriend Em, her sister Cassie, and myself, were up in Collingwood on New Years Day, enjoying a relaxing day at a Norwegian spa (the outdoor baths were amazing – how awesome is it to be in a boiling hot tub, while simultaneously, your hair is so frozen that it’s snapping off in your hands?)
The roads that night were treacherous. Snowy, un-plowed, and dark. I had borrowed my Mom’s car for the trip, and we took it realllllly slow.
After a tortoise-paced two hour ride back to Em’s place in Newmarket, and then another two hour drive from Newmarket to my home in Grimsby the next day, I was getting pretty sick of winter driving. On top of that, the brakes seemed to be acting funny. I found myself sliding a lot, and there didn’t seem to be a lot of resistance when I put my foot down.
The next day, my Mom takes the car to go to work. She doesn’t even leave the drive-way. The brakes hadn’t been acting funny: the brakes hadn’t been acting at all. Turns out we had a leaky brake-line for the entire trip…
Guts of the story: I think we drove home from Collingwood with about 35% brake power in one of the worst snow storms I’ve ever driven in.

Breakfast tasted especially good for us that morning.

Anyhow, now where was I? Oh yeah…

MarkUs

MarkUs 0.6 got kicked out a week or so ago. The MarkUs Team kicked the crap out of a bunch of tickets over the holidays, and I think we ended up with a pretty solid release. MarkUs is being used again at UofT this semester, and Byron Weber Becker is also piloting it at UWaterloo. I’ll cautiously say that things seem to be going well for this release. Great job, MarkUs Team!

I’m TAing the students working on MarkUs for Greg’s UCOSP course again. We had a fantastic code-sprint this past weekend! The new team members have already started working on tickets and submitting code to review. I think we’re on our way into another highly productive semester.

A Few More Web-Based Code Review Tools

Remember that big list of code review tools I put up a while back? I’ve got a few more to add:

How’s My Code

This is a pretty dead-simple code review tool that came about during a Rails Rumble a few months back. It has that “big friendly buttons and round corners” web 2.0 thingy going on. I haven’t gone so far as to actually try it out, but I did watch this web-cast:

Not bad if you just want to get your code out there, and get your team commenting on your changes…

A few things caught my attention:

It’s a web service, so you don’t install it…you sign up for it
It currently only supports Git. 🙁
There doesn’t seem to be any support for contextual per-line commenting…I think it’s just file by file commenting. I’d love it if I could comment on a single line of code…

Still, if I was working on a project hosted on a Git repo, and I needed a dead-simple code review service, and I needed it quickly, I could probably do a lot worse than this.

Click here to check out How’s My Code

Squad

Remember that time when I wrote about how it might be neat if somebody created a code review tool on top of Google Wave? (or Bespin for that matter – though I didn’t mention it, and should have)

Looks like somebody else was thinking the same thing. And a few months earlier. I guess it’s not easy to be super cutting-edge.

Anyhow, looks like something Wave-ish (yet simpler, more streamlined) has been developed. Check out Squad.

I just tried this thing out for free (with ads, features locked, etc), and it was pretty cool. I could see something like this being very useful for showing new MarkUs team members how to do things. Actually, I just used it to show a new member of the MarkUs team how to use Shoulda. Pretty useful. It sure beats coding through IRC and Pastie.org.

A few things to keep in mind:

Super simple to get going – open up a session, and send someone a generated link, and you’re both coding in no time
One person codes at a time…so while one person edits, the screen is locked for everyone else
Ads on the left are a little annoying
Sports syntax highlighting for a number of languages – though I noticed that Ruby wasn’t one of them. :/

I can see this becoming second nature, like Pastie.org.

Who knows – I might find more reasons to use Squad as the semester rolls, and MarkUs picks up speed. I’ll keep you posted.

If you missed the link I put in above, click here to check out Squad

Refactor My Code

This service crowd-sources code review requests, so don’t expect to get deep architectural feedback, because it’ll probably come from strangers who don’t/barely know your code base.

The idea is – slap a piece of code that you’d like refactored up on the site, and then others swoop in with brilliant suggestions (assuming of course, you asked your question properly…check this out…what the…?)

This is the sort of thing that CS instructors probably wouldn’t want their students using too much…it’d then become solve-my-CS-programming-assignment.com.

Still, I think it counts as peer code review. And it’s way different that anything else I’ve been looking at. Nice.

Click here to check out Refactor My Code

Anyhow, I just thought I’d mention those.

The Achilles’ Heel of Light-Weight Code Review

So I had my weekly meeting with my supervisor, and fellow students Zuzel and Jon Pipitone. Something interesting popped up, and I thought I’d share it.

If it wasn’t already clear, I dig code review. I think it really helps get a team of developers more in tune with what’s going on in their repository, and is an easy way to weed out mistakes and bad design in small chunks of code.

But there’s a fly in the soup.

This semester, the MarkUs project has been using ReviewBoard to review all commits to the repository. We’ve caught quite a few things, and we’ve been learning a lot.

Now for that fly:

A developer recently found a typo in our code base. An I18n variable was misspelled in one of our Views as I18n.t(:no_students_wihtou_a_group_message). One of our developers saw this, fixed the typo, and submitted the diff for review.

I was one of the reviewers. And I gave it the green light. It made sense – I18n.t(:no_students_wihtou_a_group_message) is clearly wrong, and I18n.t(:no_students_without_a_group_message) is clearly what was meant here.

So the review got the “Ship It”, and the code was committed. What we didn’t catch, however, was that the locale string was actually named “no_student_without_a_group_message” in the translation file, not “no_students_without_a_group_message”. So the fix didn’t work.

This is important: the diff looked good, but the bug remained because we didn’t have more information on the context of the bug. We had no information about I18n.t(:no_students_without_a_group_message) besides the fact that I18n.t(:no_students_wihtou_a_group_message) looked wrong.

Which brings me back to the conversation we had yesterday: while it seems plausible that code review helps catch defects in small code blocks, does the global defect count on the application actually decrease? Since ReviewBoard doesn’t have any static analysis tools to check what our diffs are doing, isn’t it plausible that while our diffs look good, we’re not preventing ourselves from adding new bugs into the code base?

So, the question is: does light-weight code review actually decrease the defect count across an application as a whole?

If not, can we augment these code review tools so that they’re more sensitive to the context of the diffs that are being reviewed?