Tag Archives: marking

Some More Results: Did the Graders Agree? – Part 2

(Click here to read the first part of the story)

I’m just going to come right out and say it:  I’m no stats buff.

Actually, maybe that’s giving myself too much credit.  I barely scraped through my compulsory statistics course.  In my defense, the teaching was abysmal, and the class average was in the sewer the entire time.

So, unfortunately, I don’t have the statistical chops that a real scientist should.

But, today, I learned a new trick.

Pearson’s Correlation Co-efficient

Joorden’s and Pare gave me the idea while I was reviewing their paper for the Related Work section of my thesis.  They used it in order to inspect mark agreement between their expert markers.

In my last post on Grader agreement, I was looking at mark agreement at the equivalence level.  Pearson’s Correlation Co-efficient should (I think) let me inspect mark agreement at the “shape” level.

And by shape level, I mean this:  if Grader 1 gives a high mark for a participant, then Grader 2 gives a high mark.  If Grader 1 gives a low mark for the next participant, then Grader 2 gives a low mark.  These high and low marks might not be equal, but the basic shape of the thing is there.

And this page, with it’s useful table, tell me how I can tell if the correlation co-efficient that I find is significant.  Awesome.

At least, that’s my interpretation of Pearson’s Correlation Co-efficient.  Maybe I’ve got it wrong.  Please let me know if I do.

Anyhow, it can’t hurt to look at some more tables.  Let’s do that.

About these tables…

Like my previous post on graders, I’ve organized my data into two tables – one for each assignment.

Each table has a row for that assignments criteria.

Each table has two columns – the first is strictly to list the assignment criteria.  The second column gives the Pearson Correlation Co-efficient for each criterion.  The correlation measurement is between the marks that my two Graders gave on that criterion across all 30 submissions for that assignment.

I hope that makes sense.

Anyways, here goes…


Decks and Cards Grader Correlation Table

[table id=8 /]

Flights and Passengers Grader Correlation Table

[table id=9 /]

What does this tell us?

Well, first off, remember that for each assignment, for each criterion, there were 30 submissions.

So N = 30.

In order to determine if the correlation co-efficients are significant, we look at this table, and find N – 2 down the left hand side:

28                       .306    .361    .423    .463

Those 4 values on the right are the critical values that we want to pass for significance.

Good news!  All of the correlation co-efficients fall within the range of [.306, .463].  So now, I’ll show you their significance by level:

p < 0.10

  • Design of __str__ in Decks and Cards assignment

p < 0.05

  • Design of deal method in Decks and Cards assignment

p < 0.02

  • Design of heaviest_passenger method in Flights and Passengers

p < 0.01

Decks and Cards
  • Design of Deck constructor
  • Style
  • Internal Comments
  • __str__ method correctness
  • deal method correctness
  • Deck constructor correctness
  • Docstrings
  • shuffle method correctness
  • Design of shuffle method
  • Design of cut method
  • cut method correctness
  • Error checking
Flights and Passengers
  • Design of __str__ method
  • Design of lightest_passenger method
  • Style
  • Design of Flight constructor
  • Internal comments
  • Design of add_passenger method
  • __str__ method correctness
  • Error checking
  • heaviest_passenger method correctness
  • Docstrings
  • lightest_passenger method correctness
  • Flight constructor correctness
  • add_passenger method correctness


Correlation of Mark Totals

Joorden’s and Pare ran their correlation statistics on assignments that were marked on a scale from 1 to 10.  I can do the same type of analysis by simply running Pearson’s on the totals for each participant by each Grader.

Drum roll, please…

Decks and Cards

p(28) = 0.89, p < 0.01

Flights and Passengers

p(28) = 0.92, p < 0.01


Summary / Conclusion

I already showed before that my two Graders rarely agreed mark for mark, and that one Grader tended to give higher marks than the other.

The analysis with Pearson’s correlation co-efficient seems to suggest that, while there isn’t one-to-one agreement, there is certainly a significant correlation – with the majority of the criteria having a correlation with p < 0.01!

The total marks also show a very strong, significant, positive correlation.

Ok, so that’s the conclusion here:  the Graders marks do not match, but show moderate to high positive correlation to a significant degree.

How’s My Stats?

Did I screw up somewhere?  Am I making fallacious claims?  Let me know – post a comment!

Some More Results: Did the Graders Agree?

My experiment makes a little bit of an assumption – and it’s the same assumption most teachers probably make before they hand back work.  We assume that the work has been graded correctly and objectively.

The rubric that I provided to my graders was supposed to help sort out all of this objectivity business.  It was supposed to boil down all of the subjectivity into a nice, discrete, quantitative value.

But I’m a careful guy, and I like back-ups.  That’s why I had 2 graders do my grading.  Both graders worked in isolation on the same submissions, with the same rubric.

So, did it work?  How did the grades match up?  Did my graders tend to agree?

Sounds like it’s time for some data analysis!

About these tables…

I’m about to show you two tables of data – one table for each assignment.  The rows of the tables map to a single criterion on that assignments rubric.

The columns are concerned with the graders marks for each criterion.  The first columns, Grader 1 – Average and Grader 2 – Average, simply show the average mark given for each criteria for each grader.

Number of Agreements shows the number of times the marks between both graders matched for that criterion.  Similarly, Number of Disagreements shows how many times they didn’t match.  Agreement Percentage just converts those two values into a single percentage for agreement.

Average Disagreement Magnitude takes every instance where there was a disagreement, and averages the magnitude of the disagreement (a reminder:  the magnitude here is the absolute value of the difference).

Finally, I should point out that these tables can be sorted by clicking on the headers.  This will probably make your interpretation of the data a bit easier.

So, if we’re clear on that, then let’s take a look at those tables…

Flights and Passengers Grader Comparison

[table id=6 /]

Decks and Cards Grader Comparison

[table id=7 /]

Findings and Analysis

It is very rare for the graders to fully agree

It only happened once, on the “add_passenger” correctness criterion of the Flights and Passengers assignments.  If you sort the tables by “Number of Agreements” (or Number of Disagreements), you’ll see what I mean.

Grader 2 tended to give higher marks than Grader 1

In fact, there are only a handful of cases (4, by my count), where this isn’t true:

  1. The add_passenger correctness criterion on Flights and Passengers
  2. The internal comments criterion on Flights and Passengers
  3. The error checking criterion on Decks and Cards
  4. The internal comments criterion on Decks and Cards

The graders tended to disagree more often on design and style

Sort the tables by Number of Disagreements descending, and take a look down the left-hand side.

There are 14 criteria in total for each assignment.  If you’ve sorted the tables like I’ve asked, the top 7 criteria of each assignment are:

Flights and Passengers
  1. Style
  2. Design of __str__ method
  3. Design of heaviest_passenger method
  4. Design of lightest_passenger method
  5. Docstrings
  6. Correctness of __str__ method
  7. Design of Flight constructor
Decks and Cards
  1. Correctness of deal method
  2. Style
  3. Design of cut method
  4. Design of __str__ method
  5. Docstrings
  6. Design of deal method
  7. __str__

Of those 14, 9 have to do with design or style.  It’s also worth noting that Doctrings and the correctness of the __str__ methods are in there too.

There were slightly more disagreement in Decks and Cards than in Flights and Passengers

Total number of disagreements for Flights and Passengers:  136 (avg:  9.71 per criterion)

Total number of disagreements for Decks and Cards:  161 (avg:  11.5 per criterion)


Being Hands-off

From the very beginning, when I contacted / hired my Graders, I was very hands-off.  Each Grader was given the assignment specifications and rubrics ahead of time to look over, and then a single meeting to ask questions.  After that, I just handed them manila envelopes filled with submissions for them to mark.

Having spoken with some of the undergraduate instructors here in the department, I know that this isn’t usually how grading is done.

Usually, the instructor will have a big grading meeting with their TAs.  They’ll all work through a few submissions, and the TAs will be free to ask for a marking opinion from the instructor.

By being hands-off, I didn’t give my Graders the same level of guidance that they may have been used to.  I did, however, tell them that they were free to e-mail me or come up to me if they had any questions during their marking.

The hands-off thing was a conscious choice by Greg and myself.  We didn’t want me to bias the marking results, since I would know which submissions would be from the treatment group, and which ones would be from control.

Anyhow, the results from above have driven me to conclude that if you just hand your graders the assignments and the rubrics, and say “go”, you run the risk of seeing dramatic differences in grading from each Grader.  From a student’s perspective, this means that it’s possible to be marked by “the good Grader”, or “the bad Grader”.

I’m not sure if a marking-meeting like I described would mitigate this difference in grading.  I hypothesize that it would, but that’s an experiment for another day.

Questionable Calls

If you sort the Decks and Cards table by Number of Disagreements, you’ll find that the criterion that my Graders disagreed most on was the correctness of the “deal” method.  Out of 30 submissions, both Graders disagreed on that particular criterion 21 times (70%).

It’s a little strange to see that criterion all the way at the top there.  As I mentioned earlier, most of the disagreements tended to be concerning design and style.

So what happened?

Well, let’s take a look at some examples.

Example #1

The following is the deal method from participant #013:

def deal(self, num_to_deal):
  i = 0
  while i < num_to_deal:
    print self.deck.pop(0)
    i += 1

Grader 1 gave this method a 1 for correctness, where Grader 2 gave this method a 4.

That’s a big disagreement.  And remember, a 1 on this criterion means:

Barely meets assignment specifications. Severe problems throughout.

I think I might have to go with Grader 2 on this one.  Personally, I wouldn’t use a while-loop here – but that falls under the design criterion, and shouldn’t impact the correctness of the method.  I’ve tried the code out.  It works to spec.  It deals from the top of the deck, just like it’s supposed to.  Sure, there are some edge cases missed here (what is the Deck is empty?  What if we’re asked to deal more than the number of cards left?  What if we’re asked to deal a negative number of cards?  etc)… but the method seems to deliver the basics.

Not sure what Grader 1 saw here.  Hmph.

Example #2

The following is the deal method from participant #023:

def deal(self, num_to_deal):
 res = []
 for i in range(0, num_to_deal):

Grader 1 gave this method a 0 for correctness.  Grader 2 gave it a 3.

I see two major problems with this method.  The first one is that it doesn’t print out the cards that are being dealt off:  instead, it stores them in a list.  Secondly, that list is just tossed out once the method exits, and nothing is returned.

A “0” for correctness simply means Unimplemented, which isn’t exactly true:  this method has been implemented, and has the right interface.

But it doesn’t conform to the specification whatsoever.  I would give this a 1.

So, in this case, I’d side more (but not agree) with Grader 1.

Example #3

This is the deal method from participant #025:

def deal(self, num_to_deal):
    num_cards_in_deck = len(self.cards)
        num_to_deal = int(num_to_deal)
        if num_to_deal > num_cards_in_deck:
            print "Cannot deal more than " + num_cards_in_deck + " cards\n"
        i = 0
        while i < num_to_deal:
            print str(self.cards[i])
            i += 1
        self.cards = self.cards[num_to_deal:]
        print "Error using deal\n"

Grader 1 also gave this method a 1 for correctness, where Grader 2 gave a 4.

The method is pretty awkward from a design perspective, but it seems to behave as it should – it deals the provided number of cards off of the top of the deck and prints them out.

It also catches some edge-cases:  num_to_deal is converted to an int, and we check to ensure that num_to_deal is less than or equal to the number of cards left in the deck.

Again, I’ll have to side more with Grader 2 here.

Example #4

This is the deal method from participant #030:

def deal(self, num_to_deal):
  i = 0
  while i <= num_to_deal:
    print self.cards[0]
    del self.cards[0]

Grader 1 gave this a 1.  Grader 2 gave this a 4.

Well, right off the bat, there’s a major problem:  this while-loop never exists.  The while-loop is waiting for the value i to become greater than num_to_deal…but it never can, because i is defined as 0, and never incremented.

So this method doesn’t even come close to satisfying the spec.  The description for a “1” on this criterion is:

Barely meets assignment specifications. Severe problems throughout.

I’d have to side with Grader 1 on this one.  The only thing this method delivers in accordance with the spec is the right interface.  That’s about it.

Dealing from the Bottom of the Deck

I received an e-mail from Grader 2 about the deal method.  I’ve paraphrased it here:

If the students create the list of cards in a typical way, for suit in CARD_SUITS; for rank in CARD_RANKS, and then print using something like:
for card in self.cards
print str(card) +  “\n”
Then for deal, if they pick the cards to deal using pop() somehow, like:
for i in range(num_to_deal):
print str(self.cards.pop())

Aren’t they dealing from the bottom

My answer was “yes, they are, and that’s a correctness problem”.  In my assignment specification, I was intentionally vague about the internal collection of the cards – I let the participant figure that all out.  All that mattered was that the model made sense, and followed the rules.

So if I print my deck, and it prints:

Q of Hearts
A of Spades
7 of Clubs

Then deal(1) should print:

Q of Hearts

regardless of the internal organization.

Anyhow, only Grader 2 asked for clarification on this, and I thought this might be the reason for all of the disagreement on the deal method.

Looking at all of the disagreements on the deal methods, it looks like 7 out of the 20 can be accounted for because students were unintentionally dealing from the bottom of the deck, and only Grader 2 caught it.

Subtracting the “dealing from the bottom” disagreements from the total leaves us with 13, which puts it more in line with some of the other correctness criteria.

So I’d have to say that, yes, the “dealing from the bottom” problem is what made the Graders disagree so much on this criterion:  only 1 Grader realized that it was a problem while they were marking.  Again, I think this was symptomatic of my hands-off approach to this part of the experiment.

In Summary

My graders disagreed.  A lot.  And a good chunk of those disagreements were about style and design.  Some of these disagreements might be attributable to my hands-off approach to the grading portion of the experiment.  Some of them seem to be questionable calls from the Graders themselves.

Part of my experiment was interested in determining how closely peer grades from students can approximate grades from TAs.  Since my TAs have trouble agreeing amongst themselves, I’m not sure how that part of the analysis is going to play out.

I hope the rest of my experiment is unaffected by their disagreement.

Stay tuned.

See anything?

Do my numbers make no sense?  Have I contradicted myself?  Have I missed something critical?  Are there unanswered questions here that I might be able to answer?  I’d love to know.  Please comment!

Research Experiment: A Recap

Before I start diving into results, I’m just going to recap my experiment so we’re all up to speed.

I’ll try to keep it short, sweet, and punchy – but remember, this is a couple of months of work right here.

Ready?  Here we go.

What I was looking for

A quick refresher on what code review is

Code review is like the software industry equivalent of a taste test.  A developer makes a change to a piece of software, puts that change up for review, and a few reviewers take a look at that change to make sure it’s up to snuff.  If some issues are found during the course of the review, the developer can go back and make revisions.  Once the reviewers give it the thumbs up, the change is put into the software.

That’s an oversimplified description of code review,  but it’ll do for now.

So what?

What’s important is to know that it works. Jason Cohen showed that code review reduces the number of defects that enter the final software product. That’s great!

But there are some other cool advantages to doing code review as well.

  1. It helps to train up new hires.  They can lurk during reviews to see how more experienced developers look at the code.  They get to see what’s happening in other parts of the software.  They get their code reviewed, which means direct, applicable feedback.  All good things.
  2. It helps to clean and homogenize the code.  Since the code will be seen by their peers, developers are generally compelled to not put up “embarrassing” code (or, if they do, to at least try to explain why they did).  Code review is a great way to compel developers to keep their code readable and consistent.
  3. It helps to spread knowledge and good practices around the team.  New hires aren’t the only ones to benefit from code reviews.  There’s always something you can learn from another developer, and code review is where that will happen.  And I believe this is true not just for those who receive the reviews, but also for those who perform the reviews.

That last one is important.  Code review sounds like an excellent teaching tool.

So why isn’t code review part of the standard undergraduate computer science education?  Greg and I hypothesized that the reason that code review isn’t taught is because we don’t know how to teach it.

I’ll quote myself:

What if peer code review isn’t taught in undergraduate courses because we just don’t know how to teach it?  We don’t know how to fit it in to a curriculum that’s already packed to the brim.  We don’t know how to get students to take it seriously.  We don’t know if there’s pedagogical value, let alone how to show such value to the students.

The idea

Inspired by work by Joordens and Pare, Greg and I developed an approach to teaching code review that integrates itself nicely into the current curriculum.

Here’s the basic idea:

Suppose we have a computer programming class.  Also suppose that after each assignment, each student is randomly presented with anonymized assignment submissions from some of their peers.  Students will then be asked to anonymously peer grade these assignment submissions.

Now, before you go howling your head off about the inadequacy / incompetence of student markers, or the PeerScholar debacle, read this next paragraph, because there’s a twist.

The assignment submissions will still be marked by TA’s as usual.  The grades that a student receives from her peers will not directly affect her mark.  Instead, the student is graded based on how well they graded their peers. The peer reviews that a student completes will be compared with the grades that the TA’s delivered.  The closer a student is to the TA, the better the mark they get on their “peer grading” component (which is distinct from the mark they receive for their programming assignment).

Now, granted, the idea still needs some fleshing out, but already, we’ve got some questions that need answering:

  1. Joordens and Pare showed that for short written assignments, you need about 5 peer reviews to predict the mark that the TA will give.  Is this also true for computer programming assignments?
  2. Grading students based on how much their peer grading matches TA grading assumes that the TA is an infallible point of reference.  How often to TA’s disagree amongst themselves?
  3. Would peer grading like this actually make students better programmers?  Is there a significant difference in the quality of their programming after they perform the grading?
  4. What would students think of peer grading computer programming assignments?  How would they feel about it?

So those were my questions.

How I went about looking for the answers

Here’s the design of the experiment in a nutshell:

Writing phase

I have a treatment group, and a control group.  Both groups are composed of undergraduate students.  After writing a short pre-experiment questionnaire, participants in both groups will have half an hour to work on a short programming assignment.  The treatment group will then have another half an hour to peer grade some submissions for the assignment they just wrote.  The submissions that they mark will be mocked up by me, and will be the same for each participant in the treatment group.  The control group will not perform any grading – instead, they will do an unrelated vocabulary exercise for the same amount of time.  Then, participants in either group will have another half an hour to work on the second short programming assignment. Participants in my treatment group will write a short post-experiment questionnaire to get their impressions on their peer grading experience.  Then the participants are released.

Here’s a picture to help you visualize what you just read.

Tasks for each group in my experiment.

So now I’ve got two piles of submissions – one for each assignment, 60 submissions in total.  I add my mock-ups to each pile.  That means 35 submissions in each pile, and 70 submissions in total.

Marking phase

I assign ID numbers to each submission, shuffle them up, and hand them off to some graduate level TA’s that I hired.  The TA’s will grade each assignment using the same marking rubric that the treatment group used to peer grade.  They will not know if they are grading a treatment group submission, a control group submission, or a mock-up.

Choosing phase

After the grading is completed, I remove the mock-ups, and pair up submissions in both piles based on who wrote it.  So now I’ve got 30 pairs of submissions:  one for each student.  I then ask my graders to look at each pair, knowing that they’re both written by the same student, and to choose which one they think is better coded, and to rate and describe the difference (if any) between the two.  This is an attempt to catch possible improvements in the treatment group’s code that might not be captured in the marking rubric.

So that’s what I did

So everything you’ve just read is what I’ve just finished doing.

Once the submissions are marked, I’ll analyze the marks for the following:

  1. Comparing the two groups, is there any significant improvement in the marks from the first assignment to the second in the treatment group?
    1. If there was an improvement, on which criteria?  And how much of an improvement?
  2. How did the students do at grading my mock-ups?  How similar were their peer grades to what the TAs gave?
  3. How much did my two graders agree with one another?
  4. During the choosing phase, did my graders tend to choose the second assignment over the first assignment more often for the treatment group?

And I’ll also analyze the post-experiment questionnaire to get student feedback on their grading experience.

Ok, so that’s where I’m at.  Stay tuned for results.