Some More Results: Did the Graders Agree? – Part 2

(Click here to read the first part of the story)

I’m just going to come right out and say it:  I’m no stats buff.

Actually, maybe that’s giving myself too much credit.  I barely scraped through my compulsory statistics course.  In my defense, the teaching was abysmal, and the class average was in the sewer the entire time.

So, unfortunately, I don’t have the statistical chops that a real scientist should.

But, today, I learned a new trick.

Pearson’s Correlation Co-efficient

Joorden’s and Pare gave me the idea while I was reviewing their paper for the Related Work section of my thesis.  They used it in order to inspect mark agreement between their expert markers.

In my last post on Grader agreement, I was looking at mark agreement at the equivalence level.  Pearson’s Correlation Co-efficient should (I think) let me inspect mark agreement at the “shape” level.

And by shape level, I mean this:  if Grader 1 gives a high mark for a participant, then Grader 2 gives a high mark.  If Grader 1 gives a low mark for the next participant, then Grader 2 gives a low mark.  These high and low marks might not be equal, but the basic shape of the thing is there.

And this page, with it’s useful table, tell me how I can tell if the correlation co-efficient that I find is significant.  Awesome.

At least, that’s my interpretation of Pearson’s Correlation Co-efficient.  Maybe I’ve got it wrong.  Please let me know if I do.

Anyhow, it can’t hurt to look at some more tables.  Let’s do that.

About these tables…

Like my previous post on graders, I’ve organized my data into two tables – one for each assignment.

Each table has a row for that assignments criteria.

Each table has two columns – the first is strictly to list the assignment criteria.  The second column gives the Pearson Correlation Co-efficient for each criterion.  The correlation measurement is between the marks that my two Graders gave on that criterion across all 30 submissions for that assignment.

I hope that makes sense.

Anyways, here goes…

Da-ta!

Decks and Cards Grader Correlation Table

Grader 1 – AverageGrader 2 – AveragePearson's Correlation Co-efficient
Deck Constructor3.373.570.65
Design of Deck Constructor33.770.5
__str__2.633.40.57
Design of __str__2.333.670.36
deal2.273.030.57
Design of deal2.533.70.4
shuffle3.233.530.77
Design of shuffle33.470.78
cut2.672.970.88
Design of cut2.172.90.78
Error Checking1.070.930.95
Style2.93.630.52
Docstrings1.872.030.7
Internal Comments1.10.830.56

Flights and Passengers Grader Correlation Table

Grader 1 – AverageGrader 2 – AveragePearson's Correlation Co-efficient
Flight Constructor3.673.730.97
Design of Flight Constructor3.433.930.72
__str__3.033.370.8
Design of __str__2.43.40.57
add_passenger3.93.91
Design of add_passenger3.533.870.77
heaviest_passenger33.270.87
Design of heaviest_passenger2.173.10.46
lightest_passenger2.833.030.9
Design of lightest_passenger22.830.64
Error Checking1.41.730.85
Style2.83.530.68
Docstrings1.471.90.87
Internal Comments0.730.670.76

What does this tell us?

Well, first off, remember that for each assignment, for each criterion, there were 30 submissions.

So N = 30.

In order to determine if the correlation co-efficients are significant, we look at this table, and find N – 2 down the left hand side:

28                       .306    .361    .423    .463

Those 4 values on the right are the critical values that we want to pass for significance.

Good news!  All of the correlation co-efficients fall within the range of [.306, .463].  So now, I’ll show you their significance by level:

p < 0.10

  • Design of __str__ in Decks and Cards assignment

p < 0.05

  • Design of deal method in Decks and Cards assignment

p < 0.02

  • Design of heaviest_passenger method in Flights and Passengers

p < 0.01

Decks and Cards
  • Design of Deck constructor
  • Style
  • Internal Comments
  • __str__ method correctness
  • deal method correctness
  • Deck constructor correctness
  • Docstrings
  • shuffle method correctness
  • Design of shuffle method
  • Design of cut method
  • cut method correctness
  • Error checking
Flights and Passengers
  • Design of __str__ method
  • Design of lightest_passenger method
  • Style
  • Design of Flight constructor
  • Internal comments
  • Design of add_passenger method
  • __str__ method correctness
  • Error checking
  • heaviest_passenger method correctness
  • Docstrings
  • lightest_passenger method correctness
  • Flight constructor correctness
  • add_passenger method correctness

Wow!

Correlation of Mark Totals

Joorden’s and Pare ran their correlation statistics on assignments that were marked on a scale from 1 to 10.  I can do the same type of analysis by simply running Pearson’s on the totals for each participant by each Grader.

Drum roll, please…

Decks and Cards

p(28) = 0.89, p < 0.01

Flights and Passengers

p(28) = 0.92, p < 0.01

Awesome!

Summary / Conclusion

I already showed before that my two Graders rarely agreed mark for mark, and that one Grader tended to give higher marks than the other.

The analysis with Pearson’s correlation co-efficient seems to suggest that, while there isn’t one-to-one agreement, there is certainly a significant correlation – with the majority of the criteria having a correlation with p < 0.01!

The total marks also show a very strong, significant, positive correlation.

Ok, so that’s the conclusion here:  the Graders marks do not match, but show moderate to high positive correlation to a significant degree.

How’s My Stats?

Did I screw up somewhere?  Am I making fallacious claims?  Let me know – post a comment!

2 thoughts on “Some More Results: Did the Graders Agree? – Part 2

  1. Jonathan Lung

    Interesting results. Wasn’t looking for stats errors and didn’t notice any. Your conclusion is that “the Graders marks do not match, but show moderate to high positive correlation to a significant degree”; so, assuming your statistics to be correct, even though you’ve found moderate to high correlation (and statistically significant), do you think there is sufficient correlation to say that it would be “fair” to let either grader’s mark stand?

    On a different note, you mentioned that Steve and Dwayne found five peer reviews were required to predict the TA’s grade; have you compared the granularity and degree of objectivity of your marking rubric and theirs? It seems like a necessary step if you want to see if the results for computer science are consistent with theirs. While I don’t have a copy of Steve’s rubric, I would guess that your rubric is much more comprehensive and detailed since you’re dealing with things on a method-by-method basis. On the other hand, who knows… if you had *less* granularity, you’re dealing with an average of averages and might get even better correlation. Anyway, that sounds like a different experiment entirely.

  2. Mike

    @Jonathan:

    Sorry it took so long to get back to you on this.

    > do you think there is sufficient correlation to say that it would be “fair” to let either grader’s mark stand?

    Good question. I think, for the most part, after adjustments (instructors *do* tend to adjust grades to compensate for “harder” and “easier” TAs), it would be fair to let the grader’s mark stand. If a set of students complained, I might be compelled to take a closer look.

    But then again, I’ve never been a course instructor. 😉

    > have you compared the granularity and degree of objectivity of your marking rubric and theirs?

    Joorden’s and Pare’s experiment was slightly different in that they didn’t really use a rubric. Instead, the assignments were marked out of 10.

    The advantage of this design is that if there is some interesting correlation, we *might* see different behaviour in the design and correctness criteria…and if not, no loss.

    Thanks for posting,

    -Mike

Comments are closed.