{"id":1676,"date":"2010-08-24T23:38:45","date_gmt":"2010-08-25T04:38:45","guid":{"rendered":"http:\/\/mikeconley.ca\/blog\/?p=1676"},"modified":"2023-12-20T16:25:15","modified_gmt":"2023-12-20T21:25:15","slug":"some-more-results-did-the-graders-agree","status":"publish","type":"post","link":"https:\/\/mikeconley.ca\/blog\/2010\/08\/24\/some-more-results-did-the-graders-agree\/","title":{"rendered":"Some More Results:  Did the Graders Agree?"},"content":{"rendered":"<p><a href=\"http:\/\/mikeconley.ca\/blog\/2010\/08\/11\/research-experiment-a-recap\/\">My experiment<\/a> makes a little bit of an assumption &#8211; and it&#8217;s the same assumption most teachers probably make before they hand back work.\u00a0 We assume that <em>the work has been graded correctly and objectively.<\/em><\/p>\n<p>The rubric that I provided to my graders was supposed to help sort out all of this objectivity business.\u00a0 It was supposed to boil down all of the subjectivity into a nice, discrete, quantitative value.<\/p>\n<p>But I&#8217;m a careful guy, and I like back-ups.\u00a0 That&#8217;s why I had 2 graders do my grading.\u00a0 Both graders worked in isolation on the same submissions, with the same rubric.<\/p>\n<p>So, did it work?\u00a0 How did the grades match up?\u00a0 Did my graders tend to agree?<\/p>\n<p>Sounds like it&#8217;s time for some data analysis!<\/p>\n<h3>About these tables&#8230;<\/h3>\n<p>I&#8217;m about to show you two tables of data &#8211; one table for <a href=\"http:\/\/mikeconley.ca\/blog\/2010\/08\/20\/my-experiment-apparatus-the-assignments-rubrics-and-mock-ups\/\">each assignment<\/a>.\u00a0 The rows of the tables map to a single criterion on that <a href=\"http:\/\/mikeconley.ca\/blog\/2010\/08\/20\/my-experiment-apparatus-the-assignments-rubrics-and-mock-ups\/\">assignments rubric<\/a>.<\/p>\n<p>The columns are concerned with the graders marks for each criterion.\u00a0 The first columns, <strong>Grader 1 &#8211; Average<\/strong> and <strong>Grader 2 &#8211; Average<\/strong>, simply show the average mark given for each criteria for each grader.<\/p>\n<p><strong>Number of Agreements<\/strong> shows the number of times the marks between both graders matched for that criterion.\u00a0 Similarly, <strong>Number of Disagreements<\/strong> shows how many times they didn&#8217;t match.\u00a0 <strong>Agreement Percentage<\/strong> just converts those two values into a single percentage for agreement.<\/p>\n<p><strong>Average Disagreement Magnitude<\/strong> takes every instance where there was a disagreement, and averages the magnitude of the disagreement (a reminder:\u00a0 the magnitude here is the absolute value of the difference).<\/p>\n<p>Finally, I should point out that these tables can be sorted by clicking on the headers.\u00a0 This will probably make your interpretation of the data a bit easier.<\/p>\n<p>So, if we&#8217;re clear on that, then let&#8217;s take a look at those tables&#8230;<\/p>\n<h3>Flights and Passengers Grader Comparison<\/h3>\n<p>[table id=6 \/]<\/p>\n<h3>Decks and Cards Grader Comparison<\/h3>\n<p>[table id=7 \/]<\/p>\n<h3>Findings and Analysis<\/h3>\n<h4>It is very rare for the graders to fully agree<\/h4>\n<p>It only happened once, on the &#8220;add_passenger&#8221; correctness criterion of the Flights and Passengers assignments. \u00a0If you sort the tables by &#8220;Number of Agreements&#8221; (or Number of Disagreements), you&#8217;ll see what I mean.<\/p>\n<h4>Grader 2 tended to give higher marks than Grader 1<\/h4>\n<p>In fact, there are only a handful of cases (4, by my count), where this isn&#8217;t true:<\/p>\n<ol>\n<li>The add_passenger correctness criterion on Flights and Passengers<\/li>\n<li>The internal comments criterion on Flights and Passengers<\/li>\n<li>The error checking criterion on Decks and Cards<\/li>\n<li>The internal comments criterion on Decks and Cards<\/li>\n<\/ol>\n<h4>The graders tended to disagree more often on design and style<\/h4>\n<p>Sort the tables by Number of Disagreements descending, and take a look down the left-hand side.<\/p>\n<p>There are 14 criteria in total for each assignment. \u00a0If you&#8217;ve sorted the tables like I&#8217;ve asked, the top 7 criteria of each assignment are:<\/p>\n<h5>Flights and Passengers<\/h5>\n<ol>\n<li>Style<\/li>\n<li>Design of __str__ method<\/li>\n<li>Design of heaviest_passenger method<\/li>\n<li>Design of lightest_passenger method<\/li>\n<li>Docstrings<\/li>\n<li>Correctness of __str__ method<\/li>\n<li>Design of Flight constructor<\/li>\n<\/ol>\n<h5>Decks and Cards<\/h5>\n<ol>\n<li>Correctness of deal method<\/li>\n<li>Style<\/li>\n<li>Design of cut method<\/li>\n<li>Design of __str__ method<\/li>\n<li>Docstrings<\/li>\n<li>Design of deal method<\/li>\n<li>__str__<\/li>\n<\/ol>\n<p>Of those 14, <strong>9<\/strong> have to do with design or style. \u00a0It&#8217;s also worth noting that Doctrings and the correctness of the __str__ methods are in there too.<\/p>\n<h4>There were slightly more disagreement in Decks and Cards than in Flights and Passengers<\/h4>\n<p>Total number of disagreements for Flights and Passengers: \u00a0136 (avg: \u00a09.71 per criterion)<\/p>\n<p>Total number of disagreements for Decks and Cards: \u00a0161 (avg: \u00a011.5 per criterion)<\/p>\n<h3>Discussion<\/h3>\n<h4>Being Hands-off<\/h4>\n<p>From the very beginning, when I contacted \/ hired my Graders, I was very hands-off.\u00a0 Each Grader was given the assignment specifications and rubrics ahead of time to look over, and then a single meeting to ask questions.\u00a0 After that, I just handed them manila envelopes filled with submissions for them to mark.<\/p>\n<p>Having spoken with some of the undergraduate instructors here in the department, I know that this isn&#8217;t usually how grading is done.<\/p>\n<p>Usually, the instructor will have a big grading meeting with their TAs.\u00a0 They&#8217;ll all work through a few submissions, and the TAs will be free to ask for a marking opinion from the instructor.<\/p>\n<p>By being hands-off, I didn&#8217;t give my Graders the same level of guidance that they may have been used to.\u00a0 I did, however, tell them that they were free to e-mail me or come up to me if they had any questions during their marking.<\/p>\n<p>The hands-off thing was a conscious choice by Greg and myself.\u00a0 We didn&#8217;t want me to bias the marking results, since I would know which submissions would be from the treatment group, and which ones would be from control.<\/p>\n<p>Anyhow, the results from above have driven me to conclude that if you just hand your graders the assignments and the rubrics, and say &#8220;go&#8221;, you run the risk of seeing dramatic differences in grading from each Grader.\u00a0 From a student&#8217;s perspective, this means that it&#8217;s possible to be marked by &#8220;the good Grader&#8221;, or &#8220;the bad Grader&#8221;.<\/p>\n<p>I&#8217;m not sure if a marking-meeting like I described would mitigate this difference in grading.\u00a0 I hypothesize that it would, but that&#8217;s an experiment for another day.<\/p>\n<h4>Questionable Calls<\/h4>\n<p>If you sort the Decks and Cards table by Number of Disagreements, you&#8217;ll find that the criterion that my Graders disagreed most on was the correctness of the &#8220;deal&#8221; method.\u00a0 Out of 30 submissions, both Graders disagreed on that particular criterion 21 times (70%).<\/p>\n<p>It&#8217;s a little strange to see that criterion all the way at the top there.\u00a0 As I mentioned earlier, most of the disagreements tended to be concerning design and style.<\/p>\n<p>So what happened?<\/p>\n<p>Well, let&#8217;s take a look at some examples.<\/p>\n<h5>Example #1<\/h5>\n<p>The following is the deal method from participant #013:<\/p>\n<pre>def deal(self, num_to_deal):\r\n  i = 0\r\n  while i &lt; num_to_deal:\r\n    print self.deck.pop(0)\r\n    i += 1\r\n<\/pre>\n<p>Grader 1 gave this method a 1 for correctness, where Grader 2 gave this method a 4.<\/p>\n<p>That&#8217;s a big disagreement.\u00a0 And remember, a 1 on this criterion means:<\/p>\n<blockquote><p>Barely meets assignment specifications. Severe problems throughout.<\/p><\/blockquote>\n<p>I think I might have to go with Grader 2 on this one.\u00a0 Personally, I wouldn&#8217;t use a while-loop here &#8211; but that falls under the design criterion, and shouldn&#8217;t impact the correctness of the method.\u00a0 I&#8217;ve tried the code out.\u00a0 It works to spec.\u00a0 It deals from the top of the deck, just like it&#8217;s supposed to.\u00a0 Sure, there are some edge cases missed here (what is the Deck is empty?\u00a0 What if we&#8217;re asked to deal more than the number of cards left?\u00a0 What if we&#8217;re asked to deal a negative number of cards?\u00a0 etc)&#8230; but the method seems to deliver the basics.<\/p>\n<p>Not sure what Grader 1 saw here.\u00a0 Hmph.<\/p>\n<h5>Example #2<\/h5>\n<p>The following is the deal method from participant #023:<\/p>\n<pre>def deal(self, num_to_deal):\r\n res = []\r\n for i in range(0, num_to_deal):\r\n   res.append(self.cards.pop(0))\r\n<\/pre>\n<p>Grader 1 gave this method a 0 for correctness.\u00a0 Grader 2 gave it a 3.<\/p>\n<p>I see two major problems with this method.\u00a0 The first one is that it doesn&#8217;t print out the cards that are being dealt off:\u00a0 instead, it stores them in a list.\u00a0 Secondly, that list is just tossed out once the method exits, and nothing is returned.<\/p>\n<p>A &#8220;0&#8221; for correctness simply means Unimplemented, which isn&#8217;t exactly true:\u00a0 this method has been implemented, and has the right interface.<\/p>\n<p>But it doesn&#8217;t conform to the specification whatsoever.\u00a0 I would give this a 1.<\/p>\n<p>So, in this case, I&#8217;d side more (but not agree) with Grader 1.<\/p>\n<h5>Example #3<\/h5>\n<p>This is the deal method from participant #025:<\/p>\n<pre>def deal(self, num_to_deal):\r\n    num_cards_in_deck = len(self.cards)\r\n    try:\r\n        num_to_deal = int(num_to_deal)\r\n        if num_to_deal &gt; num_cards_in_deck:\r\n            print \"Cannot deal more than \" + num_cards_in_deck + \" cards\\n\"\r\n        i = 0\r\n        while i &lt; num_to_deal:\r\n            print str(self.cards[i])\r\n            i += 1\r\n        self.cards = self.cards[num_to_deal:]\r\n    except:\r\n        print \"Error using deal\\n\"\r\n<\/pre>\n<p>Grader 1 also gave this method a 1 for correctness, where Grader 2 gave a 4.<\/p>\n<p>The method is pretty awkward from a design perspective, but it seems to behave as it should &#8211; it deals the provided number of cards off of the top of the deck and prints them out.<\/p>\n<p>It also catches some edge-cases:\u00a0 num_to_deal is converted to an int, and we check to ensure that num_to_deal is less than or equal to the number of cards left in the deck.<\/p>\n<p>Again, I&#8217;ll have to side more with Grader 2 here.<\/p>\n<h5>Example #4<\/h5>\n<p>This is the deal method from participant #030:<\/p>\n<pre>def deal(self, num_to_deal):\r\n  ''''''\r\n  i = 0\r\n  while i &lt;= num_to_deal:\r\n    print self.cards[0]\r\n    del self.cards[0]\r\n<\/pre>\n<p>Grader 1 gave this a 1.\u00a0 Grader 2 gave this a 4.<\/p>\n<p>Well, right off the bat, there&#8217;s a major problem:\u00a0 this while-loop never exists.\u00a0 The while-loop is waiting for the value i to become greater than num_to_deal&#8230;but it never can, because i is defined as 0, and never incremented.<\/p>\n<p>So this method doesn&#8217;t even come close to satisfying the spec.\u00a0 The description for a &#8220;1&#8221; on this criterion is:<\/p>\n<blockquote><p>Barely meets assignment specifications. Severe problems throughout.<\/p><\/blockquote>\n<p>I&#8217;d have to side with Grader 1 on this one.\u00a0 The only thing this method delivers in accordance with the spec is the right interface.\u00a0 That&#8217;s about it.<\/p>\n<h4>Dealing from the Bottom of the Deck<\/h4>\n<p>I received an e-mail from Grader 2 about the deal method.\u00a0 I&#8217;ve paraphrased it here:<\/p>\n<blockquote><p>If the students create the list of cards in a typical way, for suit in  CARD_SUITS; for rank in CARD_RANKS, and then print using something like:<br \/>\nfor card in self.cards<br \/>\nprint str(card) +\u00a0 &#8220;\\n&#8221;<br \/>\nThen for deal, if they pick the cards to deal using pop() somehow, like:<br \/>\nfor i in range(num_to_deal):<br \/>\nprint str(self.cards.pop())<\/p>\n<p>Aren&#8217;t they dealing from the bottom<\/p><\/blockquote>\n<p>My answer was &#8220;yes, they are, and that&#8217;s a correctness problem&#8221;.\u00a0 In my assignment specification, I was intentionally vague about the internal collection of the cards &#8211; I let the participant figure that all out.\u00a0 All that mattered was that the model made sense, and followed the rules.<\/p>\n<p>So if I print my deck, and it prints:<\/p>\n<pre>Q of Hearts\r\nA of Spades\r\n7 of Clubs<\/pre>\n<p>Then deal(1) should print:<\/p>\n<pre>Q of Hearts\r\n<\/pre>\n<p>regardless of the internal organization.<\/p>\n<p>Anyhow, only Grader 2 asked for clarification on this, and I thought this might be the reason for all of the disagreement on the deal method.<\/p>\n<p>Looking at all of the disagreements on the deal methods, it looks like 7 out of the 20 can be accounted for because students were unintentionally dealing from the bottom of the deck, and only Grader 2 caught it.<\/p>\n<p>Subtracting the &#8220;dealing from the bottom&#8221; disagreements from the total leaves us with 13, which puts it more in line with some of the other correctness criteria.<\/p>\n<p>So I&#8217;d have to say that, yes, the &#8220;dealing from the bottom&#8221; problem is what made the Graders disagree so much on this criterion:\u00a0 only 1 Grader realized that it was a problem while they were marking.\u00a0 Again, I think this was symptomatic of my hands-off approach to this part of the experiment.<\/p>\n<h3>In Summary<\/h3>\n<p>My graders disagreed.\u00a0 A lot.\u00a0 And a good chunk of those disagreements were about style and design.\u00a0 Some of these disagreements might be attributable to my hands-off approach to the grading portion of the experiment.\u00a0 Some of them seem to be questionable calls from the Graders themselves.<\/p>\n<p>Part of my experiment was interested in determining how closely peer grades from students can approximate grades from TAs.\u00a0 Since my TAs have trouble agreeing amongst themselves, I&#8217;m not sure how that part of the analysis is going to play out.<\/p>\n<p>I hope the rest of my experiment is unaffected by their disagreement.<\/p>\n<p>Stay tuned.<\/p>\n<h3>See anything?<\/h3>\n<p>Do my numbers make no sense?\u00a0 Have I contradicted myself?\u00a0 Have I missed something critical?\u00a0 Are there unanswered questions here that I might be able to answer?\u00a0 I&#8217;d love to know.\u00a0 Please comment!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>My experiment makes a little bit of an assumption &#8211; and it&#8217;s the same assumption most teachers probably make before they hand back work.\u00a0 We assume that the work has been graded correctly and objectively. The rubric that I provided to my graders was supposed to help sort out all of this objectivity business.\u00a0 It [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[454,626],"tags":[831,828,829,830,824,827,823,802,803,816,825,826],"class_list":["post-1676","post","type-post","status-publish","format-standard","hentry","category-code-reviews","category-research-computer-science-technology","tag-comparison","tag-criteria","tag-criterion","tag-data-analysis","tag-disagreement","tag-fairness-in-grading","tag-graders","tag-grading","tag-marking","tag-rubrics","tag-tas","tag-teaching-assistants"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/prmTy-r2","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/posts\/1676","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/comments?post=1676"}],"version-history":[{"count":45,"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/posts\/1676\/revisions"}],"predecessor-version":[{"id":3157,"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/posts\/1676\/revisions\/3157"}],"wp:attachment":[{"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/media?parent=1676"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/categories?post=1676"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mikeconley.ca\/blog\/wp-json\/wp\/v2\/tags?post=1676"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}