Archive for the ‘Technology’ Category.

Starting Work on Mozilla Thunderbird

With the Winter holidays drawing to a close, I’m really looking forward to starting the next chapter of my life – namely, my new job at Mozilla Messaging working on the Thunderbird e-mail client.

In just a little under a week, I’ll be knee-deep in a code-base larger than any I’ve ever worked on before.  And I’ll be working with some of the best software developers in the world.

I’m pretty stoked.

So, what exactly will I be doing for Thunderbird?  What project will I be starting my work with?  I’m so glad you asked…

Thunderbird + Unity = Badass

Ok, that’s not technically the code-name for the project, but I think it more or less conveys my feelings about the whole thing.

So here’s the story in a nutshell:

Ubuntu Linux is one of several operating systems that Thunderbird runs on (the other big ones being Mac OSX and the various flavours of Microsoft Windows).  I use Ubuntu as my primary operating system – I’m comfortable with it, and I like it.

In the coming months, there will be a tectonic shift of sorts in Ubuntu.  The graphical user interface that most Ubuntu users are used to (the GNOME Shell) will no longer be the default.  Instead, Canonical, the makers of Ubuntu, have created their own user interface to run on top of GNOME.  That interface is called Unity, and will be made default in the Natty Narwhal release (due to come out on or around April 28th of this year).

Just to make sure we’re clear on this:  Ubuntu is not dropping GNOME.  The GNOME Shell is the icing on the whole GNOME Stack.  Canonical has just decided to put their own icing on the cake.

So, anyhow, my job is to make Thunderbird work nicely with Unity in time for the April 28th release.

And by “work nicely”, I mean the following:

The Global Menu Bar

If you’ve never used Mac OSX, it’s likely that you don’t know what a global menu bar is.  Here’s the idea:  in Windows and Ubuntu, each window tends to have its own menu bar (File, Edit, etc…).

In Mac OSX, and the upcoming Unity shell, instead of having these individual menu bars, we have a single, overarching menu bar. This menu bar changes itself every time you switch application focus.

Here’s some guy demonstrating the global menu in Ubuntu Linux:

Currently, Thunderbird doesn’t “play nice” with Unity’s global menu bar, and just displays the menu within the Thunderbird window as it always has.

My job is to get Thunderbird to use the global menu bar properly.  Click here to read more about Ubuntu Unity’s global menu bar.

The Messaging Menu

Ubuntu Unity also sports a shiny new messaging menu.  The messaging menu aggregates all sorts of message-related information – and that includes e-mail messages, chat messages, social networking messages, etc.  It tosses all of these into a nice, clean, simple notification interface, like this:

Ubuntu Unity Messaging Menu

It’s up to messaging application developers to leverage this feature in Unity, and that’s where I come in.  I’ll be getting Thunderbird to work nicely with this messaging menu.  Click here to read more about Ubuntu Unity’s messaging menu.

The Task List

Ubuntu Unity also sports a new application launcher.  The launcher is a panel that stretches down the left-hand side of the screen, and allows users to quickly find and execute their applications.  It also lets users know which applications are already open.  In a way, it is very similar to the Mac OSX dock.

Here is a Canonical designer demonstrating the new launcher:

Unity Launcher Introduction from Canonical Design on Vimeo.

Right-clicking on an item in the launcher brings up a context-menu for the selected application.  For Thunderbird, we’ll probably want the context menu to allow users to do some common operations, such as fetching mail, and composing a new message.  We’ll probably also want to display the number of unread messages.  So that’s what I’m going to be looking into there.

I’m looking forward to tackling these problems!  I’ll keep you posted on my progress.

1 person likes this post.

The Wisdom of Peers: A Motive for Exploring Peer Code Review in the Classroom

A major part of my Master’s degree requirements was my research paper.  If you heard me lament over the past year or so about my “thesis”, I was referring to this research paper.

Anyhow, after lots of hard work, my research paper was finally signed off by my supervisor, Dr. Greg Wilson, and second reader Dr. Yuri Takhteyev.  A huge thanks to both of them!

Here’s the abstract, followed by a download link for the PDF.  Enjoy!

Abstract

Peer code review is commonly used in the software development industry to identify and fix problems during the development process. An additional benefit is that it seems to help spread knowledge and expertise around the team conducting the review. So is it possible to leverage peer code review as a learning tool? Our experiment results show that peer code review seems to cause a performance boost in students. They also show that the average total peer mark generated by students seems to be similar to the total mark that a graduate-level teaching assistant might give. We found that students agree that peer code review teaches them something – however, we also found they do not enjoy grading their peers’ work. We are encouraged by these results, and feel that they are a strong motive for further research in this area.

Click here to download my research paper

4 people like this post.

That’s all, folks! or Becoming Randall Stevens

Once again, I’ve let a month’s worth of dust gather on my blog.  But I have a good reason for being so busy!

Several good reasons, actually.

And here they are:

UCOSP has wrapped for the semester

This semester, I was a teaching assistant for the UCOSP (Undergraduate Capstone Open-source Projects) course.  I helped out with two projects:  MarkUs and Review Board.

This semester, we saw some outstanding work for both projects.  Lots of great students, lots of good code, lots of leaps forward.

I’m looking forward to helping out next semester with UCOSP.

I won’t be doing it as a paid teaching assistant though.  Why?  Well…

I’ve finished school

My research paper was signed off by my two readers, and I just wrote my last final exam a few nights ago.  Unofficial grades have been posted, and I’ve passed what I needed to pass.

So that’s that – I’m a Master of Computer Sciences, I guess.  Awesome!

I got a job!

I’ve been hired by Mozilla Messaging to work on the Thunderbird project!  I’m 100% psyched about this opportunity, and look forward to peeling into the code.  An added bonus:  since Thunderbird is an open-source project, I’m absolutely free to discuss the code and the various things I’m doing with it.  No NDAs for me!  So stay tuned – I’ll have lots to say about Thunderbird and the Mozilla Framework code.  Just give me some time to wade through it.

Zihuatanejo

It’s been a pretty long road.  I’ve been in school, in one form or another, for over two decades.  It’s strange that it’s over.  I’m extremely excited about my next adventures, but I think I’m going to miss school.

Still, I can’t help but be a bit dramatic…

In 1966, Andy Dufresne escaped from Shawshank prison. All they found of him was a muddy set of prison clothes, a bar of soap, and an old rock hammer, damn near worn down to the nub. I remember thinking it would take a man six hundred years to tunnel through the wall with it. Old Andy did it in less than twenty. Oh, Andy loved geology. I imagine it appealed to his meticulous nature. An ice age here, million years of mountain building there. Geology is the study of pressure and time. That’s all it takes really, pressure, and time. …Andy crawled to freedom through five hundred yards of shit smelling foulness I can’t even imagine, or maybe I just don’t want to. Five hundred yards… that’s the length of five football fields, just shy of half a mile…

Andy Dufresne – who crawled through a river of shit and came out clean on the other side.

P.S.:  Here are some celebration rituals, if so inclined.

3 people like this post.

Stallin’…

I know, I know.  I left you all hanging at the edge of your seat with my last blog post, and I still haven’t posted my idea for recognizing good code review.

I’m bogged down with school work, and I’m aiming to have the first draft of my research paper done next week.  So that’s taking 100% of my resources.

Just be patient.  I’ll post my idea soon.

Be the first to like.

Recognizing Good Code Review

While the benefits of code review are proven, documented, numerous and awesome, it doesn’t change the fact that most people, in general, don’t like doing it.

I guess code review just isn’t really all that fun.

So a few months ago, I broadcast the idea of turning code review into a game. It was my way of trying to mix things up – “let’s add points, and have reviewers/developers competing to be the best participant in the code review process”.

Well, if there’s one thing that my supervisor Greg has taught me, it’s how I shouldn’t rush headlong into something before all of the facts are in.  So before I decide to do something like game-ifize code review, I should take a look at some prior work in the area…

Enter this guy:  Sebastian Deterding.

In particular, check out the following slide-show.  Flip through it if you have the time.  If you don’t have the time, scroll down, where I get to the salient point with respect to game-ificating code review.

Here’s the slide-show. Be sure to read the narrative at the bottom.

The Salient Point

Sebastian seems to be saying that adding points to apps and trying to incite competition does not make something a game.  If it did, then this should be countless hours of fun.

Without play, there is no game. Points do not equal a game.  It’s not nearly that simple.

Free Pizza and Pop

I’m going to divert for a second here.

Last week, a company set themselves up a couple of booths in the lobby of the Bahen Center where I work.  They were there to recruit university students to work for their company – either as interns, or full-timers.

They were also handing out free pizza and pop.

Needless to say, I wanted a few slices – but I figured it would be polite if I engaged them in conversation before waltzing off with some of the free food and drink they’d brought.

So I sparked up a conversation with one of the recruiters, and he told me about the company.  I’m going to call this recruiter Vlad.

I ended up gently steering the conversation towards code review, and I asked my inevitable question:

“So, do you guys do code review?”

I felt like a dentist asking a patient if he’s been flossing.  Vlad waffled a bit, but the general impression was:

“Not as much as we should.  We don’t have a prescribed workflow. It’d be hard to persuade all of the teams to do it.”

And then we started talking about code review in general.  It turns out that Vlad had worked in a few companies where they’d done code review, and he always felt a little short changed.  He said something along the lines of:

“I never felt compelled to do reviews.  They just sort of happened…and I did it, and it felt like…unrecognized effort.  I mean, what’s the incentive?  Do you know what I mean?  There’s incentive for the software, but I’m talking incentive for me.  And some people did really lousy reviews…but my reviews were treated the same as theirs.  I didn’t get recognized, and didn’t get rewarded if I did a good review.  So it was hard for me to do them.  I want to be recognized for my good reviews, for my good contributions.”

I wish I’d had a tape-recorder running so I could have gotten Vlad’s exact words.  But that’s what I remember him saying.

Feedback and Recognition

Maybe instead of trying to game-ulize code review, I can instead hear what Vlad is saying and work off of that.

With the code review that Vlad participated in, all of the feedback went to the code author, and none went to the reviewers.  And the reviewers are the ones who are doing all of the heavy lifting!  As a reviewer, Vlad also wants feedback, and recognition for code review done well.

There’s a company in Toronto that specializes in feedback like this.  They’re one of the major players in the Toronto start-up scene, and have built a pretty sweet suite of tools to facilitate quick and easy feedback/recognition.

The company is called Rypple.  And maybe that’s the name of the application, too.  (checks website) Yeah, it’s both.

So Rypple has this feature called Kudos that let’s people publicly acknowledge the good work of their team.

Normally, I don’t pimp companies.  And it upsets me when people comment on my blog, and their sub-text is to try to sell their product or service.  However, I think this video is relevant, so I’m posting their demo video so you can see how Kudos work:

Click here if you can’t see the video.

The Idea

So Rypple’s idea is to have a feed that the team subscribes to, and publicly display things like Kudos.  The badges for the Kudos are also limited in how many you can give per week, so they’re a valuable commodity that can’t just be handed out all over the place.  Cool idea.

So there’s one approach – use a service like Rypple to give your reviewers better feedback and recognition.

Or maybe we could build an extension for Review Board that does something similar, and more oriented around code review.

It’s not oriented like a game, like I had originally envisioned.  But somehow, I think this idea has more meaning and traction than just “adding points”.

More on this idea in a few days.  But please, comment if you have any thoughts or ideas to add.

4 people like this post.

Review Board Issue Tracking: A Sneak Peek

So I wrote my (hopefully) last mid-term ever last night, and in celebration, I thought I’d put together a little video showing off the issue tracking feature I’m hoping to put into Review Board.

It’s still in it’s very early stages.  The code hasn’t been reviewed.  I’m still really really open to suggestions and feedback on this.  So please, comment here, or on the reviewboard-dev list.

So here it is – enjoy!

(Click here if you can’t see the video)

2 people like this post.

Starting My Thesis

So I’ve been given the go-ahead to start writing my thesis.  I was going to post up some more exciting numbers/findings from my experiment, but that’ll have to wait – the thesis beckons.

I’ve started writing it, and holy smokes, it’s hard.  It’s hard because I have to zoom out from my current perspective, and start right from scratch, explaining where every single decision came from.

And I have to do it in a formal, academic tone – without awesome photos.

Plan of Attack

I think I’m going to go with Alecia on this one, and start with my outline.  That’s what I always did for any of my Drama classes where I had to write a big essay:  start with the outline, and treat it like the skeleton…then slowly put more flesh on the skeleton.  Keep fleshing it out, throw on some skin, some clothes, a lick of varnish, and bam:  it’s all done.

Anyhow, that’s my plan of attack.  So I need an outline.  Let me show you what I have.

Tentative Outline

  1. Intro
    1. Title Page
    2. Abstract
    3. Acknowledgments
    4. Table of Contents
    5. List of Tables (where applicable)
    6. List of Plates (where applicable)
    7. List of Figures
    8. List of Appendices (where applicable)
  2. The Meat
    1. Background
      1. Code Review
          1. What it is, how it is commonly used in industry
          2. Proven to be effective (Jason Cohen study)
          3. Helps to spread learning in a development team
        1. If code review is so good at spreading learning, why isn’t it part of the pedagogy in the undergrad curriculum?
            1. How do we teach it?
            2. The curriculum is already packed – how do we fit it in?
            3. Joorden’s and Pare’s peerScholar approach
          1. The idea:
              1. Have students evaluate one another after assignments, and give them a code review grade based on agreement with the TA grades.
          2. Unanswered questions:
            1. Would students actually benefit from this idea?
            2. What is the relationship between the marks given by TAs, and the marks given by student evaluators?
            3. How would students feel about grading one another?
          3. The experiment
            1. Terminology
              1. Assignment specification
              2. Submission
              3. Subject
              4. Grader
              5. Peer Grader
              6. Marking
              7. Marking Rubric
              8. Peer Average
              9. Agreement
            2. Design
              1. Single-blind, with two groups (control and treatment)
                1. In both groups, subjects would:
                2. fill out brief questionnaire
                3. work on two programming assignments
                4. have a maximum of half an hour to complete each assignment
                5. perform another activity during the time between assignments, dependent on their particular group:
                  1. treatment group would perform some grading
                  2. control group would work on a vocabulary exercise
              2. Subjects in the treatment group would then fill out a post-experiment questionnaire to get their feedback on their marking experience
              3. Counter-balancing?
              4. Graders would mark shuffled submissions
              5. Graders would choose their preferred submission
            3. Instruments
              1. Pre-experiment Questionnaire
              2. Assignment Specifications
                1. Flights and Passengers
                2. Decks and Cards
              3. Assignment Rubrics
              4. Mock-ups
              5. Vocabulary Exercise
              6. Post-experiment Questionnaire
              7. Working Environment
                1. IDE
                2. Count-down widget
                3. Screen capture
            4. Subjects
              1. Undergraduates with 4+ months of Python programming experience
              2. Months as a unit of experience
              3. The two graders
            5. Assignment Sessions
              1. Greeting, informed consent, withdrawal rights
              2. Pre-experiment questionnaire
              3. First Assignment Rules
                1. 30 minutes maximum – finish early, let me know
                2. full access to Internet
                3. work may or may not be seen by other participants in the study
                4. may ask for clarification
              4. First Assignment begins
                1. Timer widget starts
                2. Screen capture begins
                3. Subject left alone
              5. Marking / vocabulary phase
                1. Treatment group
                  1. Would be given 5 submissions (secretly mock-ups), given 5 rubrics, asked to fill out as much as possible
                  2. 30 minute time limit
                2. Control group
                  1. Given links to 5 vocabulary exercises found online
                  2. Asked to complete as much as possible, and to self-report results on a sheet of paper
                  3. 30 minute time limit
              6. Second Assignment Rules
                1. Same as first, but repeated for emphasis
              7. Second Assignment begins
                1. Timer widget starts
                2. Screen capture begins
                3. Subject left alone
              8. Control group subjects released
              9. Treatment group subjects fill out post-experiment questionnaire
            6. Grading
              1. Initial meeting, and then hand-off of submissions / rubrics
              2. Hands-off approach
            7. Choosing Phase
              1. Submissions for each assignment were paired by the subject that wrote them
              2. Mock-ups not included
              3. Graders were asked to choose which one they preferred, and give a rating of the difference
          4. Analysis
            1. Pearson’s Correlation Co-efficient as a measure of agreement
            2. Fisher’s z-score
          5. Results
            1. On grader vs. grader agreement
            2. On grader vs. peer average agreement
            3. On treatment vs. control
              1. Difference in average
              2. Grader preference
            4. On student opinion wrt peer grading
          6. Discussion
          7. Threats to validity
            1. The 30 minute time limit
            2. A rigid rubric
          8. Future work
          9. Conclusion

        That’s the current structure of it.  I’m meeting my supervisor tomorrow and getting feedback, so this might change.  Stay tuned.

        Be the first to like.

        Some More Results: Did the Graders Agree? – Part 2

        (Click here to read the first part of the story)

        I’m just going to come right out and say it:  I’m no stats buff.

        Actually, maybe that’s giving myself too much credit.  I barely scraped through my compulsory statistics course.  In my defense, the teaching was abysmal, and the class average was in the sewer the entire time.

        So, unfortunately, I don’t have the statistical chops that a real scientist should.

        But, today, I learned a new trick.

        Pearson’s Correlation Co-efficient

        Joorden’s and Pare gave me the idea while I was reviewing their paper for the Related Work section of my thesis.  They used it in order to inspect mark agreement between their expert markers.

        In my last post on Grader agreement, I was looking at mark agreement at the equivalence level.  Pearson’s Correlation Co-efficient should (I think) let me inspect mark agreement at the “shape” level.

        And by shape level, I mean this:  if Grader 1 gives a high mark for a participant, then Grader 2 gives a high mark.  If Grader 1 gives a low mark for the next participant, then Grader 2 gives a low mark.  These high and low marks might not be equal, but the basic shape of the thing is there.

        And this page, with it’s useful table, tell me how I can tell if the correlation co-efficient that I find is significant.  Awesome.

        At least, that’s my interpretation of Pearson’s Correlation Co-efficient.  Maybe I’ve got it wrong.  Please let me know if I do.

        Anyhow, it can’t hurt to look at some more tables.  Let’s do that.

        About these tables…

        Like my previous post on graders, I’ve organized my data into two tables – one for each assignment.

        Each table has a row for that assignments criteria.

        Each table has two columns – the first is strictly to list the assignment criteria.  The second column gives the Pearson Correlation Co-efficient for each criterion.  The correlation measurement is between the marks that my two Graders gave on that criterion across all 30 submissions for that assignment.

        I hope that makes sense.

        Anyways, here goes…

        Da-ta!

        Decks and Cards Grader Correlation Table

        Grader 1 – AverageGrader 2 – AveragePearson's Correlation Co-efficient
        Deck Constructor3.373.570.65
        Design of Deck Constructor33.770.5
        __str__2.633.40.57
        Design of __str__2.333.670.36
        deal2.273.030.57
        Design of deal2.533.70.4
        shuffle3.233.530.77
        Design of shuffle33.470.78
        cut2.672.970.88
        Design of cut2.172.90.78
        Error Checking1.070.930.95
        Style2.93.630.52
        Docstrings1.872.030.7
        Internal Comments1.10.830.56

        Flights and Passengers Grader Correlation Table

        Grader 1 – AverageGrader 2 – AveragePearson's Correlation Co-efficient
        Flight Constructor3.673.730.97
        Design of Flight Constructor3.433.930.72
        __str__3.033.370.8
        Design of __str__2.43.40.57
        add_passenger3.93.91
        Design of add_passenger3.533.870.77
        heaviest_passenger33.270.87
        Design of heaviest_passenger2.173.10.46
        lightest_passenger2.833.030.9
        Design of lightest_passenger22.830.64
        Error Checking1.41.730.85
        Style2.83.530.68
        Docstrings1.471.90.87
        Internal Comments0.730.670.76

        What does this tell us?

        Well, first off, remember that for each assignment, for each criterion, there were 30 submissions.

        So N = 30.

        In order to determine if the correlation co-efficients are significant, we look at this table, and find N – 2 down the left hand side:

        28                       .306    .361    .423    .463

        Those 4 values on the right are the critical values that we want to pass for significance.

        Good news!  All of the correlation co-efficients fall within the range of [.306, .463].  So now, I’ll show you their significance by level:

        p < 0.10

        • Design of __str__ in Decks and Cards assignment

        p < 0.05

        • Design of deal method in Decks and Cards assignment

        p < 0.02

        • Design of heaviest_passenger method in Flights and Passengers

        p < 0.01

        Decks and Cards
        • Design of Deck constructor
        • Style
        • Internal Comments
        • __str__ method correctness
        • deal method correctness
        • Deck constructor correctness
        • Docstrings
        • shuffle method correctness
        • Design of shuffle method
        • Design of cut method
        • cut method correctness
        • Error checking
        Flights and Passengers
        • Design of __str__ method
        • Design of lightest_passenger method
        • Style
        • Design of Flight constructor
        • Internal comments
        • Design of add_passenger method
        • __str__ method correctness
        • Error checking
        • heaviest_passenger method correctness
        • Docstrings
        • lightest_passenger method correctness
        • Flight constructor correctness
        • add_passenger method correctness

        Wow!

        Correlation of Mark Totals

        Joorden’s and Pare ran their correlation statistics on assignments that were marked on a scale from 1 to 10.  I can do the same type of analysis by simply running Pearson’s on the totals for each participant by each Grader.

        Drum roll, please…

        Decks and Cards

        p(28) = 0.89, p < 0.01

        Flights and Passengers

        p(28) = 0.92, p < 0.01

        Awesome!

        Summary / Conclusion

        I already showed before that my two Graders rarely agreed mark for mark, and that one Grader tended to give higher marks than the other.

        The analysis with Pearson’s correlation co-efficient seems to suggest that, while there isn’t one-to-one agreement, there is certainly a significant correlation – with the majority of the criteria having a correlation with p < 0.01!

        The total marks also show a very strong, significant, positive correlation.

        Ok, so that’s the conclusion here:  the Graders marks do not match, but show moderate to high positive correlation to a significant degree.

        How’s My Stats?

        Did I screw up somewhere?  Am I making fallacious claims?  Let me know – post a comment!

        Be the first to like.

        The $100 Best Buy Gift Card Draw

        It’s finally time.

        As promised, one of my participants is going to win a $100 Best Buy gift card, courtesy of the Department of Computer Science.

        Here’s the draw:  (click here if you can’t see the video)

        Congratulations to the winner!

        1 person likes this post.

        Some More Results: Did the Graders Agree?

        My experiment makes a little bit of an assumption – and it’s the same assumption most teachers probably make before they hand back work.  We assume that the work has been graded correctly and objectively.

        The rubric that I provided to my graders was supposed to help sort out all of this objectivity business.  It was supposed to boil down all of the subjectivity into a nice, discrete, quantitative value.

        But I’m a careful guy, and I like back-ups.  That’s why I had 2 graders do my grading.  Both graders worked in isolation on the same submissions, with the same rubric.

        So, did it work?  How did the grades match up?  Did my graders tend to agree?

        Sounds like it’s time for some data analysis!

        About these tables…

        I’m about to show you two tables of data – one table for each assignment.  The rows of the tables map to a single criterion on that assignments rubric.

        The columns are concerned with the graders marks for each criterion.  The first columns, Grader 1 – Average and Grader 2 – Average, simply show the average mark given for each criteria for each grader.

        Number of Agreements shows the number of times the marks between both graders matched for that criterion.  Similarly, Number of Disagreements shows how many times they didn’t match.  Agreement Percentage just converts those two values into a single percentage for agreement.

        Average Disagreement Magnitude takes every instance where there was a disagreement, and averages the magnitude of the disagreement (a reminder:  the magnitude here is the absolute value of the difference).

        Finally, I should point out that these tables can be sorted by clicking on the headers.  This will probably make your interpretation of the data a bit easier.

        So, if we’re clear on that, then let’s take a look at those tables…

        Flights and Passengers Grader Comparison

        Grader 1 – AverageGrader 2 – AverageNumber of AgreementsNumber of DisagreementsAgreement PercentageAverage Disagreement Magnitude
        Flight Constructor3.673.7326486.671
        Design of Flight Constructor3.433.9322873.331.88
        __str__3.033.37219701.56
        Design of __str__2.43.4102033.331.6
        add_passenger3.93.93001000
        Design of add_passenger3.533.8723776.671.43
        heaviest_passenger33.2726486.672
        Design of heaviest_passenger2.173.1111936.671.68
        lightest_passenger2.833.0328293.333
        Design of lightest_passenger22.831218401.61
        Error Checking1.41.73273903.33
        Style2.83.53921301.24
        Docstrings1.471.9161453.331.36
        Internal Comments0.730.6723776.671.71

        Decks and Cards Grader Comparison

        Grader 1 – AverageGrader 2 – AverageNumber of AgreementsNumber of DisagreementsAgreement PercentageAverage Disagreement Magnitude
        Deck Constructor3.373.5722873.331.75
        Design of Deck Constructor33.771812601.92
        __str__2.633.4171356.672.08
        Design of __str__2.333.67141646.672.5
        deal2.273.03102033.331.65
        Design of deal2.533.7171356.672.69
        shuffle3.233.53273903
        Design of shuffle33.4723776.672
        cut2.672.9723776.671.57
        Design of cut2.172.9141646.671.5
        Error Checking1.070.93273901.33
        Style2.93.63141646.671.63
        Docstrings1.872.031515501.67
        Internal Comments1.10.831812602

        Findings and Analysis

        It is very rare for the graders to fully agree

        It only happened once, on the “add_passenger” correctness criterion of the Flights and Passengers assignments.  If you sort the tables by “Number of Agreements” (or Number of Disagreements), you’ll see what I mean.

        Grader 2 tended to give higher marks than Grader 1

        In fact, there are only a handful of cases (4, by my count), where this isn’t true:

        1. The add_passenger correctness criterion on Flights and Passengers
        2. The internal comments criterion on Flights and Passengers
        3. The error checking criterion on Decks and Cards
        4. The internal comments criterion on Decks and Cards

        The graders tended to disagree more often on design and style

        Sort the tables by Number of Disagreements descending, and take a look down the left-hand side.

        There are 14 criteria in total for each assignment.  If you’ve sorted the tables like I’ve asked, the top 7 criteria of each assignment are:

        Flights and Passengers
        1. Style
        2. Design of __str__ method
        3. Design of heaviest_passenger method
        4. Design of lightest_passenger method
        5. Docstrings
        6. Correctness of __str__ method
        7. Design of Flight constructor
        Decks and Cards
        1. Correctness of deal method
        2. Style
        3. Design of cut method
        4. Design of __str__ method
        5. Docstrings
        6. Design of deal method
        7. __str__

        Of those 14, 9 have to do with design or style.  It’s also worth noting that Doctrings and the correctness of the __str__ methods are in there too.

        There were slightly more disagreement in Decks and Cards than in Flights and Passengers

        Total number of disagreements for Flights and Passengers:  136 (avg:  9.71 per criterion)

        Total number of disagreements for Decks and Cards:  161 (avg:  11.5 per criterion)

        Discussion

        Being Hands-off

        From the very beginning, when I contacted / hired my Graders, I was very hands-off.  Each Grader was given the assignment specifications and rubrics ahead of time to look over, and then a single meeting to ask questions.  After that, I just handed them manila envelopes filled with submissions for them to mark.

        Having spoken with some of the undergraduate instructors here in the department, I know that this isn’t usually how grading is done.

        Usually, the instructor will have a big grading meeting with their TAs.  They’ll all work through a few submissions, and the TAs will be free to ask for a marking opinion from the instructor.

        By being hands-off, I didn’t give my Graders the same level of guidance that they may have been used to.  I did, however, tell them that they were free to e-mail me or come up to me if they had any questions during their marking.

        The hands-off thing was a conscious choice by Greg and myself.  We didn’t want me to bias the marking results, since I would know which submissions would be from the treatment group, and which ones would be from control.

        Anyhow, the results from above have driven me to conclude that if you just hand your graders the assignments and the rubrics, and say “go”, you run the risk of seeing dramatic differences in grading from each Grader.  From a student’s perspective, this means that it’s possible to be marked by “the good Grader”, or “the bad Grader”.

        I’m not sure if a marking-meeting like I described would mitigate this difference in grading.  I hypothesize that it would, but that’s an experiment for another day.

        Questionable Calls

        If you sort the Decks and Cards table by Number of Disagreements, you’ll find that the criterion that my Graders disagreed most on was the correctness of the “deal” method.  Out of 30 submissions, both Graders disagreed on that particular criterion 21 times (70%).

        It’s a little strange to see that criterion all the way at the top there.  As I mentioned earlier, most of the disagreements tended to be concerning design and style.

        So what happened?

        Well, let’s take a look at some examples.

        Example #1

        The following is the deal method from participant #013:

        def deal(self, num_to_deal):
          i = 0
          while i < num_to_deal:
            print self.deck.pop(0)
            i += 1
        

        Grader 1 gave this method a 1 for correctness, where Grader 2 gave this method a 4.

        That’s a big disagreement.  And remember, a 1 on this criterion means:

        Barely meets assignment specifications. Severe problems throughout.

        I think I might have to go with Grader 2 on this one.  Personally, I wouldn’t use a while-loop here – but that falls under the design criterion, and shouldn’t impact the correctness of the method.  I’ve tried the code out.  It works to spec.  It deals from the top of the deck, just like it’s supposed to.  Sure, there are some edge cases missed here (what is the Deck is empty?  What if we’re asked to deal more than the number of cards left?  What if we’re asked to deal a negative number of cards?  etc)… but the method seems to deliver the basics.

        Not sure what Grader 1 saw here.  Hmph.

        Example #2

        The following is the deal method from participant #023:

        def deal(self, num_to_deal):
         res = []
         for i in range(0, num_to_deal):
           res.append(self.cards.pop(0))
        

        Grader 1 gave this method a 0 for correctness.  Grader 2 gave it a 3.

        I see two major problems with this method.  The first one is that it doesn’t print out the cards that are being dealt off:  instead, it stores them in a list.  Secondly, that list is just tossed out once the method exits, and nothing is returned.

        A “0″ for correctness simply means Unimplemented, which isn’t exactly true:  this method has been implemented, and has the right interface.

        But it doesn’t conform to the specification whatsoever.  I would give this a 1.

        So, in this case, I’d side more (but not agree) with Grader 1.

        Example #3

        This is the deal method from participant #025:

        def deal(self, num_to_deal):
            num_cards_in_deck = len(self.cards)
            try:
                num_to_deal = int(num_to_deal)
                if num_to_deal > num_cards_in_deck:
                    print "Cannot deal more than " + num_cards_in_deck + " cards\n"
                i = 0
                while i < num_to_deal:
                    print str(self.cards[i])
                    i += 1
                self.cards = self.cards[num_to_deal:]
            except:
                print "Error using deal\n"
        

        Grader 1 also gave this method a 1 for correctness, where Grader 2 gave a 4.

        The method is pretty awkward from a design perspective, but it seems to behave as it should – it deals the provided number of cards off of the top of the deck and prints them out.

        It also catches some edge-cases:  num_to_deal is converted to an int, and we check to ensure that num_to_deal is less than or equal to the number of cards left in the deck.

        Again, I’ll have to side more with Grader 2 here.

        Example #4

        This is the deal method from participant #030:

        def deal(self, num_to_deal):
          ''''''
          i = 0
          while i <= num_to_deal:
            print self.cards[0]
            del self.cards[0]
        

        Grader 1 gave this a 1.  Grader 2 gave this a 4.

        Well, right off the bat, there’s a major problem:  this while-loop never exists.  The while-loop is waiting for the value i to become greater than num_to_deal…but it never can, because i is defined as 0, and never incremented.

        So this method doesn’t even come close to satisfying the spec.  The description for a “1″ on this criterion is:

        Barely meets assignment specifications. Severe problems throughout.

        I’d have to side with Grader 1 on this one.  The only thing this method delivers in accordance with the spec is the right interface.  That’s about it.

        Dealing from the Bottom of the Deck

        I received an e-mail from Grader 2 about the deal method.  I’ve paraphrased it here:

        If the students create the list of cards in a typical way, for suit in CARD_SUITS; for rank in CARD_RANKS, and then print using something like:
        for card in self.cards
        print str(card) +  “\n”
        Then for deal, if they pick the cards to deal using pop() somehow, like:
        for i in range(num_to_deal):
        print str(self.cards.pop())

        Aren’t they dealing from the bottom

        My answer was “yes, they are, and that’s a correctness problem”.  In my assignment specification, I was intentionally vague about the internal collection of the cards – I let the participant figure that all out.  All that mattered was that the model made sense, and followed the rules.

        So if I print my deck, and it prints:

        Q of Hearts
        A of Spades
        7 of Clubs

        Then deal(1) should print:

        Q of Hearts
        

        regardless of the internal organization.

        Anyhow, only Grader 2 asked for clarification on this, and I thought this might be the reason for all of the disagreement on the deal method.

        Looking at all of the disagreements on the deal methods, it looks like 7 out of the 20 can be accounted for because students were unintentionally dealing from the bottom of the deck, and only Grader 2 caught it.

        Subtracting the “dealing from the bottom” disagreements from the total leaves us with 13, which puts it more in line with some of the other correctness criteria.

        So I’d have to say that, yes, the “dealing from the bottom” problem is what made the Graders disagree so much on this criterion:  only 1 Grader realized that it was a problem while they were marking.  Again, I think this was symptomatic of my hands-off approach to this part of the experiment.

        In Summary

        My graders disagreed.  A lot.  And a good chunk of those disagreements were about style and design.  Some of these disagreements might be attributable to my hands-off approach to the grading portion of the experiment.  Some of them seem to be questionable calls from the Graders themselves.

        Part of my experiment was interested in determining how closely peer grades from students can approximate grades from TAs.  Since my TAs have trouble agreeing amongst themselves, I’m not sure how that part of the analysis is going to play out.

        I hope the rest of my experiment is unaffected by their disagreement.

        Stay tuned.

        See anything?

        Do my numbers make no sense?  Have I contradicted myself?  Have I missed something critical?  Are there unanswered questions here that I might be able to answer?  I’d love to know.  Please comment!

        Be the first to like.