Tag Archives: wc

wordCount.xpi – Part 1

So, if you recall, I was asked to write a Firefox extension that would do word counting on websites.

Originally, when I started this project, I set a goal for myself:  I copied the text from Project Gutenberg’s First Folio version of Shakespeare’s Hamlet into OpenOffice Writer, recorded the word/line/character count statistics, and set that as my projected goal for my first iteration of my extension.

But there’s a problem with this approach:  I’m supposed to be copying the behaviour of Unix’s wc, not OpenOffice Writer’s word count.  Normally, this wouldn’t be a problem – a word count is a word count, a line count is a line count, and Writer should pump out the same numbers as wc.

Not so.

In my last post, I wrote:

According to OpenOffice Writer, this text has 32230 words, 173543 characters, and 4257 lines.

However, upon passing the same text (saved in the textfile “count.txt”) through wc, I got the following output:

5302 32230 178845 count.txt

Writer and wc agree on the number of words, but disagree on the number of lines – 5302 (wc) vs 4257 (Writer).  It’s a disagreement of about a thousand lines.

Brutal.

Anyhow, I’m going to focus on wc’s approach to line counting – simply returning the number of newline characters in the file.

And guess what…it works.  For Hamlet, my extension pumps out:

Document statistics:

Word Count:  32230
Line Count:  5302
Character Count:  178845
Character Count (no spaces):  142368

Nice.

Hamlet’s just the simple case though.  There are plenty of other cases to consider, but this is a start.

Anyhow, download here.

In this version, I’m using Mozilla’s TreeWalker implementation to stitch together the page text.  So far it seems to be working alright, but if it somehow ends up falling through, I might end up using something like Andrew Trusty’s code with the jQuery library to do the text stitching.

So there it is.  Maybe I’ll keep working on this, pretty it up a bit, etc.  However, work starts on Monday, and that’ll probably take up most of my technical attention.

We’ll see though.

For my next trick…

It didn’t take long for another Firefox extension idea to come along.

Prof. Greg Wilson recently sent me an email, saying the following:

I’d like a Firefox plugin that does ‘wc’, i.e., counts characters, words, and lines on the current web page, and displays the results in the status bar.

Cool, I thought.  No problem.  That doesn’t sound too hard.

But I’ve been mulling and chewing this around in my head, and it’s actually a harder problem than it first sounds.

wc“, short for word-count, is a small, simple, yet extraordinarily useful Unix utility that reads in some file, and spits out the number of words, characters, and lines for that file.

So what’s the problem?  What’s so hard about coding something like this for web pages?

Well, for starters, users of this proposed extension are probably only interested in the visible, readable text on a web page.  That means filtering out all of the HTML tags, all of the JavaScript, etc.  Also, many modern web pages make use of IFRAME’s, hidden DIV’s, etc.  Not to mention, most browsers do automatic word-wrapping, which could throw off the “line” counting.  How should I treat these cases?

I certainly don’t think this is an impossible task, just harder than it first sounded.

So here’s what I’m going to do:

First, I’m going to take care of the base case.  I’m going to take care of the case where users are viewing a page of all text, with almost zero HTML.

My test page will be an “etext” copy of Shakespeare’s Hamlet (first folio), hosted by Project Gutenberg.

According to OpenOffice Writer, this text has 32230 words, 173543 characters, and 4257 lines.

So that’s my target.  I’m going to create an extension that sits as a button on the status bar.  When the button is clicked, an alert will pop up with the statistics.  If all goes well, the numbers will match.

Sure, it’s not the most elegant interface, but it’ll do for now.

I’ll post more as it comes.