It didn’t take long for another Firefox extension idea to come along.
Prof. Greg Wilson recently sent me an email, saying the following:
I’d like a Firefox plugin that does ‘wc’, i.e., counts characters, words, and lines on the current web page, and displays the results in the status bar.
Cool, I thought. No problem. That doesn’t sound too hard.
But I’ve been mulling and chewing this around in my head, and it’s actually a harder problem than it first sounds.
“wc“, short for word-count, is a small, simple, yet extraordinarily useful Unix utility that reads in some file, and spits out the number of words, characters, and lines for that file.
So what’s the problem? What’s so hard about coding something like this for web pages?
Well, for starters, users of this proposed extension are probably only interested in the visible, readable text on a web page. That means filtering out all of the HTML tags, all of the JavaScript, etc. Also, many modern web pages make use of IFRAME’s, hidden DIV’s, etc. Not to mention, most browsers do automatic word-wrapping, which could throw off the “line” counting. How should I treat these cases?
I certainly don’t think this is an impossible task, just harder than it first sounded.
So here’s what I’m going to do:
First, I’m going to take care of the base case. I’m going to take care of the case where users are viewing a page of all text, with almost zero HTML.
My test page will be an “etext” copy of Shakespeare’s Hamlet (first folio), hosted by Project Gutenberg.
According to OpenOffice Writer, this text has 32230 words, 173543 characters, and 4257 lines.
So that’s my target. I’m going to create an extension that sits as a button on the status bar. When the button is clicked, an alert will pop up with the statistics. If all goes well, the numbers will match.
Sure, it’s not the most elegant interface, but it’ll do for now.
I’ll post more as it comes.
How about letting the browser render the document, then walking the DOM tree to stitch together the visible text?
@Greg That was what I was thinking too. In fact, Mozilla has provided a TreeWalker implementation that is proving quite useful.
My favourite part is the “whatToShow” parameter in the construction of a TreeWalker (https://developer.mozilla.org/en/DOM/treeWalker.whatToShow). NodeFilter.SHOW_TEXT sounds very useful.
I think I’m also wrong in putting up numbers from OpenOffice Writer. If I’m really trying to clone wc for Firefox, I might as well get numbers from wc to compare with.
I’m doing a lot of DOM text parsing in my research. This snippet extracted from my code should help you get started.
http://www.friendpaste.com/29sHyVu1J3iljnGks30yY3
@Andrew Interesting – thanks for the code snippit!
I hadn’t considered the possibility of importing jQuery into my extension, but it certainly seems possible ( http://gluei.com/blog/view/using-jquery-inside-your-firefox-extension ). I might go this route.
However, I’m pretty interested in Mozilla’s TreeWalker implementation – I’ve actually got it spitting out page text:
var walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT, null, false);
var page_text = ”;
while(walker.nextNode()) {
page_text += walker.currentNode.textContent;
}
That’s what I’m playing around with right now. However, if it ends up being too much trouble, I’ll probably just got the jQuery route.
Cheers!
-Mike