January 2009 – Miles's Thoughs

A couple of posts ago, I said that my next post would be about some of the technologlies I’ve had to learn to make progress on my vocabulary builder app. (BTW, I don’t really have a good name for it, so if you’ve got some suggestions, please share)

The way I’ll proceed is to in order that I’ve learned them. Some of it is pretty technical, but I think there’s a bunch in there that others can understand as well.

To recap a bit, my idea is to look at the content of a webpage and annotate it with a personalized vocabulary list. As a test case, I’ve been using a focus.de article which has 600 words total, ~300 of which are unique. Some of them are variations of the same word. (plurals, conjugations..) More on that later.

Before I get overly technical, I’ll list the buzzwords:

1) javascript (the main language of the web.)

2) DOM (the interface to the structure of a webpage.)

3) mysql (to store the master vocabulary list)

4) php (to enable external programs to ask for a definition)

5) firefox XUL (variant of html for things like menu items)

6) firefox infrastructure (read/write files, extension sidebars)

7) more html than I knew before.

My first inclination was to just have it be another webpage on my site. You go to the page, enter a web address, and hit go. It turns out there are a couple problems with this. The first is security. If you load a page from one domain, scripts on that page are not allowed to access data from another. This is enforced by all of the main browsers. There is an expeption to this, frames, but that doesn’t really help. While frames allow you to have content from different domains to be displayed on the same page, they are compartmentalized. A script in one frame is not allowed to interact with others.

A second thing I thought of was to artificially have all of the content come from my site. While the web browser is restricted to one domain per page, the server doesn’t have the same constraint; I can do whatever I want on my server. You could tell my server via a form what content you’re interested in, it would ask the other server and the annotated page is delivered in one piece to your browser.

I passed on this option because of bandwidth, performance, and ocpyright constraints. Some websites may only want their stuff to come from them. RSS aggregators like the google reader get around this by serving only the text in the site provided feed. The problem there, is that this text is often pretty useless. If you want to read the actual article, you’ll have to go to the website. This also doesn’t work with the movie subtitle idea I’m thinking about. Bandwidth and performance are issues because it would mean every annotated web access would be two jumps. You tell my site what you want to see and wait. My site would tell the other site what’s needed and wait.

So I’m taking the Firefox extension idea.

Most website stuff is done in javascript, a language I did not speak, but I’m glad I’ve learned it. The basic syntax is like C. Functions are first class data, which is something I haven’t had available to me since college. It turns out perl does as well, but until recently, I only spoke an old dialect. Javascript also has closures. It doesn’t have classes, but the combination of function data and associative objects give something similar. It’s as much a “real programming language” as it get.

Now that I’ve decided where the program would sit, the next question becomes “where do I get my definitions?”. An initial thought I had was to use one of the many translation sites out there. google translate is one possibiliy and it will probably be a part of my final solution. The problem is that it translates and I want definitions. There are other sites that will give a tranlation, but those don’t really work either, because the result is not just a definition, but an entire webpage. If I need to ask for 300 words, this won’t work.

So I poked around and found that there are a number of dictionaries available for download, some proprierary and some not. I’m currently using one from Bablyon.com, but I’ll probably need to find something more freeware. They have good quality though, so that’s where I am now. Because they use a proprietary format I had to poke around a bit to be able to convert to something I can load into a database.

Now that I have a dictionary, I have to be able to access it and some sort of SQL database seemed appropriate. mysql is one that is commonly available on web servers including the one that I use for my page (1and1.com). It’s also where this blog is stored. I hadn’t used mysql in a while so I needed some refreshing. I still need to find a way to deal with variations on spelling, but it’s working well so far. Hopefully, magazine editors do a good job running spellcheck.

While experimenting with these, I’ve found that I need a web server installed on my pc at home. Copying files back and forth between it and 1and1.com was just too slow. This meant installing Apache webserver, the PHP and mysql extensions for it, and mysql itself.

Ok, now I have a dictionary as well as the ability to ask my server for a definition. Now I need something to translate. If you go to any page, you’ll find lots of unrelated stuff on it. In addition to the article itself, there are menus, advertisements, links to other articles, all of which may contain vocabulary that the user doesn’t know. Lots of clutter. How do I know what’s part of the main text and what’s not? As it turns out, html components are annotated with ids and classes that are used to formatting purposes. The text can be identified using these. Every site does it a little different, of course. I don’t have a good generalize solution to this yet. I’m hoping I can look for the blocks with the largest amount of text and find commonalities in their html structure. I’ll probably have to embed some knoweledge into the system.

Now I have the text. I make a list of all of the words in it. Time to get the definitions. For many of the words, it’s simple. Dog is in there. So is green. “words” like 20 are easy to translate (ie, I don’t). What about “dogs”? Including all plurals would make the dictionary larger. Even if we’re ok with that, the dictionary file has what it has; I don’t have control of it. Ok, what about words like “walked”, or “walks”. What about the word, “Obama”?

German, with fewer irregularities, makes all this a bit easier than English, but there’s still work to be done. (work I haven’t done yet). For the 300 words, about 1/3 of them are not in the dictionary as-is. From a gut feeling the “undefined” words break out like this:

1) 20%-30% are plurals and basic conjugations. Freund means friend. Freunden means friends. If I get Freunden and find that it’s not in the dictionary, I can lop off the ‘en’ and try again. There are some simple conjugation rules as well. German is pretty regular. Hebrew moreso. I think this means it’ll work for the ones I’m trying to improve.

2) 10% are names and numbers. For these, I think I can ask google translate for it’s translation. If the translation is identical to the input work, I leave it off the list.

So that’s where I am now. I have something wiggling and I think it’ll be helpful for me. Pretty soon I’ll look for a beta-tester AKA guinea pig. Probably my sister who’s learning Swedish (I like to called it Svenskish)

Couple more weeks of work here and there, but even if it turns out to be a total flop, I’ve learned a lot.

Month: January 2009

WiMax

Demo of my vocabulary builder v1.0

Vocabulary Builder Technologies