Automatic page classification using artificial neural networks (classifyit.herokuapp.com)
Being able to categorise is a very human caracteristic. We tend to look at people, objects, ideas, opinions, reactions, etc and immediatly put them in what we think is their right bucket. Person A is introvert, idea B is outrageous, webpage C is technical, etc.
Why we do it is not the subject I am interested in. However, how we do it is a very interesting point.
One key way of learning, in my opinion, is learning by example. When we are presented with a new subject, we tend to think of examples that would fit in that specific context. Only after a reasonable number of examples, we are good at understanding a given topic, or recognising patterns.
That is why we, for example, look at a webpage and immediatly know what is the main topic of the page. It’s a scientific page, or it is an e-commerce page, or it is a technical blog page. We have seen a lot of different pages and therefore our filters are very accurate.
What about computers?
Could computers learn as well by example and perform classification? Can we trust computers to perform decent classification?
Well, to a certain extent, I believe so. Therefore, I would like to explore the idea of automatic classification of webpages based in algorithms that learn by example.
One example of such an algorithm is the backpropagation algorithm, which trains artificial neural networks (ANNs) in order to infer a function from observations. That function is then used to classify input data.
The ANNs ability to learn by example is very useful when the output is rather subjective or not very well defined. If we provide good sample data, the resulting classification function will provide reasonably good results.
Idea
My intention is to create a simple program that, given a url of a webpage, will respond to this simple question:
Is this a tech site?
I would like to know, given a URL, if the page is about technology or not.
Implementation
In order to do that, I followed a very simple algorithm described below. First, there is some pre-processing needed:
- I started by defining what I considered to be a basic set of words that are usually present in technology pages (disclaimer: please note that this is just a very basic example and therefore doesn’t aim at accurately representing what is the content of a technical site);
- I also defined a couple of datasets that could be representative of a technical page in terms of word counts based on the set of words defined in step 1; and a dataset representative of a webpage that is not technical, again using word counts.
After defining the datasets described above, the sinatra application (ruby DSL) will, on startup, do the following:
- Train an ANN using the ruby gem ai4r. To achieve that, I added the datasets defined in step 2 in order to give good positive and negative examples, so that the resulting function can perform classification in an acceptable way;
- Once the network is trained, the application is ready to receive input URLs.
Upon request, the application will:
- Fetch the entire page content;
- Count the frequency of the words in the page content;
- Discard the words that are not a part of the dataset defined in the very first point mentioned above;
- Run the classification function resulting from training mentioned above and output the result.
Hopefully, if the assumptions are correct, one will get reasonable results. I tried with a few websites and, in general, the algorithm is capable of classifying technical webpages, with a few exceptions.
What not to do
Along the way, I experienced some problems. For example, one has to be careful with the words one chooses to use. Initially, I was considering words such as technology, iphone and method, which are present in many websites, regardless of the nature of the websites. For example, news websites such as bbc.co.uk and guardian.co.uk were false positives when using such words.
It is also important to have good test data, otherwise the results won’t be accurate (or will be accurate in relation to the testing data, but meaningless towards what we are trying to classify).
Finally, it is also very important to have a good decision function. The ANN classification will return two results:
- the probability of the page being technical;
- and the probability of the page not being technical.
In my case, I consider a site to be technical if the probability of being technical is greater than the probability of not being technical, or if the probability of being technical is greater than 90%.
Minor changes in both the training data and the decision function will have a big impact in the results, so it is very important to get those 2 factors right.
Conclusion
Even with very lightweight examples and classification rules, the algorithm seems to classify websites with reasonable accuracy.
I tested with sites such as wired and ycombinator; blogs such as github, heroku and daring fireball. In terms of negative tests, I tried news sites such as bbc and guardian, among others.
My gut feeling is that with proper training data and better keywords, one could expect much better results and therefore achieve a better classification score.
Working with ANNs was a good experience and I will keep them in mind for future classification problems.
If you would like yo try it online, visit this site. The source code is also available here.
Replacing rails views with ember.js (dash-it-app.herokuapp.com)
In the past year or so, there has been a lot of discussions about client side javascript model-view-controller (MVC) frameworks. In this category, both backbone.js and ember.js have been particularly in focus. They both have in common the interest in providing structure and architecture to web applications and also to allow one to build rich user interfaces.
The idea of building rich and structured web applications is very appealing. Furthermore, combining this with ruby on rails, which also provides structure and an asset pipeline that processes your assets (processes, minifies and compresses javascript), makes it the perfect combination.
I then decided to try out ember.js with ruby on rails.
After seeing a number of tutorials and examples, I decided to build an app which would, to some extent, replace rails views with ember.js and take advantage of the the fact that it is not necessary to download a new page every time a user interacts with the server. I built a simple todo list that allows one to create projects and a list of todos per project. The projects page is almost entirely done using ember.js.
You can either try it online (http://dash-it-app.herokuapp.com) or download the code and try it locally (http://github.com/carvil/dash-it).
Disclaimer: this is my first experience with client side MVC frameworks and therefore I am not an expert in the matter. Thus, I kindly ask you to take it with a grain of salt.
Practicalities
On the practical side, there are a few things that made it easier for me to use ember.js with rails.
-
create structure: in general, ember.js examples have some structure, however you will still find controllers, models and views in the same file. I find it hard to understand the code if one just puts everything in the same file. One of the great things about rails is the structure one finds in the code. Therefore, I created a directory in
assets/javascripts
which contains four directories:controllers
,models
,templates
andviews
. -
use coffeescript: I also find it much easier to write coffeescript instead of javascript. The syntax is much cleaner and rails automatically converts the coffee files to javascript. I personally find this:
much more readable than this:
Pain points
-
Nested resources: I also decided to use ember-rest instead of ember-data, for RESTful resources, due to its simplicity. Since every decision comes with a price, ember-rest doesn’t have support for nested resources. Since I have defined that my
todos
depend onprojects
, I had to write the dependencies explicitly in ember.js. -
Debug ember.js: ember.js objects are very complex. It takes quite a long time to understand Ember.Object and what properties you can use. For more details, I recommend you to read Understanding Ember.Object.
-
MVC on top of MVC: at first, it seems a bit awkward to use an MVC framework (ember.js) on top of another MVC framework (rails). However, since rails exposes the resources in a nice way, it is easy to see rails as an API and have ember.js handle the view part. Once one sees rails as an API, then it becomes easier.
-
Rails helpers: using pure ember views means one won’t be able to use rails helpers in the views, which means writing more code. One example is the checkbox I added to each todo item. In rails, you only need one line of code to create a checkbox. In ember, due to the fact that I needed it to perform custom actions on click, it takes one view with around 20 lines of code.
The good part
-
No more page loads: one doesn’t need to reload the page when a user executes an action. This means a better user experience and a quicker UI.
-
Consistent user interface: one can also see the changes being propagated across the same page opened by different users. Since ember.js keeps the data consistent and uses the underlying API, it is possible to see the changes done by a user in one browser being propagated in the same page opened by different users in real time.
Conclusion
Ember.js has proven to be a very interesting framework. One can build great applications with it, and the integration with rails makes it even easier. However, since it’s still in its early days, there is still not a lot of documented examples on how to use it or which are the best practices to follow. It is a trial and error process, however, one that is worth the time and effort.
Moving my blog to GitHub or why Jekyll is awesome
Over the last few weeks I have been thinking about why I stopped blogging and that I should do it again one of these days. I have had many ideas since then, however, for a number of reasons, I have been pushing that decision forward.
Hopefully, that day has arrived.
I think one of the reasons (and perhaps the most important one) that made me stop blogging was that most blogging frameworks are just either too complicated or poluted. One usually have to go through a number of pages full of menus, dropdown selections, plugins, etc in order to write a blog post. When one finally arrives to the right page (not before having to look for the magic “new post” button hidden somewhere in the complicated UI), one realises that the window occupies a third of the screen with the other two thirds being used for totally useless components. They are nothing but distractions from the actual purpose, which is to write a blog post.
This is just wrong. When I want to write, I want to concentrate in the actual post. I don’t want to see menu bars, fancy buttons, gorgeous styles, etc. I want to write. That’s it. Everything else comes later. Everything else is a distraction.
That’s why I think Jekyll is a good alternative. Jekyll simplifies the blogging experience because I can:
- Choose what editor to use. In my case, I use MacVim. I am comfortable with it and it’s simple.
- Use Markdown. I don’t have to worry about styling. I don’t have to click buttons to add links, lists or write html. I am used to write markdown pages (READMEs and Wikis) and therefore I don’t even have to think about it’s syntax.
- Use GitHub. Being able to push blog posts to the repository is just a wonderful experience.
- Run Jekyll locally to test changes. After I’m done with the blog post, I can see a preview by running
jekyll --server
on my local machine. - Change the theme with one command. Jekyll provides an easy way of changing themes, adding blog posts, pages, etc.
Thus, I really hope that this change will make me a more active blogger!