Feed Spider - Part 1
I’d like to make a directory of blogs that users can search or browse to find blogs that they are interested in. I don’t think there really is such a thing right now that works well at least. The only ones I know about are part of a subscription service, like Feedly.
It’s understandable why there really isn’t such a thing. There isn’t a lot of money in blogs these days. Marketing dollars have moved on to a highly privacy invasive model that doesn’t work well with decentralized blogs. Servers and search engines are costly to run. I think that’s why there isn’t a quality RSS search engine or directory outside paid services.
I’d like to discuss approaching it differently. What if we created a fairly compact database that could be bundled in an application? It wouldn’t be able to cover every blog or every subject, but it could hold a lot. It could be enough to help people find content so that they spend more time reading blogs and enjoying their RSS reader.
The Problem
To build our small database we need to do a couple things. We need to find blogs and we need to categorize blogs.
A hand curated directory of blogs is too labor intensive. I don’t think there is a chance that enough volunteers would show up to create a meaningful directory. We’ll have to write some software to make this happen.
The Approach
Fortunately one of the most studied areas in Machine Learning is Text Classification. This means that there are open source solutions and free vendor supplied solutions. Apple’s Create ML is a good free solution that includes Text Classification. fastText is an Open Source project that focuses on text Machine Learning.
I think fastText if the correct choice for a couple reasons. The first is that it supports multiple label classification and Create ML only supports single label classification. fastText is cross platform and will run on commodity Linux machines and Create ML will require a Mac to run.
To train our Text Classification model, we will need some input data. I think Wikipedia will be a good source for this. Wikipedia has good article content and categories associated with those articles. To process Wikipedia articles, you have to parse them. They are in a unique format that isn’t easy to extract. Fortunately there is mwparserfromhell that we can use to parse the articles. We should be able to get the input data we need now to train our model.
Assuming we’ve found a way to classify blogs, now we need to find them. Scrapy is an Open Source web spider that can be customized. I’m going to assume for now that since it is Open Source, it can be extended to crawl for RSS feeds.
What now?
All the components that I’ve found thus far are written in Python or have Python bindings. Everything I’ve discussed is stuff a Data Scientist would do. I’m not a Python developer or a Data Scientist, so I’ve got a lot to learn and a lot of hard work ahead.
That hasn’t stopped me from trying to figure out how all of this will come together. In Part 2, I’ll discuss the architecture and high level design in more detail.