May 22, 2020

Feed Spider - Update 5

My first run at classifying blogs ended predictably bad. Not horribly bad, I guess. If you squint really hard, you could see that some of the categories kind of make some sense. They just were generally not useful due to category vagueness. The categories that were found were things like “culture” or “humanities” which could be almost anything. Things are going to have to get more specific and more accurate.

One of the things I noticed when I was validating the categories and their relationships that I extracted was that some were missing. It turns out that Wikipedia will use templates sometimes for categories. A Wikipedia template is server-side include, if you know what that is. Basically is a way to put one page inside another page. I didn’t have a way to include template contents in a page while I was parsing it and was missing categories because of it.

I’ve started reading fastText Quick Start Guide and am about a ¼ of the way through it. I haven’t learned much about NLP, but I have gotten more tools to play with now. One of these is another Wikipedia extraction utility, WikiExtractor and it handles templates!

Something I always do when looking at a new Github project is check out open issues and pull requests. It tells you a lot about how well maintained the project is. One open pull request for WikiExtractor is “Extract page categories”. I’m glad I saw that pull request, because I didn’t know that it didn’t extract categories. Also, I now had the code to extract those categories. I grabbed the pull request version and got to work.

I did a couple test runs and realized that although I was getting full template expanded articles with categories, I wasn’t getting any category pages. The category pages are how I build the relationships between categories. WikiExtractor is about 3000 lines of Python, all in one file. After a couple hours of reading code, I was familiar enough with the program to modify it to only extract category pages and bypass article pages. I’ll extract the article pages later.

I wrote a new Category Extractor that took input from WikiExtractor and reloaded my categories database. Success! I now had the missing categories. Before, I had about 1 million categories. Now I have 1.8 million. Due to this change and fixing some other bugs, my category relationship count went up from 550,000 entries to 3.1 million. This is a lot higher fidelity information than I had loaded before.

The larger database makes a problem I had earlier even harder now. How to roll up categories into their higher level categories. This was a poor performer before and now that I will be extracting articles again and assigning them categories, I’m going to have to make it go faster. It ran so slow that I only had 68,000 articles to train my model with and I want to use a lot more than that next time.

That’s the next thing to work on. In the meantime, I’m running WikiExtractor against the full Wikipedia dump to give me template expanded articles. This is running much slower than when I just extracted the category pages and may take a couple days to complete. My poor laptop. If I have to extract those articles more than once using WikiExtractor, I’m going to set up a large Amazon On-Demand instance to run it on. Hopefully, it won’t come to that.