May 20, 2020

Feed Spider - Update 4

I put a test harness around the prediction engine for fastText. The test harness downloads and cleans an RSS feed and asks for the most likely classification. Here are some results from One Foot Tsunami:

Screen Shot 2020 05 20 at 9 21 22 PM

Each row is an article title from the feed followed by the classification derived from the article content. I’m both encouraged and strangely disappointed at the same time. Things seem to be working, but clearly I need to do some work on what my categories are.

Initially, I tried combining all the articles in the feed and running that through the prediction engine. It always gave “chronology” back as the classification. Individual articles seem to give better results. I’ll probably end up classifying by article and taking the most common classifications as the feed’s.

I think “chronology” might be the default classification in the model. I see it come up a lot. Looking at the Wikipedia page for Category:Chronology has me thinking anything with a date in it will roll up to it. It looks like there will be trouble maker categories that I have to delete from the database, like “chronology”. I’ve already eliminated the ones with the word “by” in them. These were things like “Birds by state” which would clearly better be described by another classification.

I think I’ll probably fall into a cycle of tweaking the categories and then running the rest of the flow to see how well the predictions improve. That means making that slow category roll-up query run faster. I think I have my work cut out for me tomorrow.