May 24, 2020

Feed Spider - Update 6

It took around 40 hours to expand and extract all of Wikipedia using WikiExtractor. In the end, I ended up with 5.6 million articles extracted. Wikipedia has 6 million articles, so WikiExtractor tossed out 400k of those. Possibly due to template recursion errors. That was something that WikiExtractor would occasionally complain about as it was working.

My next step was to fix that slow query that is used to roll up categories. I had no idea what I was going to do about it given the complexity of the query and amount of data that it was processing. Still, I thought I better do my due diligence and run an EXPLAIN against the query to tune it as much as I could.

I was surprised to see that the query was doing a full sequential scan of the relationship table. I thought that I had indexed its columns, but hadn’t. I only needed to have an index for one side of the relationship table, so I added it. I reran the query and it now consistently came back within 10’s of milliseconds as opposed to multiple seconds. This was a massive improvement.

Another change I made was that I went down another level in categories from the main content category. This netted about 10,000 categories that we would roll up into, versus the hundreds we had before. My hope was that this level would provide more useful categories for blogs.

I had to rewrite the Article Extractor now that it wasn’t going to be processing raw Wikipedia data any longer. Now it would be reading the JSON files generated by WikiExtractor. This would be much faster, especially since I got the roll up query fixed. Last time I ran the Article Extractor, it took all night long to extract only 68,000 records. This time I ran it and processed 5.6 million records in less than 2 hours. 💥

I was excited at this point and ran that output through the Article Cleaner to prepare it for training by fastText. That process is quick and only takes about ½ hour to run. Now for fastText training. I ran it with the same parameters as last time, just this time with a much, much larger dataset. fastText helpfully provides an ETA for completion. It was 4 hours, so I went to relax and have dinner.

After the model was built, I validated it and this time it only came out with 60% accuracy. That was a disappointment considering that it was 80% last time. Forging ahead, I ran the new model against a couple blogs. Testing against technology blogs gave varying and disappointing results.

Screen Shot 2020 05 24 at 8 52 44 PM

The results for One Foot Tsunami are now more specific and more accurate. They still aren’t very useful. I decided I would try a simpler blog, a recipe blog, to see if that would improve results. This is the results for “Serious Eats: Recipes”.

Screen Shot 2020 05 24 at 8 50 55 PM

At least it picked categories with “food” in the name a couple times. Still the accuracy is off and the categories not helpful. I need something that people would be looking for when trying to find a cooking or recipes blog.

I’m feeling pretty discouraged at this point. I think a part of me thought that throwing huge amounts of data at the problem would net much better results than I got. I have learned some things lately that I can try to improve the quality of the data. I’m not out of options and am far from giving up.

I think the next thing I will try though, is going down one more level in categories. Maybe the categories will get more useful. Maybe the accuracy will increase. Maybe it will get worse. I won’t know until I try.