Feed Spider - Update 7
I made two changes in my latest run. I probably should only make one change at a time to be able to narrow down what is helping and what is setting me back. Still, I went ahead and moved down one more level for the targeted categories. This gave me a lot more categories that would come up. My second change was that I only selected the categories that were closest in relationship to the article category. The net effect of this I estimate to be, more categories and fewer category labels per article.
I ran the processes needed up to the supervised training. That I started and took a 4 hour break while it ran. I should have waited around and checked the ETA for completion before taking my break. When I got back to it, there were still 11 hours of training remaining.
I wouldn’t consider it a big deal just to go and do something else for 11 hours, except that the supervised training was using 100% of all my CPU’s on my work computer. I’m impressed with how responsive macOS stays while under that kind of load. If all I wanted to do was some light work, I could just let it run. What I really wanted to do was do bugfixing on NetNewsWire while it ran. Compile times are just too frustratingly long while the rest of the computer is maxed out.
I figured it was finally time to spin up an on demand Amazon instance. After doing some superficial research, I decided to give Amazon’s new ARM CPU’s a go. They have the best price to performance ratio and since everything I’m doing is open source, I can compile it to ARM just fine.
The first machine I picked out was a 16 CPU instance. I got everything set up and started the supervised training. It was going to take 10 hours. Not good enough, so I detached the volume associated with the instance so that I wouldn’t have to set up and compile everything again. I attached the volume to a 64 CPU instance and tried again. 10 hours to run. I checked and was only getting 1200% CPU utilization.
I’d assumed that fastText was scaling itself based on available CPU’s since it was sized perfectly for my MacBook. I have 12 logical processors and it was maxing them out. It turns out that you have to pass a command line parameter to fastText to set the number of threads for it to use. 12 is the default and by coincidence matched my MacBook perfectly.
I restarted again using 64 threads this time, expecting great things. Instead I got NaN exceptions from fastText. Rather than dig through the fastText code to find the problem, I took a shot in the dark and started with 48 threads. That worked and had an ETA of 3 hours. A 48 CPU instance is the next step down for Amazon’s new ARM CPU instances, so that is where I’ve settled in for my on demand instance.
As a sidebar, I would like to point out that this is a pretty good deal for me. The 48 CPU instance is $1.85/per hour to run. I’m not sure how to compare apple-to-apples with a workstation, but to get to a 40 thread CPU workstation at Dell is around $5k. Since, I’m primarily an Apple platforms developer, I wouldn’t have any use for it besides doing machine learning. It would mostly be sitting idle and depreciating in value. I would have to more than 2500 hours of processing to get ahead by buying a workstation. That’s assuming that a 5k workstation is as fast as the 48 CPU on demand instance, which I doubt it is.
After 3 hours, the model came out 70% accurate against the validation file. That’s pretty good, but what about in the real world? Pretty shitty still. Here is One Foot Tsunami again.
The new model simply doesn’t find any suggestions lots of time. See the “Pizza Arbitrage” above. The categories that it does find, kind of make sense? The categories are trash for categorizing a blog though.
One of my assumptions when starting this project was that Wikipedia’s categories would be useful for doing supervised training. I really don’t know if that is the case. How things are categorized in Wikipedia is chaotic and subjective. You can tell that just from browsing them yourself on the web. My hopes that machine learning would find useful patterns in them is mostly gone.
It is time for a change in direction. Up until this point, I had hoped to get by with a superficial understanding of Natural Language Processing and data science. I can see know that won’t be enough. I’m going to have to dig in and understand the tools I am playing with better as well as think about finding a new way of getting a trained model to categorize blogs.