I’ve been working to get Feed Spider development started and it has started. One of the challenging things about starting a new project is getting the development environment set up. More than once, I’ve seen this lead to analysis-paralysis on projects. There is a strong urge to get things planned and set up correctly to get the project off to a good start.
I’m not completely immune to this, even if I know it is a danger. I spent about a day working with Docker, which I have next to no knowledge of, to come up with a portable dev environment. My thinking was that with Docker I could set up developers with a full environment, including dependencies, with little effort. In my head, Docker would handle the PostgreSQL database, Python, Python dependencies, compiling fastText, etc…
I realized that this was beyond my knowledge of Docker and Docker Composer and that I would have to get an education in these before I even got started. Struggling with this felt too much like I was getting paralyzed and was working on the wrong things. So I put that to the side. It is something I can add later. In the meantime, I got PostgreSQL installed, fastText compiled, and Python up and going with all my dependencies.
I’m new to Python, but am picking it up quickly. I was able to write the Category Extractor that scans all of Wikipedia for the categories assigned to the articles. It also understands the relationship between those categories so that they can be rolled up. Python has some very useful libraries for parsing Wikipedia and tutorials on how to do it. It all came together a lot faster than I thought it would.
The actual processing time to extract the categories and load them to a database was faster than I thought it would be too. Compressed, all the articles in Wikipedia are about 16GB. Uncompressed, they are supposed to be over 50GB. There are 60 compressed files that you can download. I processed them individually, but 8 at a time. It took less than an hour to go through all of Wikipedia on my 2018 MacBook Pro.
I ended up finding over 1 million categories and over 550,000 category relationships. That sounds about right considering there are 6 million articles and articles have multiple categories.
Now I just have to make some sense of all that data. The next couple days will be important to see if that is possible. If I can’t figure out how to roll up and narrow down those categories, I’ll have to figure out another way to train my Text Classification model.