Blog

close

My Uncle Terry

Last February, Nicole and I were wintering in Arizona. We were down there living out of the van, like we normally try to do in the winter. We were fairly close to my Uncle Terry had a winter home in Yuma, AZ, so we went to visit him and stayed just across the border in California on some BLM land.

IMG 0482

It was pretty much just desert there, but that didn’t matter. We were going to be spending our days with Terry and his partner Irene. They took us to see the Yuma Territorial Prison and went out to eat at all their favorite spots. We had a blast visiting with them and catching up.

IMG 0477

Uncle Terry died today from complications from COVID-19. He was older, but in otherwise good health until the virus. When last I saw him he was just as sharp as I’d ever seen him mentally. He was taken before his time was up.

I’m going to miss that man. We didn’t spend enough time together over the years and I regret that. Terry was a great story teller and always so much fun to be around. There was always an abundance of laughter when Terry got on telling stories. He also loved cars and collected them, so he and I never ran out of things to talk about.

My family is heartbroken. We’re heartbroken and also angry. It didn’t have to be like this.

IMG 0473

Privately owned firetrucks

I was reading a blog post about firetrucks and it reminded me of a #vanlife post I never got around to writing.

In late January, 2019 we were traveling around the U.S. and ended up on Padre Island. Padre Island is just off the coast of Texas. You can drive to the island over a bridge and can camp for free most anywhere you want to along the beach. Even in the winter it is warm there and quite beautiful if you like the ocean.

Van and Ocean

As beautiful as it is, there isn’t much to do except watch the ocean. So a couple times a day, I would walk up and down the shore line, and sometimes see some wildlife.

Bird on a dune

People were well spread out along the beach. On my walks I got to meet lots of interesting people with all different kinds of mobile living arrangements. By far the most interesting was the firetruck conversion.

Firetruck

I’d walked by this contraption several times before I caught the owner outside showing it off for someone and could discuss it with him. He was a former boat builder that had a fascination with firetrucks.

The rig is part firetruck, part boat, and part tow truck. You can clearly see the boat part of the vehicle that has been added to the firetruck. There is also ramp (extended in the picture) that can load and carry a car.

I’ve seen a lot of unique builds while traveling and this one is definitely the most unique.

Feed Spider - Update 8

I’ve been doing a lot of reading and a lot of soul searching lately. Having to dig deep into Machine Learning wasn’t on my 2020 list of things to do. I’d really planned to spend this year improving my Apple platforms developer skills. Learning Python and a bunch of new concepts is a real detour for me.

To better understand if the Machine Learning was something I wanted go ahead with, I did some research on how much education you needed to get into it. It turns out there is quite a bit to it, but there is also a good deal of overlap with my existing skillset. For example, my business programming career left me with lots of skill manipulating and cleansing large amounts of data programmatically. That and being able to program at all are good starting points.

The other big prerequisites are math. Specifically linear algebra, statistics, and probabilities. I used to know how to do that stuff, but that was 30 years ago. The good news is that the Khan Academy has courses that I can take for refreshers. All this Machine Learning stuff is within my reach.

I’ve decided to go ahead and get proficient with Machine Learning. It is a skill not too far out of my grasp. Besides, I need something to do.

Coursera has a Machine Learning course that they are giving away during the pandemic. Coursera looks like a great way to get your credit card charged for classes you haven’t taken or signed up for. At least that is what the online reviews say. Needless to say, they won’t be getting a credit card number from me. I am going to take the free class. I’m dubious as to how good it will be, but as long as it isn’t giving my outright misinformation, I think I’ll be ok.

After that, I’ll be taking the Khan Academy classes to get my math back up to par. In the mean time, I’m following the Towards Data Science blog. They put out lots of good material and the more I read of it the more I’m beginning to understand.

All of this will take some time. If I make any progress towards Feed Spider, I’ll blog about it. Don’t expect much for a while though. 🤓

Feed Spider - Update 7

I made two changes in my latest run. I probably should only make one change at a time to be able to narrow down what is helping and what is setting me back. Still, I went ahead and moved down one more level for the targeted categories. This gave me a lot more categories that would come up. My second change was that I only selected the categories that were closest in relationship to the article category. The net effect of this I estimate to be, more categories and fewer category labels per article.

I ran the processes needed up to the supervised training. That I started and took a 4 hour break while it ran. I should have waited around and checked the ETA for completion before taking my break. When I got back to it, there were still 11 hours of training remaining.

I wouldn’t consider it a big deal just to go and do something else for 11 hours, except that the supervised training was using 100% of all my CPU’s on my work computer. I’m impressed with how responsive macOS stays while under that kind of load. If all I wanted to do was some light work, I could just let it run. What I really wanted to do was do bugfixing on NetNewsWire while it ran. Compile times are just too frustratingly long while the rest of the computer is maxed out.

I figured it was finally time to spin up an on demand Amazon instance. After doing some superficial research, I decided to give Amazon’s new ARM CPU’s a go. They have the best price to performance ratio and since everything I’m doing is open source, I can compile it to ARM just fine.

The first machine I picked out was a 16 CPU instance. I got everything set up and started the supervised training. It was going to take 10 hours. Not good enough, so I detached the volume associated with the instance so that I wouldn’t have to set up and compile everything again. I attached the volume to a 64 CPU instance and tried again. 10 hours to run. I checked and was only getting 1200% CPU utilization.

I’d assumed that fastText was scaling itself based on available CPU’s since it was sized perfectly for my MacBook. I have 12 logical processors and it was maxing them out. It turns out that you have to pass a command line parameter to fastText to set the number of threads for it to use. 12 is the default and by coincidence matched my MacBook perfectly.

I restarted again using 64 threads this time, expecting great things. Instead I got NaN exceptions from fastText. Rather than dig through the fastText code to find the problem, I took a shot in the dark and started with 48 threads. That worked and had an ETA of 3 hours. A 48 CPU instance is the next step down for Amazon’s new ARM CPU instances, so that is where I’ve settled in for my on demand instance.

As a sidebar, I would like to point out that this is a pretty good deal for me. The 48 CPU instance is $1.85/per hour to run. I’m not sure how to compare apple-to-apples with a workstation, but to get to a 40 thread CPU workstation at Dell is around $5k. Since, I’m primarily an Apple platforms developer, I wouldn’t have any use for it besides doing machine learning. It would mostly be sitting idle and depreciating in value. I would have to more than 2500 hours of processing to get ahead by buying a workstation. That’s assuming that a 5k workstation is as fast as the 48 CPU on demand instance, which I doubt it is.

After 3 hours, the model came out 70% accurate against the validation file. That’s pretty good, but what about in the real world? Pretty shitty still. Here is One Foot Tsunami again.

Screen Shot 2020 05 26 at 10 31 25 AM

The new model simply doesn’t find any suggestions lots of time. See the “Pizza Arbitrage” above. The categories that it does find, kind of make sense? The categories are trash for categorizing a blog though.

One of my assumptions when starting this project was that Wikipedia’s categories would be useful for doing supervised training. I really don’t know if that is the case. How things are categorized in Wikipedia is chaotic and subjective. You can tell that just from browsing them yourself on the web. My hopes that machine learning would find useful patterns in them is mostly gone.

It is time for a change in direction. Up until this point, I had hoped to get by with a superficial understanding of Natural Language Processing and data science. I can see know that won’t be enough. I’m going to have to dig in and understand the tools I am playing with better as well as think about finding a new way of getting a trained model to categorize blogs.

Feed Spider - Update 6

It took around 40 hours to expand and extract all of Wikipedia using WikiExtractor. In the end, I ended up with 5.6 million articles extracted. Wikipedia has 6 million articles, so WikiExtractor tossed out 400k of those. Possibly due to template recursion errors. That was something that WikiExtractor would occasionally complain about as it was working.

My next step was to fix that slow query that is used to roll up categories. I had no idea what I was going to do about it given the complexity of the query and amount of data that it was processing. Still, I thought I better do my due diligence and run an EXPLAIN against the query to tune it as much as I could.

I was surprised to see that the query was doing a full sequential scan of the relationship table. I thought that I had indexed its columns, but hadn’t. I only needed to have an index for one side of the relationship table, so I added it. I reran the query and it now consistently came back within 10’s of milliseconds as opposed to multiple seconds. This was a massive improvement.

Another change I made was that I went down another level in categories from the main content category. This netted about 10,000 categories that we would roll up into, versus the hundreds we had before. My hope was that this level would provide more useful categories for blogs.

I had to rewrite the Article Extractor now that it wasn’t going to be processing raw Wikipedia data any longer. Now it would be reading the JSON files generated by WikiExtractor. This would be much faster, especially since I got the roll up query fixed. Last time I ran the Article Extractor, it took all night long to extract only 68,000 records. This time I ran it and processed 5.6 million records in less than 2 hours. đź’Ą

I was excited at this point and ran that output through the Article Cleaner to prepare it for training by fastText. That process is quick and only takes about ½ hour to run. Now for fastText training. I ran it with the same parameters as last time, just this time with a much, much larger dataset. fastText helpfully provides an ETA for completion. It was 4 hours, so I went to relax and have dinner.

After the model was built, I validated it and this time it only came out with 60% accuracy. That was a disappointment considering that it was 80% last time. Forging ahead, I ran the new model against a couple blogs. Testing against technology blogs gave varying and disappointing results.

Screen Shot 2020 05 24 at 8 52 44 PM

The results for One Foot Tsunami are now more specific and more accurate. They still aren’t very useful. I decided I would try a simpler blog, a recipe blog, to see if that would improve results. This is the results for “Serious Eats: Recipes”.

Screen Shot 2020 05 24 at 8 50 55 PM

At least it picked categories with “food” in the name a couple times. Still the accuracy is off and the categories not helpful. I need something that people would be looking for when trying to find a cooking or recipes blog.

I’m feeling pretty discouraged at this point. I think a part of me thought that throwing huge amounts of data at the problem would net much better results than I got. I have learned some things lately that I can try to improve the quality of the data. I’m not out of options and am far from giving up.

I think the next thing I will try though, is going down one more level in categories. Maybe the categories will get more useful. Maybe the accuracy will increase. Maybe it will get worse. I won’t know until I try.

CloudKit Extended Pauses

I’ve got something strange that happens with NetNewsWire’s Cloudkit integration. I consider the code stable at this point. I’ve been running it for weeks across 3 different devices and they never go out of sync.

My problem is that the CloudKit operations seem to pause for extended periods of time. This could be an hour, but then it will just break loose and start working again. Restarting the app also clears the problem up. I’d suspect a deadlock of some kind, but it will start back up again without intervention.

What is strange about this is that it only happens on macOS. iOS this never happens on. It seems to be worse if my system is under load or if I’ve left it NNW running for an extended period of time. It happens for both fetch and modify operations. I’m at a loss as to if this is a test environment issue, something with my machine, or a coding problem. Any one ever seen anything like this before?

Feed Spider - Update 5

My first run at classifying blogs ended predictably bad. Not horribly bad, I guess. If you squint really hard, you could see that some of the categories kind of make some sense. They just were generally not useful due to category vagueness. The categories that were found were things like “culture” or “humanities” which could be almost anything. Things are going to have to get more specific and more accurate.

One of the things I noticed when I was validating the categories and their relationships that I extracted was that some were missing. It turns out that Wikipedia will use templates sometimes for categories. A Wikipedia template is server-side include, if you know what that is. Basically is a way to put one page inside another page. I didn’t have a way to include template contents in a page while I was parsing it and was missing categories because of it.

I’ve started reading fastText Quick Start Guide and am about a ¼ of the way through it. I haven’t learned much about NLP, but I have gotten more tools to play with now. One of these is another Wikipedia extraction utility, WikiExtractor and it handles templates!

Something I always do when looking at a new Github project is check out open issues and pull requests. It tells you a lot about how well maintained the project is. One open pull request for WikiExtractor is “Extract page categories”. I’m glad I saw that pull request, because I didn’t know that it didn’t extract categories. Also, I now had the code to extract those categories. I grabbed the pull request version and got to work.

I did a couple test runs and realized that although I was getting full template expanded articles with categories, I wasn’t getting any category pages. The category pages are how I build the relationships between categories. WikiExtractor is about 3000 lines of Python, all in one file. After a couple hours of reading code, I was familiar enough with the program to modify it to only extract category pages and bypass article pages. I’ll extract the article pages later.

I wrote a new Category Extractor that took input from WikiExtractor and reloaded my categories database. Success! I now had the missing categories. Before, I had about 1 million categories. Now I have 1.8 million. Due to this change and fixing some other bugs, my category relationship count went up from 550,000 entries to 3.1 million. This is a lot higher fidelity information than I had loaded before.

The larger database makes a problem I had earlier even harder now. How to roll up categories into their higher level categories. This was a poor performer before and now that I will be extracting articles again and assigning them categories, I’m going to have to make it go faster. It ran so slow that I only had 68,000 articles to train my model with and I want to use a lot more than that next time.

That’s the next thing to work on. In the meantime, I’m running WikiExtractor against the full Wikipedia dump to give me template expanded articles. This is running much slower than when I just extracted the category pages and may take a couple days to complete. My poor laptop. If I have to extract those articles more than once using WikiExtractor, I’m going to set up a large Amazon On-Demand instance to run it on. Hopefully, it won’t come to that.

Feed Spider - Update 4

I put a test harness around the prediction engine for fastText. The test harness downloads and cleans an RSS feed and asks for the most likely classification. Here are some results from One Foot Tsunami:

Screen Shot 2020 05 20 at 9 21 22 PM

Each row is an article title from the feed followed by the classification derived from the article content. I’m both encouraged and strangely disappointed at the same time. Things seem to be working, but clearly I need to do some work on what my categories are.

Initially, I tried combining all the articles in the feed and running that through the prediction engine. It always gave “chronology” back as the classification. Individual articles seem to give better results. I’ll probably end up classifying by article and taking the most common classifications as the feed’s.

I think “chronology” might be the default classification in the model. I see it come up a lot. Looking at the Wikipedia page for Category:Chronology has me thinking anything with a date in it will roll up to it. It looks like there will be trouble maker categories that I have to delete from the database, like “chronology”. I’ve already eliminated the ones with the word “by” in them. These were things like “Birds by state” which would clearly better be described by another classification.

I think I’ll probably fall into a cycle of tweaking the categories and then running the rest of the flow to see how well the predictions improve. That means making that slow category roll-up query run faster. I think I have my work cut out for me tomorrow.

Feed Spider - Update 3

Yesterday, I had just gotten the categories and category relationships loaded into the relational database and identified the categories I want to use for blogs. The next step was rolling up all the hundreds of thousands of categories into those roughly 1300 categories.

I came up with a query that I thought would work. This isn’t easy because Wikipedia’s categories aren’t strictly hierarchal. They kind of are, but it is really more of a graph than a hierarchy. What I mean by this is that a specific article can have multiple top level categories and there are many paths to get there. You can see this if you click on one of the categories at the bottom of a Wikipedia article. It will take you to a page about that category and at the bottom of that page is more categories that this one belongs to. Since there are more than one, the path to the top isn’t obvious and is plural.

One pitfall to walking a graph like this is getting stuck in a loop. For example category A points to category B, which points to category C, which points to category A. Another problem is getting back to too many top level categories. Finally, you have to deal with the sheer number of paths that can be taken. Say the category we are looking for has 5 parent categories and those have 5 each and those have 5 each. That’s 125 paths to search after only going up 3 levels.

In the end I put code in to limit recursion, limited to 5 resulting categories, and only 4 levels of searching upwards. The query still takes about 700ms to run which is very slow. That is not good.

Screen Shot 2020 05 20 at 4 26 50 PM

When building a complex system it is important to address architectural risk as early as possible. I’m sure you have heard stories of projects getting cancelled after spending months or even years of development. Lots of times this is because a critical piece of the architecture proved unviable late in the project. Tragically a lot of work has usually been done that relied on that piece before discovering that it won’t work.

The biggest piece of architectural risk in this system is the machine learning parts. We want to get to that as quickly as possible so that we don’t end up doing work that may end up being thrown away. So I decided to move on to building the Article Extractor instead of optimizing the query or even trying to make it more accurate. We can come back to it later after risk has been addressed.

The Article Extractor being very similar to the Category Extractor didn’t take long to code and test. Its job is to read in the Wikipedia dumps for an article, assign the roll-up categories, and write it out to a file. Since it relied on the slow query for part of its logic, I knew it would be slow. So I fired up 9 instances of the Article Extractor and let it run for about 14 hours.

When I finally checked the output of the Article Extractor it had produced only 68,000 records. That isn’t very much, but should be enough for us to move on to the next step. We can go back and generate more data later if this doesn’t prove sufficient.

The next step is to prepare fastText to do some blog classifications by training the model. I don’t know much of anything about fastText yet. I’ve bought a book on it, but haven’t read it. To keep things moving along, I adapted their tutorial to work with the data produced thus far.

I wrote the Article Cleaner, see flowchart, as a shell script. It combines the multiple output files from the Article Extractor processes, runs it through some data normalization routines, and spits the result 80/20 into two separate files. The bigger file is used to train the model and the smaller to validate it.

Supervised training came next. I fed fastText some parameters that I don’t understand fully, but come from the tutorial, and validated the model. I was quite shocked when the whole thing ran in under 3 minutes. The fastText developers weren’t false when naming the project.

Screen Shot 2020 05 20 at 5 11 21 PM

The numbers are hard to read, but what they are saying is that we are 88% accurate at predicting the correct categories for the Wikipedia articles fed to it for verification. In the tutorial, they only get their model up to 60% accurate, so I’m calling this good for now. Almost assuredly, our larger input dataset made us more accurate than the tutorial. Eventually, I’ll do some reading and hopefully get that number even higher.

Now, it is almost time for the rubber to hit the road. Next step is to begin feeding blogs to the prediction engine and see what comes out. At this point, I’m not too concerned that about the machine learning part working. I’m mostly concerned that the categories that we’ve selected won’t work well for blogs. I guess I’m about to find out.

Feed Spider - Update 2

Wow, the Wikipedia category data I loaded was bad. Really bad. I guess that should be expected considering that was the first run of the Category Extractor. Still, I expected better.

I’ve decided to use the same category classifications as Wikipedia does for determining the main topics. The “Main topic classifications” page lists out the top level categories. From there on down, there are subcategory after subcategory of classifications.

Screen Shot 2020 05 19 at 5 00 12 AM

For example, if you drill down into “Academic disciplines”, you get a listing if its categories.

Screen Shot 2020 05 19 at 5 07 58 AM

I should have been able to query my database after loading it and see that under “Main topic classifications” where all its subcategories. About half were missing. When dealing with 16GB of compressed text, where do you even start? I could see that I was missing the “Mathematics” subcategory, but had no idea where in that pile of 16GB to look.

Eventually, I wrote a program to extract the “Mathematics” category page that had the relationships in it that I was looking for. Then I was able to test with a single page instead of millions to find some bugs. As was typical for me, my logic was sound, but I had made a mistake in the details. I’d gotten in a hurry and made a cut and paste error when assigning a field to the database and was inserting the wrong one. A little desk checking might have saved me half a day or so of debugging.

I fixed my bug and started up the process again. Since I’d added another constraint on the database to improve the quality of the category relationships, the process was running slower than it had before. It was now taking about 2 or 3 hours to run. I never stick around to time it, so I fired it off and went to bed.

This morning I got up early and checked the data. It looks good now!

Screen Shot 2020 05 19 at 5 00 19 AM

I’m able to recreate the category relationships in the Wikipedia pages. Nothing is missing! What you see in the query in the above screenshot is grabbing all the subcategories for “Main topic classifications” and their subtopics. This yields 1351 topics. I think that is a reasonable amount of labels to train the model with. At least that is a good starting point to see how it will shake out.

I’m envisioning having a way for users to select “Academic disciplines” and then choose from the resulting list, “Biblical studies” and then see a listing of blogs that fall into that category. They should also be able to search for “biblical” and get the same thing. Possibly we could even get to the point that searching for “bible” turns up the correct blogs using word vectors.

Now that I can go down from “Main topic classifications” to the categories I want identified, I have to go the other way. If a page has the category “Novelistic portrayals of Jesus”, I need to be able to roll that up to “Biblical studies”. After that gets figured, I can begin extracting articles for the training model.

Postico

I’ve been analyzing the database I created for Feed Spider that models the Wikipedia categories in a relational database. I’m using PostgreSQL for the database and wanted something nicer than the command line to execute my queries.

I searched around for a bit and settled on Postico. It’s a PostgreSQL front end written by some indie devs. It’s what Brent Simmons would call a Mac-assed Mac app.

Screen Shot 2020 05 18 at 5 56 35 PM

I’ve only been using it a couple days, but it has made my life much nicer while working with both DDL (data definition language) and DML (data manipulation language). It was an easy decision to buy it. It’s only $40 if you get it from their site or $50 through the Mac App Store.

Some days I can’t but help to step back and marvel about where we are technologically in the developer world. I used to work at companies that paid big money for Oracle or DB2 with their shitty Java based developer frontends. Now I can have a high powered database for free, with a real AppKit front end for less than the price of a night out with my wife. Good times.

Feed Spider - Update 1

I’ve been working to get Feed Spider development started and it has started. One of the challenging things about starting a new project is getting the development environment set up. More than once, I’ve seen this lead to analysis-paralysis on projects. There is a strong urge to get things planned and set up correctly to get the project off to a good start.

I’m not completely immune to this, even if I know it is a danger. I spent about a day working with Docker, which I have next to no knowledge of, to come up with a portable dev environment. My thinking was that with Docker I could set up developers with a full environment, including dependencies, with little effort. In my head, Docker would handle the PostgreSQL database, Python, Python dependencies, compiling fastText, etc…

I realized that this was beyond my knowledge of Docker and Docker Composer and that I would have to get an education in these before I even got started. Struggling with this felt too much like I was getting paralyzed and was working on the wrong things. So I put that to the side. It is something I can add later. In the meantime, I got PostgreSQL installed, fastText compiled, and Python up and going with all my dependencies.

I’m new to Python, but am picking it up quickly. I was able to write the Category Extractor that scans all of Wikipedia for the categories assigned to the articles. It also understands the relationship between those categories so that they can be rolled up. Python has some very useful libraries for parsing Wikipedia and tutorials on how to do it. It all came together a lot faster than I thought it would.

The actual processing time to extract the categories and load them to a database was faster than I thought it would be too. Compressed, all the articles in Wikipedia are about 16GB. Uncompressed, they are supposed to be over 50GB. There are 60 compressed files that you can download. I processed them individually, but 8 at a time. It took less than an hour to go through all of Wikipedia on my 2018 MacBook Pro.

I ended up finding over 1 million categories and over 550,000 category relationships. That sounds about right considering there are 6 million articles and articles have multiple categories.

Now I just have to make some sense of all that data. The next couple days will be important to see if that is possible. If I can’t figure out how to roll up and narrow down those categories, I’ll have to figure out another way to train my Text Classification model.

Feed Spider - Part 2

In the first post about Feed Spider we discussed the motivation behind creating a feed directory. We also discussed some software components that can be used to create Feed Spider. Now we’re going to try to tie that all together.

The architecture and design of Feed Spider is at the inception phase. Writing this blog post is an exercise in helping me better understand the problem space as much as it is for communicating what I am trying to do. As they say, no plan survives first contact with the enemy. This plan is no different and I expect it to be iterated over and refined as more gets learned and implemented.

Feedback and criticisms are welcome. Changing a naive approach is easier the sooner it is caught.

Flow Chart

Pictures always help and sometimes the old ways are best.

FeedSpiderFlowChart

Categories

The first problem that we run into with categorizing RSS feeds is what categories to put them into. What are those categories? Wikipedia supplies us with categories associated with an article. The problem is that these categories are hierarchical. Categories can have categories. There are also multiple categories assigned to an article. There are probably thousands of categories in Wikipedia. We will need to roll up the category hierarchy and reduce the number of categories used.

To do so, we will extract the categories used in Wikipedia and load them into a relational database using a new process called Category Extractor. We can then do some data analysis using SQL to get a rough idea of the what the top level categories are and how many articles are under them. Once we understand the data better, we should be able to come up with a criteria for tagging high value categories. A process called Category Valuator will be run against the database to identify and tag the high value categories we want to extract articles for.

Training Data

The Article Extractor process will scan Wikipedia for articles that have the categories that we are looking for. Initially we will pull 100 articles per category and increase that number as needed for the training file. An additional 10% of the records will be pulled in a separate file to use to test the model. The output records will be formatted for use by fastText. Each record will have one or more categories (or labels) associated with the article text.

Training models work better on clean data. For example, capitalization and punctuation can degrade fastText results. We will preprocess the data in an Article Cleaner process.

The clean data will be passed into fastText to train the classifier model. The test file will be used to validate the training model. Training parameters will be tweaked here to improve accuracy while maintaining reasonable performance.

RSS Feeds

Eventually we will seed our RSS Web Crawler process with the Alexa Top 1 Million Domains. Initially however, we will probably test by crawling a blog hosting site like Blogger. We should filter out RSS feeds by downloading them and checking last posted date and content length. This will favor blogs that post full article content over summary blogs. This is necessary so that we have enough content to make a category (label) prediction about the feed.

Feed Database

Our Labeled Feed Database will be generated by the RSS Prediction Processor. This process will call out to fastText to get a prediction of which categories match the RSS feed. It will also extract RSS feed metadata. This information will be combined to generate an output database of labeled (categorized) feed information.

The database should allow for being used as a way to browse by category (label). It should also allow for full text searching of Feed title and/or category.

Search Results

The Labeled Feed Database should be able to be embedded in a client application. An RSS Reader is a prime example of where this could be used. A user interface should allow users to search the database, find feeds, and subscribe to them.

Summary

This high level overview should give you an idea of what we will build and how much work is involved. There is room for improvement. For example, the Labeled Feed Database is just a text search / browsing database. There might be something we could do with Machine Learning to better match search criteria with the labeled feeds.

Now on to implementing, iterating, and learning.

Feed Spider - Part 1

I’d like to make a directory of blogs that users can search or browse to find blogs that they are interested in. I don’t think there really is such a thing right now that works well at least. The only ones I know about are part of a subscription service, like Feedly.

It’s understandable why there really isn’t such a thing. There isn’t a lot of money in blogs these days. Marketing dollars have moved on to a highly privacy invasive model that doesn’t work well with decentralized blogs. Servers and search engines are costly to run. I think that’s why there isn’t a quality RSS search engine or directory outside paid services.

I’d like to discuss approaching it differently. What if we created a fairly compact database that could be bundled in an application? It wouldn’t be able to cover every blog or every subject, but it could hold a lot. It could be enough to help people find content so that they spend more time reading blogs and enjoying their RSS reader.

The Problem

To build our small database we need to do a couple things. We need to find blogs and we need to categorize blogs.

A hand curated directory of blogs is too labor intensive. I don’t think there is a chance that enough volunteers would show up to create a meaningful directory. We’ll have to write some software to make this happen.

The Approach

Fortunately one of the most studied areas in Machine Learning is Text Classification. This means that there are open source solutions and free vendor supplied solutions. Apple’s Create ML is a good free solution that includes Text Classification. fastText is an Open Source project that focuses on text Machine Learning.

I think fastText if the correct choice for a couple reasons. The first is that it supports multiple label classification and Create ML only supports single label classification. fastText is cross platform and will run on commodity Linux machines and Create ML will require a Mac to run.

To train our Text Classification model, we will need some input data. I think Wikipedia will be a good source for this. Wikipedia has good article content and categories associated with those articles. To process Wikipedia articles, you have to parse them. They are in a unique format that isn’t easy to extract. Fortunately there is mwparserfromhell that we can use to parse the articles. We should be able to get the input data we need now to train our model.

Assuming we’ve found a way to classify blogs, now we need to find them. Scrapy is an Open Source web spider that can be customized. I’m going to assume for now that since it is Open Source, it can be extended to crawl for RSS feeds.

What now?

All the components that I’ve found thus far are written in Python or have Python bindings. Everything I’ve discussed is stuff a Data Scientist would do. I’m not a Python developer or a Data Scientist, so I’ve got a lot to learn and a lot of hard work ahead.

That hasn’t stopped me from trying to figure out how all of this will come together. In Part 2, I’ll discuss the architecture and high level design in more detail.

I found an old photo from a couple years ago. It’s of my wife, Nicole, sitting along the bank of a river in Northern California. Just sitting along the river, listening to music, and having a few beers. It will be a long time before we’re able to do camping like this again.

CloudKit Impressions from a NetNewsWire Developer

I just got done implementing iCloud support in NetNewsWire. We are still doing preliminary testing on it and aren’t ready for public testing. I don’t know which release it will be in unfortunately. That depends on how initial testing does and then public testing.

I thought I’d write up some my initial impressions of CloudKit. This isn’t a tutorial although you might find some of the information useful if you are looking to develop with CloudKit.

Education Process

The process for learning any major technology from Apple seems to be about the same and I found CloudKit no different. I read a couple blog posts on CloudKit when I got started. Then I watched some old WWDC videos on it. Then I searched Github for open source projects using it and read their code. Then I implemented it while relying on the API docs.

I found reading the code from an actual project to be most helpful. A blog post and couple WWDC videos only get you to the point that you are dangerous to yourself and others. CloudKit has some advanced error handling that needs to be implemented to work at all. It is hard to pick that up from only a few sources.

Implementing basic sync

The hardest part about implementing the basic syncing process was leaving my relational database design knowledge behind. A CloudKit record is not a table, even if superficially they look the same.

This becomes very obvious when you look at how you can get only the items that have changed in a database since the last time you checked. CloudKit has a feature that will return you only the changes to records that have been made which saves greatly on processing. You don’t necessarily get those changes in an order that you can rely upon, so managing complex relationships between records isn’t recommended by me. I ended up doing a couple things that went against all my training to get it to perform well.

Once you figure out how to model your CloudKit data and understand the API’s, things fall together fairly quickly. We have other RESTful services that do syncing in NetNewsWire, and CloudKit is the most simple implementation we have.

One area that CloudKit outshines our RESTful service implementations is that it gets notifications when the data changes. This keeps our data more up to date. In the RESTful services, we sync which feeds you are subscribed to every so often via polling. This happens at shortest around every 15 minutes. Realtime updates to your subscription information isn’t necessary, but it is fun to add a feed on your phone and watch it appear in realtime on the desktop.

Advanced syncing failure

One thing I wanted to do was provide a centralized repository that knew which feeds had been updated and when. I planned to have a system that would use the various NetNewsWire clients to update this data and notify the clients. My theory was that checking one site for updated feeds would be faster than testing all the sites to see if their feeds had updated.

I ended up giving up on this task. I think it would have been possible to implement in CloudKit, but would not have been faster than checking all the sites for their feed updates. You see, we can send out hundreds of requests to see if a feed has been updated all at the same time. Typically they return back a 304 status code that says the weren’t updated and they don’t return any data at all. This is very fast and all the site checks happen at the same time. This is how the “On My Device” account works and it is very fast.

The reason I couldn’t get CloudKit to work faster than checking all the sites individually comes down to one thing. There is no such thing as a “JOIN” between CloudKit records. If I could have connected data from more than one record per query I could have done some data driven logic.

What I wanted to do was have one record type that contained information about all the feeds that NetNewsWire iCloud accounts were subscribed to. This would contain the information about if the feed had been updated. I needed to join this to another record type that had an individuals feeds so that I could restrict the number of feeds that a user was checking for updates.

I could have implemented something that didn’t use the “JOIN” concept, but it would have required lots of CloudKit calls. This required me to pass the data I wanted to JOIN in every call. It would have been unnecessarily complex and not performed better than just checking the site.

Conclusion

I think that CloudKit is amazing for what it is intended to do. That is syncing data between devices. I think it has potential to do more and I’ll be watching to see if Apple extends its capabilities in the future. There may be more yet that we do with CloudKit on NetNewsWIre.

The new iPadOS cursor is amazing. I’m so impressed by Apple on this one. They successfully reinvented a concept none of us thought even needed updating. I hope they bring some of these ideas to the macOS cursor.