Online News Popularity Analysis
The growth of the internet and technology has contributed to the popularity of online news articles and blogs. While print news is not completely dead yet, a growing number of people prefer to search the web for the day’s happenings, as online news is free, immediate, and convenient. In fact, in 2018 Pew Research noted that just over 50% of Americans get their news from some online format. According to Forbes, the number had increased to 55% in 2019.
For our BANA-273, Machine Learning for Analytics, Term Project, our team used data on Mashable articles from a two-year period to build a few machine learning models that predict the popularity of an article given a set of features about that article, such as the number of words in the article, the day of the week the article was published on, and the average sentiment polarity of the article content. In order to run these models, we discretized the continuous prediction variable “shares” into two categories: “High” popularity for those articles with a number of shares greater than or equal to the median number of shares for all articles, and “Low” popularity for those articles with a number of shares less than the median number of shares for all articles.
In order to predict the popularity of a given article, we built three machine learning models using Naïve Bayes, Random Forest, and K-Nearest Neighbors (KNN) algorithms. Overall, the accuracies we received from all three models were low and were unable to be improved much with the various data processing methods that we utilized. In fact, different method of data processing actually reduced the accuracies for various models. However, we found that implementing a Random Forest model while using supervised discretization (while not “looking” at the test set) gave us the best results at 64.91% overall.
Our team used Python for the majority of the models we built, and we used WEKA to run the models with supervised discretization. For more information on our techniques as well as our code, please check out the GitHub link above.