One of the more annoying aspects of the modern Internet is crap comments. For instance, it's improved in recent years, but for a while the typical comments on Youtube music videos were among the most idiotic examples of human "thought" and behavior I've ever seen…
A common solution to the problem is to have readers rate comments. Then comments that are highly-rated by readers get ranked near the top of the list, and comments that are panned by readers get ranked near the bottom of the list. This mechanism is used to good effect on general-purpose sites like Reddit, and specialized-community sites like Less Wrong.
Obviously this mechanism is very similar to the one used on Slashdot and Digg and other such sites, for collaborative rating of news items, web pages, and so forth.
There are many refinements of the methodology. For instance, if an individual tends to make highly-rated comments, one can have the rating algorithm give extra weight to their ratings of others' comments.
Such algorithms are interesting and effective, but have some shortcomings as well, one of which is a tendency toward "dictatorship of the majority." For instance, if you have a content that's loved by a certain 20% of readers but hated by the other 80%, it will get badly down-voted.
I started wondering recently whether this problem could be interestingly solved via an appropriate application of basic graph theory and machine learning.
That is, suppose one is given: A pool of texts (e.g. comments on some topic), and a set of ratings for each text, and information on the ratings made by each rater across a variety of texts.
Then, one can analyze this data to discover *clusters of raters* and *networks of raters*.
A cluster of raters is a set of folks who tend to rate things roughly the same way. Clusters might be defined in a context-specific way -- e.g. one could have a set of raters who form a cluster in the context of music video comments, determined via only looking at music video comments and ignoring all other texts.
A network of raters is a set of folks who tend to rate each others' texts highly, or who tend to write texts that are replies to each others' texts.
Given information on the clusters and networks of raters present in a community, one can then rank texts using this information. One can rank a text highly if some reasonably definite cluster or network of raters tends to rank it highly.
This method would remove the "dictatorship of the majority" problem, and result in texts being highly rated if any "meaningful subgroup" of people liked it.
Novel methods of browsing content also pop to mind here. For instance: instead of just a ranked list of texts, one could show a set of tabs, each giving a ranked list of texts according to some meaningful subgroup.
Similar ideas could also be applied to the results of a search engine. In this case, the role of "ratings of text X" would be played by links from other websites to site X. The PageRank formula gives highest rank to sites that are linked to by other sites (with highest weight given to links from other sites with high PageRank, using a recursive algorithm). Other graph centrality formulas work similarly. As an alternative to this approach, one could give high rank to a site if there is some meaningful subgroup of other sites that links to it (where a meaningful subgroup is defined as a cluster of sites that link to similar pages, or a cluster of sites with similar content according to natural language analysis, or a network of richly inter-linking sites). Instead of a single list of search results, one could give a set of tabs of results, each tab listing the results ranked according to a certain (automatically discovered) meaningful subgroup.
There are many ways to tune and extend this kind of methodology. After writing the above, a moment's Googling found a couple papers on related topics, such as:
But it doesn't seem that anyone has rolled out these sorts of ideas into the Web at large, which is unfortunate….
But the Web is famously fast-advancing, so there's reason to be optimistic about the future. Some sort of technology like I've described here, deployed on a mass scale, is going to be important for the development of the Internet and its associated human community into an increasingly powerful "global brain" …