Monday, August 2, 2010 Statistical Ratings Explained relies heavily on massive statistical analysis of user reviews to rate and compare products and services. How does this work exactly?

Looking for user reviews

For each product we evaluate, we look for major sites with large amounts of user reviews. The key is to find a large enough sample of reviews that we can reliably rate a product and detail its pros and cons, in particular as regards to quality and reliability. We love to find sites with many reviews of the products we analyze! But we don't stop there - we will keep on searching, looking for even single reviews in retail and review sites, in order to aggregate as many reviews as we can, so as come up with the most statistically valid rating we can get.

Using multiple sources for user reviews
We like finding reviews from many different sites, because single sites always introduce a bias concern about the self selection of its users. For instance, Newegg is frequently used by IT professionals, and has an excellent user review system. But is the nature of its audience skewing the results in a way that does not automatically apply to the general public? On the other hand, Staples is mostly used by home and small business users with less technical expertise, and also has a user review system with large amounts of data. Will the nature of its users skew the results of their evaluation? Whenever we can - i.e. when the user reviews can be found - we incorporate as many sources for user reviews as we can, so as to remove, as much as possible, sources of review bias.

Combining reviews from multiple sources 
The user review systems for the many web sites on which they are found vary widely in their definitions. Some use secondary criteria that they ask users to use, and which may skew the numerical results of the final user rating for the user. Some use 5-star ratings, while others are binary "recommend/do not recommend" systems, or combined both aspects. Some sources may use non-numerical ratings, such as icons ("Thumbs up/ Thumbs down"), or simply text, while others do not have global ratings but only ratings of multiple criteria. We want the broadest range of sources for our user reviews. To make it possible for us to use all possible sources of reviews, we classify each review as a positive or a negative review, and use this binary criterion as the fundamental unit of data for our statistical analysis.

Converting to positive/ negative reviews
Our purpose when evaluating a product is to come up with the most valid evaluation of the product's quality. For that purpose, we classify every review as a positive of negative review, and enter it in our statistical analysis for the product. For instance, for sites with a 5 star review system, we will take 5- and 4-star reviews, and classify them as positive reviews. All others will be entered as negative reviews. If, as often happens, the rating system for a site defaults to a rating which many reviewers accidentally forget to change (typically a 3-star rating), if the review is clearly a recommendation of the product, and if the rating is clearly a user error, we will correct the rating to make it a positive review. Whenever the source review does not clearly indicate a positive recommendation of the product, we classify the review as negative. Our rating system is therefore slightly biased towards negative reviews. A good review score makes it all the more likely that the product experience is statistically likely to be positive.

Statistical analysis
The size of the sample of user reviews (how many opinions we collected) has a major impact on the statistical validity of its average rating. When quoting the value of an average rating, we systematically quote the measure of how likely the measure is to be true to reality, measured in the form of a margin of error with a confidence level. Unless otherwise specified, we use a 95% confidence level as a basis.

Statistical margin of error and confidence level explained
Let's say that we have obtained 59 user reviews, of which 96% were positive. When we use statistical software to analyze these numbers, we find out that, if it was possible for us to collect feedback from every single product user, there is a 95% chance that the "true" rating across all users is within 5% of the average we measured across the 59 reviews. So, for this example, the margin of error is 5% at a 95% confidence level. If we decrease the confidence level to 90%, then the margin of error decreases to 4%. The smaller the margin of error, the more likely its is for the average rating to be close to the truth.

Impact of larger sample size on margin of error
Everything else being equal, the more reviews we gather, the narrower the margin of error. For instance, in the previous example, if we had gotten the same 96% positive average rating from only 15 users, the margin of error would be 10%. If we had the same average rating from 164 users, the margin of error would be 3%.

Impact of average rating on margin of error
Surprisingly, the average rating makes a big difference to the margin of error, everything else being equal. The more users agree on their rating of quality for a product, whether it is a negative or a positive rating, the less reviews we need to get to a specific level for the margin of error. We like a margin of error of 5% with a 95% certainty. If the average rating is 99% (or 1%), we only need 16 reviews to get to a margin of error of 5%. If the average rating is 75% (or 25%), we need 288 reviews for the same margin of error! predictive quality rating
For many of us, it is difficult to combine the concept of margin of error with that of average rating. What do we do if the average rating is the same between two products but the margin of error? The answer is easy: we should pick the one with the lesser margin of error. But what if the choice is between two products where the higher rated product has a larger margin of error? In order to facilitate consumer choices between different options, has developed an exclusive statistical predictive quality rating, that we use in cases then statistical validity for the result in particularly important. We lower the average rating of a product until its negative margin of error (i.e. the lower bounder of the margin of error) is within 5% of the resulting number.

Predictive quality rating impact on the rating value
The predictive quality rating is always lower than the true average rating, in order to compensate for a lower than ideal sample size. For instance, if we have a product whose average rating is 89% with a 7% margin of error, the predictive quality rating for this product is actually 87%, meaning that it is 95% likely that the "true" rating for the product is no more than 5% lower than the predictive rating. As a result, predictive quality rating gives a worst case picture of what the rating "truly" is. But, if you need to make a purchase decision that you want to be sure not to regret, wouldn't you rather be sure that the downside is the same for all your options?

Custom predictive quality ratings
In cases where specific failure modes are dramatically impacting the user far beyond the value of the product, we may create a custom predictive quality rating adapted to a specific product, which will downgrade the rating further in relation to the appearance of such failure modes. Examples might be where the product endangers life or health, or may cause significant consequences to the user's lifestyle, such as compromising banking data or losing personal data. For instance, if we rate banking services, and find that one failure mode involves compromised banking data resulting in financial losses, for every 1% in reviews reporting compromised data we might lower the global rating by 5%, because the failure mode's impact to the user is so high.

Statistical validity rating
We occasionally display a rating of statistical validity to compare the numbers of reviews we gathered and compare how valid the ratings we produced are. We want to measure statistical validity on a scale of 0-100%, 100% being best. We use what we feel is the best measurement of quality for the product under evaluation as our expectation of what the response distribution should be. We plug this quality measurement in our statistical packages as the response distribution (i.e. the uncorrected average rating for us), then calculate what the statistical margin of error will be based on the actual number of reviews we get. As discussed above, for ratings over 50%, the higher the rating, the higher the statistical validity will be for the same number of reviews. The statistical validity rating is equal to 100% minus the statistical margin of error. It will be 99% in the optimal case when our best measurement of quality matches the response distribution of the user reviews, and when we have enough user reviews that the margin of error is 1% or less. It will be less than that in any other case. The higher the statistical validity rating, the "truer" the quality ratings, the more statistically valid the user reviews are.

No comments:

Post a Comment