Category Archives: Reports

An Analysis of Wine Quality

This research paper looks into the specific factors that has had an effect on consumer’s decisions when they gave their sentiment over the quality of wine.  The report gives an overview of the methods used to conduct the analysis, the results of the analysis and their interpretation. Finally, the report ends with a recommended conclusion of which factors should be considered significant for influencing consumer’s opinions over the quality of wine.

 

The principal objective was to determine which factors were considered to be of most importance, and also of least importance, when predicting the wine’s quality. To investigate this, thirty eight tasters were asked to give their opinions of the new wine by giving 6 different ratings after tasting the wine. Multiple linear regression analysis was used to develop a model for predicting the average wine quality rating given the rating of five other factors. These five factors were: Clarity, Aroma, Body, Flavour and Oakiness. For analysis, the ratings from the five factors were treated as the explanatory variables: the variables that are used to explain/predict the response variable; the wine quality rating. The multiple regression analysis essentially gave information determining the comparative influence of each of the 5 explanatory variables to the total variation in wine quality ratings. The findings from the analysis provided useful evidence of what factors should be considered of most/least importance to help importing decisions. The aim of this analysis was to give a parsimonious model that may have helped the importer to predict the average quality rating of wine, given the ratings of the five factors were known.

 

Using the statistical software package IBM SPSS Statistics 21, six variables were created:

1) Quality, which was the rating given for quality of the wine;

2) X1_Clarity, the rating given for the wine’s clarity;

3) X2_Aroma, the rating tasters gave for the Aroma (Smell) of the wine;

4) X3_Body, the rating given for body which generally refers to the sense of alcohol in the wine and the sense of feeling in the mouth;

5) X4_Flavour, the rating given for the Flavour of the wine;

6) Oakiness, rating’s given for how well they perceived the effects of oak from the wine.

Ratings for all these factors were taken from all 38 tasters. After the data was inputted, a multiple linear regression test, as opposed to a simple linear regression test which only has single explanatory variable, was performed on the software: which provided useful information that could be interpreted and reported. A regression test was made for all possible models: from a null model, which included no explanatory variables, to a full model that included all 5 explanatory variables. The results of interest from each test was the model summary table, ANOVA table and coefficients table. The model summary table gave values for, adjusted  and standard error of the estimate which are measurements that show how well a model fits the data.  is a measurement that shows the strength of the relationship between the response variables and the explanatory variables. Squaring  gives the proportion of explained variation from the explanatory variables as is called the coefficient of determination. However, as we had multiple explanatory variables in our model (five), we were unaware which of the variables contributed most significantly to this value. Nonetheless, the SPSS multiple regression analysis also gave the coefficients table to find each explanatory’s level of significance, this is explained below. The ANOVA (Analysis of Variance) table shows how much of the variability in our explanatory variables has attributed to the variation of the response variable (quality of wine rating).  It separates the total variability within a model into two parts. These are regression: the variance that can be explained by the explanatory values and residual: the variance which is not explained by the explanatory values, also known as the error part. The table also shows the sum of squares, degrees of freedom, mean square (MSR), F-Ratio and p-value. The F-Ratio indicates whether the model used provides a good overall fit for our data and shows how significantly the explanatory variables predict the response variables. The higher the ratio the more significant. The final table of interest, the coefficients table, gives values for unstandardized coefficients for each explanatory variable. Unstandardized coefficients signify how much the response variable; quality of wine rating, varies with one of the five explanatory variables when the other four explanatory variables are held constant. The first coefficient in this table represents a constant, denoted, this is the predicted rating of wine quality when all explanatory factors are held at zero. The unstandardized coefficients form a regression equation where  is the y-intercept and the variable coefficients are the weights of the explanatory variables. The equation is then used to predict the response variable, in this case the rating of wine quality, from the explanatory variables: the factor ratings. The results from all tests were then used to summarise the MSR values and  in two simple tables against the number of explanatory factors in the model. Further, the tables were used to create scatter graphs which helped give visual clues of the optimal number of factors that should be considered. The table was also used to find the best model suggested by backward elimination and forward selection. Using these methods, the most effective model that minimized the number of factors (explanatory variables) needed to be considered, therefore giving the wine important a bigger pool of wine’s to choose from. Simultaneously, the model also needed to have enough factors wine quality rating could be predicted (response variable) as accurately as was possible.

 

Multiple regression analysis was performed on all possible models (combinations of explanatory variables), using the methods above. The mean square error value (MSR) and coefficient of determination were obtained from these tests and summarized according to each model, as shown in the table below. Note: X1, X2, X3, X4, X5 explanatory variables relate to the Clarity, Aroma, Body, Flavour and Oakiness factors respectively.

Variables Included MSR Variables Included MSR
None 4.183459 0 X1, X2, X3 2.089265 0.541
X1 4.296194 0.001 X1, X2, X4 1.534382 0.663
X2 2.152917 0.499 X1, X2, X5 2.111765 0.536
X3 3.005167 0.301 X1,X3, X4 1.634882 0.641
X4 1.615917 0.624 X1, X3, X5 2.823412 0.38
X5 4.290167 0.002 X1, X4, X5 1.456294 0.68
X1, X2 2.213543 0.499 X2, X3, X4 1.550265 0.6659
X1, X3 2.900086 0.344 X2, X3, X5 1.927412 0.577
X1, X4 1.621286 0.633 X2, X4, X5 1.348441 0.704
X1, X5 4.406457 0.004 X3, X4, X5 1.527059 0.665
X2, X3 2.051629 0.536 X1, X2, X3, X4 1.569303 0.665
X2, X4 1.508257 0.659 X1, X2, X3, X5 1.921788 0.59
X2, X5 2.053114 0.536 X1, X3, X4, X5 1.439030 0.693
X3, X4 1.651229 0.627 X1, X2, X4, X5 1.337485 0.715
X3, X5 3.013914 0.319 X2, X3, X4, X5 1.385091 0.705
X4, X5 1.498743 0.661 X1, X2, X3, X4, X5 1.271824 0.721

 

To spot trends between the number of explanatory variables and the mean square error value in each model, the data was represented visually using a scatter diagram as shown below. The x-axis gives the number of explanatory variables and the y-axis gives the MSR values.

Untitled
 

 

The first insight found with the relationship is that, clearly, an increase in explanatory variables suggested a decrease in MSR. This plot showed that the lowest MSR value for three explanatory variables was almost the same as when four of five explanatory variables were used in a model. This indicated that we could ignore at most 2 explanatory variables without leading to a significant increase in the MSR value. Inspection of the table showed that this MSR value is 1.34844 (the lowest MSR) when variables X2, X4 and X5 were included in the model. This suggested a suitable parsimonious model (simplest plausible model with the fewest possible number of variables) might have been

A scatter diagram was also used to show the relation between the number of explanatory variables and the coefficient of determination. Again, the number of explanatory variables were plotted along the x-axis but now against the  values along the y-axis.

Untitled 1

The trend suggested an increase in the number of explanatory variables tended to led to an increase in R². Signifying that a larger proportion of the variation in wine quality ratings could be explained if we included more explanatory factors, i.e. the Clarity Rating. The plot also showed that including three, four or five variables produced an R² value at around the same value. Thus, the plot implies a suitable parsimonious model would again include three explanatory variables: which confirmed the previous suggestion. The highest R² value, with three explanatory variables was 0.704 and included the variables X2, X4 and X5, as before. The SPSS regression test for these three explanatory variables gave the tables:

 

 

 

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .839a .704 .678 1.1612
a. Predictors: (Constant), X5_Oakiness, X4_Flavour, X2_Aroma

 

ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 108.941 3 36.314 26.930 .000b
Residual 45.847 34 1.348
Total 154.788 37
 a. Dependent Variable: Quality
b. Predictors: (Constant), X5_Oakiness, X4_Flavour, X2_Aroma

 

 

Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 6.461 1.333 4.846 .000
X2_Aroma .576 .260 .306 2.214 .034
X4_Flavour 1.203 .274 .604 4.392 .000
X5_Oakiness -.600 .264 -.216 -2.269 .030
a. Dependent Variable: Quality

 

 

For this model, the model summary gave a  value of 0.839. This indicated a good level of prediction of the response variable. The R² was 0.704 which meant that the explanatory variables, X2, X4 and X5, explained 70.4% of the variability of the response variable. Interpreting this for analysis, this statistic showed that producers could have significantly influenced the quality rating of the wine by focusing on the ratings for Aroma, Flavour and Oakiness when importing.  The ANOVA table above also gives the F-Statistic F(3,34)=26.93 and p<0.0005. This showed the regression model was an excellent fit of the data. Finally, the coefficients table gave the regression equation for the model. So, the general form of the equation to predict the response variable (y) was: y=6.461+0.576X2 +1.203X4 -0.6X5 This explanatory variable, which had the largest influence on wine quality rating, was therefore the flavour rating X4. This meant that a 1.203 wine quality rating increase was predicted for each time the wine’s flavour rating had increased by 1. On the other hand, a 1 rating increase in the wine’s Oakiness, according to the regression, is predicted to decrease the wine’s quality rating by 0.6.

Another approach used to select a suitable parsimonious model was the Backward Selection method. The technique initially involved including all explanatory variables in our model and then removing variables sequentially from the model, if any, that improved the model. The deletion of each explanatory variable was tested using its F-statistic and the comparison criterion of 4. If the F-statics were calculated to be less than 4, the explanatory variables were eliminated from the model and so on. This technique eliminated X3 from the full model and gave the prediction equation: y=4.972 + 1.802X1 + 0.527X2+1.267X3-0.657X5

 

Further, an alternative to the backward elimination technique was forward selection: the complete reversal of backward elimination. This involved starting with a regression model, which included no explanatory variables, then including sequentially variables to the model that had an F-statistic larger than the comparison criterion of 4. This technique selected just X4 (Flavour rating) to be included in the model and gave the prediction equation: y=4.941+1.572X4

 

The results above provided three contrasted results. The first model included 3 explanatory variables, the second 4 explanatory variables and the third only 1 explanatory variable. The second model, as suggested by backward elimination, predicts that the wine quality rating was significantly influenced by 4 of the explanatory factors (Clarity, Aroma, Flavour and Oakiness ratings) and supports that, to increase the overall wine quality rating, the wine importer should import a wine that has high ratings on Clarity, Aroma, Flavour but a lower rating on Oakiness. However, finding a wine that has all these factors can be difficult and also costly to produce. Therefore, a simpler model may be more appropriate. The third model, as suggested by forward selection, predicts that the wine quality rating was significantly influenced only by the Flavour rating. This suggests that the wine importer should, to increase wine quality ratings, only find wines that have high Flavour ratings.  Conversely, this model may have been too simplistic and may not include the full model which impacts the overall wine quality rating. The first model, as suggested by inspection from the scatter plots, predicts that wine quality rating was actually significantly influenced by 3 of the explanatory factors which were Aroma, Flavour and Oakiness. This regression equation for this model shows that, in order for the wine importer to import the highest quality wine (as rated by consumers), the importer must select wine that has high ratings for Aroma and Flavour but a lower rating of Oakiness.  In this case, the most important factor effect on the wine’s quality rating is the Flavour so this factor should be the most significant when considering the selection of wine. If the importer wanted to find a wine that had a lower average quality rating (due to lowering the selling the price of the for example) the opposite of these factor ratings apply.