Some Stats and SPSS Pointers

Some stats and SPSS pointers
http://privatewww.essex.ac.uk/~scholp/statsquibs.htm

These relate to some specific technical matters I am sometimes asked about but which are not covered in detail in my courses or maybe passed you by unnoticed. Note, I am not trying to present all my course material here (you have to take the courses for that), just deal with some frequently asked questions and things people frequently get confused over/get wrong. Also, these are not all readily understandable unless you took stats courses already! How do I round figures down to make them shorter, e.g. 3.852. And how many decimal places should I report? How do I generate random numbers to help when sampling from a list, or when dividing subjects randomly into groups? Use the facility at http://www.randomizer.org/form.htm I have the proficiency scores (or the like) for 30 subjects, and want to divide the cases into groups based on this. Or I need categories of word stimuli of three different frequencies.. How do I do it? Can I get phonetic symbols like [] shown on the scales of SPSS graphs? How do I combine columns of figures I have entered in SPSS, when I want averages for each person of the figures in the columns (e.g. the scores for separate items in a test)? What is item analysis? And what does it mean if the F in an ANOVA result is labelled F1 or F2, where there has been an analysis by items as well as by subjects? How do I eliminate extreme response times in psycholinguistic data? or response times where the response was wrong? What does the standard deviation really mean? When I do a histogram of some scores (interval scale data) I am supposed to look at the distribution shape the pattern of the heaps on the graph but how do I interpret the shape I see? How should one treat rating scale responses? As ordered categories or interval scores? If my data is not normally distributed, so not suited to t tests and ANOVA, what can I do? What are the transformations I can use? What really are Likert and Guttman scales, and how should they be constructed? They both are ways of measuring things via a set of agree-disagree items. Often we use sets of items of this type that other researchers made but I wonder if anyone actually selected and rated the items in the approved way in the first place? What does it mean when SPSS gives you a figure with an E on the end? e.g. 7.012E-02 What are degrees of freedom (df) and how do I report them, if needed? What are residuals and what do they tell me? If in a pilot trial of a few subjects I dont get the significant result I want, how can I estimate how many subjects I would need to probably get a sig result? How do I do follow-up post hoc paired comparisons and planned comparison tests for any kind of main effect or interaction in ANOVA where more than two groups or conditions were initially compared? SPSS doesnt do all the possibilities, or hides some away How do I do post hoc paired comparisons after a Kruskal-Wallis test? What is Bonferroni adjustment and how can I do it? What is eta squared and how does SPSS calculate it? Esp. for ACQUISITION people and SOCIOLINGUISTS. Twenty people in two groups
1 of 29 12/18/11 2:58 PM
are each measured for the number of times they use the third person s out of all the occasions when they had an opportunity to in compositions, recorded speech etc. (often called potential occurrences or loci). How do you summarise % scores like this? Group % scores for frequency of use of things, or individual % scores? Esp. for PSYCHOLINGUISTS and people doing repeated measures EXPERIMENTS. What on earth is a Latin square and how do I use it or some other method of organising conditions, different types of stimuli etc. in an experimental design? What are those tests of prerequisites for ANOVA/GLM such as those of Levene, Mauchly etc. all in aid of? If I have a lot of missing scores, can I fill them in somehow? Can I check on whether people are responding by random guessing or with bias, and adjust scores to take account of that? My subjects all gave several responses to a set of different stimuli, and I have entered the data in SPSS with each response as a row. So there are several rows for each subject. How do I turn that into the more usable SPSS layout with one row per subject? Subjects have been categorised in a parallel way in several different columns. E.g. they answered a set of questions each of which had the possible response: me, my teacher, my classmates (i.e. although coded for SPSS as 1, 2, 3, the responses cannot be considered as degrees of anything on an interval number scale). How do I get SPSS to add up for each person across the items totals of how many times each category was chosen? If you are into word association tests, there are a few descriptive stats that one can use there that one does not find used anywhere else much: The Group overlap coefficient, Within groups overlap coefficient, and Index of commonality.
Degrees of freedom Sometimes journals expect you to report these df figures along with other statistics. They are the figures you see quoted in brackets often subscript after t, F, Chi squared etc. E.g. instead of t = 2.34 one sees perhaps t(28) = 2.34. They can usually easily be got from SPSS output where they are not obvious. Look for df. Broadly they reflect the number of categories in any category variables in the design, and the number of cases in each group. The exception is designs where only category variables are involved (e.g. where you would use chi squared): in that instance the df just reflects the number of categories. Since you will have told the reader the numbers of categories and cases involved anyway, I don't
2 of 29 12/18/11 2:58 PM
personally see the point of mentioning df. But in case you need to, they mainly turn out to be one less than the numbers you started with. though it can get more complicated. The df numbers are written subscript, or in brackets, after the statistic t, F or whatever (not the p). So in a t test comparing two groups, 108 subjects altogether, the df will be 1,106. One might write t1,106 = ..... The first figure is one less than the number of EV categories (2-1=1). The second is the number of cases less one for each group involved (N-2=108-2=106). In an ANOVA comparing four groups with 108 subjects altogether, df would be 3,104. In a t test comparing the same group in two conditions, the df for 108 cases will be 1,107. The df can be more tricky for more complicated designs and interactions. In the output of ANOVA you will generally see the first df figure you need in line with the main effect or interaction of interest, and the second one listed as within groups or error below it. In a chi squared test with three categories on each scale, the df is 4 because (3-1) x (3-1) = 4. In a chi squared test with two categories on each scale, the df is 1 because (2-1) x (2-1) = 1. Why are these figures called 'degrees of freedom', and why are they important? It is basically because what is important in statistics is not so much the numbers of anything but the numbers of choices or separate pieces of information involved. Typically there are always one less choices than people etc. If I have ten assignments to hand back to my class of ten students, I have to make a choice who to give each one to for the first nine, but for the tenth one there is no choice, as there is only one assignment left and one person left to give it to. I have no 'freedom' left on the last one. Here's the statistician's analog of that. 100 people answer a yes-no question and 38 say 'yes' and 62 say 'no'. We want to know if that differs significantly from 50:50. I.e. are they showing a real preference? There are two categories (yes and no), so we use the binomial test. It might seem that we have two figures to handle in the test and two comparisons to make. We have to check if the observed figure of 38 differs from the E of 50 and if O of 62 differs from E of 50. But in fact, of course, the test need only do one of those. The data has only one degree of freedom. Once the test establishes if 38 differs significantly from 50 for one category, the answer for the other category, whether 62 does so as well, is fixed. Hence if one calculates statistics by hand one always finds that in the formulae one has to use the df figures rather than full numbers of cases or categories.
Residuals These are simply the differences between observed figures (O) and some kind of predicted/expected figure (E). But they mean different things in different analyses. Category data: for significant differences/relationships we want them big, because the E figures represent what is expected under the null hypothesis of NO difference/relationship. In analyses where
3 of 29
12/18/11 2:58 PM
just frequencies in categories are involved (e.g. analysed using chi squared or the binomial test), the residuals are the differences between O and E frequencies. The bigger they are, the more likely that there is a significant difference involved. In the Labov analysis in class we looked at the table of O and E values to see where the biggest O-E differences were (for which r use in which store). In fact chi squared itself is calculated by essentially adding up the residuals for each cell in the table (with a bit more maths to it). In the binomial test where, say, 20 people are divided 4 saying 'yes' and 16 saying 'no' to a question, we want to know if that differs significantly from a 50-50 split, which would be 10 'yes' and 10 'no' in this instance. So we are concerned with the size of the residual... in this instance 6. The bigger the better, if we want to show a clear preference. Interval data: for significant relationships we want them small, because the E figures represent what is expected under the hypothesis of a perfect linear relationship. This is the other place where you often find residuals being talked about - in data where all the variables are (treated as) interval (analysed using Pearson r, or regression). Here they are the differences between the observed scores and the scores predicted by the best fitting line on a scatterplot, showing the EV-DV relationship. Here obviously the smaller the residuals, the more likely the relationship is significant. Obviously one can find a best fitting line to any data where cases are scored on two or more interval variables.... but if most of the observations fall miles away from the line, that does not show a real relationship. Pearson r and regression statistics in effect reflect whether the residuals are generally large or small; examining scatterplots, when we look at cases (subjects) that are way off line, we are looking at cases with exceptionally large residuals. Eta squared This is the measure of relationship that you can get in ANOVAs and the like. A bit like a correlation coefficient it tells you on a scale 0-1 how much EV-DV relationship there is. Really it is more analogous to r2 and can be thought of as a % on a scale 0-100. It is a useful addition to just being told if a relationship or difference is significant. Many significant differences/relationships in fact are quite small in terms of the SIZE of the difference/relationship. SPSS does not calculate eta quite how the books suggest, or even how SPSS help itself seems to suggest. In fact every eta sq is calculated so that it is a proportion out of a different total and some of the variance that goes into the calculation of one of them may also go into the calculation of another, so none of them can be added sensibly to each other. So every effect (main or interaction) is out of its own 100%, representing the maximum variance that it could account for, but not all the variation in DV scores. This applies even where the effects are of the same type and a sensible calculation could be made of the % of variance of the same type accounted for (e.g. two between subjects main effects - in principle one could calculate what % of the WS variance they account for together). In fact this is not done. So the SPSS etas can be compared with each other (This one is accounting for more of the total it could account for than that one is...) but not really added. Or if you like, the total % if there are three factors with three main effects, 3 two-way interactions, and one three way, is not 700% but less than
4 of 29 12/18/11 2:58 PM
that... but hard to calculate exactly what. (In fact you can see how SPSS calculates the etas: in the sum of squares column it is simply the sum of squares for the effect of interest divided by the SS of that effect plus the relevant error SS for that effect. Clearly then it is not calculating the proportion of all the SS in the entire analysis accounted for by that effect, just the proportion of the SS relevant to that effect. And also the error SS get re-used in different calculations)
Post hoc tests of paired comparisons after ANOVA Wherever a main effect or interaction involves a comparison of more than two means, post hoc tests can be relevant, as the basic significance value given by the ANOVA does not say which pair or pairs is/are sig different. If the main or interaction effect from ANOVA comes out significant that just means that there is a sig difference SOMEWHERE among the means but not between every pair necessarily. Especially this arises where one or more of the EVs has three or more levels (i.e. groups or conditions), though it can also arise, say, where you have two two-value EVs and the interaction is significant. You need a post hoc test to identify where the differences are exactly or just judge it by eye from a graph or table of means. This situation arises in various ways in ANOVAs, some of which SPSS deals with straightforwardly, others not. One might think the solution is just to do loads of familiar t tests comparing the means in pairs as required, to see which pairs are sig different. Indeed one sees this done in some published work, and in moderation probably you can get away with this However, statisticians dont like that. The statistical issue underlying all this is that, when you do paired comparisons like this, the same means are getting reused several times in different comparisons. If you have three groups and compare them in pairs then the mean for group 1 gets used in the comparison both with group 2 and group 3. Now the more times a mean gets compared with others in repeated statistical tests, the more chances it has to come out as significantly different just by chance, not reflecting a real population difference. Remember that if a difference between two means is significant (at the .05 level) that actually MEANS that one would not get a result this different more than 5% of the time or one in twenty times by chance, due to the vagaries of random sampling, in similar sized samples from a population where there really was no difference. But another way of looking at that is to say that if you use the same data in twenty comparisons, then one of the results might be that one-in-twenty result that looks significant but is actually from a population where there is no difference. The more tests you do, the more chance of getting a result that looks sig but is not really. Some adjustment has to be made to compensate for this. Like other activities in life involving pairs, your tests for multiple paired comparisons should not be unprotected! Post hoc tests and the like cope with this better than t tests. It is not appropriate to do multiple t tests at least not without a Bonferroni adjustment of the sig level (though that is a solution that is seen as rather overcompensating for the problem). Better is to use a post hoc test designed for such comparisons (e.g. Tukey, Scheffe, etc.). However, as the SPSS dialog box for post hoc shows, there is a myriad of options: nobody is certain which is the best, and none are perfect. As a consequence sometimes you can get an anomalous result that the ANOVA says there is a sig difference somewhere, but the paired post hoc test does not find any pair significantly different.
5 of 29
12/18/11 2:58 PM
The term post hoc is used for where you just want to consider all pairs of means that are possible to compare, following an overall analysis including all the means, which is the appropriate starting point. SPSS however limits this term to comparisons between cases in different groups, though statisticians use the term generally for follow up comparisons of pairs of repeated measures conditions as well. The term planned comparison (=contrasts in SPSS) is used where you planned specific paired comparisons, not all the possible ones, such as the comparison of three groups of learners with an NS group, but not with each other. The general rule is that for k means there are k(k-1)/2 paired comparisons possible. E.g. if four groups then 4 x 3 / 2 comparisons, i.e. 6. However, SPSS output usually gives you the pairs twice over so it looks even more. 1. An EV with three or more independent groups being compared. E.g. the % correct scores for third singular s of three groups of learners are compared. The basic ANOVA result says whether there is a significant relationship between the EV and the DV a difference somewhere among the groups but not exactly where. If the overall result is sig, then to see which pairs of groups are sig different you need to do post hoc tests. Whether you do the ANOVA via Compare means Oneway ANOVA or via General Linear Model Univariate, you get many many ways of doing the post hoc test offered under the Post Hoc option. Tukey HSD is a common safe bet. Basic post hoc tests compare every pair of means. But suppose your groups were two of learners and one of native speakers and you plan to compare the two learner groups with the NS group (which may be thought of as a control group) but not with each other. These are often called planned comparisons and you would do better not to use the post hoc tests which compare every pair, and so are weaker (less likely to identify sig differences). You get this sort of limited comparison in Analyze.. General Linear Model... Univariate... enter your DV as usual and the three languages variable as a fixed factor. This does a oneway ANOVA exactly like you get with Compare Means... Oneway.. except that it gives you some extra options. If you click Contrasts and click the contrast option to get Simple and then click first or last depending on whether the control group is numbered 1 or 3... then (don't forget) click Change... then Continue then OK... you get an output that just does those limited paired comparisons. 2. An EV with three or more repeated measures conditions being compared E.g. you compare the same peoples fluency speaking to the teacher, to peers and to parents. You want to compare each pair of those conditions afterwards. In General Linear Model Repeated Measures you have to use not what is labelled Post Hoc but rather Options click the variables into Display means and tick Compare main effects and below that choose Bonferroni. This in effect uses t tests with a simple Bonferroni adjustment for multiple comparisons to compare the pairs of means. Not ideal because overcautious: i.e. likely to lead to you missing a difference that is actually sig. SPSS should really make Tukey etc. available in repeated measures as well as independent groups comparisons. Alternatively you can do your own Tukey test as described below.
6 of 29
12/18/11 2:58 PM
Once again you can alternatively choose limited planned comparisons via the Contrasts option as above. 3. Interaction in a two way ANOVA with both EVs as groupings Where there are two EVs that are groupings, the interaction always involves at least 4 subgroups. Even if both variables are just two groups, like male-female and upper class-middle class, the interaction has four groups involved and, if the interaction is sig, you might want to know which pairs of those are producing that result, beyond just guessing from a suitable graph. SPSS does not deal with post hoc for interactions, but in some instances you can do it yourself fairly simply with calculator. For instance you can do a Tukey test to test for pairwise differences when you get a sig interaction in a two way ANOVA with two independent gps factors, where all groups have the same number of subjects in. Calculate T = q x (error mean square / number of people in each group) Error mean square or error variance is in the original ANOVA table in output. q is found from the table of the Tukey statistic (ask me for it or see a serious stats textbook which has it in the back. I cant include it here for copyright reasons). Read off the column for the number of means being compared pairwise, and the row for the df of the error variance/mean square (from ANOVA table). Then calculate T and any pair of means differing by more than T is sig different. If the groups are different sizes, or you wish to save effort, do t tests with Bonferroni adjustment. 4. Interaction in a two way ANOVA with both EVs as repeated measures As for 3. OR Treat it as a oneway repeated measures situation. Enter all the repeated measures columns as if there were just one factor not two, and follow 2 above. That in effect does the post hoc for the interaction. 5. Mixed independent groups and repeated measures ANOVAs As usual, if the result in ANOVA is significant, and more than two means are being compared, one needs follow-up tests to see which pairs of means are significantly different (or be happy just to judge it visually from a graph). Each main effect involving 3 or more levels can be dealt with as above, but the interactions are more of a problem. Take five repeated measures conditions and two groups. One can get the main effect multiple comparisons done by SPSS with suitable adjustments as described in (2) above (i.e. comparing results on the five conditions with each other in pairs, for the whole sample of subjects lumping both gps together). In fact if one wants all of them there are 10
7 of 29
12/18/11 2:58 PM
comparisons.... because there are five conditions, so (5 x 4) / 2 paired comparisons. In the interaction, since there are 10 means involved for all 5 conditions and two groups, there are (10 x 9) / 2 comparisons potentially, which makes 45. One can do some of the interaction paired comparisons, by splitting the file and getting SPSS to use the Bonferroni option again. Those are the comparisons of each condition with each other condition within each group separately. 10 comparisons in each group = 20 in all. That leaves 25 comparisons that you could not do with any post hoc procedure in SPSS as far as I know... the comparisons between each of the 5 means for one gp and the five for the other. Ordinary t tests do not have any required reduction for multiple comparisons like post hoc tests do. However a simple adjustment by hand is to use the t test but require stricter sig levels. In fact this is really making the Bonferroni adjustment oneself. The account immediately above assumed that there was no a priori reason to be interested in any of those 25 pairs more than any other... It was a DIY post hoc solution. However, it could be that, for theoretical reasons or whatever, you were not interested in comparing every pair of means, only certain ones. In particular: - the comparisons of all 5 conditions within each group, done OK with split file and Bonferroni adjustment..... 20 comparisons - the comparison of each group with the other on each condition separately. That is in fact only 5 comparisons out of the 25 possible other ones. (I.e. you have no interest in comparisons like that between the lower group on condition A and the higher group on condition C, between with lower and higher on A etc.). You want to claim, in this instance, that these were what are called 'planned comparisons' not the usual post hoc 'try everything' type. Then you could reduce the required sig value of the t test for this part by dividing by 5 not 25 in the Bonferroni adjustment.... In general, then, where there is no post hoc test available in SPSS, the simple but crude solution is to use ordinary pair comparison statistical tests, but divide the target sig level by the number of potential comparisons you COULD make, or PLANNED to make, to compensate for making multiple comparisons. However, this is cruder than using post hoc tests, which take care of this better. You are more likely to miss sig differences (a so-called Type II error). You dont get a sig result and you want to know how big a sample you would need to get one If you have gathered data, especially in a pilot study, and not got a significant result, you may want to know how big a sample you would need to make the result significant. Remember, if you choose a big enough sample, even a very small difference or relationship may be significant. So if you have the possibility available to increase the size of the sample (i.e. there are more subjects or cases available), and are desperate to get a significant result, it would be useful to know how many subjects would be ideal. Some books give formulae to calculate how big a sample you need, but they dont necessarily straightforwardly fit the situations you have. The following is my best suggestion for an easy way to get an estimate of required sample size using SPSS facilities.
8 of 29
12/18/11 2:58 PM
Basically you create imaginary larger samples simply by using your subjects more than once. Suppose you have 20 subjects and p=.231 for whatever test you are interested in. You get SPSS to think that you have three times as many subjects, simply by getting each subject counted three times, and run the test again. Say then p=.09. Then you get SPSS to think you have four times as many subjects, including each of your twenty four times, and see again what happens. By trial and error you get to the point where p=.05, and that gives an estimate of the minimum number of subjects you need to get a significant result. To get SPSS to count a subject more than once you weight the data, similar to how you are familiar with doing elsewhere. At transform..compute you nominate a new target variable which you might call incr (since it will tell SPSS how many times to increase your sample size). You then enter in the numerical expression space whatever you want the weighting to be. You could start with a weighting like 2. Click OK and you will find a new column called incr with 2 repeated all the way down. If you now go to dataweight cases and weight the data by that column, then SPSS sees your data as having twice as many cases counting each one twice. Now do your analysis again and see if it is significant. Go on altering the weighting figure in the incr column via transformcompute repeatedly and redoing the analysis until you get a sig difference or relationship. Note that you can enter partial weightings like 3.5 as well. When by trial and error you achieve a weighting that gives a significant result, multiply it by your original sample size to see how many subjects you would need. E.g if your sample from two groups was 20 in all but you only get a sig difference with a weighting of 3.8, then you need at least 20 x 3.8 subjects (= 76), in similar proportions in the two groups as before to have a chance of getting a sig difference Cautions. You have to make sure the new bigger sample IS from the same population as the old one. In the case of comparisons of groups of course several populations may be involved. Even then, any method of estimating the required sample size is only approximate, because even truly random samples can vary a lot. Also, with an increase in sample size the actual difference or relationship you are interested in may not actually get any bigger. It is just more likely to be significant. I.e. you may end up showing that there is indeed a non-zero difference or relationship in the population (which is what significant means), but not that it is a very large one. Group % scores Twenty people in two groups are each measured for the number of times they use the third person s out of all the occasions or loci when they had an opportunity to (often called potential occurrences) Very many linguistic features are measured this way in acquisition and sociolinguistic research. In the former it is often a matter of how often the correct form (in NS terms) is used, as against some erroneous form or omission, on occasions where there was an opportunity to use it; in the latter it is often a matter of how often one variant out of two or more that make up a sociolinguistic variable is used. In all these situations there are two ways of summarising and graphing the data 1) the group way and 2) the individual way.
9 of 29
12/18/11 2:58 PM
Either 1) you add up all the potential occurrences for each group, and all the occurrences of the form of interest, and express the second as a percent of the first for each group. Or 2) you calculate a % score for each person using their individual frequency of the form of interest and their individual number of potential occurrences. Then for each group you can calculate the average (mean) % score for that group from the individual scores of its members. However, you have to be aware that this can be a bit misleading for cases whose number of potential occurrences is small: getting one out of one right is 100% as much as getting 20 right out of 20 possible occasions! It is common to require at least 5 potential occurrences, otherwise treat a case as missing data. It is easy to show that the group figures may not come out the same! Here we imagine figures for a group of two people and see what happens:
Method 1
Person 1 Person 2 Total Group %
Frequency of Number form of interest potential occurrences 4 16 8 10 12 26
of % occurrence of form of interest 25% 80% (12/26)x100 46.2% =
Method 2
Person 1 Person 2 Mean % group
Frequency of Number form of interest potential occurrences 4 16 8 10 for
of % occurrence of form of interest 25% 80% (25+80)/2 = 52.5%
In fact the two methods will come out the same only when all subjects had the same number of potential occurrences (e.g. in a test or list reading task). Many SLA and sociolinguistic studies use method 1. That is fine, if you wish, for the purposes of giving descriptive statistics and making graphs, provided you make it clear what you are doing, and are aware of the difference from the other method. BUT for any inferential statistics you should use the method 2, entering the data in SPSS in the form of one row per person, with a % score for each person. Then, to compare two groups, for example, you use the independent groups t test on the two sets of scores.
10 of 29 12/18/11 2:58 PM
If you were to attempt inferential statistics on the total figures of method 1, you would have to use the numbers of individual occurrences regardless of people. I.e. if the example above were for one group, you would represent that group with the proportions 12 and 14 (i.e. 12 occurrences of the form of interest, versus 14 non-occurrences, making up the total of 26 potential occurrences) and compare those with the overall proportions for the other group being compared with. The test for that is chi squared, and you do see this used even in some published work for data like this. However, there are at least two major problems with this which would lead statisticians mostly to regard this as a misuse of chi squared. - Like for all significance tests, the basic observations (cases) which enter into the test have to be independent of each other. Now in method 2 the cases are the people, and there is no problem in seeing scores from different people as being independent of each other. However, in method 1 the 26 occurrences in the example are the cases, and clearly while some of those are independent of each other (being from different people) some are likely not (being from the same person) - There is also an expectation that populations sampled are homogeneous. From what we have just said that is clearly not the case in method 1: the 26 observations representing one group in the example are a mixture. It cannot be said that each observation is from one population it is from a mixture of a population of people and the populations of occurrences of each separate person. The only instances where chi squared and method 1 might be defensible would be where the numbers of potential occurrences are very small amounting to little more than one or two per person included. OR where all the potential and actual occurrences come from just one person per group. though that still does not deal with the independence problem. OR where you feel able to argue that responses from the same person are as independent as if they were from different people There is a tradition of phoneticians making this tacit assumption for things like VOT, on the belief that such things are beyond the persons ability to control. Rounding interval scores Just checking.... do we know how to round figures on interval scales? The mean of a set of scores may come out as 6.3597, but often we want to express this in shorter form, such as 6.36 or 6.4. Quoting long strings of numbers after the decimal point can look as if you are just trying to impress with loads of numbers. Or it may be you are trying to make up for sloppy METHOD by being super-detailed in the figures quoted in RESULTS.... Best not to do that, since one's measurement is unlikely to be so accurate that more than two decimal places are relevant (except perhaps where a computer has measured something for you like response time...). Generally three or two decimal places for sig/p values, and two or one for everything else. Keep it intelligible and round numbers where necessary. But where do you round up, and where down? Task Just round the following figures to two decimal places: 3.852 0.679 18.505 1.006 7.597 20.955 0.602 SPSS often rounds figures on screen (e.g. in the data grid) even though it is holding longer versions in its
11 of 29
12/18/11 2:58 PM
memory. You can select for each column how many decimal places it shows on the Data View window. Answer to above 3.85 0.68 18.51 1.01 7.60 20.96 0.60
Decoding interval scores expressed in E notation in SPSS output Sometimes SPSS produces numbers like 7.012E-02 This is not 7.012.... It is 0.07012 The E with a minus sign signals the number of places the decimal point has to be moved to the left. So 1.369E-03 = 0.001369 Etc. The E is a shorthand so as not to write a load of noughts. Always convert any such figures into the familiar form if you report them in your work. Correspondingly 7.012E+02 would indicate 701.2. Combining columns of scores for separate items in a test etc. to give a total or average score Where a test or other instrument produces scores for separate items which then need to be added up to give a total score for a variable, one could of course add them up off computer and just enter the totals. However, to check on internal reliability or to do an analysis by items in addition, or to filter response times and exclude some, you will need the scores for every item in a separate column, so will have to enter the data in full. To then add columns use TransformCompute in SPSS to create a new column that totals the separate ones. You enter the title of the new summary column top left in the dialog box, and click the column names to be added into the top right space, with + between them. That creates a new column of totals. However, anyone with a score missing in any column will be missed out and their total will come out as missing. If there are missing values in some columns, marked in SPSS by a . , where subjects failed to respond or have unanalysable data, you will probably want each persons total really to be the average score over all the items they answered, not the total (unless you have some reason to count missing as the same as wrong and so score it 0). You can get this by, in TransformCompute, inserting in the Numeric Expression box the function MEAN(numexpr, numexpr,) from the functions list, and putting the relevant column labels in the brackets separated by commas. I.e. if you have a set of three items whose scores are in columns item1, item2, item3, then you would enter MEAN(item1, item2, item3) in the Numeric Expression box. SPSS then generates a new column with the average score of each case on the three items or, if they answered less, over the ones they answered. Similarly, if you want to just add, not average, a set of columns, using whatever scores are available, then to avoid the people with missing values getting recorded as with zero total use SUM(numexpr, numexpr,) in the same way as described for means above.
12 of 29
12/18/11 2:58 PM
Cutting an interval scale into ordered categories A common example is deriving a grouping of subjects from something you measured about them originally on a numerical scale: an explanatory variable such as their ages, English proficiency, extraversion etc.,. This is often done casually without due thought, and often in peculiar idiosyncratic ways by novice researchers, but above all it needs careful thought about why it is done, and how Before you do this at all, you need to ask if it is necessary at all. Just because some other researcher had a high prof group and a low prof one does not mean you necessarily have to have groups. When you derive such groupings from scores originally recorded on a continuous interval scale, obviously you lose some information. One person may be a bit better than another on the original scores, but once you decide they both belong in the high prof group, or whatever, they are treated as identical in any further tests. This may or may not help produce the result you want Certainly how you divide subjects into groups, if you do, can drastically affect the result! Reasons for cutting There are a number of reasons some statistical, some related to research methods, design and hypotheses more. a. A few statistical techniques require interval scales to be reduced to binary grouping. Implicational scaling (scalogram analysis) is one method of statistical analysis used in acquisition research that requires this: subjects simply have to be categorised as having acquired or mastered each feature of interest or not. So also varbrule analysis requires groupings of people who use or dont use some form of sociolinguistic interest. b. If the true interval nature of a scale is in doubt that could be a reason to reduce it to categories (though reduction to rank order would lose less information). c. If you retain the original scores and look at relationships with other (dependent) variables then you are into the statistics of correlation, and maybe multiple regression, typically. If you form groups, then you can identify a mean for each group on the other variables of interest and compare those means with t tests, ANOVA etc. Both methods will show relationships between EVs and DVs, but the second will be better (or at least easier in SPSS) for dealing with i. nonlinear relations e.g. where high and low proficiency subjects perform similarly on some other variable of interest, compared with intermediate subjects ii. interactions between different EVs e.g. where you want to see the combined effects of gender and prof on something: do high prof females differ from high prof males in the same way as low prof females differ from low prof males? iii. designs involving repeated measures. d. The goal of the research may be exploratory - precisely to discover useful categories of subjects. e. You may wish to identify extreme groups of subjects for comparison. E.g. you want to
13 of 29
12/18/11 2:58 PM
compare bilinguals who are English dominant with those who are Welsh dominant. You do not want more or less balanced bilinguals. So you measure the bilingual dominance of a sample and will reject the middle scorers, keeping two extreme groups. f. You need categories to form the IV in an experiment. E.g. you want words of three levels of frequency to be the stimuli for three conditions in an experiment. Or maybe you want extreme stimuli just frequent and rare. Either way you need groups of words as it is difficult to use an interval scored variable directly as the EV in a repeated measures design. Means of cutting OK so you still want to make groups there are many ways of doing it. To some extent they match the reasons above.. The principles apply to any interval-scored variable that is to be turned into a grouping. The issue is where to cut the original interval scale so as to obtain two or more groups of cases Cutting at a priori scale values. That is, cutting at predecided score values on the scale, which would be the same whatever sample you gathered. These values may or may not have some absolute meaning of the criterion-referenced type. Cf. Reason 1 above. Such a point could be One used arbitrarily by previous researchers. Not necessarily a good way to do it if it has no sound basis, other than for the reason that it then enables you later to compare your results directly with those of other researchers The pass mark used in a particular institution for some English exam, or a succession of such marks e.g. corresponding to what are called grade A, B, C, D in some institution. Again such points may be fairly arbitrary, but perhaps meaningful for your research in allowing you to contextualise it. Grades with some universal absolute meaning associated with them, maybe in a professional published test you have used. E.g. you divide subjects into those who got grade 6 or better in the IELTS test, and those who scored worse, given the widespread use of this value as a criterion for entry to UK universities. Ranges of scores of the Jacobs instrument for assessing EFL written compositions, and many international language tests, have proficiency definitions associated with them. A different example of this type is to divide a five point rating scale of the type strongly agree agree neutral disagree strongly disagree into just two categories those who showed some agreement (i.e. the top 2 choices) versus the rest who disagreed or were indifferent: this uses a division point with some clear meaning of its own (but why then did one not ask the question in the first place just as a two choice item?) The score on a variable scored as % correct which is conventionally regarded as indicative that someone has acquired a feature. Acquisition researchers vary in what they think this score is, but 80% or higher correct use of, say, third person s would be regarded by many as enough to put a subject in the group they would say has acquired the feature. Others argue that that only 100% correct indicates true acquisition. Others that any correct use greater than 0% correct indicates acquisition has occurred. Again others use other scores like number of occurrences of a structure in 5 hours of observation (Bloom and Lahey 1978, 328) 5 or more indicating acquisition.
14 of 29
12/18/11 2:58 PM
The score on a variable scored as % use of one alternative which is conventionally regarded as indicative that someone is a clear user of that alternative. Labov in his famous department store study divided subjects into three groups - those using no [r] sounds in the words fourth floor said twice, those using them on all four possible occasions (categorical users), and those in between (i.e. variable users). Scores defined by how some other relevant group of people performed on the same test or measure e.g. for learners you might make use of the mean score of native speakers doing the test (a criterion group), or perhaps the score which only 15% of NS do better than (the 85th percentile). Alternatively one might rely on the mean score that large numbers of learners of the same sort as ones own testees gained in other research (a reference group). The latter is not often available in language research more a feature of standardised NS tests like the British Picture Vocabulary Scale and so on. Cutting the score scale into halves or equal lengths. That only is easy if the scale has fixed ends, such as a % score scale, or a test marked out of 40. E.g. you make four groups: those who scored between 0 and 10, 11-20, etc. (being careful not to label them overlapping 0-10, 10-20, 20-30). This is often not very meaningful unless the scale has some absolute meaning so that half-marks actually means half knowledge of something beyond the test items and it produces unequal sized groups. Also it may not even be possible to quite achieve equal lengths with ease (0-10 actually covers one more point than 11-20!). However, it is a system that can be used with the same cutting scores on any sample, like the above but unlike those below. Mitchelmore (1981) suggests that the scale should not be cut into lengths that are too short, so as to avoid misclassification. Lengths should not be shorter than 1.635 x SD of scores x (1 reliability). Possibly useful for Reason 2 above. Cutting so as to achieve equal numbers of subjects/cases in each group. Technically this uses the median and quartiles. I.e. if you had scored 30 people and want two groups. You simply put them in rank order on the basis of their scores, and the top 15 (those above the median score) become the high prof group, those below the low prof one. The cutting score obviously will differ for different samples and has no real meaning, but generally it is better for later comparisons if groups have more or less the same numbers of subjects in. Often used for Reasons 2,3, 6 above. Cutting at the mean, and points related to it. E.g. you divide into those who scored above the mean (average) and those below. Or four groups: those scoring more than one SD above the mean, those more than one SD below the mean, those between the mean and one SD above, those between the mean and one SD below it To get three groups you might use the mean plus or minus half the SD as cutting points. The mean, like the median, is entirely relative to a particular sample of course. The problem with dividing at the mean is that usually many cases score near the mean, so cases very close to each other will get put in different groups. If the original scoring is not perfectly reliable, that in turn means that some cases may be misclassified. Cutting into natural groups using low points in the distribution shape. This is a simple form of cluster analysis and simply looks to see if the subjects in the sample seem to have grouped themselves (cf. Reason 4 above, and also maybe 2,3) i.e. looking at a histogram of scores are there two or more heaps with a low point on the scale where few scored? then make the cutting score the middle of the low point(s). This of course decides both where to cut and, unlike most methods, how many groups to identify. It may vary from sample to sample but does
15 of 29
12/18/11 2:58 PM
reflect the nature of a particular sample better than some of the above methods. It will not work if the histogram is simply one heap (e.g. with the normal distribution shape), though sometimes rescaling the histogram with finer divisions may reveal what an initial SPSS histogram may conceal. As an example, the scores of 217 subjects on a College English exam in Pakistan are graphed below and it is fairly clear that there are two groups in the sample, those scoring above 58 or so and those below. By comparison the median score, above and below which are equal numbers of cases, is 50 for this data and appears to rather arbitrarily divide people within one of the groups that they seem to naturally form.
With all the above methods, but especially the third, researchers may choose to use extreme groups only. Often where a researcher wants to get clear differences between groups later he/she will help this along a bit by, say, using the top third and the bottom third of subjects and missing out the middle third in any later comparisons. Reason 5,6 above. However you cut, you have to be careful how you speak. Very often you will call the groups you make the high proficiency group and the low proficiency group, or the like. But unless your original test that produced the scores was a criterion referenced one, deciding some absolute level of prof for each taker, with international equivalence, then this can be misleading. Very often the proficiency test researchers use test was a cloze test you cracked up yourself, or the like. It may well distinguish students with higher proficiency from those with lower, in the sample of students you are using. But that does not mean there is any equivalence with what were called high prof students by some other researcher who used a different test with a different sample in another country. It could be that all his students, high or low prof, are no better than the worst of your low prof group, and so on. Only if some standard published test such as FCE or TOEFL was used by all could you match up across studies and see if there was any real comparability between so-called high prof students in different studies. In fact close examination shows that many variables used in research have no absolute definitions of scale points, and most of the above ways of dividing cases into groups only distinguish in a relative way between who/what has more of something or less, not exactly how much.
16 of 29
12/18/11 2:58 PM
The size of the standard deviation One is quite used to having SPSS calculate the SD along with the mean (=average) of a set of scores (i.e. for any interval scale). We are also used to the idea that the SD measures spread of scores around the mean. If all cases scored the same, the SD would be 0. The bigger the SD, the more spread the scores of different cases the more subjects are disagreeing with each other in their scores. And the more that happens within groups, the harder it is usually to show any convincing differences between groups. Similar concepts to SD are what statisticians call variance and error. These measures are slightly different but all, roughly, are averages of the differences between each cases score and the mean. If all cases score the same, which will be also the mean score, then their differences from the mean are 0, so SD = 0. Sometimes SPSS fails to perform a procedure because of a problem of zero variance. That means it found that one of your groups on one the variables measured had an SD of 0. All cases scored the same. This makes certain statistical procedures impossible: they involve variables and cannot work if everyone scores the same, as then you have not a variable but a constant. You cannot answer the question what is the relationship between age and reading ability? if you have obtained data from a sample who are actually all of the same age! So we know what an SD of 0 means, but what about big SDs? There is often no simple maximum value that the SD can have. But there are some guides to help assess the size of an SD: It may often be more of interest whether different groups or conditions show similar or different variation (SD) than how great the SD actually is. In general you assess the size of an SD for each sample group separately. If your scores are on a scale with both ends logically fixed (e.g. a test scored out of 40), then the maximum possible SD, if cases were maximally varied in scores, is half the scale length (well, actually it will be a shade above that for small numbers of cases, but that is a useful rule of thumb). So you can assess the size of an SD you get in relation to that. An SD would usually be regarded as big if it was even as much as half the maximum (i.e. a quarter of the scale length). On a scale of % correct scores, half the scale is 50. Note that on a five point rating scale running 1-5, half the scale length is 2. On such scales of course the mean is also limited: it cannot be a figure outside the end points of the scale. That places further limits on the size of the SD: the nearer the mean is to the limit of the scale, the smaller the maximum possible for an SD If your scores are on a scale with one or both ends virtually open, then the SD (and the mean) could be indefinitely large. In language research many scales are fixed at one end on zero, but open at the other. E.g. word frequency: words cannot occur less than 0 times, but there is no clear upper limit to how often they can be observed. So also sentence length: sentences cannot be shorter than one word, but they can be indefinitely long. Response times in milliseconds have a hazier bottom limit: there is an indefinite upper limit to how long anyone can take to respond to a stimulus and, although technically there is a lower limit of zero, nobody can really
17 of 29
12/18/11 2:58 PM
respond in zero milliseconds so there is an indeterminate lower limit to fast responses. With these scales it is harder to say what is a big SD, but one can use some yardsticks: One can use the maximum and minimum scores that occur in ones data as indications of the effective limits of the scale, and as above treat an SD larger than a quarter of the distance between them as large. For a scale fixed at one end, one could use the distance between the bottom limit and the highest observed score. With scales fixed at the bottom end, but open at the high end, the distribution is often skewed to the left (positively). I.e. scores are heaped near the bottom limit and tail off to the right. In that situation the SD can be, and often has to be, greater than the mean, though if the distribution has a perfect Poisson shape, the mean = the square of the SD. If the mean is some way above the bottom limit, and that limit is 0, and the distribution is more symmetrical, then people sometimes assess an SD in relation to the mean: if the SD is as much or more than half the mean, that indicates very substantial variation among the scores of a group. Always look at the distribution shape on a histogram as well as the mean and SD The shape may reveal more than anything else. How to treat rating scale responses An old problem is how to handle responses to items recorded on scales such as strongly agree agree neutral disagree strongly disagree always often sometimes never They are rating scales (not usually called multiple choice). They are clearly ordered choices and there is uncertainty whether they are really best thought of, and treated statistically, as Ordered categories: so you present the results in bar charts, report the % of people who responded in each category on the scale, and use ordered category statistics to analyse relationships with other variables. OR Interval scores: so you assign a score number to each point on the scale and present the results as a histogram, report the mean and SD of the scores of a group, and use t tests, Pearson correlation or whatever when comparing groups or looking for relationships. The numbering could be e.g. strongly disagree = 0, disagree = 1, and so on; or if you prefer strongly disagree = -2, disagree = -1, neutral = 0 etc. Generally it is far easier for any statistical handling to treat the data the interval score way as the stats for interval scores are more well known and versatile in what they can do. The results are usually easier to absorb as well. Suppose two groups are asked how far they agree that a CALL activity is easy to understand; Group B is of a higher English level than A. ?Is it easier to derive some meaning from being told: In group A the response was: strongly agree 43.3% agree 20% neutral 13.3% disagree 13.3% strongly disagree 10%. In group B it was: strongly agree 30% agree 30% neutral 10% disagree 30% strongly disagree 0%. The difference between the two groups is not significant (Kolmogorov-Smirnov Z =0.365, p=0.999). OR from
18 of 29
12/18/11 2:58 PM
The mean agreement response (on a scale 2 for strong disagreement to +2 for strong agreement) was in group A 0.73 and in group B 0.6. Variation was similar in the two groups, and moderately high (SDs 1.41, 1.26). The difference between the groups is not significant (t=0.265, p=0.793).
I know which I find easier to follow! So I advise going for the second interpretation wherever possible, but making sure that when you use such scales the way they are used in the data gathering itself justifies this interpretation. In particular: Make sure the words used for the points of the scales do suggest more or less equal intervals between one point and the next, otherwise the interval interpretation is invalid Accompany the wording with figures in the version presented to respondents, so they are encouraged to think of the scale as being a number one, with equal intervals between the numbers.
Tests of prerequisites for parametric statistical tests These tests of prerequisites are only of interest to check if the data is suitable for using some OTHER test that you are REALLY interested in, because it relates to your actual research questions or hypotheses. Tests of prerequisites generally apply where ANOVA/GLM is used, though researchers rarely report having made these checks and we cannot tell if the checks were performed or not! You generally want them all to be nonsignificant, as that is what shows the data is straightforwardly suitable for parametric significance tests like ANOVA/GLM. If the prerequisite test is failed then there may be alternatives within the parametric tests you can use to compensate, or weaker nonparametric tests you can use instead of straightforward ANOVA etc., or possible transformations of the data one could do... but often one has to just admit the data is not perfect for the procedure but carry on and use ANOVA anyway.... Their functions are as follows: Any parametric significance tests.... t tests, ANOVAs etc. all assume that the populations that the groups are from have distributions of scores that are normal in shape (i.e. that bell-shaped distribution you see in all the books). Check with K-S test (though on small samples everything passes this test!!). t test for 2 independent groups, and all ANOVAs involving comparisons of 2 or more groups (with or without also repeated measures). The groups need to each have a similar spread of scores within them round their respective means (=homogeneity of variance). Check with Levene's test, which (roughly) decides if the SDs of the groups could be from one population of SDs, so are similar, or not. The t test for 2 independent groups has alternative versions depending on whether this prerequisite Levene test is passed (nonsig) or not, but ANOVAs don't, they all assume the prerequisite test of equal variances is passed. All ANOVAs involving comparisons of 3 or more repeated measures (with or without independent group comparisons as well). Here again the spreads of the scores in each condition
19 of 29
12/18/11 2:58 PM
need ideally to be similar. Strictly it is the covariation between each pair of conditions that needs to be similar (=test of sphericity). Check with Mauchly's test (which SPSS automatically gives you even where you only have two repeated measures, though it applies vacuously there and need not be looked at). The check, roughly speaking, looks at the correlation between the scores in each condition and those in each other condition in pairs and sees if the correlations could all be from a population with one correlation or not. The data would likely not pass if people who did better on condition A also did so on B but were the worst on C, and so on... If it is passed (nonsig) then you use the 'sphericity assumed' results in the ANOVA table, otherwise the ones below those (Greenhouse-Geisser). ANOVAs with a mixture of repeated measure comparisons and independent groups. Here there is an extra requirement about the pattern of covariance between conditions in each group separately being similar also between the groups. Check with Box's M test. Missing values Missing values are where cases have scores or categorisations completely missing for some reason, where most cases did provide data. E.g. they gave no response, were uncooperative, or their response was unanalysable, etc. (Where subjects have taken a multi-item test or the like to produce their scores, then they may miss some items but still get a score for the test as a whole. That is a different issue You have to decide there whether a missed out item counts as wrong, or whether you allow people to miss items and as overall test score give them the average score for the set of items they did answer) They are usually entered in SPSS by a . in the space where a figure should be, unless you have assigned an actual number that you enter as indicating missing values, and declared it in Variable ViewMissing. If you have missing values there may be problems: - You may have very few cases left that you can use in the required statistical analyses: especially in repeated measures and multivariate designs if a case has data missing on one variable/condition included in an analysis, it gets left out totally (i.e. listwise). - The missing values may not be random, but certain kinds of subject may be more prone to produce them so using the data without them, or with too few of them, will lead to a biased result. E.g. young versus older testees; lower versus middle class informants. If you leave missing values in place, SPSS usually gives the choice (in Options for a given test) for you to treat them listwise or pairwise/test-by-test. This really applies to multiple analyses of the same data, as within one analysis it usually has to be listwise, meaning that the number of cases used is the maximum number that has a complete set of data across all the relevant columns: e.g. if in Correlation you want correlations done between every pair of variables in 5 columns: ten pairs, so ten analyses. Listwise option would get you correlations using just the cases with full data across all 5 columns, so the same number of cases would be used in each analysis. Pairwise would, for each analysis, use the maximum cases with data on both the relevant columns, so use more of the data, but different numbers of cases might well be used to calculate different correlations. If you want to fill in missing values. the main principle is that it should not be done in some way that
20 of 29
12/18/11 2:58 PM
will clearly directly influence the result you are interested in. I.e. you should not fill in the missing values following a principle that will obviously make the difference or relationship which is the focus of your actual research more marked. Broadly there are two ways of filling in missings in any column in SPSS (where a column represents a variable, or a condition in a repeated measures data). A) You fill in with the mean of the scores in the column itself (or if it is in categories, the mode, which is the most popular category in that column). B) You fill in by predicting a score from the general correlation of that column with others in the data: the EM and regression methods. Imagine data as follows: C1 C2 3 5 5 7 7 9 4 . 6 8 If the research question concerns whether there is a relationship between two variables, in C1 and C2 (correlational design), then you do NOT use method B, which would use the correlation that exists already in the data to fill in missing values. I.e. here, given the perfect positive correlation between the two sets of scores, method B would fill in the missing as 6, predicting it from C1. But that will obviously enhance the perfection of the correlation which it is your aim to discover! So the mean of the second column (method A) would be better a better fill-in value: 7.25. If on the other hand this was data from the same subjects on the same DV scored in two conditions in C1 and C2 (repeated measures design), and the research interest is in the difference between the means of the scores in each column (Do they score significantly higher on condition 2?), the better way to fill in the missing values would by method B. Method A would simply enhance the level of the mean of C2, and strengthen its distance from the mean of C1. For these reasons, when you run correlation-type statistics like Regression and Factor analysis, SPSS under Options offers you the choice to fill in missing values with the means (method A) as it operates. The data in the Data view does not get visibly altered: just you find all the cases have been used instead of those with missings left out. Similarly in Regression with optimal scaling, which works on associations between categories rather than interval scores, there is the choice to use Mode imputation, which fills in the missings with the most popular category in the relevant column. In situations where method B is suitable, you have to use AnalyzeMissing Value Analysis to actually fill in the missing values in the data in Data view beforehand. Basic instructions: at the first box, enter all the columns relevant to the analysis you will be doing, either as quantitative (i.e. interval) or categorical (categories/nominal). Only the former are actually used in the estimation of missing scores, though (SPSS does not seem to provide a way of filling in missing category data by
21 of 29
12/18/11 2:58 PM
Method B). Tick EM and if there are some quantitative columns that you dont want used as a basis for predicting values of missings, then click the Variables button and make your selection. Otherwise all the quantitative columns you declared in the first box are used to predict any missings in each other. Click the EM button and tick Save completed data; and under File name a file for it to be stored in. Then SaveContinueOK The procedure will produce various output, but mainly you are interested in the new stored file of data. If you call it up, you will find the missings all filled in. In data for independent groups analysis (e.g. t tests, ANOVA), with missings in the DV column, if you have other columns of dependent variable data not being used in the same analysis, you could use them to fill in the missings by method B. Otherwise you can only use method A i.e. use the mean for the DV column (NOT the mean of each group) to fill them in. Getting phonetic symbols displayed in SPSS graphs First ensure you have the fonts of your choice (e.g. SILManuscriptIPA etc...) installed in Windows in the usual way. If they are available to you in Word in the usual way via Insert Symbol, then they will be available in SPSS. If not, get a copy of the font file (ending .ttf) and put it in the Fonts subdirectory of the Windows folder on your PC. Now, having made a graph in SPSS, click the graph you have created to make it appear in the Chart editing window. Then click the part you want to put special symbols in, such as the bottom scale, so it comes up outlined. Next click Format...Text and select the required font from the menu and the size you want and click Apply, Close. Now when you click the scale of the graph and choose to change the Labels, you can type the symbols you want. However, you don't initially see them when you type them in the dialog box. You have to know that in the SIL font shift-t gets you the symbol for the th sound of thick, though it will look as if you have just got T. Anyway, you have to type all the labels in the new font, you cannot mix symbols from different fonts, I think. So retype the labels using Change, and Continue. The symbols you want will appear on the graph itself. I have not found a way to get symbols that are coded outside the range of the font that is covered by the keyboard keys, with and without shift. To know what symbols you can get from which key with and without shift, you may have to study the table of symbols for your font in advance through a program such as Word which displays it through the Insert..Symbol option.
Item Analysis This term is found used in two distinct senses. Both involve data where variables or experimental conditions are measured using sets of items for each in some way. A) The usual traditional sense found especially in the pedagogical testing literature. Here it applies in the situation where a set of items is used to measure what is regarded as one single variable/construct. The set of items is usually thought of as a multi-item test of one thing (e.g.
22 of 29
12/18/11 2:58 PM
reading ability, or vocabulary size). However, item analysis may also be applied to, say, a set of Gardner-type statements for respondents to agree or not with, where a distinct attitude or orientation is measured by an inventory of five such statements, rather than just one. It can also apply separately to each set of items designed collectively to measure a single condition in an experiment. Item analysis in all these instances is the activity of checking whether there are some items in the set that in some way do not seem to belong there, illuminating how and, if possible, why they are odd, and maybe removing them or replacing them with better items when the test is used again. It is closely tied to internal reliability checking, often done these days with the use of the Cronbach alpha coefficient or Rasch analysis. Removing items that are odd improves reliability. This sort of item analysis is often done in pilot studies, as it represents a way of refining the quality of instruments for use in a main study. There are several statistical criteria for deciding what items are odd in a set that is supposed to be all measuring one thing. See further my Reliability handouts. Where items are supposed to attract similar levels of response (e.g. be of similar difficulty) then the classical IA approach involving alpha is appropriate; where items are supposed to be graded, and form an implicational scale, then approach using IRT/Rasch is better. Where response times are involved, other criteria may be used to exclude responses for specific people on specific items instances rather than whole items. B) The sense in which it is found used in some psycholinguistic literature. Here it denotes a second kind of analysis of data, beyond the usual default one. In an item analysis, instead of the subjects (usually people) being treated as the cases, the items are treated as the cases. Hence it is really analysis with items as cases, rather than item analysis, and is typically part of the analysis of the results of a main study. This applies only when a study has several conditions, each represented by a set of items, but this is very common in psycholinguistic studies, where subjects performance in different conditions is often measured by their responses to sets of stimuli in a repeated measures design. For example a repeated measures variable word frequency might be constituted as three sets of ten words, of three different frequency levels, making 30 items for people to respond to in some way; a variable early vs late attachment could be instantiated as two sets of sentences, of two structure types, one in which a relative clause has to parsed with an early noun phrase, the other with a late occurring one. Often such data arises also in areas such as SLA, applied linguistic and even sociolinguistic research as well as psycholinguistics, but item analysis in this sense is only routine in the latter, where it is regarded as a further confirmation of results obtained by the usual subject analysis, i.e. analysis with subjects as cases. Where, as often, ANOVA (see my handouts) is used to analyse the results, then the F values for the subjects as cases analysis are reported as F1, and those for the items as cases analysis as F2. Statisticians generally regard analysis with subjects as cases as the sounder basis, due especially to the independence requirement. Cases have to be regarded as providing independent observations if the assumptions of inferential statistical tests (e.g. ANOVA) are to be met. While it is generally not difficult to assume that responses from different people are independent of each other, it is not so certain if responses to different items are so independent, when the same people respond to all of them. One has to assume that in psycholinguistic experiments people are unable to make their responses to one item reflect their response to another. This is often assumed by phoneticians and psycholinguists.
23 of 29
12/18/11 2:58 PM
Imaginary dataset to illuminate both the above. Suppose we have two groups of ten people (G1 and G2), and each respond in two conditions (C1 and C2), where 5 items are used to obtain responses for each condition. As laid out for a customary subjects as cases analysis in SPSS this would appear as 11 columns and 20 rows thus. Of course, the items would often not have been presented to subjects in an experiment in sets, but intermixed with each other and maybe with additional distracter/filler items that are not scored at all.
Group 10 rows labelled 1, to mark each G1 subject 10 rows labelled 2, to mark each G2 subject
C1 item1 Scores for each G1 person on C1 item 1 Scores for each G2 person on C1 item 1
C1 item2 Scores for each G1 person on C1 item 2 Etc.
C1 item3 Etc.
C1 item4
C1 item5
C2 item1
C2 item2
C2 item3
C2 item4
C2 item5
To do item analysis (A) above in SPSS, you would split the file by Group and use Analyze Scale Reliability analysis Alpha on each set of five items separately (or for Rasch analysis, you need other software). Four analyses. That means that the internal consistency is always assessed within a collection of scores which is from a set of items that supposedly measures one thing, and which comes from a homogeneous group of subjects. After any adjustment of the data to improve reliability based on the above, you then typically move on the the actual analysis of results with subjects as cases. You first produce two extra columns which contain the averages of each five item set of scores for each person. Use Transform Compute. These Mean C1 and Mean C2 columns each now summarise the performance of subjects in one condition. Those two columns, together with the Group column, are then used in a mixed two way ANOVA to see if there is a sig difference between groups or between conditions, or a significant interaction effect. That is your subjects as cases F1 ANOVA. For item analysis (B), you need to make the items into the rows. You can do this with Data Transpose in SPSS. If you start from the data as displayed above and include all the columns you end up with 11 rows, which were previously the columns. There are columns now for each of the 20 subjects. You can now use Transform Compute to get two new columns calculated which represent the mean scores for each group of subjects on each item. Then delete the row that
24 of 29
12/18/11 2:58 PM
contains the grouping numbers. Add a column of 5 1s and 5 2s to record which items (now rows) relate to condition C1 and which to C2. So the data should end up much as below. Finally use the column that records whether an item belongs to C1 or C2, and the two columns of group mean scores for each item. Again do a mixed two way ANOVA to see if there is a sig difference between groups or between conditions, or a significant interaction effect. That is your items as cases F2 ANOVA. Note that what was a repeated measures factor in the F1 subject analysis, condition, becomes a between groups factor in the F2 item analysis. The grouping of subjects, which was a between groups factor in F1, becomes a repeated measures factor in F2. G1 subj1 5 rows with scores for G1 subj1 on each C1 item 5 rows with scores for G1 subj1 on each C2 item G1 subj2 Scores for G1 subj2 on each C1 item G1 subj3 Etc. Etc. to G1 subj10 G2 subj1 G2 subj2 G2 subj3 Etc. to G2 subj10 Condition Group 1 Mean scores of 10 G1 subjects on each C1 item Group 2 Mean scores of 10 G2 subjects on each C1 item
5 rows labelled 1, to mark each C1 item
Etc.
5 rows labelled 2, to mark each C2 item
Mean scores of 10 G1 subjects on each C2 item
Mean scores of 10 G2 subjects on each C2 item
Note, the above account of items-as-cases analysis assumed that the sets of items used to represent the two conditions were not themselves matched or repeated in any way. I.e. C1 items 1-5 might have been five nouns as stimuli in some response time experiment, and C2 items 1-5 five verbs, with no special connection between individual verbs in one set and individual nouns in the other. If however the items are themselves matched in pairs or repeated in different forms etc. across conditions, the items as cases analysis should be different. E.g. if C1 items were five verbs in the past tense and C2 five verbs in the bare infinitive form, the researcher might choose to use the same five verbs in both conditions (randomised with suitable distracters interspersed when they are actually presented to subjects). Then the items are individually matched and the items-as-cases analysis should be done with the items as repeated measures. I.e. in the data grid above for SPSS,
25 of 29
12/18/11 2:58 PM
the 5 rows for C2 responses would need to be not below the 5 rows for C1 but side by side, with the matched items in the same row, to allow repeated measures comparison of items as well as subjects. Checking for guessing or response bias when using certain data-gathering instruments with closed responses Checking for guessing Any instrument where the subjects are given choices to pick from for an answer are potentially open to guessing. In the sense of picking one option at random, without thought. For example, the respondent may randomly pick one of the choices because they cant be bothered to think about the question/item just want to finish quickly they dont actually have any relevant knowledge to make a correct choice they cant understand the question (language too hard, too long, pragmatically odd etc.) etc.. Clearly the results will not then be a true measure of whatever the researcher intended to measure, and could even vary if the subjects responded to the same items again on another occasion. I.e. not valid or even reliable. This affects multiple choice items, yes/no or agree/disagree items in questionnaires and tests, rating scales and so forth. Clearly it cannot affect instruments which have open response in some form, i.e. with no alternatives supplied. One cannot statistically tell definitely if guessing has taken place or not, but one can check if the responses are like those one would get from someone who was guessing, or not. Obviously it is quite possible to get a real result, where people have paid attention and answered sensibly, which happens to be similar to the guessing one. Only the researcher can judge the interpretation. You need to calculate what the result would be, on average, for someone who was randomly guessing, and use the appropriate one sample test (see my LG475-SP handout) to check if the observed result differs significantly from the one you would get by random guessing. For example: 1) 30 subjects have to answer yes or no to a question about whether they use the keyword method of vocab learning or not. Random guess frequency of yes would yield a frequency of 30/2 = 15 yes responses. Use 50% binomial test. 2) 30 subjects have to pick one of four reasons they are offered for why they are learning English. Random guess frequency of each choice being picked would be 30/4 = 7.5. Use chi squared one
26 of 29 12/18/11 2:58 PM
sample fit test. 3) 30 subjects have to judge 20 words for whether they exist or not in English. Thus each person gets a score out of 20 for how many they say exist. The average random guess score would be 20/2 = 10. Use the one sample t test. 4) 30 subjects listen to a short talk and are offered 5 test items afterwards. Each item consists of four sentences, one of which occurred in the talk, while the others are similar but did not. In each item subjects have to pick the sentence that they had heard. Thus they can get a score of max 5 correct. Average random guess score would be 5/4 = 1.25. Use the one sample t test. One protection against guessing is to include a dont know option and encourage respondents to use it. However, often in tests you do not want to allow this: you want to force a response. If blind guessing has been encouraged, or appears to have been used a lot, then some researchers adjust all cases scores for guessing (relevant in multi-item test examples 3 and 4 above) as follows: Adjusted score = raw score _ ____maximum possible score - raw score____ number of alternatives offered on each item - 1
On the new scale someone who scores full marks still gets the same full mark/max possible score but someone who scores the guess rate score gets 0. So in example 4 someone with a raw score of 3 in fact receives adjusted 2.33. In example 3 someone with a raw score of 10 scores 0. Checking for bias In those same multiple choice instruments (and indeed others) people may answer with bias. That is, although they are not randomly picking options, they still do not always answer truthfully (whether consciously or not). So again the measurement is not valid, though it may be reliable, in that subjects may choose the same response to the same item on any occasion. Response bias may be affected by a number of things, associated either with the subjects or the measurer or the instrument itself, including Researcher effect. The researcher may without realizing it convey the idea that he expects attitudes to be favorable, answers to be yes etc. and the subjects may respond to this. Subject confidence. For individual personality reasons, or maybe due to cultural factors, subjects may be cautious and choose the midpoint on bipolar rating scales (e.g. neither agree nor disagree) even when they have an opinion. Or they may be overconfident and characteristically say yes. Subject wish to be cooperative. For individual personality reasons, or maybe due to cultural factors or young age, subjects may interpret being cooperative as saying yes. Instrument factors. If an instrument presents a lot of items with the same response
27 of 29 12/18/11 2:58 PM
choices (e.g. all yes/no, or all an agree-disagree scale) and if the ones responded to first elicit a similar response choice then this can form the basis for a set and other items may be automatically answered by selecting the same option. Cost or benefit perceived by subject. In a vocab test where a list of words have to be indicated as known or not known, the testees may see it as a benefit to themselves to get as high a known score as possible, so will tend to overdo that choice and the tester will want to check for this. When deciding if learners pass or fail an English test for air pilots, the examiners may feel that there is a big risk in passing someone who is not really up to the standard, so a benefit in erring on the side of failing too many candidates.
A way to check for bias is to include additional items where you know in advance what the answer should be for subjects like these. Then if you dont get the expected answer on those, you can see that subjects may be exhibiting bias in general. A form of control, or construct validation. In a special instance, this evidence may be used to adjust scores for bias. Take the case of the vocab test mentioned above where subjects have to say if words they see exist or are known. Several of the factors mentioned above might favour yes bias. One way to counter this is to focus the testees attention on no rather than yes by making the task to mark which words they do not know / do not exist, rather than to mark those that do. However, we can also check and adjust for yes bias as follows. Though the test is of claimed knowledge of real words, it is possible to intermingle randomly in the test items some non-existent words to be judged. We know that the subjects cannot know them, because they could never have met them. Hence their response should be no for all these. If we get yes responses for some of the unreal words, we have evidence of yes bias and can quantify it. One could do a similar trick in grammaticality judgment tests, by including sentences with structures impossible in the languages under consideration, along with those of interest to us. Stimuli: Real words / True items (focus of test) True positive Hit False negative Miss Non-existing words / False items (used as controls / yes bias checks) False positive False alarm True negative Correct rejection
Response: yes, known / exists Response: no, not known / does not exist
Some researchers simply exclude any cases who give two or more false positive responses. If you need to adjust the scores for this rather than just exclude people, this is more complex (See me). Eliminating response times In response time experiments it is common to filter the data by eliminating (a) extreme response times, and/or (b) response times where the response was in fact wrong.
28 of 29
12/18/11 2:58 PM
a) Suppose subjects respond to 50 stimuli, representing three conditions (i.e. ten stimuli of each of three types of interest, with 20 distractors). Maybe they have to judge the existence or not of the word they see as fast as possible. Within each set of ten, for each person, it is common practice to eliminate responses where the value is way above or below the mean response time for that person in that condition. The argument is that if the time is excessively long, the subjects were not giving the spontaneous intuitive responses the psycholinguist wants, but referring to other types of knowledge such as explicitly learnt rules (i.e. thinking too hard); if the times are very short, maybe they were not thinking at all but just pressing a key at random to get on with the task as fast as possible. Commonly a distance of two standard deviations above and below the mean is taken. Anything outside that for a person on an item within any condition is regarded as inadmissible and treated as missing. The mean score for a condition for a person is then calculated using the remaining responses. To get SPSS to do this, we first assume that the data is entered as usual with a column for the response times to each stimulus and a row for each person. Imagine columns labeled st1, st2 etc. with the first ten columns representing response times for one condition / stimulus type. Suppose you are working on filtering st1, turning any extreme values into missings. The result can be achieved by getting SPSS to create new columns in turn for each stimulus, via the Transform Compute facility. At the dialog box enter the name of the new column top right as target variable e.g. st1f. Next enter the original column as the numeric expression, st1. Next click on If and opt for Include if case satisfies condition. Then write the condition so that the scores you want to keep pass the condition. E.g.
st1 < (MEAN(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10) + (2*(SD(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10)))) AND st1 > (MEAN(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10) -(2*(SD(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10))))
The new column st1f will have missings where the data was extreme. Alter the statement to do st2, and so on in turn the same way. b) The same sort of thing can be done to get response time data turned into missing values where the responses were wrong. If there is a separate set of columns sta1, sta2 recording accuracy of response as 1 or 0 for each stimulus, then write the If condition for st1 simply as sta1 = 1 . After doing either of the above you will need to combine the columns for the relevant sets of items to create summary scores for cases for each condition (e.g. st1f through st10f). That has to be done usually by getting the mean for each person over the non-missing items that they have scores for.
29 of 29
12/18/11 2:58 PM

Some Stats and SPSS Pointers

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Some Stats and SPSS Pointers

Încărcat de

Drepturi de autor:

Formate disponibile

Some stats and SPSS pointers