A Research Process for Educators
Posted: Tuesday, October 13, 2009
by Jon Gresham
civilsociety.Seedwiki
A General Process of Statistical Quantitative Analysis
(I will appreciate help to rework this content to make it clearer)
Favorite statistical book: Discovering Statistics by Andy Field
What are the main Research Questions to be answered? (What will you answer?)
What are the most important dependent variables that you want to investigate?
What are the primary hypotheses? (What questions will help you answer the main question?)
What are the main alternative hypotheses? (What are other questions?)
These steps can be done in as much or little detail as needed or of interest.
(Adapted from Marcolin's Ph.D. Statistics Course, Univeristy of Calgary)
The simplified steps in analysis of new data:
1. Select the data within normal range, is of appropriate measurement scale.
Missing or incorrect data is either corrected or processed to work on later.
2. Look for unimportant data patterns that can be left out of first studies of relationships
3. Look for the most important correlations (cause and effect)
4. Measure how much effect variables have on each other
Analysis Procedures in Detail
From simple to complex
(Not necessarily always to be done in exactly this order),
1. Review raw data. Note missing, duplicate, or obviously "out of normal range" items.
Check spelling, obviously incorrect entries.
I sort data on one variable at a time for write-in reponses to look for patterns in responses,
and then simplify types of responses into fewer categories that can be quantified.
For example, "clergy," "mullahs," "ayatollahs" might all be simplified into the single category of "clergy."
**As you examine entries, save a new copy of the data after each change, especially in SPSS.**
-Add labels to make clear what answers are given, and then standardize answers appropriate to those labels.
-Create a "Code Book" giving all possible responses for each question. I usually try to set up the code book even before I start collecting data; this allows me to understand what are the critical issues to focus on during the survey and analysis processes. I create a response matrix to show what types of responses come to which items.
-Recode data to match flow of analysis and comparisons. For example, Arabic survey responses have right to left answers; English surveys use left to right. These need to match in the analysis. If some scales are 1-5 and others are 1-10, these may need to match, especially when you want to compare them directly with each other.
Nonlinear Principal Component Analysis can help standardize the units or scale of measurement.
-New variables can be created for composites or means or other combinations of original variables.
For example, if I have three questions about the benefits of return migration, I will look at each one individually but can also create a new variable "average migration benefit" to give fewer variables (as indices or dimensions) to compare with non-migration questions, and it could represent the average response to the original three questions. Dealing with one representative question is much easier initially than dealing with three different questions. You can go back later to look deeper at each individual question. This allows you to later drop out of the analysis those variables that do not seem to relate to the main question that you want to answer. I use the 0.1 or 0.5 level of component loading, thus keeping only those variables that deal with 99% or 95% of the differences in the answers.
NLPCA can facilitate this, as can factor analysis, by reducing a large number of variables into fewer common variables that can represent fairly accurately the responses to the individual variables.
Cluster Analysis is used to group items, such as those respondents which seemed above the average and those less than the average. k-means clustering is grouping responses according to a mean response. (http://www.statsoft.com/textbook/stcluan.html#k)
Logistic Regression allows you to predict the probability that getting one of two possible responses based on the answer to another variable. It can compare the clustered responses (the dependent outcome variables--what you really want to measure) for use of the original variables with the use of the index/dimension variables (what you think might be causing the outcomes). This will show which set of independent variables is more accurate in predicting change in the dependent variables. (http://udel.edu/~mcdonald/statlogistic.html)
Other important first steps.
-Decide what to do with missing data: leave as empty, replace with null or zero, "impute missing data (estimating what might be appropriate to use as an average response or a mode common response)," or code it as 999 to know that it will need further attention later
- Run simple tests first to eliminate at the beginning those variables not likely to provide much value for the time available to invest. The whole process is to see which factors cause important consistent changes in the dependent variables.
-Run Frequencies to understand the Number of responses, Range, Mean (average), median, variance, percentiles, etc. This will raise your awareness of which questions may NOT be of interest in the first exploration.
-Recode obvious answers so that all scale data answers will in general reflect the same direction: 1 = a lowest level of agreement and 5 = the highest level of agreement (according to whatever scale is suitable for your data).
-Cross-Tab Correlation matrix or Chi-Square statistics to get first glimpse of relationships between responses to questions. The initial significance estimates from these tests may not hold over during later, higher-level comparisons, but it is good to see what might be hidden in the data.
2. Review Tests of Assumptions on the data. Test for normality, homoscedasticity, linearity, correlated errors, data transformations. These will give opinions on the overall distribution of responses.
That is, if everyone answered the same way, then the responses are skewed and may not be treated as if they came from a "normal" range of respondents who might give a variety of answers to any one question.
3. Hypothesis testing. Define how which hypotheses are to be tested.
Test assumptions by:
a. Standard Deviation, Standard Error, add a Z test - mean if interesting.
b. One and two tails on the data curves
c. Alpha ( a ) ( a is rejecting H0). How much chance is there of a real difference between those responding differently?
d. Power (1- b ) ( b is to not reject H0). How much chance is there that the responses are not really different from each other?
e. Define how to reach conclusions and how to report which findings.
4. T-Tests are used to compare groups (sets) of data
a. Paired tests (males and females, children and adults, literate and illiterate)
b. Unpaired tests (males and everyone else, blue shirts and city of residence)
c. Compare means of paired data within one group
d. Compare Paired and Unpaired t-tests
e. Compare Parametric versus Non-parametric data
Other useful ways to compare groups:
Wilcoxon Rank Sum W Test (Mann-Whitney U Test) (it is like an unpaired t-test)
Chi-square ( c 2)
Fisher Exact Test (2 x 2)
Paired
Sign Test
5. F-Test (such as Levene)
1-way Analaysis of Variance (ANOVA) to see which variables are most likely to give reliable comparisons
Post-hoc tests are then added to do further analyses on the results
Then n-way/multivariate analyses of variance
Then n-Way Analysis of covariance.
6. Chi-square test statistic
7. Power 8. Use the Sign Test or a Wilcoxon Pairs Test or a Wilcoxon Signed Rank Test
9. Factor Analysis. This is a big leap in seeing relations between individual and groups of items because it shows patterns of correlation. Comparisons can be narrowed to only "Main Effects" to focus only on the few priority variables.
10 Correlation matrix, again, using groups of variables to look for patterns.
11. 1-Way Analysis of Variance (ANOVA) to look at only one variable at a time to see how it relates to any other variable. That is, for a given variable, such as level of income of businessmen in Iraq, which other variables show changes according to the changes in level of income. Post-Hoc tests can set to automatically check for other characteristics (mentioned above) of only those variables which show significant correlation to other variables. This step is one of the most important because it allows the researcher to focus energy on variables that seem to have a significant impact on other variables. If a variable has no relation to any other key dependent variable, then you do not want to waste time on it at the beginning.
12. n-Way Anova and Ancova. Look at how clusters of variables have impact on the dependent variables. For example, do men who wear blue shirts and have high income and live in the center of Basra city tend to have more trust in the government officials than do men who wear white shirts and have low income and live in the surburbs of Basra.
Again, this sequence of work is not essential to follow exactly.
Take a bit of time to think about the great contribution you can make to the science of teaching and training. You, too, can be a scientist in the classroom.
Jon Gresham, Ph.D.
This Article has been viewed 16 times. (Not updated in real-time.)
No comments yet.We want your comments! If you can read this, you don't have javascript enabled, so you can't use this comment system. Please enable javascript.