Sunday, January 25, 2015

Data cleaning meaning 'proofreading'

Data cleaning means the ‘proofreading’ of data to eliminate errors and coding (techniques to organize raw data) inconsistencies, according to Frankfort-Nachmias & Nachmias (2009). Data cleaning, also known as data cleaning is an integral part of data processing that should take place before an analysis of the data collected. Computers perform the function of data cleaning mostly now with the development of efficient software (Frankfort-Nachmias & Nachmias, 2009). Missing data can be in many forms like response bias, careless response or no response. According to Meade & Craig (2012), internet surveys, especially in cases of ‘obligatory participation’ can result in data that’s quality can be a concern. They report that ‘careless responses’ can be controlled by using identified rather than anonymous responses.
Missing data is a common concern with multivariate studies, as reported by Little (1988) and can lead to a questions if it is data ‘missing completely at random’ (MCAR) or if it’s related to a some variables. Little (1998) suggests that if we compare the value of means for each variable between groups then we may be able to assess if it is MCAR or not. Schafer & Olsen (1998) highlight the possibility of missing data in a multivariate study as well. They further add that ‘new computational algorithms and software’ have given the ability to researchers to create proper imputations multivariate studies. Their study reports ‘multiple imputation’ technique that combine estimates with m> plausible values. Bourque & Clark (1992) write an interesting point that ‘data preparation more of an ‘art when compared to science of hypothesis testing’
One example of data cleaning is dealing with outliers (these are values or data points considered to be far outside norm of a variable) that have been defined by some researchers as values that deviate so much that they arouse suspicion (Osborne & Overbay, 2012). Outliers can have feverish effects on data analysis and can be handled by either by eliminating if evaluated as an error in data or by observing, looking at the original responses (Osborne & Overbay, 2012).
Hypothesis testing is directly connected to data analysis (Bourque & Clark, 1992). In the rush of testing the hypothesis researchers usually do an incomplete job of data analysis and then repeatedly process data to bring it into usable form. In our research study, our second hypothesis states that individuals who grow up in a cross cultural home with immigrant parents experience lesser success in life compared to their peers. To illustrate this relationship
                                                                            References
Bourque, L. B., & Clark, V. (1992). Processing data: The survey example (No. 85). Sage.
Frankfort-Nachmias, C., & Nachmias, D. (2008). Research methods in the social sciences (7th ed.). New York: Worth.

Little, R. J. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198-1202.
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological methods, 17(3), 437.
Osborne, J. W., & Overbay, A. (2012). Best practices in data cleaning. Sage.
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate behavioral research, 33(4), 545-571.


No comments:

Post a Comment