View Complete Reference

Porras, J and English, N (2004)

Data-Driven Approaches to Identifying Interviewer Falsification: The Case of Health Surveys

Proceedings of the American Statistical Association, Survey Research Methods Section, 4223-4228.

ISSN/ISBN: Not available at this time. DOI: Not available at this time.

Abstract: INTRODUCTION: Using data from a large-scale health study conducted by NORC, we looked to develop three data-driven approaches that aim to identify fabricated interviewer data. Along with being easy to implement and relatively inexpensive, we believe these methods also benefit from their non-intuitive underpinnings (to the interviewers), which make them difficult for the average interviewer to outsmart. The three approaches we will now discuss are briefly described below. 1. Goodness-of-fit to Benford. The leading digits of a random collection of distributions can be frequently approximated by a Benford distribution. Although the leading digits of our dataset did not conform to a Benford distribution, they did form a latent distribution to which the falsified data did not conform. 2. Lack of Variance. It is theorized that interviewers falsifying data tend to center their data around their “intuitive mean”. By ranking the relative variances of interviewers’ means, we demonstrate that the interviewer who falsified his data produced some of the lowest relative variances. 3. Unlikely Combinations. We hypothesize that falsifiers will occasionally outsmart themselves by recoding item combinations that rarely occur in real data. One example would be heavy smokers who also get considerable quantities of vigorous exercise. If such items are present, it may be an indicator of falsification. The question, then, is what combinations to search for and how to determine if they are legitimate or not. For the purposes of conducting our analysis, we obtained two sources of falsified cases. The first source of data came from one of the project interviewers. A portion of his data could not be validated, and it was then decided to remove all of his completed questionnaire data from the final dataset. For the second source of falsified data, we instructed five of our interviewers to generate falsified questionnaire data, producing a total of 50 falsified interviews. These two sets of falsified data provided the opportunity for us to test the three methods which we now present

Not available at this time.

Reference Type: Conference Paper

Subject Area(s): Medical Sciences