Missing Data:
When people don’t answer survey questions
Elsewhere I’ve called missing data “the silent killer of valid inference.” It is a silent killer, and a problem that is
easy to overlook, because, well . . . it’s missing, not there.
Missing data is a problem all forms of research—from
interviews to experiments—but it is an especially tricky and often overlooked
issue in survey research. Kathy Godfrey
recently posted (in an e-mail discussion) a very clear and concise description of the problem and how to
handle it. Her comments are so
helpful that, with her permission, I’m re-posting them here.
What you need to do when writing,
conducting, and analyzing a survey is make sure to get the maximum amount of
information from the people-- probably a large number of people--who do not
respond the way you had hoped. Kathy’s “four flavors” of non-response are
actually four important variables. A
researcher can learn a lot by coding and analyzing them.
Original
Message:
Sent: 04-10-2013
From: Katherine Godfrey
Subject: Multiple imputation of "I don't know/ I don't remember"
Depending on the situation, there are at least four "flavors" of non-response, any one of which might be what's behind someone failing to answer a question (and this ignores the people who meant to respond, but simply goofed):
1. No Answer (deliberately not answering, and telling you so)
2. Does Not Apply (the question does not apply to respondent, and thus can't be answered)
3. Don't Know/Don't Remember (would answer if could, but can't)
4. No Preference/Don't Care (this is actually a real answer)
Here's an example to hopefully clarify:
Imagine a pollster asking people on the street, "Who will you vote for in the Senatorial election next week, Smith or Jones?" The following answers are all possible:
1. "None of your business!" (No Answer)
2. "I don't live in this state." (Does Not Apply)
3. "I haven't decided yet; I'm still sorting through the candidates' stands on the issues." (Don't Know)
4. "They're both the same; I may just toss a coin in the voting booth--or stay home." (No Preference)
Not to mention the people that just walk past the pollster, leaving him to wonder if they're deliberately ignoring him or simply didn't hear him.
The second category (Does Not Apply) can turn up as "structural zeroes" in frequency analysis contexts. The difference between "Don't Know" and "No Preference" is subtle, but I think it's real. The former says that there is (or was, or will be) an answer, but the respondent can't give it now. The latter says that there is a known answer, and the answer is not to have a particular feeling/opinion.
The "don't know/don't remember" answer is actually more informative than a missing (non-response) answer, since a non-response could be in any of these non-response categories. If possible, and if the sample size supports it, it could be used as a third category beyond a yes/no binary. (For example, perhaps people who say they don't remember ever driving drunk are definitely more likely to have had an auto accident in the last year than those who say "no," but also substantially less likely to have done so than those who say "yes.")
I first learned about these types of non-response from my mother, who worked in survey research. She was adamant that her surveys should include Don't Know, and DNA (does not apply) options for questions along with NA (no answer), to remove at least some of the reasons for people to feel that they had to leave a question blank because they couldn't or wouldn't answer. Partial information is better than the dreaded DNR (Did Not Reply), which tells you nothing.
Sent: 04-10-2013
From: Katherine Godfrey
Subject: Multiple imputation of "I don't know/ I don't remember"
Depending on the situation, there are at least four "flavors" of non-response, any one of which might be what's behind someone failing to answer a question (and this ignores the people who meant to respond, but simply goofed):
1. No Answer (deliberately not answering, and telling you so)
2. Does Not Apply (the question does not apply to respondent, and thus can't be answered)
3. Don't Know/Don't Remember (would answer if could, but can't)
4. No Preference/Don't Care (this is actually a real answer)
Here's an example to hopefully clarify:
Imagine a pollster asking people on the street, "Who will you vote for in the Senatorial election next week, Smith or Jones?" The following answers are all possible:
1. "None of your business!" (No Answer)
2. "I don't live in this state." (Does Not Apply)
3. "I haven't decided yet; I'm still sorting through the candidates' stands on the issues." (Don't Know)
4. "They're both the same; I may just toss a coin in the voting booth--or stay home." (No Preference)
Not to mention the people that just walk past the pollster, leaving him to wonder if they're deliberately ignoring him or simply didn't hear him.
The second category (Does Not Apply) can turn up as "structural zeroes" in frequency analysis contexts. The difference between "Don't Know" and "No Preference" is subtle, but I think it's real. The former says that there is (or was, or will be) an answer, but the respondent can't give it now. The latter says that there is a known answer, and the answer is not to have a particular feeling/opinion.
The "don't know/don't remember" answer is actually more informative than a missing (non-response) answer, since a non-response could be in any of these non-response categories. If possible, and if the sample size supports it, it could be used as a third category beyond a yes/no binary. (For example, perhaps people who say they don't remember ever driving drunk are definitely more likely to have had an auto accident in the last year than those who say "no," but also substantially less likely to have done so than those who say "yes.")
I first learned about these types of non-response from my mother, who worked in survey research. She was adamant that her surveys should include Don't Know, and DNA (does not apply) options for questions along with NA (no answer), to remove at least some of the reasons for people to feel that they had to leave a question blank because they couldn't or wouldn't answer. Partial information is better than the dreaded DNR (Did Not Reply), which tells you nothing.
By the
same token, I cringe whenever I see computer-administered Likert-scale survey
questions that not only have no option for not answering (or indicating
non-applicability), but have an even number of response options, thus not even
allowing for "no preference."