So say you’re looking to measure depression. It’s very important that the people in your research are depressed, rather than anxious or sad or grieving, but you don’t have time or money to pay a psychiatrist to spend hours interviewing each person who requests participation in your study.
Being an enterprising sort, you decide to create a scale—a questionnaire that you can distribute to everyone who responds to your generic Participate In Interesting Research About Stuff flier. Participants who score in the Depressed Zone will then get an interview with a psychiatrist, thus decreasing the total number of hours of her time that you pay for. You understand Likert scales and careful item selection, and you run a few pilot tests, and in the end, you have this:
We now take a brief detour to explain reverse scoring. Some of the items (psychspeak for each question/statement) would be scored backwards. Answering ‘Strongly Disagree’ to items #1 and #2 would be in direct conflict with strong disagreement with item #3. So to score this test, we don’t just count up the number of answers in each category—we reverse the coding system for items. People who agree that they are equal to others, have a number of good qualities, and disagree that they’re a failure will all go in the Probably Not Depressed basket. People who don’t agree with the first two statements and agree with the third go in the Probably Depressed, Seek Help basket.
This is a common technique to force the participant to read each question, and give additional information to researchers. In a questionnaire without reverse coding, when you have someone who has Strongly Agreed with every statement they could have actually agreed strongly with each component (suggesting they’re severely depressed). But they could also be one of those jerks who just answered every question the same. Reverse-coding controls for jerks.
But enough about this picky detail of psychometric design. You have a measure for depression!
Except, this isn’t a measure of depression. It’s a measure of self esteem. Rosenberg’s Self Esteem Scale, in fact.
Which brings me to the other picky detail of psychometric design I want to talk about: face validity. I’ve usually heard face validity explained as the answer to this question:
On the face of it, does this scale measure the actual thing we’re trying to measure?
Having low self esteem does correlate with depression, sure. We might even go as far as to say that ‘has lack of self esteem’ is often a component of being depressed. But measuring self-esteem is not the same as measuring depression. Okay, but mental health is fairly fuzzy in terms of definitions. Let’s get more concrete.
Having low social economic status (SES) tracks very closely with poor diet. If find a group of people with very low income, they’re almost definitely going to have poor nutrition. But, if you spend ten minutes asking adults about their monthly income, you have not collected data on their nutritional intake. You’ve got priors and you can speculate with some amount of surety, but the thing you have measured is still monetary. Additionally, if you write a paper about the relationship between income and method of transportation, and your data for income sounds like “eats more than two fruits per day” the reviewers will giggle and write sarcastic notes when they return your study.
Returning to Rosenberg’s Self Esteem Scale, what we have is something that seems to be face invalid. If you have some amount of psychopathology (psychspeak for mental illness) training and you’re asked if it seems like the scale above is measuring depression, you’d probably say no, or not quite.
And of course, you could do some amount of empirically testing the scale too. You could see if it correlated with other measures of depression: with Becks Depression Inventory, or the HAM-D. You could see if it was uncorrelated with things that aren’t depression. You don’t want your measure of depression to be correlated with being sad (a brief state, where depression is more like a trait). But in the end, it’s possible that both of these could be true, and you’d still have a measure that answered in terms self-esteem, rather than depressiveness. That’s where face validity comes in. Is it all those things: uncorrelated with unrelated concepts, correlated with related concepts and measures, and does it sound like it’ll measure the thing we want?
And…I promised myself I would make this post more than just geeking out about psychometrics, so here we are. I think the issue of face validity is what a bunch of non-specialist arguments over psychology boil down to. Some people look at an IQ test and say, no, that’s not measuring intelligence, intelligence clearly includes all these other components! To which pro-IQ test people fold their arms and glare back with no, that stuff is outside the concept-space of intelligence, this is just measuring intelligence. Or, “that’s not measuring for ADHD, it’ll catch any hyper kid!” vs. “This is about ADHD, those kids you think are ‘just hyper’ have pathological attentional issues!” And at some further point, everyone’s stomping around arguing about which is the map and which is the territory and I start to have sympathy for Szasz.
Related: Streetlight Psychology
Rosenberg, M. (1965). Society and the adolescent child. Princeton, NJ: Princeton University Press.