Showing posts with label Methods. Show all posts
Showing posts with label Methods. Show all posts

Margin of Error – Mostly Error

February 14, 2016
Posted by Jay Livingston

It’s the sort of social “science” I’d expect from Fox, not Vox. But today, Valentine’s Day, Vox (here) posted this map purporting to show the average amount people in each state spent on Valentine’s Day.

(Click on the image for a larger view.)


“What’s with North Dakota spending $108 on average, but South Dakota spending just $36?” asks Vox. The answer is almost surely: Error.

The sample size was 3,121. If they sampled each state in its proportion of the US population, the sample in the each Dakota would be about n = 80 n = 8. The source of the data, Finder, does not report any margins of error or standard deviations, so we can’t know. Possibly, a couple of guys in North Dakota who’d saved their oil-boom money and spent it on chocolates are responsible for that average. Idaho, Nevada, and Kansas – the only other states over the $100 mark – are also small-n. So are the states at the other other end, the supposedly low-spending states (SD, WY, VT, NH, ME, etc.). So we can’t trust these numbers.

The sample in the states with large populations (NY, CA, TX, etc.) might have been as high as 300-400, possibly enough to make legitimate comparisons, but the differences among them are small – less than $20.

My consultant on this matter, Dan Cassino (he does a lot of serious polling), confirmed my own suspicions. “The study is complete bullshit.”

UPDATE February 24, 2016: Andrew Gelman (here) downloaded the data did a far more thorough analysis, estimating the variation for each state. His graph of the states shows that even between the state with the highest mean and the state with the lowest, the uncertainty is too great to allow for any conclusions: “Soooo . . . we got nuthin’.”

Andrew explains why it’s worthwhile to do a serious analysis even on frivolous data like this Valentine-spending survey. He also corrects my order-of-magnitude overestimation of the North Dakota sample size. 

Too Good to Be True

January 26, 2016
Posted by Jay Livingston


Some findings that turn up in social science research look good to be true, as when a small change in inputs brings a large change in outcomes. Usually the good news comes in the form of anecdotal evidence, but systematic research too can yield wildly optimistic results.

Anecdotal evidence?  Everyone knows to be suspicious, even journalists. A Lexis-Nexis search returns about 300 news articles just in this month where someone was careful to specify that claims were based on “anecdotal evidence” and not systematic research.

Everywhere else, the anecdotal-systematic scale of credibility is reversed. As Stalin said, “The death of a million Russian soldiers – that is a statistic. The death of one Russian soldier – that is a tragedy.” He didn’t bother to add the obvious corollary: a tragedy is far more compelling and persuasive than is a statistic.

Yet here is journalist Heather Havrilesky in the paper of record reviewing Presence, a new book by social scientist Amy Cuddy:

This detailed rehashing of academic research . . . has the unintended effect of transforming her Ph.D. into something of a red flag.

Yes, you read that correctly. Systematic research supporting an idea is a bright red warning sign.

Amy Cuddy, for those who are not among the millions who have seen her TED talk, is the social psychologist (Ph.D. Princeton) at the Harvard Business School who claims that standing in the Wonder Woman “power pose” for just two minutes a day will transform the self-doubting and timid into the confident, assertive, and powerful. Power posing even changes levels of hormones like cortisol and testosterone.


Havrilesky continues.

While Cuddy’s research seems to back up her claims about the effects of power posing, even more convincing are the personal stories sent to the author by some of the 28 million people who have viewed her TED talk. Cuddy scatters their stories throughout the book. . . .

Systematic research is OK for what it is, Havrilesky is saying, but the clincher is the anecdotal evidence. Either way, the results fall into the category of “Amazing But True.”

Havrilesky was unwittingly closer to the truth with that “seems” in the first clause. “Cuddy’s research seems to back up her claims . . . ” Perhaps, but research done by people other than Cuddy and her colleagues does not.  As Andrew Gelman and Kaiser Fung detail in Slate, the power-pose studies have not had a Wonder Woman-like resilience in the lab. Other researchers trying to replicate Cuddy’s experiments could not get similarly positive results.

But outside the tiny world of replication studies, Cuddy’s findings have had a remarkable staying power considering how fragile* the power-pose effect was. The problem is not just that the Times reviewer takes anecdotal evidence as more valid. It’s that she is unaware that contradictory research was available. Nor is she unique in this ignorance. It pervades reporting even in serious places like the Times. “Gee whiz science,” as Gelman and Fung call it, has a seemingly irresistible attraction, much like anecdotal evidence. Journalists and the public want to believe it; scientists want to examine it further.

Our point here is not to slam Cuddy and her collaborators. . . . And we are not really criticizing the New York Times or CBS News, either. . . . Rather, we want to highlight the yawning gap between the news media, science celebrities, and publicists on one side, and the general scientific community on the other. To one group, power posing is a scientifically established fact and an inspiring story to boot. To the other, it’s just one more amusing example of scientific overreach.

I admire Gelman and Fung’s magnanimous view. But I do think that those in the popular press who report about science should do a little skeptical fact-checking when the results seem too good to be true, for too often these results are in fact too good to be true.

---------------------
* “Fragile” is the word used by Joe Simmons and Uri Simonsohn in their review and replication of Cuddy’s experiments (here).

B is for Beauty Bias

January 6, 2016
Posted by Jay Livingston

The headlines make it pretty clear.
Attractive Students Get Higher Grades, Researchers Say

That’s from NewsweekSlate copied Scott Jaschik’s piece, “Graded on Looks,” at Inside Higher Ed and gave it the title, “Better-Looking Female Students Get Better Grades.”

But how much higher, how much better?

For female students, an increase of one standard deviation in attractiveness was associated with a 0.024 increase in grade (on a 4.0 scale).

The story is based on a paper by Rey Hernández-Julián and Christina Peters presented at the American Economic Association meetings. 

You can read the IHE article for the methodology. I assume it’s solid. But for me the problem is that I don’t know if the difference is a lot or if it’s a mere speck of dust – statistically significant dust, but a speck nevertheless. It’s like the 2007 Price-Wolfers research on fouls in the NBA. White refs were more likely to call fouls on Black players than on Whites. Andrew Gelman (here), who is to statistics what Steph Curry is to the 3-pointer, liked the paper, so I have reservations about my reservations. But the degree of bias it found came to this: if an all-Black NBA team played a very hypothetical all-White NBA team in a game refereed by Whites, the refs’ unconscious bias would result in one extra foul called against the all-Blacks. 

I have the same problem with this beauty-bias paper. Imagine a really good-looking girl, one whose beauty is 2½ standard deviations above the mean – the beauty equivalent of an IQ of 137. Her average-looking counterpart with similar performance in the course gets a 3.00 – a B. But the stunningly attractive girl winds up with a 3.06 – a B.

The more serious bias reported in the paper is the bias against unattractive girls.

The least attractive third of women, the average course grade was 0.067 grade points below those earned by others.

It’s still not enough to lower a grade from B to B-, but perhaps the bias is greater against girls who are in the lower end of that lower third. The report doesn’t say.

Both these papers, basketball and beauty, get at something dear to the liberal heart – bias based on physical characteristics that the person has little power to change. And like the Implicit Association Test, they reveal that the evil may lurk even in the hearts and minds of those who think they are without bias. But if one foul in a game or one-sixth of a + or - appended to your letter grade on your GPA is all we had to worry about, I’d feel pretty good about the amount of bias in our world.

[Personal aside: the research I’d like to see would reverse the variables. Does a girl’s academic performance in the course affect her beauty score? Ask the instructor on day one to rate each student on physical attractiveness. Then ask him to rate them again at the end of the term. My guess is that the good students will become better looking.]

Men Are From Mars, Survey Respondents Are From Neptune

November 22, 2015
Posted by Jay Livingston

Survey researchers have long been aware that people don’t always answer honestly. In face-to-face interviews especially, people may mask their true opinion with the socially desirable response. Anonymous questionnaires have the same problem, though perhaps to a lesser degree. But self-administered surveys, especially the online variety, have the additional problem of people who either don’t take it seriously or treat it with outright contempt. Worse, as Shane Frederick (Yale, management) discovered, the proportion of “the random and the perverse” varies from item to item.

On open-ended questions, when someone answers “asdf” or “your mama,” as they did on an online survey Frederick conducted, it’s pretty clear that they are making what my professor in my methods class called “the ‘fuck you’ response.”

But what about multiple-choice items.
Is 8+4 less than 3? YES / NO
11% Yes.
Maybe 11% of Americans online really can’t do the math.  Or maybe all 11% were blowing off the survey. But then what about this item?

Were you born on the planet Neptune? YES / NO
17% Yes
     
Now the ranks of the perverse have grown by at least six percentage points, probably more. Non-responders, the IRB, and now the random and the perverse – I tell ya, survey researchers don’t get no respect.

----------------
Big hat tip to Andrew Gelman. I took everything in this post from his blog (here), where commenters try seriously to deal with the problem created by these kinds of responses.

Evidence vs. Bullshit – Mobster Edition

September 21, 2015
Posted by Jay Livingston

Maria Konnikova is a regular guest on Mike Pesca’s pocast “The Gist.”  Her segment is called “Is That Bullshit.” She addresses pressing topics like
  • Compression sleeves – is that bullshit?
  • Are there different kinds of female orgasm?
  • Are artificial sweeteners bad for your health?
  • Does anger management work?
We can imagine of all kinds of reasons why compression sleeves might work or why diet soda might be unhealthful, but if you want to know if it’s bullshit, you need good evidence. Which is what Konnikova researches and reports on.

Good evidence is also the gist of my class early in the semester. I ask students whether more deaths are caused each year by fires or by drownings. Then I ask them why they chose their answer. They come up with good reasons. Fires can happen anywhere – we spend most of our time in buildings, not so much on water. Fires happen all year round; drownings are mostly in the summer. A fire may kill many people, but group drownings are rare. The news reports a lot about fires, rarely about drownings. And so on.

The point is that for a good answer to the question, you need more than just persuasive reasoning. You need someone to count up the dead bodies. You need the relevant evidence.

“Why Do We Admire Mobsters?” asks Maria Konnikova recently in the New Yorker (here).  She has some answers:
  • Prohibition: “Because Prohibition was hugely unpopular, the men who stood up to it [i.e., mobsters] were heralded as heroes, not criminals.” Even after Repeal, “that initial positive image stuck.”
  • In-group/ out-group: For Americans, Italian (and Irish) mobsters are “similar enough for sympathy, yet different enough for a false sense of safety. . .  Members of the Chinese and Russian mob have been hard to romanticize.”
  • Distance: “Ultimately the mob myth depends on psychological distance. . .  As painful events recede into the past, our perceptions soften. . . . Psychological distance allows us to romanticize and feel nostalgia for almost anything.”
  • Ideals: “We enjoy contemplating the general principles by which they are supposed to have lived: omertà, standing up to unfair authority, protecting your own.”
These are plausible reasons, but are they bullshit? Konnikova offers no systematic evidence for anything she says. Do we really admire mobsters? We don’t know. Besides it would be better to ask: how many of us admire them, and to what degree? Either way, I doubt that we have good survey data on approval ratings for John Gotti. All we know is that mobster movies often sell a lot of tickets. Yet the relation between our actual lives (admiration, desires, behavior) and what we like to watch on screen is fuzzy and inconsistent.

It’s fun to speculate about movies and mobsters,* but without evidence all we have is at best speculation, at worst bullshit.

UPDATE:
In a message to me, Maria Konnikova says that there is evidence, including surveys, but that the New Yorker edited that material out of the final version of her article.

----------
* Nine years ago, in what is still one of my favorite posts on this blog, I speculated on the appeal of mafia movies (here). I had the good sense to acknowledge that I was speculating and to point out that our preferences in fantasyland had a complicated relation to our preferences in real life.

Cartwheeling to Conclusions

September 7, 2015
Posted by Jay Livingston

This post was going to be about kids – what the heck is wrong with these kids today – their narcissism and sense of entitlement and how that’s all because their wealthy parents and schools are so overprotective and doting. giving them trophies for merely showing up and telling them they’re so great all the time.

I’m skeptical about that view – both its accuracy and its underlying values (as I said in this post about “Frances Ha”). But yesterday in Central Park there was this young dad with a $7500 camera.


I was reminded of something from a photo class I once took at Montclair. We were talking about cameras – this was decades ago, long before digital -  and the instructor Klaus Schnitzer said dismissively: “Most Hasselblads are bought by doctors who take snapshots of their kids on weekends.”


Now here was this guy with his very expensive camera taking videos of his 9-year old daughter doing cartwheels. And not just filming her. He interviewed her, for godssake - asked her a couple of questions as she was standing there (notice the mike attached to the camera) as though she were some great gymnast. This is going to be one narcissistic kid, I thought, if she wasn’t already. I imagined her parents in a few years giving her one of those $50,000 bat mitzvahs – a big stage show with her as the star. My Super Sweet Thirteen.

Maybe it was also because the dad reminded me of the Rick Moranis character in the movie “Parenthood,” the father who is over-invested in the idea of his daughter’s being brilliant. 


(The guy looked a little like Moranis. I’ve blurred his face in the photos here, but trust me on this one. My wife thought so too.)

But here’s where the story takes a sharp turn away from the millennials cliches. My wife, who had been a working photographer, went over to ask him about his camera. It turns out that he works for “20/20,” and ABC had asked him to try out this Canon C-100. It was ABC’s camera not his, and as much as he was indulging his daughter, she was indulging him – agreeing to do the cartwheels and mock interview for purposes of his work.

OK, it wasn’t exactly the second-generation kid working in her immigrant parents’ vegetable store, but it wasn’t the narcissism-generating scenario that I had imagined. 

The point is that my wife was a much better social psychologist than I was. If you want to find out what people are doing, don’t just look at them from a distance or number-crunch their responses on survey items. Talk with them.

Margin of Error Error

August 3, 2015
Posted by Jay Livingston

The margin of error is getting more attention than usual in the news. That’s not saying much since it’s usually a tiny footnote, like those rapidly muttered disclaimers in TV ads (“Offer not good mumble mumble more than four hours mumble mumble and Canada.”) Recent headlines proclaim, “Trump leads Bush. . .” A paragraph or two in, the story will report that in the recent poll Trump got 18% and Bush 15%.  That difference is well within the margin of error, but you have to listen closely to hear that. Most people don’t want to know abut uncertainty and ambiguity.

What’s bringing uncertainty out of the closest now is the upcoming Republican presidential debate.  The Fox-CNN-GOP axis has decided to split the field of presidential candidates in two based on their showing in the polls. The top ten will be in the main event. All other candidates – currently Jindal, Santorum, Fiorina, et al. – will be relegated to the children’s table, i.e., a second debate a month later and at the very unprime hour of 5 p.m.

But does Rick Perry’s 4% in the a recent poll (419 likely GOP voters) really in a different class than Bobby Jindal’s 25? The margin of error that CNN announced in that survey was a confidence interval of  +/- 5.  Here’s the box score.

(Click on the image for a larger view.)

Jindal might argue that with a margin of error of 5 points, his 2% might actually be as high as 7%, which would put him in the top tier. 

He might argue that, but he shouldn’t.  Downplaying the margin of error makes a poll result seem more precise than it really is, but using that one-interval-fits-all number of five points understates the precision.  That’s because the margin of error depends on the percent that a candidate gets.  The confidence interval is larger for proportions near 50%, smaller for proportions at the extreme. 

Just in case you haven’t taken the basic statistics course, here is the formula.
The    (pronounced “pee hat”) is the proportion of the sample who preferred each candidate. For the candidate who polled 50%, the numerator of the fraction under the square root sign will be 0.5 (1-0.5) = .25.  That's much larger than the numerator for the 2% candidate:  0.02 (1-0.02) = .0196.*

Multiplying by the 1.96, the 50% candidate’s margin of error with a sample of 419 is +/- 4.8. That’s the figure that CNN reported. But plug in Jindal’s 2%, and  the result is much less: +/- 1.3.  So there’s a less than one in twenty chance that Jindal’s true proportion of support is more than 3.3%. 

Polls usually report their margin of error based on the 50% maximum. The media reporting the results then use the one-margin-fits-all assumption – even NPR. Here is their story from May 29 with the headline “The Math Problem Behind Ranking The Top 10 GOP Candidates.”

There’s a big problem with winnowing down the field this way: the lowest-rated people included in the debate might not deserve to be there.

The latest GOP presidential poll, from Quinnipiac, shows just how messy polling can be in a field this big. We’ve put together a chart showing how the candidates stack up against each other among Republican and Republican-leaning voters — and how much their margins of error overlap.





The NPR writer, Danielle Kurtzleben, does mention that “margins might be a little smaller at the low end of the spectrum,” but she creates a graph that ignores that reality.

The misinterpretation of presidential polls is nothing new.  But this time, that ignorance will determine whether a candidate plays to a larger or smaller TV audience.**

-------------------
* There are slightly different formulas for calculating the margin of error for very low percentages.  The Agresti-Coull formula  gives a confidence interval even if there are zero Yes responses. (HT: Andrew Gelman)

** Department of Irony: Some of these GOP politicians might complain about polls determining candidates’ ability to reach the widest audience. But none of them objects to having that ability determined by money from wealthy individuals and corporations.

David Brooks – The Great Resource

April 29, 2015
Posted by Jay Livingston

What would I do without David Brooks?

One of the exercises I always assign asks students to find an opinion piece – an op-ed, a letter to the editor – and to reduce its central point to a testable hypothesis about the relation between variables. What are the variables, how would you operationalize them, what would be their categories or values, what would be your units of analysis, and what information would you use to decide which category each unit goes in?

To save them the trouble of sifting through the media, I have a stockpile of articles that I’ve collected over the years – articles that make all sorts of assertions but without any evidence.  Most of them are by David Brooks. (OK, not most, but his oeuvre is well represented.)

Yesterday’s column (here) is an excellent example. His point is very simple: We should consider personal morality when choosing our political leaders. People with bad morals will also be bad leaders.

Voting for someone with bad private morals is like setting off on a battleship with awesome guns and a rotting hull. There’s a good chance you’re going to sink before the voyage is over.

People who are dishonest, unkind and inconsiderate have trouble attracting and retaining good people to their team. They tend to have sleazy friends. They may be personally canny, but they are almost always surrounded by sycophants and second-raters who kick up scandal and undermine the leader’s effectiveness. . .

But, historically, most effective leaders — like, say, George Washington, Theodore Roosevelt and Winston Churchill — had a dual consciousness. They had an earnest, inner moral voice capable of radical self-awareness, rectitude and great compassion. They also had a pragmatic, canny outer voice. . . .

Those three – Washington, TR, and Churchill – constitute the entirety of Brooks’s evidence for his basic proposition: “If candidates don’t acquire a moral compass outside of politics, they’re not going to get it in the White House, and they won’t be effective there.”

The comments from readers mentioned others leaders, mostly presidents. But how do you measure a politician’s effectiveness? And how do you measure a politician’s morality? More important, how do you measure them separately. Was Bush II moral? It’s very tempting to those on the left to see the failures of his presidency as not just bad decisions but as sins, violations of morality. Was Bill Clinton effective? Those who dwell on his moral failings probably don’t think so. Presumably, political scientists have some way of measuring effectiveness. Or do they? But does anyone have a standard measure of morality?

So Brooks gets a pass on this one. It’s not that he’s wrong, it’s that it would be impossible to get systematic evidence that might help settle the question.

Still, Brooks, in this column as in so many others, provides a useful material for an exercise in methodology. If David Brooks didn’t exist, I would have to create him.

Good Time Charts, Bad Time Charts

April 21, 2015
Posted by Jay Livingston

How do you graph data to show changes over time? You might make “years” your x-axis and plot each year’s number. But you’re not the Washington Post. If you’re the Post Wonkblog (here), first you proclaim:
Here is that single chart.

(Click on a chart for a slightly larger view.)

The data points are years, but the seem to be in no logical order, and they overlap so much that you can’t tell which year is where.  Even for a point we can easily identify, 1987, it’s not clear what we are supposed to get.  In that year, the average income of the lower 90% of earners was about $33,000, and the average for the top 1% was about $500,000. But how was that different from 1980 or 1990. Go ahead, find those years. I’ll wait.

Here’s the same data,* different graph.


The point is clearer: beginning in 1980 or thereabouts, the 1% started pulling away from the lower 90%. 

A graph showing the percentage change shows the historical trends still more clearly.


From the mid-40s to 1980, incomes for the lower 90% were growing more rapidly than were incomes for the 1%. This period is what some now call “the great compression,” when income inequality decreased. Since 1980, income growth for the 90% has leveled off while incomes for the 1% have risen dramatically.

(The Post acknowledges that it got its material from Quoctrong Bui at NPR. But the NPR page (here) has two graphs, and the one that is similar to the one in the Post has an time-series animation that shows the year to year changes.)
-------------------------------
* The data set, available here , comes from the Paris School of Economics. Presumably, it contains the data that Thomas Piketty has been working with.

Is That Evidence in Your Pocket, or Are You Just Writing an Op-Ed?

February 25, 2015
Posted by Jay Livingston

Nobody looks to USA Today op-eds for methodologically scrupulous research. But even by these standards, James Alan Fox’s opinion piece this morning (here) was a bit clumsy. Fox was arguing against the idea that allowing guns on campus would reduce sexual assaults.

You have to admit, the gunlovers are getting kind of cute with this proposal. They are ostensibly paying attention to a liberal issue – the victimization of women – but their policy proposal is one they know liberals will hate. Next thing you know, the “guns everywhere” folks will be proposing concealed carry as a way to reduce economic inequality. After all, aren’t guns the great equalizer?

What makes the guns-on-campus debate so frustrating is that there’s not much relevant evidence. The trouble with Fox’s op-ed is that he pretends there is.

However compelling the deterrence argument, the evidence suggests otherwise. According to victimization figures routinely collected by the Bureau of Justice Statistics, the sexual assault victimization rate for college women is considerably lower (by more than one-third) than that among their non-college counterparts of the same age range. Thus, prohibiting college women from carrying guns on campus does not put them at greater risk.

You can’t legitimately compare college women on college campuses with non-college women in all the variety of non-college settings. There are just too many other relevant variables. The relevant comparison would be between colleges that allow guns and those than don’t, and there are very few of the latter. Yet even if more campuses begin to allow concealed carry, comparisons with gun-free campuses will be plagued by all the methodological problems that leave the “more guns, less crime” studies open to debate.

The rest of Fox’s op-ed about what might happen is speculation, some of it reasonable and some of it not. “Would an aroused and inebriated brute then use his ‘just in case of emergency’ gun to intimidate some non-consenting woman into bed? Submit or you’re dead?” 

But the “pure speculation” label also applies to the arguments that an armed student body will be a polite and non-sexually-assaultive student body.  Well, as long as we’re speculating, here’s my guess, based on what we know from off-campus data: the difference between gun-heavy campuses and unarmed campuses will turn up more in the numbers of accidents and suicides than in the number of sex crimes committed or deterred, and all these numbers will be small.

Data in the Streets

November 2, 2014
Posted by Jay Livingston

I confess, I have little memory for books or articles on methods. I may have learned their content, but the specific documents and authors faded rapidly to gray.  And then there’s Unobtrusive Measures. It must have been Robert Rosenthal who assigned it. He was, after all, the man who forced social scientists to realize that they were unwittingly affecting the responses of their subjects and respondents, whether those subjects were people or lab rats.  The beauty of unobtrusive measures is that they eliminate that possibility. 


Now that states have started to legalize marijuana, one of the questions they must deal with is how to tax it. Untaxed, weed would be incredibly cheap. “A legal joint would cost (before tax) about what a tea-bag costs” (Mark Kleiman, here). Presumably, states want to tax weed so that the price is high enough to discourage abuse and raise money but not so high that it creates a black market in untaxed weed.

The same problem already occurs with cigarettes.


The above graph, from a study commissioned by the Tax Foundation, shows that as taxes increase, so does smuggling. (The Tax Foundation does not show the correlation coefficient, but it looks like it might be as high as 0.6, though without that dot in the upper right, surely New York, it might be more like 0.5.)

In a high-tax area like New York City, many of the cigarettes sold are smuggled in from other states. But how much is “many cigarettes,” and how can you find out? Most studies of smuggled and counterfeit cigarettes get their estimates by comparing sales figures with smoking rates. The trouble with that method is that rates of smoking come from surveys, and people may not accurately report how much they smoke.

That’s why I liked this study by Klaus von Lampe and colleagues.* They selected a sample of South Bronx census tracts and walked around, eyes down, scanning the sidewalks for discarded cigarette packs to see whether the pack had the proper tax stamps.


 All in all, they picked up 497; of those, 329 still had the cellophane wrapper that the stamp would be on.  If there was a tax stamp, they sent it the state to determine if it was counterfeit.

In the end, they estimate that only 20% of the cigarettes were fully legit with state and city taxes paid. About two-fifths had no tax stamp, another 15% had counterfeit stamps, and 18% had out-of-state stamps.

Unobtrusive measures solve one methodological problem, but they are not perfect. The trouble  here, and in many other cases, is the limited range.  Extending this research to the entire city let alone the fifty states would be a huge and costly undertaking.

----------------
* Hat tip to Peter Moskos, who mentioned it on his Cop in the Hood blog.

Whose Bad Guess is More Bad? Difficult Comparisons

October 29, 2014
Jay Livingston

How to compare percentages that are very different? 

A recent Guardian/Ipsos poll asked people in fourteen wealthy nations to estimate certain demographics. What percent of the population of your country are immigrants? Muslim? Christian?

People overestimated the number of immigrants and Muslims, and underestimated the number of Christians. But the size of the error varied.  Here is the chart on immigration that the Guardian published (here).


Italy, the US, Belgium, and France are way off. The average guess was 18-23 percentage points higher than the true percentage.  People in Japan, South Korea, Sweden, and Australia were off by only 7-8 percentage points.

But is that a fair comparison? The underlying question is this: which country is better at estimating these demographics? Japan and South Korea have only 2% immigrants. People estimated that it was 10%, a difference of eight percentage points. But looked at another way, their estimate was five times the actual number. The US estimate was only 2½ times higher than the true number.

The Guardian ranks Hungary, Poland, and Canada together since they all have errors of 14 points. But I would say that Canada’s 35% vs. 21% is a better estimate than Hungary’s 16% vs. 2%.  Yet I do not know a statistic or statistical technique that factors in this difference and allows us to compare countries with very few immigrants and those with far more immigrants.* 

My brother suggested that the Guardian’s readers could get a better picture of the differences if the chart ordered the countries by the immigrant percentage rather than by the percentage-point gap.


This makes clearer that the 7-point overestimate in Sweden and Australia is a much different sort of error than the 8-point overestimate in South Korea and Japan. But I’m still uncertain as to the best way to make these comparisons.


-------------------------------
* Saying that I know of no such statistic is not saying much. Perhaps others who are more familiar with statistics will know how to solve this problem.

Naming Variables

July 21, 2014
Posted by Jay Livingston

Variable labels – not the sort of problem that should excite much debate. Still, it’s important to identify your variables as what they really are. If I’m comparing, say, New Yorkers with Clevelanders, should I call my independent variable “Sophistication” (Gothamites, as we all know, are more sophisticated)? Or should it be “City” (or “City of residence”)? “Sophistication” would be sexier, “City” would  more accurate.

Dan Ariely does experiments about cheating.  In a recent experiment, he compared East Germans and West Germans and found that East Germans cheated more. 

we found evidence that East Germans who were exposed to socialism cheat more than West Germans who were exposed to capitalism.

Yes, East Germany was a socialist state. But it was also dominated by another nation (the USSR, which appropriated much of East Germany’s wealth) and had a totalitarian government that ruled by fear and mistrust.  For Ariely to write up his results and call his independent variable “Socialism/Captialism,” he must either ignore all those other aspects of East Germany or else assume that they are inherent in socialism.*

The title of the paper is worth noting: “The (True) Legacy of Two Really Existing Economic Systems.”  You can find it here.)

The paper has been well received among mainstream conservatives (e.g., The Economist), who, rather than looking carefully at the variables, are glad to conflate socialism with totalitarian evils.

Mark Kleiman at the Reality Based Community makes an analogy with Chile under socialist Allende and capitalist Pinochet.

Imagine that the results had come out the other way: say, showing that Chileans became less honest while Pinochet was having his minions gouge out their opponents’ eyeballs and Milton Friedman was gushing about the “miracle of Chile”? How do you think the paper would read, and what do you think the Economist, Marginal Revolution, and AEI would have had to say about its methods?

--------------------
* A couple of commas might have made it clearer that other East-West differences might have been at work. Ariely should have written, “we found evidence that East Germans, who were exposed to socialism, cheat more than West Germans, who were exposed to capitalism.”

Replication and Bullshit

July 9, 2014
Posted by Jay Livingston

A bet is tax on bullshit, says Marginal Revolution’s Alex Tabarrok (here).  So is replication.

Here’s one of my favorite examples of both – the cold-open scene from “The Hustler” (1961). Charlie is proposing replication. Without it, he considers the effect to be random variation.



It’s a great three minutes of film, but to spare you the time, here’s the relevant exchange.

CHARLIE
    You ought to take up crap shooting. Talk about luck!

         EDDIE
    Luck! Whaddya mean, luck?

         CHARLIE
    You know what I mean. You couldn't make that shot again in a million years.

       EDDIE
    I couldn’t, huh? Okay. Go ahead. Set ’em up the way they were before.

         CHARLIE
    Why?

         EDDIE
    Go ahead. Set ’em up the way they were before. Bet ya twenty bucks. Make that shot just the way I made it before.

         CHARLIE
    Nobody can make that shot and you know it. Not even a lucky lush.


After some by-play and betting and a deliberate miss, Eddie (aka Fast Eddie) replicates the effect, and we segue to the opening credits* confident that the results are indeed not random variation but a true indicator of Eddie’s skill.

But now Jason Mitchell, a psychologist at Harvard, has published a long throw-down against replication. (The essay is here.) Psychologists shouldn’t try to replicate others’ experiments, he says. And if they do replicate and find no effect, the results shouldn’t be published.  Experiments are delicate mechanisms, and you have to do everything just right. The failure to replicate results means only that someone messed up.

Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way.  Unless direct replications are conducted by flawless experimenters, nothing interesting can be learned from them.


L. J. Zigerell, in a comment at Scatterplot thinks that Mitchell may have gotten it switched around. Zigerell begins by quoting Mitchell,

“When an experiment succeeds, we can celebrate that the phenomenon survived these all-too-frequent shortcomings.”

But, actually, when an experiment succeeds, we can only wallow in uncertainty about whether a phenomenon exists, or whether a phenomenon appears to exist only because a researcher invented the data, because the research report revealed a non-representative selection of results, because the research design biased results away from the null, or because the researcher performed the experiment in a context in which the effect size for some reason appeared much larger than the true effect size.

It would probably be more accurate to say that replication is not so much a tax on bullshit as a tax on those other factors Zigerell mentions. But he left out one other possibility: that the experimenter hadn’t taken all the relevant variables into account.  The best-known of these unincluded variables is the experimenter himself or herself, even in this post-Rosenthal world. But Zigerell’s comment reminded me of my own experience in an experimental psych lab. A full description is here, but in brief, here’s what happened. The experimenters claimed that a monkey watching the face of another monkey on a small black-and-white TV monitor could read the other monkey’s facial expressions.  Their publications made no mention of something that should have been clear to anyone in the lab: that the monkey was responding to the shrieks and pounding of the other monkey – auditory signals that could be clearly heard even though the monkeys were in different rooms.

Imagine another researcher trying to replicate the experiment. She puts the monkeys in rooms where they cannot hear each other, and what they have is a failure to communicate. Should a journal publish her results? Should she have even tried to replicate in the first place?  In response, here are Mitchell’s general principles:


    •    failed replications do not provide meaningful information if they closely follow original methodology;
    •     Replication efforts appear to reflect strong prior expectations that published findings are not reliable, and as such, do not constitute scientific output.
    •    The field of social psychology can be improved, but not by the publication of negative findings.
    •    authors and editors of failed replications are publicly impugning the scientific integrity of their colleagues.


Mitchell makes research sound like a zero-sum game, with “mean-spirited” replicators out to win some easy money from a “a lucky lush.” But often, the attempt to replicate is not motivated by skepticism and envy. Just the opposite. You hear about some finding, and you want to see where the underlying idea might lead.** So as a first step, to see if you’ve got it right, you try to imitate the original research. And if you fail to get similar results, you usually question your own methods.

My guess is that the arrogance Mitchell attributes to the replicators is more common among those who have gotten positive findings.  How often do they reflect on their experiments and wonder if it might have been luck or some other element not in their model?

----
* Those credits can be seen here – with the correct aspect ratio and a saxophone on the soundtrack that has to be Phil Woods. 

** (Update, July 10) ** DrugMonkey, a bio-medical research scientist says something similar:   
Trying to replicate another paper's effects is a compliment! Failing to do so is not an attack on the authors’ “integrity.” It is how science advances.  

Tide and Time

June 4, 2014
Posted by Jay Livingston

Survey questions, even those that seem simple and straightforward, can be tricky and yield incorrect answers.  Social desirability can skew the answers to questions about what you would do – “Would you vote for a woman for president. . . .?” and even factual questions about what you did do.  “Don’t ask, ‘How many books did you read last year?’‘ said the professor in my undergraduate methods course. “Ask ‘Did you read a book last week?’” There’s no shame in having been too busy to read a book in a seven-day period. Besides, people’s recall will be more accurate.  Or will it? Is even a week’s time enough to distort memory?

Leif Nelson (Berkeley, Business School) asked shoppers, “Did you buy laundry detergent the last time you went to the store?” Forty-two percent said yes.



Nelson doesn’t question the 42% figure. He’s interested in something else:  the “false consensus effect” – the tendency to think that others are more like us than they really are.

So he asks, “What percentage of shoppers do you think will buy laundry detergent?” and he also asks “Did you buy laundry detergent.” Sure enough, those who said they bought detergent give higher estimates of detergent buying by others. (Nelson’s blog post, with other interesting findings, is here.)

But did 42% of those shoppers really buy detergent last time they were in the store? Andrew Gelman is “stunned” and skeptical. So am I.

The average family does 7-8 washes a week. Let’s round that up to 10.  They typically do serious shopping once a week with a few other quick express-lane trips during the week.  This 50 oz. jug of Tide will do 32 loads – three week’s of washing.



That means only 33% of customers should have said yes.  And that 33% is a very high estimate since most families buy in bulk, especially with items like detergent. Tide also comes in 100-oz. and 150-oz. jugs.

If you prefer powder, how about this 10-lb. box of Cheer? It’s good for 120 loads. 

A family should need to buy this one in only one out of 12 trips. Even at double the average washing, that’s six weeks of detergent. The true proportion of shoppers buying detergent should be well below 20%.

Why then do people think they buy detergent so much more frequently?  I’m puzzled.  Maybe if washing clothes is part of the daily routine, something you’re always doing, buying detergent seems like part of the weekly shopping trip. Still, if we can’t rely on people’s answers about whether they bought detergent, what does that mean for other seemingly innocuous survey questions?

Sell It! – American (Psychology) Hustle

May 23, 2014
Posted by Jay Livingston

The Rangers crushed the Canadiens convincingly in game one: 7-2. The question was whether that result could be replicated . . . three more times.

Replication is hard (as the Rangers and their fans discovered in overtime at the Garden last night). That’s true in social science too. The difference is that the results of the Rangers’ failure to replicate were published.

Social psychologists are now paying more attention to the replication question. In the Reproducibility Project, Brian Nosek and others have set about trying to replicate most of the studies published in three top journals since 2008.  The first round of results was encouraging – of thirteen attempts, ten were consistent with the original findings. In one case, an “anchoring” study by Daniel Kahneman, the effect was stronger than in the original.

What failed to replicate? Mostly, experiments involving “priming,” where subliminal cues affect people’s ideas or behavior. In the best known and now most controversial of these, participants were primed by words suggesting old age (wrinkles, bingo, alone, Florida). They were then surreptitiously timed as they walked down the hall. In the original study by John Bargh (the priming primus inter pares), participants who were primed walked more slowly than did the controls.*

Many people have tried to replicate this study, and the results are mixed. One problem might be a “Rosenthal” effect, where the experimenters unintentionally and unknowingly influence the participants’ behavior so that it conforms with their expectations. Double-blind experiments, where the experimenters don’t know which participants have been primed, do not produce significant differences. (More here.)

I had a different explanation:  some guys can prime; some can’t. 

Maybe John Bargh and his assistants are really good at priming. Somehow, when they give participants those words mixed in among others, the subjects get a strong but still subliminal mental image of wrinkled retirees in Miami. But other psychologists at other labs haven’t got the same touch. Unfortunately, the researchers did not use an independent measure of how effective the priming was, so we can’t know.

I was delighted to see that Daniel Kahneman (quoted here ) had the same idea.

The conduct of subtle experiments has much in common with the direction of a theatre performance . . . you must tweak the situation just so, to make the manipulation strong enough to work, but not salient enough to attract even a little attention . . . .Bargh has a knack that not all of us have.

Many social psychology experiments involve a manipulation that the participant must be unaware of. If the person catches on to the priming (“Hey, all these sentences have words with a geezer theme,”), it blows the con. Some experiments require more blatant deceptions (think Milgram), and not all psychologists are good deceivers.

What reminded me of this was Eliot Aronson’s memoir Not by Chance Alone. Aronson is one of the godfathers of social psychology experiments, and one of his most famous is the one-dollar-twenty-dollar lie, more widely known as Aronson and Carlsmith, 1963.  Carlsmith was J. Merrill Carlsmith.  The name seems like something from central casting, and so did the man –  a polite  WASP who prepped at Andover, etc. 

In the experiment, the subject was given a boring task to do – taking spools out of a rack and then putting them back, again and again, while Carlsmith as experimenter stood there with a stopwatch pretending to time him.  The next step was to convince the subject to help the experimenter.

[Merrill] would explain that he was testing the hypothesis that people work faster if they are told in advance that the task is incredibly interesting than if they are told nothing and informed, “You were in the control condition. That is why you were told nothing.”

At this point Merrill would say that the guy who was supposed to give the ecstatic description to the next subject had just phoned in to say he couldn’t make it. Merrill would beg the “control” subject to do him a favor and play the role, offering him a dollar (or twenty dollars) to do it. Once the subject agreed, Merrill was to give him the money and a sheet listing the main things to say praising the experiment and leave him alone for a few minutes to prepare.

But Carlsmith could not do a credible job. Subjects immediately became suspicious.

It was crystal clear why the subjects weren’t buying it: He wasn’t selling it. Leon [Festinger] said to me, “Train him.”

Sell it.  If you’ve seen “American Hustle,” you might remember the scene where Irving Rosenfeld (Christian Bale) is trying to show the FBI agent disguised as an Arab prince how to give a gift to the politician they are setting up. (The relevant part starts at 0:12 and ends at about 0:38)



Here is the script:


Aronson had to do something similar, and he had the qualifications. As a teenager, he had worked at a Fascination booth on the boardwalk in Revere, Massachusetts, reeling off a spiel to draw strollers in to try their luck.

Walk right in, sit in, get a seat, get a ball. Play poker for a nickel. . . You get five rubber balls. You roll them nice and easy . . . Any three of a kind or better poker hand, and you are a winner. So walk in, sit in, play poker for a nickel. Five cents. Hey! There's three jacks on table number 27. Payoff that lucky winner!

Twenty years later, Aronson still had the knack, and he could impart it to others. Like Kahneman, he thinks of the experiment as theater.

I gave Merrill a crash course in acting. “You don’t simply say that the assistant hasn’t shown up,” I said. “You fidget, you sweat, you pace up and down, you wring your hands, you convey to the subject that you are in real trouble here. And then, you act as if you just now got an idea. You look at the subject, and you brighten up. ‘You! You can do this for me. I can even pay you.’”

The deception worked, and the experiment worked.  When asked to say how interesting the task was, the $1 subjects give it higher ratings than did the $20 subjects.  Less pay for lying, more attitude shift. The experiment is now part of the cognitive dissonance canon. Surely, others have tried to replicate it.  I just don’t know what the results have been.

--------------------

* An earlier post on Bargh and replication is here
.

Know Your Sample

April 22, 2014
Posted by Jay Livingston


Tim Huelskamp is a Congressman representing the Kansas first district. He’s a conservative Republican, and a pugnacious one (or is that a redundancy). Civility, at least in his tweets, is not his long suit. He refers to “King Obama” and invariably refers to the Affordable Care Act as “ObamaScare.” Pretty clever, huh?

He’s also not a very careful reader.  Either that or he does not understand the first thing about sampling. Tonight he tweeted.

(Click on a graphic for a larger view.)

Since polls also show that Americans support gay marriage, I clicked on the link.  The report is brief in the extreme. It gives data on only two questions and has this introduction.


The outrage might come from liberals. More likely it will come from people who think that members of the US Congress ought to be able to read.

Or maybe in Huelskamp’s view, only Republicans count as Americans.

Wonks Nix Pic Survey

February 18, 2014
Posted by Jay Livingston

“How could we get evidence for this?” I often ask students. And the answer, almost always is, “Do a survey.” The word survey has magical power; anything designated by that name wears a cloak of infallibility.

“Survey just means asking a bunch of people a bunch of questions,” I’ll say. “Whether it has any value depends on how good the bunch of people is and how good the questions are.”  My hope is that a few examples of bad sampling and bad questions will demystify.

For example, Variety



Here’s the lede:
Despite its Biblical inspiration, Paramount’s upcoming “Noah” may face some rough seas with religious audiences, according to a new survey by Faith Driven Consumers.
The data to confirm that idea:
The religious organization found in a survey that 98% of its supporters were not “satisfied” with Hollywood’s take on religious stories such as “Noah,” which focuses on Biblical figure Noah.
The sample:
Faith Driven Consumers surveyed its supporters over several days and based the results on a collected 5,000+ responses.
And (I’m saving the best till last) here’s the crucial survey question:
As a Faith Driven Consumer, are you satisfied with a Biblically themed movie – designed to appeal to you – which replaces the Bible’s core message with one created by Hollywood?
As if the part about “replacing the Bibles core message” werent enough, the item reminds the respondent of her or his identity as a Faith Driven Consumer. It does make you wonder about that 2% who either were fine with the Hollywood* message or didn’t know. 

You can’t really fault Faith Driven Consumer too much for this shoddy “research.” They’re not in business to find the sociological facts. What’s appalling is that Variety accepts it at face value and without comment.

----------------------
* The director of “Noah” is Daniel Aronofsky; the script is credited to him and Ari Handel.  For the Faith Driven Consumer, “Hollywood” may carry connotations in addition to that of industry and location – perhaps something similar to “New York sense of humor” in this clip  from “The West Wing” (the whole six minutes is worth watching, but you’ll get the idea if you push the pointer to 2:20 or so and watch for the next 45 seconds). Or look at this L.A. Times column by Joel Stein.

(HT: @BrendanNyhan retweeted by Gabriel Rossman)

What Never? No, Never.

January 31, 2014
Posted by Jay Livingston

A survey question is only as good as its choices. Sometimes an important choice has been left off the menu.

I was Gallup polled once, long ago. I’ve always felt that they didn’t get my real opinion.
“What’d they ask?” said my brother when I mentioned it to him.
“You know, they asked whether I approved of the way the President was doing his job.”  Nixon - this was in 1969.
“What’d you say?”
“I said I disapproved of his entire existential being.”

I was exaggerating my opinion, and I didn’t actually say that to the pollster.  But even if I had, my opinion would have been coded as “disapprove.” 

For many years the American National Election Study (ANES),  has asked
How much of the time do you think you can trust the government in Washington to do what is right – just about always, most of the time or only some of the time?
The trouble with these choices at that they exclude the truly disaffected. The worst you can say about the federal government is that it can be trusted “only some of the time.”  A few ornery souls say they don’t trust the federal at all. But because that view is a write-in candidate, it usually gets only one or two percent of the vote. 

This year the ANES included “Never” in the options read to respondents.  Putting “No-way, no-how” right there on the ballot makes a big difference. And as you’d expect, there were party differences:


Over half of Republicans say that the federal government can NEVER be trusted.

The graph appears in this Monkey Cage post by Marc Hetherington and Thomas Rudolph. Of course, some of those “never” Republicans don’t really mean “never ever.”  If a Republican becomes president, they’ll become more trusting, and the “never-trust” Democrat tide will rise.  Here’s the Hetherington-Rudolph graph tracking changes in the percent of people who do trust Washington during different administrations.


This one seems to show three things:
  1. Trust took a dive in the 1960s and 70s and never really recovered.
  2. Republican trust is much more volatile, with greater fluctuations depending on which party is in the White House.
  3. Republicans really, really hate President Obama.

Get a Spouse (sha-na-na-na. . . )

January 11, 2014
Posted by Jay Livingston

A bumper sticker I used to occasionally see said, “I fight poverty. I work.”

In this fiftieth anniversary of the War on Poverty, we should remember the difference between individual solutions to individual problems and societal or governmental solutions to social problems.  Yes, you’re less likely to be poor if you have a job. But exhorting the unemployed to go out and get a job is unlikely to have much effect on overall rates of poverty. 

The same can be said of marriage. In a recent speech, Sen. Marco Rubio offered the conservative approach to poverty.  The Rubio bumper sticker would say, “I fight poverty. I have a spouse.”  Here’s what he said:
 the greatest tool to lift people, to lift children and families from poverty, is one that decreases the probability of child poverty by 82 percent. But it isn't a government program. It's called marriage.
His evidence was drawn from a Heritage Foundation paper by Robert Rector.  Rector used Census data showing that poverty rates among single-parent families were much higher than among two-parent families – 37.1% vs. 6.8%.  “Being raised in a married family reduced child’s probability of living in poverty by about 82 percent.”

As Philip Cohen (here) pointed out, the same logic applies even more so to employment.
The median weekly earnings of full-time, year-round workers is $771 per week, which is $40,000 per year more than people with no jobs earn.
Philip apparently thought that this analogy would make the fallacy of the Rubio-Rector claim obvious, for he didn’t bother to spell it out. The point is that singling out marriage or employment as a cause ignores all the reasons why people don’t have jobs or spouses. It also implies that a job is a job and a spouse is a spouse, and that there is no difference between those of the middle-class and those of the poor.  (Philip should have spelled out the obvious. These logical problems did not bother PolitiFact , which rated Rubio’s claim as “mostly true.”)


According to Rubio, Rector, and PolitiFact, if all poor women with children got married, the child-poverty rate in the US would decrease by 82%.  Or at the individual level, if a poor single woman got married, her children would be nearly certain (93.2% likely) to be un-poor.

To illustrate the society-wide impact of marriage on poverty, Rubio-Rector look at the increase in out-of-wedlock births.  Here is a graph from Rector’s article.



The rate rises from about 7% in 1959 to 40-41% today.  If Rubio is right, rates of child poverty should have risen steadily right along with this increase (almost invariably  referred to as “the alarming” increase) in out-of-wedlock births.  The graph below shows poverty rates for families with children under 18.



Both show a large decrease in poverty in the first decade or so of the War on Poverty – between 1959 and 1974, the rate for all families was cut in half.  Since then the rate has remained between 9% and 12%.  The line for unmarried mothers shows something else that Rubio and Rector ignore: the effects of forces that individuals have no power over, things like the overall economy.  In the good years of 1990s, the chance that a single mother would be below the poverty line fell from nearly half (47%) to one-third.  Her marital status did not change, but her chances of being in poverty did.  The number of families in poverty fell from 6.7 million to 5.1 million – despite the increase in population and despite the increase in percentage of children born out of wedlock. There were more single mothers, but fewer of them were in poverty.

Addendum, January 12:  The title of this post refers to the classic oldie “Get a Job” (Silhouettes, 1957). The final lines of that song could, with only some slight editing, apply to Sen. Rubio and his colleagues:

In the Senate and the House
I hear the right-wing mouths,
Preachin’ and a cryin’
Tell me that I’m lyin’
’Bout a spouse
That I never could find.
(Sha-na-na-na, sha-na-na-na-na.)