Showing posts with label Methods. Show all posts
Showing posts with label Methods. Show all posts

Algorithms and False Positives

September 13, 2017
Posted by Jay Livingston

Can face-recognition software tell if you’re gay?

Here’s the headline from The Guardian a week ago.

Yilun Wang and Michal Kosinski at Stanford’s School of Business have written an article showing that artificial intelligence – machines that can learn from their experiences – can develop algorithms to distinguish the gay from the straight. Kosinski goes farther. According to Business Insider,
He predicts that self-learning algorithms with human characteristics will also be able to identify:
  • a person’s political beliefs
  • whether they have high IQs
  • whether they are predisposed to criminal behaviour
When I read that last line, something clicked. I remembered that a while ago I had blogged about an Israeli company, Faception, that claimed its face recognition software could pick out the faces of terrorists, professional poker players, and other types. It all reminded me of Cesare Lombroso, the Italian criminologist. Nearly 150 years ago, Lombroso claimed that criminals could be distinguished by the shape of their skulls, ears, noses, chins, etc. (That blog post, complete with pictures from Lombroso’s book, is here.) So I was not surprised to learn that Kosinski had worked with Faception.

For a thorough (3000 word) critique of the Wang-Kosinski paper, see Greggor Mattson’s post at Scatterplot. The part I want to emphasize here is the problem of False Positives.

Wang-Kosinski tested their algorithm by showing a series of paired pictures from a dating site. In each pair, one person was gay, the other straight. The task was to guess which was which. The machine’s accuracy was roughly 80% – much better than guessing randomly and better than the guesses made by actual humans, who got about 60% right. (These are the numbers for photos of men only. The machine and humans were not as good at spotting lesbians. In my hypothetical example that follows, assume that all the photos are of men.)

But does that mean that the face-recognition algorithm can spot the gay person? The trouble with Wang-Kosinki’s gaydar test was that it created a world where half the population was gay. For each trial, people or machine saw one gay person and one straight.

Let’s suppose that the machine had an accuracy rate of 90%. Let’s also present the machine with a 50-50 world. Looking at the 50 gays, the machine will guess correctly on 45. These are “True Positives.” It identified them as gay, and they were gay. But it will also classify 5 of the gay people as not-gay. These are the False Negatives.

It will have the same ratio of true and false for the not-gay population. It will correctly identify 45 of the not-gays (True Negatives), but it will guess incorrectly that 5 of these straight people are gay (False Positive).

It looks pretty good. But how well will this work in the real world, where the gay-straight ratio is nowhere near 50-50? Just what that ratio is depends on definitions. But to make the math easier, I’m going to use 5% as my estimate. In a sample of 1000, only 50 will be gay. The other 950 will be straight.

Again, let’s give the machine an accuracy rate of 90%. For the 50 gays, it will again have 45 True Positives and 5 False Negatives. But what about the 950 not-gays. It will be correct 90% of the time and identify 885 of them as not-gay (True Negatives). But it will also guess incorrectly that 10% are gay. That’s 95 False Positives.

The number of False Positives is more than double the number of True Positives. The overall accuracy may be 90%, but when it comes to picking out gays, the machine is wrong far more often than it’s right.

The rarer the thing that you’re trying to predict, the greater the ratio of False Positives to True Positives. And those False Positives can have bad consequences. In medicine, a false positive diagnosis can lead to unnecessary treatment that is physically and psychologically damaging. As for politics and policy, think of the consequences if the government goes full Lomborso and uses algorithms for predicting “predisposition to criminal behavior.”

Somewhat Likely to Mess Up on the Likert Scale

May 27, 2017
Posted by Jay Livingston

Ipsos called last night, and I blew it. The interviewer, a very nice-sounding man in Toronto, didn’t have to tell me what Ipsos was, though he did, sticking with his script. I’d regularly seen their numbers cited (The latest “Reuers/Ipsos” poll shows Trump’s approve/disapprove at 37%/57%.)

The interviewer wanted to speak with someone in the household older than 18. No problem; I’m your man. After all, when I vote, I am a mere one among millions. The Ipsos sample, I figured, was only 1,000.  My voice would be heard.

He said at the start that the survey was about energy. Maybe he even said it was sponsored by some energy group. I wish I could remember.

 After a few questions about whether I intended to vote in local elections and how often I got news from various sources (newspapers, TV, Internet), he asked how well-informed I was about energy issues Again, I can’t remember the exact phrasing, but my Likert choices ranged from Very Well Informed to Not At All Informed.

I thought about people who are really up on this sort of thing – a guy I know who writes an oil industry newsletter, bloggers who post about fracking and earthquakes or the history of the cost of solar energy.  I feel so ignorant compared with them when I read about these things. So I went for the next-to-least informed choice. I think it was “not so well informed.”

“That concludes the interview. Thank you.”
“Wait a minute,” I said. “I don’t get to say what I think about energy companies? Don’t you want to know what bastards I think they are?”
“I’m sorry, we have to go with the first response.”
“I was being falsely modest.”
He laughed.
“The Koch brothers, Rex Tillerson, climate change, Massey Coal . . .”
He laughed again, but he wouldn’t budge. They run a tight ship at Ipsos.

Next time they ask, whatever the topic, I’m a freakin’ expert.

Imagine There’s a $5 Discount. It’s Easy If You Try. . . .

June 21, 2016
Posted by Jay Livingston

Reading Robert H. Frank’s new book Luck and Success, I came across this allusion to the famous Kahneman and Tversky finding about “framing.”

It is common . . . for someone to be willing to drive across town to save $10 on a $20 clock radio, but unwilling to do so to save $10 on a $1,000 television set.

Is it common? Do we really have data on crosstown driving to save $10? The research that I assume Frank is alluding to is a 1981 study by Daniel Kahneman and Amos Tversky. (pdf here ) Here are the two scenarios that Kahneman and Tversky presented to their subjects.

A.  Imagine that you are about to purchase a jacket for $125 and a calculator for $15. The calculator salesman informs you that the calculator you wish to buy is on sale for $10 at the other branch of the store, located 20 minutes drive away. Would you make the trip to the other store?

B. Imagine that you are about to purchase a calculator for $125 and a jacket for $15. The calculator salesman informs you that the calculator you wish to buy is on sale for $120 at the other branch of the store, located 20 minutes drive away. Would you make the trip to the other store?

The two are really the same: would you drive 20 minutes to save $5 on a calculator? But when the discount was on a $15 calculator, 68% of the subject said they would make the 20 minute trip. When the $5 savings applied to the $125 calculator, only 29% said they’d make the trip.

The study is famous even outside behavioral economics, and rightly so. It points up one of the many ways that we are not perfectly rational when we think about money. But whenever I read about this result, I wonder: how many of those people actually did drive to the other store? The answer of course is none. There was no actual store, no $125 calculator, no $15 jacket. The subjects were asked to “imagine.” They were thinking about an abstract calculator and an abstract 20-minute drive, not real ones.*

But if they really did want a jacket and a calculator, would 60 of the 90 people really have driven the 20 minutes to save $5 on a $15 calculator? One of the things we have long known in social research is that what people say they would do is not always what they actually will do. And even if these subjects were accurate about what they would do, their thinking might be including real-world factors beyond just the two in the Kahneman-Tversky abstract scenario (20 minutes, $5). Maybe they were thinking that they might be over by that other mall later in the week, or that if they didn’t buy the $15 calculator right now, they could always come back to this same store and get it.

It’s surprising that social scientists who cite this study take the “would do” response at face value, surprising because another well-known topic in behavioral economics is the discrepancy between what people say they will do and what they actually do. People say that they will start exercising regularly, or save more of their income, or start that diet on Monday. Then Monday comes, and everyone else at the table is having dessert, and well, you know how it is.

In the absence of data on behavior, I prefer to think that these results tell us not so much what people will do. They tell us what people think a rational person in that situation would do. What’s interesting then is that their ideas about abstract economic rationality are themselves not so rational.

* I had the same reaction to another Kahneman study, the one involving “Linda,” an imaginary bank teller. (My post about that one, nearly four years ago, is here ). What I said of the Linda problem might also apply to the jacket-and-calculator problem: “It’s like some clever riddle or a joke – something with little relevance outside its own small universe. You’re never going to be having a real drink in a real bar and see, walking in through the door an Irishman, a rabbi, and a panda.”

The Face That Launched a Thousand False Positives

May 27, 2016
Posted by Jay Livingston

What bothered the woman sitting next to him wasn’t just that the guy was writing in what might have been Arabic (it turned out to be math). But he also looked like a terrorist. (WaPo story here.)

We know what terrorists look like. And now an Israeli company, Faception, has combined big data with facial recognition software to come up with this.

According to their Website:

Faception can analyze faces from video streams, cameras, or . . . databases. We match an individual with various personality traits or types such as an Extrovert, a person with High IQ, Professional Poker Player or a Terrorist.

My first thought was, “Oh my god, Lombroso.”

If you’ve taken Crim 101, you might remember that Lombroso, often called “the father of criminology,” had the idea that criminals were atavisms, throwbacks to earlier stages of human evolution, with different skull shapes and facial features. A careful examination of a person’s head and face could diagnose criminality – even the specific type of lawbreaking the criminal favored. Here is an illustration from an 1876 edition of his book. Can you spot the poisoner, the Neapolitan thief, the Piedmont forger?

(Click on the image for a larger view.)

Criminology textbooks still mention Lombroso, though rarely as a source enlightenment. For example, one book concludes the section on Lombroso, “At this point, you may be asking: If Lombroso, with his ideas about criminal ears and jaws, is the ‘father of criminology,’ what can we expect of subsequent generations of criminologists?”

Apparently there’s just something irresistible in the idea that people’s looks reveal their character. Some people really do look like criminals, and some people look like cops.* Some look like a terrorist or a soccer mom or a priest. That’s why Hollywood still pays casting directors. After all, we know that faces show emotion, and most of us know at a glance whether the person we’re looking at is feeling happy, angry, puzzled, hurt, etc. So it’s only logical that a face will reveal more permanent characteristics. As Faception puts it, “According to social and life science research, our personality is determined by our DNA reflected in our face.” It’s not quite true, but it sounds plausible.

The problem with this technique is not the theory or science behind it, and probably not even its ability to pick out terrorists, brand promoters, bingo players, or any of their other dramatis personae in the Faception cast of characters. The problem is false positives. Even when a test is highly accurate, if the thing it’s testing for is rare, a positive identification is likely to be wrong. Mammograms, for example, have an accuracy rate as high as 90%. Each year, about 37 million women in the US are given mammograms. The number who have breast cancer is about 180,000. The 10% error rate means that of the 37 million women tested, 3.7 million will get results that are false positives. It also means that for the woman who does test positive, the likelihood that the diagnosis is wrong is 95%.**

Think of these screening tests as stereotypes. The problem with stereotypes is not that they are wrong; without some grain of truth, they wouldn’t exist. The problem is that they have many grains of untruth – false positives. We have been taught to be wary of stereotypes not just because they denigrate an entire class of people but because in making decisions about individuals, those stereotypes yield a lot of false positives.  

Faception does provide some data on the accuracy of its screening. But poker champions and terrorists are rarer even than breast cancer. So even if the test can pick out the true terrorist waiting to board the plane, it’s also going to pick out a lot of bearded Italian economists jotting integral signs and Greek letters on their notepads.

(h/t Cathy O’Neil at

* Some people look like cops. My favorite example is the opening of Richard Price’s novel Lush Life – four undercover cops, though the cover they are under is not especially effective.

The Quality of Life Task Force: four sweatshirts in a bogus taxi set up on the corner of Clinton Street alongside the Williamsburg Bridge off-ramp to profile the incoming salmon run; their mantra: Dope, guns, overtime; their motto: Everyone’s got something to lose. 
At the corner of Houston and Chrystie, a cherry-red Denali pulls up alongside them, three overdressed women in the backseat, the driver alone up front and wearing sunglasses.
The passenger-side window glides down . “Officers, where the Howard Johnson hotel at around here ...”
“Straight ahead three blocks on the far corner,” Lugo offers.
“Thank you.” [. . .]
The window glides back up and he shoots east on Houston.
“Did he call us officers?”
“It’s that stupid flattop of yours.”
“It’s that fuckin’ tractor hat of yours.”

It wasn’t the haircut or the hat. They just looked like cops.

** The probability that the diagnosis is correct is 5% – the 180,000 true positives divided by the 3.7 million false positives plus the 180,000 true positives – roughly 180,000 / 3,900,000. (I took this example from Howard Wainer’s recent book, Truth and Truthiness.)

Show, Don’t Tell

March 23, 2016
Posted by Jay Livingston

Can the mood of a piece of writing be graphed?

For his final project in Andrew Gelman’s course on statistical communication and graphics, Lucas Estevem created a “Text Sentiment Visiualizer.” Gelman discusses it on his blog, putting the Visualizer through its paces with the opening of Moby Dick.

(Click on an image for a slightly larger view.)

Without reading too carefully, I thought that the picture – about equally positive and negative – seemed about right. Sure things ended badly, but Ishmael himself seemed like a fairly positive fellow. So I went to the Visualizer (here)  and pasted in the text of one of my blogposts. It came out mostly negative. I tried another. Ditto. And another. The results were not surprising when I thought about what I write here, but they were sobering nevertheless. Gotta be more positive.

But how did the Visualizer know? What was its formula for sussing out the sentiment in a sentence? Could the Visualizer itself be a glum creature, tilted towards the dark side, seeing negativity where others might see neutrality? I tried other novel openings. Kafka’s Metamorphosis was entirely in the red, and Holden Caulfield looked to be at about 90%. But Augie March, not exactly a brooding or nasty type, scored about 75% negative. Joyce’s Ulysses came in at about 50-50.

To get a somewhat better idea of the scoring, I looked more closely at page one of The Great Gatsby. The Visualizer scored the third paragraph heavily negative – 17 out of 21 lines. But many of those lines had words that I thought would be scored as positive.

Did the Visualizer think that extraordinary gift, gorgeous, and successful were not such a good thing?

Feeling slightly more positive about my own negative scores, I tried Dr. Seuss. He too skewed negative.

What about A Tale of Two Cities? Surely the best of times would balance out the worst of times, and that famous opening paragraph would finish in a draw. But a line-by-line analysis came out almost all negative.

Only best, hope, and Heaven made it to the blue side.

I’m not sure what the moral of the story is except that, as I said in a recent post, content analysis is a bitch.

Gelman is more on the positive side about the Visualizer. It’s “far from perfect,” but it’s a step in the right direction – i.e., towards visual presentation – and we can play around with it, as I’ve done here, to see how it works and how it might be improved. Or as Gelman concludes, “Visualization. It’s not just about showing off. It’s a tool for discovering and learning about anomalies.”

Race and Tweets

March 20, 2016
Posted by Jay Livingston

Nigger* is a racially charged word. And if you sort cities or states according to how frequently words like nigger turn up from them on Twitter, you’ll find large differences. In some states these words appear forty times more often than in others. But do those frequencies tell us about the local climate of race relations? The answer seems to be: it depends on who is tweeting.

In the previous post, I wondered whether the frequency of tweets with words like bitch, cunt, etc. tell us about general levels of misogyny in a state or city., the Website that mapped the geography of sexist tweets, also had charts and maps showing both racially charged tweets (with words like “nigger”) and more neutral, politically correct, tweets (“African Americans” or “Black people”). Here are the maps of the two different linguistic choices.

(Click on the image for a larger view.)

West Virginia certainly looks like the poster state for racism – highest in “anti-Black” tweets, and among the lowest in “neutral or tolerant” tweets. West Virginia is 95% White, so it’s clear that we’re looking at how White people there talk about Blacks. That guy who sang about the Mountaineer State being almost heaven – I’m pretty sure he wasn’t a Black dude. Nevada too is heavily White (75% , Black 9%), but there, tweets with polite terms well outnumber those with slurs. Probably, Nevada is a less racist place than West Virginia.

But what about states with more Blacks? Maryland, about 30% Black, is in the upper range for neutral race-tweets, but it’s far from the bottom on “anti-Black” tweets. The same is true for Georgia and Louisiana, both about 30% Black. These states score high on both kinds – what we might call, with a hat-tip to Chris Rock, “nigger tweets” and “Black people tweets.” (If you are not familiar with Rock’s “Niggers and Black People,” watch it here.) If he had released this 8-minute stand-up routine as a series of tweets, and if Chris Rock were a state instead of a person, that state would be at the top in both categories – “anti-Black” and “neutral and tolerant. How can a state or city be both?

The answer of course is that the meaning of nigger depends on who is using it.  When White people are tweeting about Blacks, then the choice of words probably tells us about racism. But when most of the people tweeting are Black, it’s harder to know. Here, for example, are Abodo’s top ten cities for “anti-Black tweets.”

Blacks make up a large percent of the population in most of these cities.  The top four – Baltimore, Atlanta, and New Orleans – are over 50% Black. It’s highly unlikely that it’s the Whites there who are flooding Twitter with tweets teeming with “nigger, coon, dindu, jungle bunny, monkey, or spear chucker” – the words included in Abodo’s anti-Black tag.** If the tag had included niggas, the “anti-Black” count in these cities would have been even higher.

All this tells us is that Black people tweet about things concerning Black people. And since hip-hop has been around for more than thirty years, it shouldn’t surprise anyone that Blacks use these words with no slur intended. When I searched Twitter yesterday for nigger, the tweets I saw on the first page were all from Black people, and some of those tweets, rather than using the word nigger were talking about the use of it.  (Needless to say, if you search for niggas, you can scroll through many, many screens trying to find a tweet with a White profile picture.)

For some reason, Abodo refused to draw this obvious conclusion. They do say in another section of the article that  “anti-Hispanic slurs have largely not been reclaimed by Hispanic and Latino people in the way that the N-word is commonly used in black communities.” So they know what’s going on. But in the section on Blacks, they say nothing, tacitly implying that these “anti-Black” tweets announce an anti-Black atmosphere. But that’s true only if the area is mostly White. When those tweets are coming from Blacks, it’s much more complicated.


*Abodo backs away from using the actual word. They substitute the usual euphemism – “the N-word.” As I have said elsewhere in this blog, if you can’t say the word you’re talking about when you’re talking about it as a word, then the terrorists have won. In this view, I differ from another Jay (Smooth) whose views I respect. A third Jay (Z) has no problems with using the word. A lot.

** I confess, porch monkey and dindu were new to me, but then, I don’t get out much, at least not in the right circles. Abodo ignored most of the terms in the old SNL sketch with Richard Pryor and Chevy Chase.  (The available videos, last time I checked, are of low quality, but like Chris Rock’s routine, it is an important document that everyone interested in race and media should be familiar with. A link along with a partial transcript is in this earlier post.)

Content Analysis Is a Bitch

March 18, 2016
Posted by Jay Livingston

Can Twitter tell us about the climate of intolerance? Do the words in all those tweets reveal something about levels of racism and sexism? Maybe. But the language of intolerance – “hate speech” – can be tricky to read.

Adobo is website for people seeking apartments – Zillow for renters – and it recently posted an article, “America’s Most P.C. and Prejudiced Places” (here), with maps and graphs of data from Twitter. Here, for example, are the cities with the highest rates of misogynistic tweets. 

Unfortunately, Abodo does not say which words are in its formula for “deragotory language against women.” But Abodo does recognize that bitch might be a problem because “it is commonly used as profanity but not always with sexist intent.”  Just to see what those uses might be, I searched for “bitch” on Twitter, but the results, if not overtly sexist, all referred to a female as a bitch.

Maybe it was New Orleans. I tried again adding “NOLA” as a search term and found one non-sexist bitch.

When Abodo ran their much larger database of tweets but excluded the word bitch from its misogyny algorithm, New Orleans dropped from first place to fourth, and Baton Rouge disappeared from the top ten. Several Northeast and Western cities now made the cut.

This tells us what we might have known if we’d been following Jack Grieve’s Twitter research (here) – that bitch is especially popular in the South.

The Twitter map of cunt is just the opposite. It appears far more frequently in tweets from the Northeast than from the South.

The bitch factor changes the estimated sexism of states as well as cities. Here are two maps, one with and one without bitch in its sexism screen.

(Click on the image for a larger view.)

With bitch out of the equation, Louisiana looks much less nasty, and the other Southeast states also shade more towards the less sexist green. The Northeast and West, especially Nevada, now look more misogynistic. A few states remain nice no matter how you score the tweets – Montana, Wyoming, Vermont – but they are among the least populous states so even with Twitter data, sample size might be a problem. Also note that bitch accounts for most of what Abodo calls sexist language. Without bitch, the rates range from 26 to 133 per 100,000 tweets. Add bitch to the formula and the range moves to 74 to 894 per 100,000.  That means that at least two-thirds of all the “derogatory language against women” on Twitter is the word bitch.

There’s a further problem in using these tweets as an index of sexism. Apparently a lot of these bitch tweets are coming from women (if my small sample of tweets is at all representative). Does that mean that the word has lost some of its misogyny? Or, as I’m sure some will argue, do these tweets mean that women have become “self-hating”? This same question is raised, in spades, by the use of nigger. Abodo has data on that too, but I will leave it for another post.

Which Percentages, Which Bars

March 13, 2016
Posted by Jay Livingston

Whence Trumpismo, as ABC calls it (here), as though he were a Latin American dictator? Where does Trump get his support? Who are the voters that prefer Trump to the other candidates?

The latest ABC/WaPo poll, out today, has some answers. But it also has some bafflingly screwed-up ways of setting out the results. For example, the ABC write-up (by Chad Kiewiet De Jonge) says what many have been saying: “Trump’s support stems from economic discontent, particularly among working-class whites.” Appropriately, the poll asked people how they were doing economically – were they Struggling, Comfortable, or Moving Up?

That’s pretty clear: economic circumstances is the independent variable, candidate preference is the dependent variable. You compare these groups and find that Strugglers are far more likely to support Trump than were the folks who are better off. Instead, we get a chart percentaged the other way.

Instead of comparing people of different economic circumstances, it compares the supporters of the different candidates. And it doesn’t even do that correctly. If you want to compare Trump backers with Cruz, and Rubio/Kasich backers, the candidates should be the columns. (The poll merged Rubio and Kasich supporters for purposes of sample size.) Here’s the same data. Which chart is easier to interpret?

This makes the comparison a bit easier.  The margin of error is about 5 points. So Trump supporters might be somewhat more likely to see their economic circumstances as a struggle.

There’s a similar problem with their analysis of authoritarianism. “It’s also been argued that people who are predisposed to value order, obedience and respect for traditional authority tend to be strongly attracted to Trump.” But instead of comparing the very authoritarian with the less so, ABC/WaPo again compares the supporters of Trump, Cruz, and Rubio/Kasich.

Instead of telling us who authoritarians prefer, this analysis tells us which candidate’s backers have a higher proportion of authoritarians. And again, even for that, it makes the answer hard to see. Same data, different chart.

Cruz supporters, not Trumpistas, are the most authoritarian, probably because of that old time religion, the kind that emphasizes respect for one’s elders. (For more on Cruz supporters and uncompassionate Christian conservatism, see this post.)

The poll has worthwhile data, and it gets the other charts right. The pdf lists Abt SRBI and Langer Research as having done the survey and analysis. To their credit, they present a regression model of the variables that is far more sophisticated than what the popular press usually reports. But come on guys, percentage on the independent variable.

Margin of Error – Mostly Error

February 14, 2016
Posted by Jay Livingston

It’s the sort of social “science” I’d expect from Fox, not Vox. But today, Valentine’s Day, Vox (here) posted this map purporting to show the average amount people in each state spent on Valentine’s Day.

(Click on the image for a larger view.)

“What’s with North Dakota spending $108 on average, but South Dakota spending just $36?” asks Vox. The answer is almost surely: Error.

The sample size was 3,121. If they sampled each state in its proportion of the US population, the sample in the each Dakota would be about n = 80 n = 8. The source of the data, Finder, does not report any margins of error or standard deviations, so we can’t know. Possibly, a couple of guys in North Dakota who’d saved their oil-boom money and spent it on chocolates are responsible for that average. Idaho, Nevada, and Kansas – the only other states over the $100 mark – are also small-n. So are the states at the other other end, the supposedly low-spending states (SD, WY, VT, NH, ME, etc.). So we can’t trust these numbers.

The sample in the states with large populations (NY, CA, TX, etc.) might have been as high as 300-400, possibly enough to make legitimate comparisons, but the differences among them are small – less than $20.

My consultant on this matter, Dan Cassino (he does a lot of serious polling), confirmed my own suspicions. “The study is complete bullshit.”

UPDATE February 24, 2016: Andrew Gelman (here) downloaded the data did a far more thorough analysis, estimating the variation for each state. His graph of the states shows that even between the state with the highest mean and the state with the lowest, the uncertainty is too great to allow for any conclusions: “Soooo . . . we got nuthin’.”

Andrew explains why it’s worthwhile to do a serious analysis even on frivolous data like this Valentine-spending survey. He also corrects my order-of-magnitude overestimation of the North Dakota sample size. 

Too Good to Be True

January 26, 2016
Posted by Jay Livingston

Some findings that turn up in social science research look good to be true, as when a small change in inputs brings a large change in outcomes. Usually the good news comes in the form of anecdotal evidence, but systematic research too can yield wildly optimistic results.

Anecdotal evidence?  Everyone knows to be suspicious, even journalists. A Lexis-Nexis search returns about 300 news articles just in this month where someone was careful to specify that claims were based on “anecdotal evidence” and not systematic research.

Everywhere else, the anecdotal-systematic scale of credibility is reversed. As Stalin said, “The death of a million Russian soldiers – that is a statistic. The death of one Russian soldier – that is a tragedy.” He didn’t bother to add the obvious corollary: a tragedy is far more compelling and persuasive than is a statistic.

Yet here is journalist Heather Havrilesky in the paper of record reviewing Presence, a new book by social scientist Amy Cuddy:

This detailed rehashing of academic research . . . has the unintended effect of transforming her Ph.D. into something of a red flag.

Yes, you read that correctly. Systematic research supporting an idea is a bright red warning sign.

Amy Cuddy, for those who are not among the millions who have seen her TED talk, is the social psychologist (Ph.D. Princeton) at the Harvard Business School who claims that standing in the Wonder Woman “power pose” for just two minutes a day will transform the self-doubting and timid into the confident, assertive, and powerful. Power posing even changes levels of hormones like cortisol and testosterone.

Havrilesky continues.

While Cuddy’s research seems to back up her claims about the effects of power posing, even more convincing are the personal stories sent to the author by some of the 28 million people who have viewed her TED talk. Cuddy scatters their stories throughout the book. . . .

Systematic research is OK for what it is, Havrilesky is saying, but the clincher is the anecdotal evidence. Either way, the results fall into the category of “Amazing But True.”

Havrilesky was unwittingly closer to the truth with that “seems” in the first clause. “Cuddy’s research seems to back up her claims . . . ” Perhaps, but research done by people other than Cuddy and her colleagues does not.  As Andrew Gelman and Kaiser Fung detail in Slate, the power-pose studies have not had a Wonder Woman-like resilience in the lab. Other researchers trying to replicate Cuddy’s experiments could not get similarly positive results.

But outside the tiny world of replication studies, Cuddy’s findings have had a remarkable staying power considering how fragile* the power-pose effect was. The problem is not just that the Times reviewer takes anecdotal evidence as more valid. It’s that she is unaware that contradictory research was available. Nor is she unique in this ignorance. It pervades reporting even in serious places like the Times. “Gee whiz science,” as Gelman and Fung call it, has a seemingly irresistible attraction, much like anecdotal evidence. Journalists and the public want to believe it; scientists want to examine it further.

Our point here is not to slam Cuddy and her collaborators. . . . And we are not really criticizing the New York Times or CBS News, either. . . . Rather, we want to highlight the yawning gap between the news media, science celebrities, and publicists on one side, and the general scientific community on the other. To one group, power posing is a scientifically established fact and an inspiring story to boot. To the other, it’s just one more amusing example of scientific overreach.

I admire Gelman and Fung’s magnanimous view. But I do think that those in the popular press who report about science should do a little skeptical fact-checking when the results seem too good to be true, for too often these results are in fact too good to be true.

* “Fragile” is the word used by Joe Simmons and Uri Simonsohn in their review and replication of Cuddy’s experiments (here).

B is for Beauty Bias

January 6, 2016
Posted by Jay Livingston

The headlines make it pretty clear.
Attractive Students Get Higher Grades, Researchers Say

That’s from NewsweekSlate copied Scott Jaschik’s piece, “Graded on Looks,” at Inside Higher Ed and gave it the title, “Better-Looking Female Students Get Better Grades.”

But how much higher, how much better?

For female students, an increase of one standard deviation in attractiveness was associated with a 0.024 increase in grade (on a 4.0 scale).

The story is based on a paper by Rey Hernández-Julián and Christina Peters presented at the American Economic Association meetings. 

You can read the IHE article for the methodology. I assume it’s solid. But for me the problem is that I don’t know if the difference is a lot or if it’s a mere speck of dust – statistically significant dust, but a speck nevertheless. It’s like the 2007 Price-Wolfers research on fouls in the NBA. White refs were more likely to call fouls on Black players than on Whites. Andrew Gelman (here), who is to statistics what Steph Curry is to the 3-pointer, liked the paper, so I have reservations about my reservations. But the degree of bias it found came to this: if an all-Black NBA team played a very hypothetical all-White NBA team in a game refereed by Whites, the refs’ unconscious bias would result in one extra foul called against the all-Blacks. 

I have the same problem with this beauty-bias paper. Imagine a really good-looking girl, one whose beauty is 2½ standard deviations above the mean – the beauty equivalent of an IQ of 137. Her average-looking counterpart with similar performance in the course gets a 3.00 – a B. But the stunningly attractive girl winds up with a 3.06 – a B.

The more serious bias reported in the paper is the bias against unattractive girls.

The least attractive third of women, the average course grade was 0.067 grade points below those earned by others.

It’s still not enough to lower a grade from B to B-, but perhaps the bias is greater against girls who are in the lower end of that lower third. The report doesn’t say.

Both these papers, basketball and beauty, get at something dear to the liberal heart – bias based on physical characteristics that the person has little power to change. And like the Implicit Association Test, they reveal that the evil may lurk even in the hearts and minds of those who think they are without bias. But if one foul in a game or one-sixth of a + or - appended to your letter grade on your GPA is all we had to worry about, I’d feel pretty good about the amount of bias in our world.

[Personal aside: the research I’d like to see would reverse the variables. Does a girl’s academic performance in the course affect her beauty score? Ask the instructor on day one to rate each student on physical attractiveness. Then ask him to rate them again at the end of the term. My guess is that the good students will become better looking.]

Men Are From Mars, Survey Respondents Are From Neptune

November 22, 2015
Posted by Jay Livingston

Survey researchers have long been aware that people don’t always answer honestly. In face-to-face interviews especially, people may mask their true opinion with the socially desirable response. Anonymous questionnaires have the same problem, though perhaps to a lesser degree. But self-administered surveys, especially the online variety, have the additional problem of people who either don’t take it seriously or treat it with outright contempt. Worse, as Shane Frederick (Yale, management) discovered, the proportion of “the random and the perverse” varies from item to item.

On open-ended questions, when someone answers “asdf” or “your mama,” as they did on an online survey Frederick conducted, it’s pretty clear that they are making what my professor in my methods class called “the ‘fuck you’ response.”

But what about multiple-choice items.
Is 8+4 less than 3? YES / NO
11% Yes.
Maybe 11% of Americans online really can’t do the math.  Or maybe all 11% were blowing off the survey. But then what about this item?

Were you born on the planet Neptune? YES / NO
17% Yes
Now the ranks of the perverse have grown by at least six percentage points, probably more. Non-responders, the IRB, and now the random and the perverse – I tell ya, survey researchers don’t get no respect.

Big hat tip to Andrew Gelman. I took everything in this post from his blog (here), where commenters try seriously to deal with the problem created by these kinds of responses.

Evidence vs. Bullshit – Mobster Edition

September 21, 2015
Posted by Jay Livingston

Maria Konnikova is a regular guest on Mike Pesca’s pocast “The Gist.”  Her segment is called “Is That Bullshit.” She addresses pressing topics like
  • Compression sleeves – is that bullshit?
  • Are there different kinds of female orgasm?
  • Are artificial sweeteners bad for your health?
  • Does anger management work?
We can imagine of all kinds of reasons why compression sleeves might work or why diet soda might be unhealthful, but if you want to know if it’s bullshit, you need good evidence. Which is what Konnikova researches and reports on.

Good evidence is also the gist of my class early in the semester. I ask students whether more deaths are caused each year by fires or by drownings. Then I ask them why they chose their answer. They come up with good reasons. Fires can happen anywhere – we spend most of our time in buildings, not so much on water. Fires happen all year round; drownings are mostly in the summer. A fire may kill many people, but group drownings are rare. The news reports a lot about fires, rarely about drownings. And so on.

The point is that for a good answer to the question, you need more than just persuasive reasoning. You need someone to count up the dead bodies. You need the relevant evidence.

“Why Do We Admire Mobsters?” asks Maria Konnikova recently in the New Yorker (here).  She has some answers:
  • Prohibition: “Because Prohibition was hugely unpopular, the men who stood up to it [i.e., mobsters] were heralded as heroes, not criminals.” Even after Repeal, “that initial positive image stuck.”
  • In-group/ out-group: For Americans, Italian (and Irish) mobsters are “similar enough for sympathy, yet different enough for a false sense of safety. . .  Members of the Chinese and Russian mob have been hard to romanticize.”
  • Distance: “Ultimately the mob myth depends on psychological distance. . .  As painful events recede into the past, our perceptions soften. . . . Psychological distance allows us to romanticize and feel nostalgia for almost anything.”
  • Ideals: “We enjoy contemplating the general principles by which they are supposed to have lived: omertà, standing up to unfair authority, protecting your own.”
These are plausible reasons, but are they bullshit? Konnikova offers no systematic evidence for anything she says. Do we really admire mobsters? We don’t know. Besides it would be better to ask: how many of us admire them, and to what degree? Either way, I doubt that we have good survey data on approval ratings for John Gotti. All we know is that mobster movies often sell a lot of tickets. Yet the relation between our actual lives (admiration, desires, behavior) and what we like to watch on screen is fuzzy and inconsistent.

It’s fun to speculate about movies and mobsters,* but without evidence all we have is at best speculation, at worst bullshit.

In a message to me, Maria Konnikova says that there is evidence, including surveys, but that the New Yorker edited that material out of the final version of her article.

* Nine years ago, in what is still one of my favorite posts on this blog, I speculated on the appeal of mafia movies (here). I had the good sense to acknowledge that I was speculating and to point out that our preferences in fantasyland had a complicated relation to our preferences in real life.

Cartwheeling to Conclusions

September 7, 2015
Posted by Jay Livingston

This post was going to be about kids – what the heck is wrong with these kids today – their narcissism and sense of entitlement and how that’s all because their wealthy parents and schools are so overprotective and doting. giving them trophies for merely showing up and telling them they’re so great all the time.

I’m skeptical about that view – both its accuracy and its underlying values (as I said in this post about “Frances Ha”). But yesterday in Central Park there was this young dad with a $7500 camera.

I was reminded of something from a photo class I once took at Montclair. We were talking about cameras – this was decades ago, long before digital -  and the instructor Klaus Schnitzer said dismissively: “Most Hasselblads are bought by doctors who take snapshots of their kids on weekends.”

Now here was this guy with his very expensive camera taking videos of his 9-year old daughter doing cartwheels. And not just filming her. He interviewed her, for godssake - asked her a couple of questions as she was standing there (notice the mike attached to the camera) as though she were some great gymnast. This is going to be one narcissistic kid, I thought, if she wasn’t already. I imagined her parents in a few years giving her one of those $50,000 bat mitzvahs – a big stage show with her as the star. My Super Sweet Thirteen.

Maybe it was also because the dad reminded me of the Rick Moranis character in the movie “Parenthood,” the father who is over-invested in the idea of his daughter’s being brilliant. 

(The guy looked a little like Moranis. I’ve blurred his face in the photos here, but trust me on this one. My wife thought so too.)

But here’s where the story takes a sharp turn away from the millennials cliches. My wife, who had been a working photographer, went over to ask him about his camera. It turns out that he works for “20/20,” and ABC had asked him to try out this Canon C-100. It was ABC’s camera not his, and as much as he was indulging his daughter, she was indulging him – agreeing to do the cartwheels and mock interview for purposes of his work.

OK, it wasn’t exactly the second-generation kid working in her immigrant parents’ vegetable store, but it wasn’t the narcissism-generating scenario that I had imagined. 

The point is that my wife was a much better social psychologist than I was. If you want to find out what people are doing, don’t just look at them from a distance or number-crunch their responses on survey items. Talk with them.

Margin of Error Error

August 3, 2015
Posted by Jay Livingston

The margin of error is getting more attention than usual in the news. That’s not saying much since it’s usually a tiny footnote, like those rapidly muttered disclaimers in TV ads (“Offer not good mumble mumble more than four hours mumble mumble and Canada.”) Recent headlines proclaim, “Trump leads Bush. . .” A paragraph or two in, the story will report that in the recent poll Trump got 18% and Bush 15%.  That difference is well within the margin of error, but you have to listen closely to hear that. Most people don’t want to know abut uncertainty and ambiguity.

What’s bringing uncertainty out of the closest now is the upcoming Republican presidential debate.  The Fox-CNN-GOP axis has decided to split the field of presidential candidates in two based on their showing in the polls. The top ten will be in the main event. All other candidates – currently Jindal, Santorum, Fiorina, et al. – will be relegated to the children’s table, i.e., a second debate a month later and at the very unprime hour of 5 p.m.

But does Rick Perry’s 4% in the a recent poll (419 likely GOP voters) really in a different class than Bobby Jindal’s 25? The margin of error that CNN announced in that survey was a confidence interval of  +/- 5.  Here’s the box score.

(Click on the image for a larger view.)

Jindal might argue that with a margin of error of 5 points, his 2% might actually be as high as 7%, which would put him in the top tier. 

He might argue that, but he shouldn’t.  Downplaying the margin of error makes a poll result seem more precise than it really is, but using that one-interval-fits-all number of five points understates the precision.  That’s because the margin of error depends on the percent that a candidate gets.  The confidence interval is larger for proportions near 50%, smaller for proportions at the extreme. 

Just in case you haven’t taken the basic statistics course, here is the formula.
The    (pronounced “pee hat”) is the proportion of the sample who preferred each candidate. For the candidate who polled 50%, the numerator of the fraction under the square root sign will be 0.5 (1-0.5) = .25.  That's much larger than the numerator for the 2% candidate:  0.02 (1-0.02) = .0196.*

Multiplying by the 1.96, the 50% candidate’s margin of error with a sample of 419 is +/- 4.8. That’s the figure that CNN reported. But plug in Jindal’s 2%, and  the result is much less: +/- 1.3.  So there’s a less than one in twenty chance that Jindal’s true proportion of support is more than 3.3%. 

Polls usually report their margin of error based on the 50% maximum. The media reporting the results then use the one-margin-fits-all assumption – even NPR. Here is their story from May 29 with the headline “The Math Problem Behind Ranking The Top 10 GOP Candidates.”

There’s a big problem with winnowing down the field this way: the lowest-rated people included in the debate might not deserve to be there.

The latest GOP presidential poll, from Quinnipiac, shows just how messy polling can be in a field this big. We’ve put together a chart showing how the candidates stack up against each other among Republican and Republican-leaning voters — and how much their margins of error overlap.

The NPR writer, Danielle Kurtzleben, does mention that “margins might be a little smaller at the low end of the spectrum,” but she creates a graph that ignores that reality.

The misinterpretation of presidential polls is nothing new.  But this time, that ignorance will determine whether a candidate plays to a larger or smaller TV audience.**

* There are slightly different formulas for calculating the margin of error for very low percentages.  The Agresti-Coull formula  gives a confidence interval even if there are zero Yes responses. (HT: Andrew Gelman)

** Department of Irony: Some of these GOP politicians might complain about polls determining candidates’ ability to reach the widest audience. But none of them objects to having that ability determined by money from wealthy individuals and corporations.

David Brooks – The Great Resource

April 29, 2015
Posted by Jay Livingston

What would I do without David Brooks?

One of the exercises I always assign asks students to find an opinion piece – an op-ed, a letter to the editor – and to reduce its central point to a testable hypothesis about the relation between variables. What are the variables, how would you operationalize them, what would be their categories or values, what would be your units of analysis, and what information would you use to decide which category each unit goes in?

To save them the trouble of sifting through the media, I have a stockpile of articles that I’ve collected over the years – articles that make all sorts of assertions but without any evidence.  Most of them are by David Brooks. (OK, not most, but his oeuvre is well represented.)

Yesterday’s column (here) is an excellent example. His point is very simple: We should consider personal morality when choosing our political leaders. People with bad morals will also be bad leaders.

Voting for someone with bad private morals is like setting off on a battleship with awesome guns and a rotting hull. There’s a good chance you’re going to sink before the voyage is over.

People who are dishonest, unkind and inconsiderate have trouble attracting and retaining good people to their team. They tend to have sleazy friends. They may be personally canny, but they are almost always surrounded by sycophants and second-raters who kick up scandal and undermine the leader’s effectiveness. . .

But, historically, most effective leaders — like, say, George Washington, Theodore Roosevelt and Winston Churchill — had a dual consciousness. They had an earnest, inner moral voice capable of radical self-awareness, rectitude and great compassion. They also had a pragmatic, canny outer voice. . . .

Those three – Washington, TR, and Churchill – constitute the entirety of Brooks’s evidence for his basic proposition: “If candidates don’t acquire a moral compass outside of politics, they’re not going to get it in the White House, and they won’t be effective there.”

The comments from readers mentioned others leaders, mostly presidents. But how do you measure a politician’s effectiveness? And how do you measure a politician’s morality? More important, how do you measure them separately. Was Bush II moral? It’s very tempting to those on the left to see the failures of his presidency as not just bad decisions but as sins, violations of morality. Was Bill Clinton effective? Those who dwell on his moral failings probably don’t think so. Presumably, political scientists have some way of measuring effectiveness. Or do they? But does anyone have a standard measure of morality?

So Brooks gets a pass on this one. It’s not that he’s wrong, it’s that it would be impossible to get systematic evidence that might help settle the question.

Still, Brooks, in this column as in so many others, provides a useful material for an exercise in methodology. If David Brooks didn’t exist, I would have to create him.

Good Time Charts, Bad Time Charts

April 21, 2015
Posted by Jay Livingston

How do you graph data to show changes over time? You might make “years” your x-axis and plot each year’s number. But you’re not the Washington Post. If you’re the Post Wonkblog (here), first you proclaim:
Here is that single chart.

(Click on a chart for a slightly larger view.)

The data points are years, but the seem to be in no logical order, and they overlap so much that you can’t tell which year is where.  Even for a point we can easily identify, 1987, it’s not clear what we are supposed to get.  In that year, the average income of the lower 90% of earners was about $33,000, and the average for the top 1% was about $500,000. But how was that different from 1980 or 1990. Go ahead, find those years. I’ll wait.

Here’s the same data,* different graph.

The point is clearer: beginning in 1980 or thereabouts, the 1% started pulling away from the lower 90%. 

A graph showing the percentage change shows the historical trends still more clearly.

From the mid-40s to 1980, incomes for the lower 90% were growing more rapidly than were incomes for the 1%. This period is what some now call “the great compression,” when income inequality decreased. Since 1980, income growth for the 90% has leveled off while incomes for the 1% have risen dramatically.

(The Post acknowledges that it got its material from Quoctrong Bui at NPR. But the NPR page (here) has two graphs, and the one that is similar to the one in the Post has an time-series animation that shows the year to year changes.)
* The data set, available here , comes from the Paris School of Economics. Presumably, it contains the data that Thomas Piketty has been working with.

Is That Evidence in Your Pocket, or Are You Just Writing an Op-Ed?

February 25, 2015
Posted by Jay Livingston

Nobody looks to USA Today op-eds for methodologically scrupulous research. Even so, James Alan Fox’s opinion piece this morning (here) was a bit clumsy. Fox was arguing against the idea that allowing guns on campus would reduce sexual assaults.

You have to admit, the gunlovers’ proposal is kind of cute. Conservatives are ostensibly paying attention to a liberal issue – the victimization of women – but their policy proposal is one they know liberals will hate. Next thing you know, the “guns everywhere” folks will be proposing concealed carry as a way to reduce economic inequality. After all, aren’t guns the great equalizer?

What makes the guns-on-campus debate so frustrating is that there’s not much relevant evidence. The trouble with Fox’s op-ed is that he pretends there is.

However compelling the deterrence argument, the evidence suggests otherwise. According to victimization figures routinely collected by the Bureau of Justice Statistics, the sexual assault victimization rate for college women is considerably lower (by more than one-third) than that among their non-college counterparts of the same age range. Thus, prohibiting college women from carrying guns on campus does not put them at greater risk.

You can’t legitimately compare college women on college campuses with non-college women in all the variety of non-college settings. There are just too many other relevant variables. Even if more campuses allow concealed carry, comparisons with gun-free campuses will be plagued by all the methodological problems that leave the “more guns, less crime” studies open to debate.

The rest of Fox’s op-ed about what might happen is speculation, some of it reasonable and some of it not. “Would an aroused and inebriated brute then use his ‘just in case of emergency’ gun to intimidate some non-consenting woman into bed? Submit or you’re dead?” 

But also pure speculation are the arguments that an armed student body will be a polite and non-sexually-assaultive student body.  Well, as long as we’re speculating, here’s my guess, based on what we know from off-campus data: the difference between gun-heavy campuses and unarmed campuses will turn up more in the numbers of accidents and suicides than in the number of sex crimes committed or deterred, and all these numbers will be small.

Data in the Streets

November 2, 2014
Posted by Jay Livingston

I confess, I have little memory for books or articles on methods. I may have learned their content, but the specific documents and authors faded rapidly to gray.  And then there’s Unobtrusive Measures. It must have been Robert Rosenthal who assigned it. He was, after all, the man who forced social scientists to realize that they were unwittingly affecting the responses of their subjects and respondents, whether those subjects were people or lab rats.  The beauty of unobtrusive measures is that they eliminate that possibility. 

Now that states have started to legalize marijuana, one of the questions they must deal with is how to tax it. Untaxed, weed would be incredibly cheap. “A legal joint would cost (before tax) about what a tea-bag costs” (Mark Kleiman, here). Presumably, states want to tax weed so that the price is high enough to discourage abuse and raise money but not so high that it creates a black market in untaxed weed.

The same problem already occurs with cigarettes.

The above graph, from a study commissioned by the Tax Foundation, shows that as taxes increase, so does smuggling. (The Tax Foundation does not show the correlation coefficient, but it looks like it might be as high as 0.6, though without that dot in the upper right, surely New York, it might be more like 0.5.)

In a high-tax area like New York City, many of the cigarettes sold are smuggled in from other states. But how much is “many cigarettes,” and how can you find out? Most studies of smuggled and counterfeit cigarettes get their estimates by comparing sales figures with smoking rates. The trouble with that method is that rates of smoking come from surveys, and people may not accurately report how much they smoke.

That’s why I liked this study by Klaus von Lampe and colleagues.* They selected a sample of South Bronx census tracts and walked around, eyes down, scanning the sidewalks for discarded cigarette packs to see whether the pack had the proper tax stamps.

 All in all, they picked up 497; of those, 329 still had the cellophane wrapper that the stamp would be on.  If there was a tax stamp, they sent it the state to determine if it was counterfeit.

In the end, they estimate that only 20% of the cigarettes were fully legit with state and city taxes paid. About two-fifths had no tax stamp, another 15% had counterfeit stamps, and 18% had out-of-state stamps.

Unobtrusive measures solve one methodological problem, but they are not perfect. The trouble  here, and in many other cases, is the limited range.  Extending this research to the entire city let alone the fifty states would be a huge and costly undertaking.

* Hat tip to Peter Moskos, who mentioned it on his Cop in the Hood blog.

Whose Bad Guess is More Bad? Difficult Comparisons

October 29, 2014
Jay Livingston

How to compare percentages that are very different? 

A recent Guardian/Ipsos poll asked people in fourteen wealthy nations to estimate certain demographics. What percent of the population of your country are immigrants? Muslim? Christian?

People overestimated the number of immigrants and Muslims, and underestimated the number of Christians. But the size of the error varied.  Here is the chart on immigration that the Guardian published (here).

Italy, the US, Belgium, and France are way off. The average guess was 18-23 percentage points higher than the true percentage.  People in Japan, South Korea, Sweden, and Australia were off by only 7-8 percentage points.

But is that a fair comparison? The underlying question is this: which country is better at estimating these demographics? Japan and South Korea have only 2% immigrants. People estimated that it was 10%, a difference of eight percentage points. But looked at another way, their estimate was five times the actual number. The US estimate was only 2½ times higher than the true number.

The Guardian ranks Hungary, Poland, and Canada together since they all have errors of 14 points. But I would say that Canada’s 35% vs. 21% is a better estimate than Hungary’s 16% vs. 2%.  Yet I do not know a statistic or statistical technique that factors in this difference and allows us to compare countries with very few immigrants and those with far more immigrants.* 

My brother suggested that the Guardian’s readers could get a better picture of the differences if the chart ordered the countries by the immigrant percentage rather than by the percentage-point gap.

This makes clearer that the 7-point overestimate in Sweden and Australia is a much different sort of error than the 8-point overestimate in South Korea and Japan. But I’m still uncertain as to the best way to make these comparisons.

* Saying that I know of no such statistic is not saying much. Perhaps others who are more familiar with statistics will know how to solve this problem.