Does banning the box increase hiring discrimination?
“Our results support the concern that BTB [Ban the Box] policies encourage racial discrimination: the black-white gap in callbacks grew dramatically at companies that removed the box after the policy went into effect. Before BTB, white applicants to employers with the box received 7% more callbacks than similar black applicants, but BTB increased this gap to 43%. We believe that the best interpretation of these results is that employers are relying on exaggerated impressions of real-world racial differences in felony conviction rates.”
- These results bolster longstanding concerns about perverse consequences arising from ban the box legislation. (Similar studies include this one from 2006, and this one from 2016.) A 2008 paper provides a theoretical accompaniment to these worries, arguing that a privacy tradeoff is required to ensure race is not being used as a proxy for criminal history: “By increasing the availability of information about individuals, we can reduce decisionmakers’ reliance on information about groups.… reducing privacy protections will reduce the prevalence of statistical discrimination.” Link.
- In a three part series from 2016, Noah Zatz at On Labor took on the perverse consequences argument and its policy implications, levelling three broad criticisms: “it places blame in the wrong place, it relies upon the wrong definition of racial equality, and it ignores cumulative effects.” Link to part one.
- A 2017 study of ban the box that focussed on the public sector—where anti-discrimination enforcement is more robust—found an increase in the probability of hiring for individuals with convictions and “no evidence of statistical discrimination against young low-skilled minority males.” Link.
- California’s Fair Chance Act went into effect January 1, 2018, joining a growing list of fair hiring regulations in many other states and counties by extending ban the box reforms to the private sector. The law provides that employers can only conduct criminal background checks after a conditional offer of employment has been made. More on the bill can be found here.
- Two posts on the California case, again by Zatz at On Labor, discuss several rich policy design questions raised by the “bright line” standards included in this legislation, and how they may interact with the prima facie standard of disparate impact discrimination: “Advocates fear, however, that bright lines would validate the exclusion of people on the wrong side of the line, despite individualized circumstances that vindicate them. But of course, the opposite could also be the case.” Link.
- Tangentially related, Ben Casselman reports in the New York Times that a tightening labor market may be encouraging some employers to hire beyond the box—without legislative guidance. Link.
An excerpt from Virginia Eubanks’s Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor
“Geographical isolation might be an important factor in child maltreatment, for example, but it won’t be represented in the data set because most families accessing public services in Allegheny County live in dense urban neighborhoods. A family living in relative isolation in a well-off suburb is much less likely to be reported to a child abuse or neglect hotline than one living in crowded housing conditions. Wealthier caregivers use private insurance or pay out of pocket for mental health or addiction treatment, so they are not included in the county’s database.
“Imagine the furor if Allegheny County proposed including monthly reports from nannies, babysitters, private therapists, Alcoholics Anonymous, and luxury rehabilitation centers to predict child abuse among middle-class families. ‘We really hope to get private insurance data. We’d love to have it,’ says Erin Dalton, director of Allegheny County’s Office of Data Analysis, Research and Evaluation. But, as she herself admits, getting private data is likely impossible. The professional middle class would not stand for such intrusive data gathering.”
- Danah Boyd of Data & Society writes about the book here: “Eubanks eschews the term ‘ethnography’ because she argues that this book is immersive journalism, not ethnography. Yet, from my perspective as a scholar and a reader, this is the best ethnography I’ve read in years. ‘Automating Inequality’ does exactly what a good ethnography should do — it offers a compelling account of the cultural logics surrounding a particular dynamic, and invites the reader to truly grok what’s at stake through the eyes of a diverse array of relevant people.” ht Will
- A few weeks ago, the New York Times Magazine published a long story also about Allegheny County’s predictive analytics in child protective services. Link. Virginia Eubanks tweeted a critical response to that story from the National Coalition for Child Protection Reform, which explicitly compares the two stories: “Everything the Times Magazine story got wrong, this story, from Wired, got right. The author looked at the same predictive analytics model in the same county that was the subject of the Times story.” Link.
- In New York, City Council member James Vacca has been leading the establishment of a task force for algorithmic transparency. “The task force will be the first city-led initiative of its kind in the country, and it is likely to have a significant impact, nationally and internationally, when it reports its findings, in late 2019. There is no doubt, however, that the final law represents a scaling back of Vacca’s early ambitions.” Link. ht Michael
Considerations on a fund for all Americans
“With the current consumption-based approach to social and economic policy, there will always be a disconnect between the macroeconomic health of the U.S. economy and the economic fortunes of the typical American family. That’s because technology-induced productivity growth often results in a windfall for the few at the top and little or no increased income rewards for those at the bottom. By contrast, if everyone were an investor, national productivity gains could instead be distributed in the form of dividends. When productivity went up, we could actually work less and take more time off when our kids were born or our parents were ailing, for instance. Such a work-deemphasizing approach would represent nothing short of a whole new economic policy–one better fit to a post-industrial knowledge economy and a fragile global ecosystem threatened by our consumerist culture.”
In a 2009 piece, DALTON CONLEY sketches out what a fund would look like. Full piece here.
A brief tour through the rest of the recent history of the idea:
- A 2013 piece from Miles Kimball: “Markets today are so hungry for assets as safe as US Treasurys, and so frightened of risk (pdf), that a US sovereign wealth fund would be paid handsomely to provide safe assets and shoulder some of the risk. But those financial returns are a bonus over and above the primary aim: fostering full economic recovery.” Link. Kimball and Roger Farmer round up more of their ideas in a 2014 blog post here.
- We’ve previously shared this Matt Bruenig op-ed from November 2017, which prompted many responses, such as this from Noah Smith.
- Norway’s sovereign wealth fund, the largest state-owned investment fund, hit $1 trillion in 2017. Link. In accordance with the fund’s ethical guidelines, it has made news in recent months over increased activity in corporate governance: voting no on CEO compensation proposals, banning investment from companies involved in nuclear arms production, and proposing to divest from oil and gas companies. (Link, link, link.) Relatedly, BlackRock CEO Larry Fink’s letter to CEOs encourages/threatens companies to work with a view to sustainability and responsiveness to all their stakeholders, “including shareholders, employees, customers, and the communities in which they operate.” Link.
- David Bollier on system change and managing shared wealth in the post-capitalist commons. Link. ht Michael
- Carbon Brief has posted an excellent in-depth series on how climate modeling actually works, which addresses many of the complexities of earth science and answers questions like “How do scientists validate climate models?” and “What is the process for improving models?” Link.
- Relatedly: “‘Our study indicates that if emissions follow a commonly used business-as-usual scenario, there is a 93 per cent chance that global warming will exceed 4C by the end of this century,’ said Dr Ken Caldeira, an atmospheric scientist at the Carnegie Institution for Science, who co-authored the new study.” Link.
- A take down of a widely discussed paper from last fall that claimed neural networks could predict sexual orientation based on physiognomy: “Much of the ensuing scrutiny has focused on ethics, implicitly assuming that the science is valid. We will focus on the science.” Link. Jack Clark summarizes: “Rather than developing a neural network that can infer your fundamental sexual proclivity by looking at you, the researchers have instead built a snazzy classifier that conditions the chance on whether you are gay or straight based on whether you’re wearing makeup or glasses or not.” Link. ht Margarita
- On quantitative vs. qualitative approaches in sociology: “For example, many publications on social movements tend to be case based and involve discourse analysis. Similarly, questions of identity tend to be studied using qualitative, ethnographic methods. On the other hand, topics related to employment, income and education (presumably dealing with inequality) employ more frequently a quantitative methodology. It might be interesting to see these topics studied from the other method: identity from a quantitative perspective and employment, income and education from a qualitative perspective.” Link. ht Bobby
- Scrutinizing improvements to Google Maps you may not have noticed. Link.
- Wired’s February issue has a Free Speech theme. Two of interest: Zeynep Tufekci’s article: “Here’s how this golden age of speech actually works: In the 21st century, the capacity to spread ideas and reach an audience is no longer limited by access to expensive, centralized broadcasting infrastructure. It’s limited instead by one’s ability to garner and distribute attention.” Link. And Doug Bock Clark on Megan Squire, a North Carolina computer scientist who created a database of alt-right members. Link.
- A report from Brookings: “The looming student loan default crisis is worse than we thought.” Link. ht Sihya
- From the St. Louis Fed: The Political Economy of Education. “Much of the framing around wealth disparity, including the use of alternative financial service products, focuses on the poor financial choices and decisionmaking on the part of largely Black, Latino, and poor borrowers, which is often tied to a culture of poverty thesis regarding an undervaluing and low acquisition of education. This framing is wrong—the directional emphasis is wrong. It is more likely that meager economic circumstance—not poor decisionmaking or deficient knowledge—constrains choice itself and leaves borrowers with little to no other option but to use predatory and abusive alternative financial services.” Link. ht Will
- A tech ethnographer describes the workings of Uber/Lyft driver forums: “With many app screenshots of their work proliferating across forums, driver-to-driver comparisons spread across a disaggregated workforce from diverse cities, fueling a pervasive sense of disparity and suspicions of unfairness. At an individual level, some of the Uber and Lyft drivers I interview shrug off pay discrepancies; others are disturbed by them. But the group dynamics of online forums build off of a common sense of the inequities that affect all drivers.” Link.
Jay comments on Automating Inequality:
The NCCPR piece is good as criticism of the NYT’s editorial decisions. It rightly points out myriad problems with their reporting. It doesn’t do as well, though, on critiquing the actual program. The Wired piece is better, but still seems misdirected.
Many of the criticisms seem to be comparing the algorithmic system with an ideal system. An ideal system wouldn’t be biased; the algorithm is biased. An ideal system would have a better false positive/false negative tradeoff; the algorithm has a mediocre one. An ideal system would understand that there’s a substantial difference between a child living in poverty and a child living in neglect; the algorithm assimilates these. These are entirely fair worries: they establish that there’s room for improvement in the algorithm, and suggest some places this could happen.
They don’t however, constitute arguments for ditching the algorithm. What’s relevant for that decision is whether the algorithm does better than the relevant alternative – the previously-existing screening system (or a modification thereof). And there seems to be very little in the way of argumentation for the claim that it doesn’t. The blog post rightly points out that there are systematic prejudices and injustices in the way foster care and childhood services are administered. The Wired piece points out that most of the racial disproportionality “actually arises from referral bias, not screening bias.” That’s true of the previous system. It’s true of the new system. This isn’t a problem with the algorithm per se, it’s a problem with the way experts think about these decisions. Better ways of thinking would likely lead to better systems, human or algorithmic.
Algorithmic analyses aren’t a silver bullet against systematic bias. People who think they are, are wrong, plain and simple. But automatic decision systems do have an advantage for those fighting such bias: when they’re open (as in the Allegheny County system), they allow for examination of where bias enters the system.
The analytics failures the blog post worries about are false positives: the algorithm wrongly decides that a child is at high risk. But this worry is incomplete. Any classification algorithm has to deal with a tradeoff between false positives (a child that isn’t at risk is labeled at-risk) and false negatives (a child that is at risk is labeled safe). The characteristics of this tradeoff vary from algorithm to algorithm (this is usually described by the ROC curve; see this video for a brief explanation). A good classifier will require fewer false positives for a given false negative rate, but for any real-world classifier, there is some trade-off.
So there are really two big decisions any classifier designer has to make. First, she’s got to choose the algorithm (and hence the tradeoff curve). Second, she’s got to choose a target level: how many false negatives are we willing to accept? Once she does these things, the number of false positives is determined – she doesn’t have any more choice in the matter. It doesn’t make a lot of sense to criticize getting lots of false positives without taking these facts into account. If you’re going to criticize the accuracy of the algorithm, you need to either criticize the choice of algorithm, or the target cutoff.
A) First, we might worry that the designer chose a bad classifier: that better choices exist, which would result in fewer false positives at the target level of false negatives. One question you could ask is whether the new algorithm has a worse trade-off than the previous classification method (since humans sitting in offices using whatever process they use is a decision method, too!). We should hope that the algorithm at least improves on that; I didn’t notice any contrary claims in the articles.
Could the designers of this system have built an algorithm that improved even more on the status quo? Certainly so – with caveats. What the team did, briefly, was to first choose a class of models (probit/boosted probit), do preliminary tests to see which input features mattered for classification, and then select from that class the best model given the training data they had access to. The team reported their methods here (pdf); the actual mean AUC for the classifier is around 0.7604, which for most problems would be considered a decent-but-not-spectacular tradeoff profile.
They could have improved on this if they had better data; but they didn’t. Among the class of models, given the selection of input features (which I’ll not get into, because none of the critics focus on it), the classifier they chose was the best they could have chosen. If they had chosen a different model with these features, they would have gotten worse results: there would be more mistakes.
The Wired article notes that “by relying on data that is only collected on families using public resources, the AFST unfairly targets low-income families for child welfare scrutiny.” This is, in a way, misleading. The input data certainly overrepresents the poor. But if the input data accurately represents the differences between poor children who are abused and poor children who are not, that doesn’t entail that the classifier will be biased against the poor. Rather, it means that the classifier will perform better when trying to decide whether a poor child is at risk than it will when trying to decide whether a middle- or upper-class child is at risk. On the other hand, if the input data fails to accurately represent the facts, by wrongly labeling past cases, this is, as above, an issue that has very little to do with the algorithm design. The data used were past decisions by case workers about whether to remove a child from his or her home. If the agency has been making those decisions badly (remember, these are the actual placement decisions, not the pre-screening decisions the algorithm is used for), then the problem is with the agency’s standards. I don’t think anyone would suggest that algorithm developers should be making that call.
They could also have improved it by using a different class of models. Interestingly, a different class of models could alleviate some of the differential accuracy worries noted above. If, for example, the predictors of neglect are markedly different for poor children and well-off children, the conditional dependencies can be much more easily captured with a decision tree or neural net than with a straightforward probit, especially given the effects of strong regularization. Interestingly, they chose not to use such methods because “they have the weakness that they tend to be ‘black box’ in the sense that it is more difficult to understand why a family received a high score” (p. 14). This seems to me like a valid place for criticism; rather than weighing the value of interpretability against the importance of various outcomes, they simply took it as a hard constraint. But the critics haven’t touched this point.
B) Second, we might worry that the algorithm designer has chosen the wrong target value for false-negatives. Where to set this target is a value judgment (ahem). In most scientific investigations, it’s considered worse to believe a false claim than to miss out on a true claim: the tradeoff is weighted so that we get few false positives, at the expense of many false negatives. In medical trials, the situation is often more relaxed: we’re willing to countenance a moderate number of treatments that don’t work if doing so means we only very rarely miss out on cures. In cases like child welfare, ex ante we would probably think that the tradeoff should be strongly weighted toward avoiding false negatives: we want to save as many at-risk children as we can, even if that means we mistakenly intervene occasionally.
There are reasons to think this ex ante intuition is wrong-headed, and some are brought up in the linked articles. If, as a matter of fact, mistakenly intervening is very harmful, then maybe we should be willing to let some at-risk children slip through the net in order to reduce the number of mistaken interventions. If we can quantify the harms, then we can solve for the optimal cutoff using straightforward expected-utility optimization procedures. But doing this quantification is, unavoidably, a question of values: just how much worse is it to allow a child to remain in an abusive home than it is to take a child out of a perfectly fine home? This is a decision of ethics and public policy. The algorithm designers clearly shouldn’t be making this decision without the input of the child welfare advocates, but there’s no indication that they did (and the Pittsburgh algorithm was explicitly developed with input from community leaders).
Each week we highlight research from a graduate student, postdoc, or early-career professor. Send us recommendations: email@example.com