Who is watching the artificial intelligence?
Katrina Borthwick (February 7, 2023)
No, I'm not talking Skynet here or robots overrunning the earth. I'm talking about the more subtle tools in the background that tend to do a bunch of boring administration. You probably haven't even thought of them overly much. But they might be responsible for what your doctor decided to prescribe you on your last visit, that job you never got shortlisted for, your acceptance into a programme of study, or the reason your bag was searched at the airport.
We are now using artificial intelligence (AI) to varying degrees to automate decisions that most individuals would consider important, such as security profiling, job selection, university acceptance or health risk prediction. The concept isn't new, the science of psychology has been looking into human traits and how to predict human behaviour for over 100 years now, and there is a significant body of peer reviewed research to draw on. What is new is the technology behind it. That technology is a double-edged sword. It gets stuff done we don't have the time for, but it can also give us a false sense of security.
AI can take massive datasets and perform complex analyses to the point that it can be easy to lose sight of how the data was derived in the first place. Worse still, the development of AI technology that predicts human behaviour is almost always not driven by those with an expertise in psychology – often they might be from an information technology discipline, or a statistician. Interdisciplinary collaboration, or even a shared technical language, often just isn't there. Added to that, some of the people involved may not be in a recognised discipline at all. For example, they might just be someone who is really good at developing snazzy looking apps.
When we are using tools to predict how humans are going to behave in these sorts of settings, there are going to be serious real-world consequences which people will care about. So understandably, with the increasing use of AI based decision tools, it follows that there is also an increasing wave of concern around the unfairness and bias that these tools may introduce.
There have been calls to get some checks and balances in place around such tools, including the suggestion that they be subjected to some sort of interdisciplinary audit or standards. By interdisciplinary, I mean that we would test the validity of the tool from the perspectives of not one but three sciences - psychology, statistics and computer science. In other words, before we release AI into the wild for use on actual human beings, we check it does what it says on the box, is fair, unbiased and valid, and isn't going to cause stuff ups and harm. So this begs the question, how do we do all that?
That's where Richard Landers and Tara Behrend come in. Both are Associate Professors in Psychology (profiles below), and this year they published a paper that investigated the specific challenges around bias and fairness in bringing these domains together, and what the crucial components in an audit might need to be.
So, the question they needed to answer was, how do you determine if an AI is fair and free from bias? And that begs the question of how we even define what “fair” or “bias” actually is?
Anyone who has had an argument with a three-year-old knows that there are many interpretations of fair, and a lot of cross foot-stomping along the way. Apparently, it can get worse when you try to sort it out amongst learned scientists across multiple disciplines. You can't send anyone to the naughty step if they don't agree.
Landers and Behrend nail it down to three lenses we can use to understand the different ways people look at fairness and bias. Individual attitudes – does a decision made by an AI violate their personal principles of fairness, the legal/ethical/moral viewpoint which is more formalised - for example through the courts or religious doctrine, and from the different areas of technical expertise – which means a huge depth of knowledge in a particular field, but likely to lack any shared terminology, and sometimes respect, between disciplines. No one-size-fits-all answer there, but at least there is a useful framework to apply to whatever context might be needed.
The researchers suggest any audit of an AI focuses on three areas: the underlying model, how people perceive and understand it, and the big picture elements, which they call meta-components.
The underlying model can be tested by looking at the quality of data going in, including research design, how the model was derived, the processes and algorithms that generate the output data, and the predictive quality of those outputs. Generally, the sorts of things we would be looking for as skeptics when trying to determine if a scientific study has any validity.
In terms of perceptions, they suggest the focus be on whether the messaging matches what we say the model is supposed to be doing – is that IQ test really predicting performance?, the reactions of those being assessed to the outcomes, and what outside observers are saying – including consultation with regulatory groups and community organisations. If an AI is producing daft outcomes, there is going to be a bit of noise about it.
The big picture elements that they suggest are included are the cultural context including the affected community participating in the design – ask anyone involved in user testing, if you don't involve the people affected in the design of a tool, you can count on entirely unexpected things happening when people do weird things. The other elements are respect – including adherence to general ethical standards, and finally the research design – does the study support the claims being made.
Overall the researchers conclude that, for the design and audit of AI systems to be done well, we need to take a truly multidisciplinary approach, involving expertise from psychology, statistical modelling, AI and machine learning from end to end. This would need to be built on a platform of solid collaboration on the algorithms, and a shared technical language at the outset. As the number of professionals stylising themselves as AI auditors grow, it is also apparent there would also need to be some sort of certification or professionalisation of AI auditors.
Honestly, I think they summarise the problem and the solution best when they say “We need each other, at every step, to develop the best AIs possible; at this point, prediction systems built within disciplinary silos are incomplete at best and destructive at worst.”
Reference
Landers, R. N., & Behrend, T. S. (2023). Auditing the AI auditors: A framework for evaluating fairness and bias in high stakes AI predictive models. American Psychologist, 78(1), 36–49. https://doi.org/10.1037/amp0000972
About the authors
Richard N. Landers - Associate Professor, Department of Psychology, University of Minnesota, Twin Cities. Landers's research concerns the use of innovative technologies in assessment, employee selection, adult learning, and research methods. Recent topics have included big data, game-based learning, game-based assessment, gamification, unproctored Internet-based testing, mobile devices, virtual reality, and online social media.
Tara S. Behrend - Associate Professor, Department of Psychological Sciences, Purdue University. She is an internationally recognized expert in workplace technology use, having published extensively on topics related to big data, surveillance, privacy, and learning. Her work in the area of STEM careers and education has been widely influential as well.