Why ‘good’ AI systems aren’t actually good for anyone
“The technologies built by a few techies from Silicon Valley run everything and none of us know how they work.” - David Middleback, Re-inventing Education for the Digital Age
A few months ago I was struck with the genius idea of embarking on a PhD program. My project aimed at figuring out how we can make some of those Silicon Valley AI technologies ‘better’ for everyone affected by them.
It sounded simple enough, right?
Well, it turns out that this simple description I was using to explain to friends and family what I was trying to achieve had a major flaw: what does ‘better’ even mean in the first place? So I went back and asked 6 different people what they thought it means to make a system, any system, ‘better’. I found out that the word ‘better’ meant something different to each person.
The two people I asked who had backgrounds in Computer Science spoke about aspects like speed, efficiency, reliability, etc. The two who didn’t have tech-based backgrounds spoke about improving the user experience: making it easier to use, requiring less steps to get things done, and so on. Someone who works in business said that ‘better’ meant that the system could allow her to do multiple things instead of having to switch between different apps to get things done. The last person I asked was a Medical Doctor, and they believed that ‘better’ meant having more up-to-date information available.
After going through this mini-experiment, it became really clear to me that a major issue with Machine Learning (ML) systems is that there’s no unified, usable set of metrics that can define what a ‘good’ system looks like.
If you can’t define what’s good, how can you tell what’s better?
There’s a massive divide between the theories describing what qualities AI systems should have, and between the actual practical and technical work happening in industry today (Stray, 2021). There are more than a dozen guidelines and rulebooks from big organizations across the globe describing the human values we should embed in a ‘good’ system. They’re all very similar and you could probably come up with a lot of them yourself (check out an example here).
But in reality, these human-like values and qualities, such as ‘improving people’s lives’ are so difficult to translate into measurable metrics. It becomes impossible to know whether or not a system is actually sticking to them. After all, how could you possibly tell whether or not your AI is enriching people’s lives?
But why is this important? Think of it this way, no effort we make to keep AI systems in check will make any difference if we don't have an effective set of standards and metrics that people can check against to evaluate it. So let’s take a step back and talk a bit more about metrics.
What’s the Issue with Metrics?
A metric is simply a way of measuring something. While this could be physical (such as a physical object’s size) or theoretical/conceptual (such as the concept of user engagement), the metric itself must be quantifiable and is commonly numerical. Think of length or height for the first example, and number of clicks or likes for the second example, both are numeric and measurable.
Metrics that are measurable are usually not very informative (e.g. number of times an item was clicked/liked/shared). For example, the fact that an article was shared many times doesn’t mean it had educational value (we all share memes all day every day), doesn’t mean people liked it (I personally often share things out of anger or frustration), and it doesn’t mean that it impacted people’s well-being positively (it could be sad, scary, or even worse, fake news).
The difficult part is that we want to align these ambiguous metrics with unobservable theoretical concepts and goals. Examples include: promoting well-being, giving users more control, preventing fake information, ensuring diversity, etc. (Jacobs & Wallach, 2021; Gabriel, 2020). Choosing generic metrics that are driven by business-oriented goals (like increasing user engagement or connecting more people) not only unintentionally harms many users, but can also harm businesses’ performance and people’s perception of their products and brand.
Users Trapped in Filter Bubbles
Let’s put aside the bigger messes and scandals we see on the news because of machine learning models being biased (CW: think Google’s identification of black people as gorillas in images), and look at something closer to home. Imagine you’re on TikTok or Instagram Reels, swiping up for hours and thinking ‘this content is so good, this thing is so clever and knows me so well’. What’s the harm in that, right?
ML recommender systems choose what appears on our social media feeds. In the name of increasing user engagement, recommender systems often profile people. This helps them determine what kind of content people will engage with the most, and they start showing these users nothing else (known as forming a Filter Bubble around them). While we interpret this as useful and entertaining, it can prevent us from evolving our interests and learning new things, or hearing viewpoints that are different from our own, keeping us trapped in the bubble.
In general, more people have recently become aware of the negative effects of social media but we still don’t know how deep it runs. For example, a study done asked various people to deactivate their Facebook accounts for four weeks. The researchers found that this deactivation made people’s political views a lot less radical or polarized and it improved their “subjective well-being”. Many people also chose to keep Facebook deactivated after the study was over (Allcott et al., 2020).
It’s Bad for Business Too
By now it should be clear that typical business metrics are not measuring the true effect of AI systems on the people using them and being affected by them. The lesser known fact is that this is all actually counter-productive to the businesses as well.
Empathy makes for great design and can be felt by the users, this also applies to designing ML models. It can inspire developers to create systems that are more transparent and explainable, where every decision is justified and the right parties can be held accountable. This shift in mindset can ultimately help users understand how systems work, instead of just imagining them as black boxes. According to design engineer and psychologist Don Norman (2013), this deeper understanding then tends to make users so much more tolerant to system errors and mistakes, and it increases their trust in the system. All this directly feeds back into customer satisfaction and brand perception. So really everyone wins.
How to Get There
Hopefully by now I’ve convinced you that companies need to do something about the metrics they’re using to evaluate their machine learning systems, and that this change has the potential to ultimately benefit everyone involved. So how can businesses set about doing this? Below I’ve put together a list of recommendations on how to get started, based on the viewpoints of prominent academics in the field: Adji Bousso, Jonathan Stray, and Rachel Thomas.
Always involve domain experts and users when setting evaluation metrics, to ensure that they’re actually effective
Collect diverse feedback and deconstruct hierarchal structures. As Stray et al. (2020) very nicely put it: the bus driver is as important as the mechanical engineer who designed the bus!
Make use of qualitative methods of evaluations like user surveys and interviews
Evaluate metrics based on what they’re actually measuring, whether they’re meaningful and useful, and how they relate to external factors, instead of just focusing on optimising the metrics you already have in place (Stray, 2021)
When creating principles and guidelines, turn theoretical values into actionable steps. For example, instead of saying ‘the AI should be safe’, try saying something like ‘the AI will be safe by adhering to safety standards X and giving workers training before they use it’ (here's a great example of this in practice).
Introduce what Stray (2021) refers to as ‘Power User Features’ which give users control over what they are exposed to (like the “see this less often” feature on Facebook feeds)
And most importantly, be prepared to evolve and expand your metrics as new information from user research comes in. In this day and age, at least some of a company’s goals should be motivated by ethics and the desire to have a truly positive impact on users, and this can only come from extensive user research every step of the way. There’s no way to anticipate all the ways that users will experience the system AND all the issues that might arise from those experiences AND what metrics can be used to measure them. The release of your system isn’t the end of the journey, it’s just the beginning.
References:
Allcott, H., Braghieri, L., Eichmeyer, S., Gentzkow, M. (2020). The Welfare Effects of Social Media. American Economic Review. 110(3), 629-676. DOI: https://doi.org/10.1257/aer.20190658
Gabriel, I. (2020). Artificial Intelligence, Values and Alignment. Mind and Machines. 30, 411-437.
Jacobs, A. & Wallach, H. (2021). Measurement and Fairness. Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). DOI: 10.1145/3442188.3445901
Norman, D. (2013). The Design of Everyday Things. Basic Books. New York, New York.
Stray, J. (2020). Beyond Engagement: Aligning algorithmic recommendations with prosocial goals. Partnership on AI. https://www.partnershiponai.org/beyond-engagement-aligning-algorithmic-recommendations-with-prosocial-goals/
Stray, J., Adler, S., Hadfield-Menell, D. (2021). What are you Optimizing For? Aligning recommender systems with human values. ICML 2020 Participatory Approaches to Machine Learning workshop. DOI: 2107.10939
Thomas, R. (2020). Metrics are Tricky If You Care About People. Powerpoint Presentation at the University of San Fransisco. https://docs.google.com/presentation/d/1P3_X9sfDiQXUPsZTjJLluL2hH-pjOMxlBj74Djz3r5U/edit#slide=id.g94a1cb7436_1_89
Thomas, R. & Uminsky, D. (2020). Reliance on Metrics is a Fundamental Challenge for AI. EDSC (Ethics of Data Science Conference). DOI: 2002.08512
Comments