Recentering Validity Considerations through
Early-Stage Deliberations Around AI and Policy Design

Anna Kawakami Carnegie Mellon UniversityPittsburghUSA akawakam@andrew.cmu.edu , Amanda Coston Carnegie Mellon UniversityPittsburghUSA acoston@cs.cmu.edu , Haiyi Zhu Carnegie Mellon UniversityPittsburghUSA haiyiz@cs.cmu.edu , Hoda Heidari Carnegie Mellon UniversityPittsburghUSA hheidari@cs.cmu.edu and Kenneth Holstein Carnegie Mellon UniversityPittsburghUSA kenneth.holstein@gmail.com

(2023)

Abstract.

AI-based decision-making tools are rapidly spreading across a range of real-world, complex domains like healthcare, criminal justice, and child welfare. A growing body of research has called for increased scrutiny around the validity of AI system designs. However, in real-world settings, it is often not possible to fully address questions around the validity of an AI tool without also considering the design of associated organizational and public policies. Yet, considerations around how an AI tool may interface with policy are often only discussed retrospectively, after the tool is designed or deployed. In this short position paper, we discuss opportunities to promote multi-stakeholder deliberations around the design of AI-based technologies and associated policies, at the earliest stages of a new project.

AI-based decision-making, technology policy, validity, AI design and evaluation

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: nn.nn^†^†conference: Workshop on Designing Technology and Policy Simultaneously; April 23, 2023; Hamburg, Germany^†^†booktitle: CHI ’23 Workshop on Designing Technology and Policy Simultaneously, April 23, 2023, Hamburg, Germany

1. Motivation

Organizations are rapidly adopting AI-based decision-making tools to augment human expert decisions in high-stakes settings like child maltreatment, criminal justice, and healthcare (De-Arteaga et al., 2021; Holstein and Aleven, 2021; Yang et al., 2016). Research and development efforts around these tools have aimed to help overcome resource constraints and limitations (such as inconsistencies and cognitive biases in human decision-making (Chouldechova et al., 2018; Kahneman et al., 2021; Levy et al., 2021). However, the in-situ use of AI-based decision-making tools has been met with significant contention (De-Arteaga et al., 2020; Green and Chen, 2019; Holstein and Aleven, 2021; Levy et al., 2021; Holten Møller et al., 2020). A growing body of research and media has surfaced ways in which AI-based decision-making tools have failed to produce value in practice, despite showing promising evaluation results prior to deployment (Yang et al., 2019; Kawakami et al., 2022). To address these concerns, research and policymaking efforts have increasingly focused on improving the downstream properties of AI models such as fairness, interpretability, or predictive accuracy. These efforts often begin with the assumption that the AI tool actually “works” and that its design is basically sound, apart from such concerns (Coston et al., 2022; Raji et al., 2022).

However, field studies of actual AI-based decision-making tools used in organizations today are beginning to surface fundamental challenges around the validity of these tools (e.g., whether the model does what it purports to do). In complex, real-world decision-making contexts, models are typically trained to predict an imperfect proxy for human decision-makers’ actual decision-making goals. For example, in child welfare, prior research discussed how frontline workers are required to make day-to-day decisions with an AI tool that predicted outcomes misaligned with their actual decision-making objectives, professional training, and legal constraints (Kawakami et al., 2022). While child welfare workers consider the immediate safety risks and harms to a child to make decisions about screening investigation, the dominant design for an AI tool in this domain uses long-term predictions of child placement out of their home. In healthcare, clinicians may make decisions about resource allocation for high-risk patients by assessing each patient’s immediate medical needs, while an AI tool may predict longer-term healthcare costs (Kerr, 1975). In these decision-making contexts, underlying model validity and value alignment challenges have far-reaching downstream impacts on broader organizational culture and community welfare (Brown et al., 2019; Cheng et al., 2021). While expert decisions in these settings are guided by considerations around existing legal systems, developers’ current processes for designing models often fail to meaningfully involve legal experts, policymakers, or decision-makers with direct expertise. Instead, considerations around how an AI tool may interface with policy are often discussed retrospectively, after the tool is designed or deployed (Jackson et al., 2014; Yang et al., 2023).

This status quo design process, scattering policy and design considerations across time and space, presents several challenges to ensuring the design of sufficiently valid AI tools in real-world social decision contexts. In many cases, it is not possible to fully address questions about the validity of an AI tool without also considering the design of both organizational and public policies that shape how the system will be used. For example, evaluating whether a design proposal for an AI-based risk assessment tool captures an appropriate notion of “risk” requires understanding legal definitions of “risk” and relevant policies governing how frontline decision-makers currently make decisions. Without considering interactions between technology design, law, and policy in the process of designing an AI tool’s objective function, the resulting AI tool may lack validity. In this case, early-stage conversations around policy and design could proactively prompt new evaluations that assess the face or construct validity of a proposed AI tool design, in the context of proposed or existing organizational policies. Beyond this specific example, there is a broader missed opportunity for communities of stakeholders (e.g., policymakers, frontline workers, leadership, developers) to proactively exchange and synthesize knowledge around validity to make design decisions that are both informed by, and inform, policy.

2. Centering Validity in Early-Stage AI and Policy Design Deliberations

Properly addressing these challenges requires turning our attention to the earliest stages of model development and adoption: How can we refocus policy and research efforts around validity considerations, by promoting early-stage deliberations around how to design AI-based technologies and associated policies? Today, we lack effective, practical processes for proactively engaging policymakers, developers, and other stakeholders in fundamental questions around the design and governance of AI systems (e.g., whether a deployed AI system will actually do what it purports to do). In this position paper, we propose supporting early-stage, multi-stakeholder deliberations around the validity of proposed AI tools as a step towards designing better policies and technologies together.

A growing chorus of research has called for increased developer attention around AI validity concerns (sometimes discussed via related concepts such as “AI functionality”) as an essential first step towards ensuring the safety of AI deployments (Coston et al., 2022; Raji et al., 2022; Wang et al., 2022). However, much of this work still lives at the theoretical level, geared towards academic researchers. Grounding these considerations around validity into real-world design and policymaking settings requires a diverse pool of expertise – from an understanding of existing organizational processes, needs, and constraints around designing AI tools to tacit knowledge of opportunities and boundaries for informing policy. Relatedly, it is critical that such early-stage deliberations promote knowledge-sharing and synthesis across a wide range of relevant stakeholders, including policymakers, frontline workers, community members, developers, and organizational leaders. Through this piece, we invite researchers, designers, and practitioners to explore anticipated challenges and opportunities to operationalizing these multi-stakeholder early-stage deliberations for policy and design.

3. Open Questions

We invite the human-AI interaction, science and technology studies, machine learning, and other relevant communities to further knowledge around these topics:

Shifting power imbalances in and through collaborative design. Effective early-stage deliberations around policy and design requires engaging stakeholders (e.g., frontline decision-makers and community members) who are often left out of the model design process in the current status quo. In other words, implementing an effective deliberation process may also require shifting institutional power imbalances across stakeholders of the AI tool. At the same time, a deliberation process in itself may be structured to intentionally help shift power imbalances, for example, through the use of accessible and shared language or stakeholder-specific questions. How can we best shift power imbalances across different stakeholders varying in position and background, in the process of collaboratively designing policy and technology? What forms of imbalances cannot be nudged through deliberations, and how might other forms of support play a role?

Supporting evaluative and generative discourse. Deliberations about the validity of AI tools may need to be both evaluative and generative, to promote sufficient organizational buy-in and ensure resulting ideas produce more benefits than harms in practice. However, there may be tensions between different stakeholders and their (perceived) stances towards technical innovation versus evaluation. For example, there may be a (mis)conception that policies and laws constrain technical innovation, hindering effective conversation. How can we promote shared goals for collaborative policy and design discussions, while also valuing and leveraging differences in perspectives towards the role of AI innovation in a given context?

Connecting to local policymaking organizations. While early-stage deliberations may help identify and design new policies to complement technology design, they do not ensure that such policies are actually implemented post-deliberation. How can we better connect organizations with local policymaking groups, to help streamline outputs resulting from designing policy and technology together?

Overcoming incentive structures and pressures. In practice, there may be incentive structures, social pressures, or infrastructural barriers that hinder different stakeholders’ desires or abilities to engage in substantive discourse around designing more responsible technologies and policies. What role might other forces of higher power (e.g., regulation) play in ensuring that process-oriented solutions like supporting early-stage deliberations have sufficient teeth in practice?

Exploring other design opportunities for recentering validity. Supporting early-stage deliberations is one possible solution for recentering validity considerations in policy and design discourse. However, it may not be the best solution. What might other opportunities for refocusing validity considerations in design and policy look like?

Acknowledgements.

This work was generously funded by CMU Block Center for Technology and Society Award No. 53680.1.5007718.

References

(1)
Brown et al. (2019) Anna Brown, Alexandra Chouldechova, Emily Putnam-Hornstein, Andrew Tobin, and Rhema Vaithianathan. 2019. Toward algorithmic accountability in public services: A qualitative study of affected community perspectives on algorithmic decision-making in child welfare services. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
Cheng et al. (2021) Hao-Fei Cheng, Logan Stapleton, Ruiqi Wang, Paige Bullock, Alexandra Chouldechova, Zhiwei Steven Steven Wu, and Haiyi Zhu. 2021. Soliciting stakeholders’ fairness notions in child maltreatment predictive systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–17.
Chouldechova et al. (2018) Alexandra Chouldechova, Emily Putnam-Hornstein, Suzanne Dworak-Peck, Diana Benavides-Prado, Oleksandr Fialko, Rhema Vaithianathan, Sorelle A Friedler, and Christo Wilson. 2018. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. Proceedings of Machine Learning Research 81 (2018), 1–15. http://proceedings.mlr.press/v81/chouldechova18a.html
Coston et al. (2022) Amanda Coston, Anna Kawakami, Haiyi Zhu, Ken Holstein, and Hoda Heidari. 2022. A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms. arXiv preprint arXiv:2206.14983 (2022).
De-Arteaga et al. (2021) Maria De-Arteaga, Artur Dubrawski, and Alexandra Chouldechova. 2021. Leveraging expert consistency to improve algorithmic decision support. arXiv (2021), 1–33. arXiv:2101.09648 http://arxiv.org/abs/2101.09648
De-Arteaga et al. (2020) Maria De-Arteaga, Riccardo Fogliato, and Alexandra Chouldechova. 2020. A case for humans-in-the-loop: Decisions in the presence of erroneous algorithmic scores. arXiv (2020), 1–12.
Green and Chen (2019) Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019). https://doi.org/10.1145/3359152
Holstein and Aleven (2021) Kenneth Holstein and Vincent Aleven. 2021. Designing for human-AI complementarity in K-12 education. arXiv preprint arXiv:2104.01266 (2021).
Holten Møller et al. (2020) Naja Holten Møller, Irina Shklovski, and Thomas T. Hildebrandt. 2020. Shifting concepts of value: Designing algorithmic decision-support systems for public services. NordiCHI (2020), 1–12. https://doi.org/10.1145/3419249.3420149
Jackson et al. (2014) Steven J Jackson, Tarleton Gillespie, and Sandy Payette. 2014. The policy knot: Re-integrating policy, practice and design in CSCW studies of social computing. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 588–602.
Kahneman et al. (2021) Daniel Kahneman, Olivier Sibony, and Cass R Sunstein. 2021. Noise: A flaw in human judgment. Little, Brown.
Kawakami et al. (2022) Anna Kawakami, Venkatesh Sivaraman, Hao-Fei Cheng, Logan Stapleton, Yanghuidi Cheng, Diana Qing, Adam Perer, Zhiwei Steven Wu, Haiyi Zhu, and Kenneth Holstein. 2022. Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support. In CHI Conference on Human Factors in Computing Systems. 1–18.
Kerr (1975) Steven Kerr. 1975. On the folly of rewarding A, while hoping for B. Academy of Management journal 18, 4 (1975), 769–783.
Levy et al. (2021) Karen Levy, Kyla E Chasalow, and Sarah Riley. 2021. Algorithms and Decision-Making in the Public Sector. Annual Review of Law and Social Science 17 (2021), 1–38.
Raji et al. (2022) Inioluwa Deborah Raji, I Elizabeth Kumar, Aaron Horowitz, and Andrew Selbst. 2022. The fallacy of AI functionality. In 2022 ACM Conference on Fairness, Accountability, and Transparency. 959–972.
Wang et al. (2022) Angelina Wang, Sayash Kapoor, Solon Barocas, and Arvind Narayanan. 2022. Against Predictive Optimization: On the Legitimacy of Decision-Making Algorithms that Optimize Predictive Accuracy. Available at SSRN (2022).
Yang et al. (2019) Qian Yang, Aaron Steinfeld, and John Zimmerman. 2019. Unremarkable AI: Fitting intelligent decision support into critical, clinical decision-making processes. Conference on Human Factors in Computing Systems - Proceedings (2019). https://doi.org/10.1145/3290605.3300468 arXiv:1904.09612
Yang et al. (2023) Qian Yang, Richmond Wong, Thomas Gilbert, Margaret Hagan, Steven Jackson, Sabine Junginger, and John Zimmerman. 2023. Designing Technology and Policy Simultaneously: Towards A Research Agenda and New Practice. (2023).
Yang et al. (2016) Qian Yang, John Zimmerman, Aaron Steinfeld, Lisa Carey, and James F Antaki. 2016. Investigating the heart pump implant decision process: opportunities for decision support tools to help. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 4477–4488.