Trust in/distrust of public sector data repositories

Posted by JN

My eye was caught by an ad for a PhD internship in the Social Media Collective, an interesting group of scholars in Microsoft Research’s NYC lab.  What’s significant is the background they cite to the project.

Microsoft Research NYC is looking for an advanced PhD student to conduct an original research project on a topic under the rubric of “(dis)trust in public-sector data infrastructures.” MSR internships provide PhD students with an opportunity to work on an independent research project that advances their intellectual development while collaborating with a multi-disciplinary group of scholars. Interns typically relish the networks that they build through this program. This internship will be mentored by danah boyd; the intern will be part of both the NYC lab’s cohort and a member of the Social Media Collective. Applicants for this internship should be interested in conducting original research related to how trust in public-sector data infrastructures is formed and/or destroyed.

Substantive Context: In the United States, federal data infrastructures are under attack. Political interference has threatened the legitimacy of federal agencies and the data infrastructures they protect. Climate science relies on data collected by NOAA, the Department of Energy, NASA, and the Department of Agriculture. Yet, anti-science political rhetoric has restricted funding, undermined hiring, and pushed for the erasure of critical sources of data. And then there was Sharpie-gate. In the midst of a pandemic, policymakers in government and leaders in industry need to trust public health data to make informed decisions. Yet, the CDC has faced such severe attacks on its data infrastructure and organization that non-governmental groups have formed to create shadow sources of data. The census is democracy’s data infrastructure, yet it too has been plagued by political interference.

Data has long been a source of political power and state legitimacy, as well as a tool to argue for specific policies and defend core values. Yet, the history of public-sector data infrastructures is fraught, in no small part because state data has long been used to oppress, colonize, and control. Numbers have politics and politics has numbers.  Anti-colonial and anti-racist movements have long challenged what data the state collects, about whom, and for what purposes. Decades of public policy debates about privacy and power have shaped public-sector data infrastructures. Amidst these efforts to ensure that data is used to ensure equity — and not abuse — there have been a range of adversarial forces who have invested in polluting data for political, financial, or ideological purposes.

The legitimacy of public-sector data infrastructures is socially constructed. It is not driven by either the quality or quantity of data, but how the data — and the institution that uses its credibility to guarantee the data —  is perceived. When data are manipulated or political interests contort the appearance of data, data infrastructures are at risk. As with any type of infrastructure, data infrastructures must be maintained as sociotechnical systems. Data infrastructures are rendered visible when they break, but the cracks in the system should be negotiated long before the system has collapsed.

At the moment, I suspect that this is a problem that’s mostly confined to the US.  But the stresses of the pandemic and of alt-right disruption may mean that it’s coming to Europe (and elsewhere) soon.