Hall 02 · Studio Journal · 工作室筆記

What are the CARE Principles? Starting from the afternoon that silenced me

In the last essay I hinted that one talk shook my conviction. This is the rest. When 'open' meets the language and cultural data of Indigenous communities, we need a parallel framework: CARE Principles.

Mei-Shin Wu-Urbanek 14 min read
Still life on a dark walnut table, side-lit from a window: a leather portfolio tied with cord, a vintage reel-to-reel field recorder, a leather notebook open to a blank page with a fountain pen resting on it, a small ochre ceramic cup, and a hand-bound book partly wrapped in a textile in the background

The afternoon that silenced me

I was still a doctoral student. On stage, it was me.

I was talking confidently about the importance of language documentation and language revitalization. Every week, endangered languages disappear forever. Before the last fluent speaker leaves us, we have to record the grammar, the vocabulary, the pronunciation, the narrative forms, place them in public repositories so that future scholars can keep building, and so that the descendants of the community can one day return to learn the language of their ancestors.

I believed this story. I said it with conviction.

Then a senior researcher raised his hand. His tone was gentle, with no edge to it:

“Have you thought about how the community feels?”

I paused, but kept answering. Of course, I said. I get informed consent, I anonymize, I go back to the community before publication.

“That’s not what I mean,” he said.

“The community you want to help preserve a language may genuinely want help. But there are many other communities in the world that may not care about any of this. Some of them may even prefer outsiders to stop interrupting their lives.”

“And another question: when you call for language documentation, what is it for? In the end, what does the community actually get out of it?”

I went silent.

What do they get? A book? A dictionary? Will the people of the community actually open those?

I noticed something. The academic righteousness I had built up was, in part, treating the community as ‘people to be served.’ But what they actually cared about might not be a dictionary at all. It might be how to live a slightly better life, how to find a job in a world that keeps shifting under their feet.

I thought about myself. I come from a Southern Min language speaking family. I understand it, but I speak it poorly. Southern Min is not endangered in Taiwan, but it is also not an official language, and growing up I did not pay much attention to how much of it I knew. As long as I could say a few sentences to my grandmother, that was enough.

If even someone with my background treats their own mother tongue with this ‘good enough’ attitude, what about children whose home language really is endangered? Their parents may not love the language any less. They may simply be under so much pressure that they have to push their children toward whatever language can offer more security, more chance of further education, more chance of a job. That is a choice forced out by living conditions.

These are not things a dictionary can solve.

The questions wouldn’t go away

Every time I drafted a project proposal and wrote the line “data will be made openly available following FAIR standards,” that question would come knocking again. Open to whom? Once it is open, who actually uses it? Who, in the end, really benefits?

Then I would talk myself round. Come on, I told myself, this is already published material. I am just digitizing it. I am just using the standard processing conventions our research community already uses.

That was the loop. Half talking myself round, half wishing I had even more paper-based corpora to work with. I spent years inside that “just a little more, just a little more” way of thinking.

Then AI arrived. Things got harder.

In the past, when we worried about open data, the main worry was personal-identifier leakage. We would strip names, strip addresses, change exact birth dates into age bands. After anonymization, the data went public and everyone could relax.

After AI, the flows of data changed.

A recording of an endangered language placed in a public repository was supposed to be downloaded by fellow linguists for further study. In practice it may also be scraped, without our knowing, by some transnational company, fed into its large language model (LLM) or speech model as training data. Oral stories an elder spent a lifetime preserving may, that way, become part of a commercial AI system.

Even worse, anonymization is no longer a cure-all. Voice itself is a biometric. When only a handful of fluent speakers remain in a community, their pronunciation, intonation, pauses, even their coughs are recognizable. Anyone familiar with that community can identify them as soon as the recording goes online. Anonymization may still do something for textual data; on audio data, it is often just a label that makes the researcher feel safer.

So the senior researcher’s question, after years of chewing it over, sharpened into this:

The community gives time, opens up recordings, lets ancestors’ stories out into the world, and what do they get in return? Their voices end up powering what, exactly? Do they themselves even know?

That is where this essay’s framework begins. It is called the CARE Principles.

Where did the CARE Principles come from?

The CARE Principles are not a top-down standard handed down by some Western science committee. The story of how they came to be is itself bound up with the problem they exist to address.

Their formation can be traced back to 2018, around International Data Week and the Research Data Alliance meetings, where a series of workshops were held on Indigenous Data Sovereignty. Out of that, the RDA International Indigenous Data Sovereignty Interest Group and the Global Indigenous Data Alliance (GIDA) drafted the core text of the CARE Principles. In 2020, Stephanie Russo Carroll and colleagues published “The CARE Principles for Indigenous Data Governance” in Data Science Journal, setting out the theoretical grounding of CARE and its relationship to FAIR.

The participants came from Indigenous data sovereignty researchers around the world: Native peoples of the Americas, Māori, Sámi, Aboriginal Australians, First Nations, and others. They shared an observation:

The FAIR Principles say important things, but they mainly ask one kind of question. How can data be found, accessed, made interoperable, and reused? They do not adequately address an earlier question: under conditions of historical harm, colonial experience, and power asymmetry, who has the right to decide whether the data should be shared at all, how it should be shared, and how the benefits of that sharing flow back to the community?

For many scientific datasets that do not directly involve the rights or cultural context of a specific community, FAIR is a reasonable starting point. Making data findable, accessible, interoperable, and reusable is good for research.

But for Indigenous languages, traditional knowledge, cultural practices, land memory, and family oral histories, the matter cannot begin from the data alone. These are not “things a researcher collected.” They are “things a community has accumulated over generations and temporarily entrusted outward.” The researcher is someone who was allowed in for a while, asked to record, and should never assume the role of owner.

CARE does not replace FAIR. It runs alongside it. FAIR asks: “Can this data, technically, be found, retrieved, connected, and reused?” CARE pushes the question one step earlier: “Should this data be used in that way at all? And if it can be, who decides?”

CARE, letter by letter

CARE stands for four letters: C, A, R, E. Each one carries one principle. Let me put them in my own words.

C, Collective Benefit

The question: what concrete good does the existence of this data do for the community itself?

This is not the abstract promise that “someday it will be useful.” It needs to be a visible, verifiable benefit. For example: can the data feed back into the community’s own language teaching? Can it give children one more textbook? Can it open new career options for young people in the community (language teachers, corpus stewards, cultural guides)?

If the answers are all “no, but it looks great on the researcher’s CV,” the project design has a problem.

A, Authority to Control

The question: how the data is used, by whom, when, who decides?

CARE’s answer: the community itself. Not the funding foundation, not the researcher’s university, not the cloud service provider.

In practice this means many things. Rights to recordings should not default to the researcher or the research institution; they should be worked out in advance with representative community organizations. Database terms of use should include a way for the community to withdraw or restrict a given record. Commercial use should require separate consent, not be settled by one consent form signed once and treated as covering a lifetime.

R, Responsibility

The question: the people who receive this data are responsible to whom?

The standard academic answer is “to the scientific community,” “to best practices.” CARE pulls the answer back: to the community that provided the data.

The shift sounds small but in practice changes a lot. If a research result might harm the community (for example, by accidentally exposing ritual knowledge that should not have been exposed), your responsibility is not first to explain it to the journal but first to report back to the community. It means you have to keep going back, not vanish once the research is done. It means that after the data is used, you return with the findings, not just by mailing them a paper but by reporting in a form they can take in.

E, Ethics

The question: the ethical framework this work rests on, whose ethics is it?

The standard academic answer is often the IRB, the Institutional Review Board. IRBs matter. But the core vocabulary of modern research-ethics review came largely out of medical research ethics, individual informed consent, and individual risk protection. It tends to treat the individual as the primary unit of analysis.

Indigenous community ethics, by contrast, is often collective. Whether a particular story can be heard by outsiders is not one person’s call. It is a collective decision. Whether a taboo can be studied is also a collective decision. A checkbox on an IRB form for “individual signed consent” is not necessarily enough at this level.

CARE’s E asks the researcher to think: within the community’s own ethical framework, is this work appropriate? Not just “do I have a consent form,” but “within their ethics, is this even something that can be done.”

CARE and FAIR are not in opposition; they sit alongside each other

I want to be clear. The CARE Principles are not trying to push Open Science back. They are not against openness or against sharing.

The two frameworks are read side by side.

For many scientific datasets, FAIR is important fundamentals. Making data findable, accessible, interoperable, and reusable makes research more transparent, and saves later researchers from reinventing the wheel.

But the moment data has a close relationship with a specific group of people, for example Indigenous languages, traditional medicine, religious practice, family oral history, land memory, or community governance data, you first need to ask CARE’s question: should this data enter the FAIR pipeline at all? And if it can, under what conditions?

My own way of holding it:

Put simply, FAIR mainly handles how data can be found and reused. CARE moves the question earlier: who has the right to decide whether the data can be used in that way, and whether that use sends benefits back to the community.

The order matters. If you flip it, you can end up with a project that is technically beautiful and ethically unstable.

So what does CARE look like in my own work?

I am currently part of a project on an endangered Indigenous language, building a text-to-speech (TTS) system. The engineering details I will save for another essay. Structurally, there are several decisions that come directly out of CARE.

Open code, closed corpus. Our training pipeline, model architecture, and evaluation methods are all open-sourced on GitHub. Our experience is something other low-resource language teams can draw on. But the underlying recordings and annotations used for training are not placed in any public repository. They stay in storage that the community has agreed to, governed by the community’s representative organization.

Withdrawal cannot live only on a consent form. If a person who provided recordings later asks for them no longer to be used, we need to be able to trace and remove that recording from the raw audio, the annotations, the training lists, the dataset versions, and downstream training runs. That does not mean “withdrawal” is ever technically perfect in machine learning. Model checkpoints, backups, derived data, and released versions all complicate the picture. But withdrawal cannot be cancelled in advance by the phrase “once data is open, you cannot take it back.”

Commercial use requires separate consent. Our license states clearly: the model may only be used in “educational and cultural uses that the community has agreed to.” If a company wants to use it in a commercial product, that requires fresh negotiation with the community. It is not a single researcher’s call.

How contributions are credited has to be decided with the community. Some community elders are willing to provide stories and recordings but do not want their names on the paper. Conventional academic publishing does not always handle this well, because the field tends to believe that “naming is what makes responsibility possible.” In a small community, however, being named can carry its own pressure or risk.

So we cannot treat “anonymity” as simply removing a name. The more responsible move is to agree in advance with the community, the journal, and the publishing party on a fitting form of credit. It might be anonymous acknowledgment, collective community attribution, the community organization standing as rights-holder, or, for non-paper outcomes, a separate arrangement for how any revenue is shared. These arrangements need contracts and publishing norms to support them. They cannot be set unilaterally by the researcher.

None of this is a perfect answer. But these are the kinds of things you can actually start doing under the logic of CARE.

A note to close

Academic work is genuinely interesting. While doing research, we tend to insulate ourselves. What we face is data, methods, analysis. But the moment we step back from that research posture, the work is actually very close to people.

For all the explanation above, you do not really need to memorize the four CARE letters one by one. There is a ready-made line in the Chinese-speaking world: “do not do unto others what you would not want done to you.”

If you would not casually post your own photo, do not do that to your interviewees. If you would not hand your own audio recordings to a stranger without thinking, do not casually upload your interviewees’ recordings either.

But CARE is also not only “think harder on the community’s behalf.” Sometimes researchers who think they are being careful and responsible end up dragging the community through more meetings, more forms, more rounds of confirmation. What the community actually needs may not be a researcher coming back over and over to verify “how is this word pronounced,” “how should that be spelt,” but less disturbance, clearer commitments, more practical feedback, and a real right to say no.

In the end, we are all small cogs working through our own lives. Respect the other side, act conservatively. That is the best fieldwork stance.

I am grateful for the questions that left me speechless that afternoon. They helped me grow up. They gave me a deeper understanding of research ethics.

If one day you set out to do research connected to a community (not only Indigenous communities, but any minoritized or vulnerable group), I hope the CARE Principles are useful to you. They will not give you a standard answer. They will give you a better list of questions.

The next essay will work through a concrete case: in the Rukai text-to-speech project, how we used CARE’s logic to make every technical choice. That is the third entry in this series, and the place where FAIR and CARE walk back into real work together.

Further reading

  • Carroll, S. R., et al. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal 19: 43. DOI
  • Global Indigenous Data Alliance (GIDA) official site: gida-global.org
  • Research Data Alliance (RDA) International Indigenous Data Sovereignty Interest Group: a cross-national community of Indigenous data sovereignty practitioners, with regular workshops and documents to consult.
  • Wilkinson, M. D., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. (Cited in the previous essay.)

Next up: when FAIR and CARE walk into the Rukai text-to-speech project, what concrete choices does the project make?

Back to all essays →