A sentence you’d want to verify
Can you imagine a day when you don’t need to be a university researcher to publish a scientific paper?
Picture this. You are scrolling through Facebook, a news app, or a link a friend forwarded. You land on a sentence: “People who smoke have a higher chance of developing depression than people who don’t.”1
You pause. That can’t be right, can it? Or maybe it’s too neat. Either way, you don’t want to take it on trust. You want to check.
In the past, your options were limited. You could go to a library, search Google Scholar, but you would usually hit a wall: thirty or forty dollars to read a single article. For most people, the story ended there.
Today, there is another option: you can write to the author directly and ask them to share the raw data of their study.
This is not a science-fiction scenario. It is the world we live in now. You do not need a master’s, you do not need a PhD, to reach big data.
This shift has a name. It is called Open Science.
Where Open Science came from
Thirty years ago, reading a scientific paper meant one of two things: subscribing to a print journal, or going through a university library. Journals were expensive. A single subscription could cost thousands of dollars a year. Scholars gave their papers to journals for free, journals charged readers to access them, and the original author (whose salary was usually paid by taxpayers) typically got nothing.
This subscription model frustrated researchers themselves. I still remember messages from colleagues at other universities or institutes: “Does your library subscribe to journal X? Can you grab this article for me, please?” For a long time, that informal network of favors was how a lot of academic research actually got done.
The first person to break the rules was a physicist in 1991. Paul Ginsparg, at Los Alamos National Laboratory, built a small website called arXiv (pronounced “archive”). The idea was simple: physicists could upload unpublished papers (preprints) directly, and anyone could read them for free.
At first only physicists used it. Then mathematicians joined. Then computer scientists, biologists, chemists, linguists. Today arXiv holds over two million papers, and “anyone” really means anyone. You, me, a school-age child, a retired engineer. If you have an internet connection, you can read them.
That was just the start.
In the early 2000s, open-access journals began to appear: PLOS, BioMed Central, and others. They inverted the traditional model. Instead of readers paying, the author (or their institution or grant) covered a processing fee. After that, the article sat on the internet, free for anyone to read.
In the 2010s, science went through a “replication crisis.” Across psychology, medicine, and the social sciences, many results that had been treated as settled turned out not to replicate when other groups tried the same methods. The problem was not necessarily fraud. It was that the original studies had not shared enough, neither data nor methods in enough detail, for anyone to verify them.
That crisis pushed the next step: not just papers should be open. Data and code should also be open.
What “open” actually means, concretely
Open Science is not one thing. It is three related things.
1. Open Access: papers anyone can read
This is the most intuitive layer. Go back to that paper on smoking and depression: if it is open access, you do not have to pay the journal. Search for the title and you will usually find the author’s copy on their university page, on ResearchGate, or on a preprint server like arXiv.
For research funded by public money, many countries now legally require open access. The EU’s Horizon Europe programme, the US NIH, UK Research and Innovation, all of them have such rules. You have, after all, already paid for the research once through taxes.
2. Open Data: raw data anyone can check
For much of its history, “a study” has meant a black box. You see the conclusion (“smokers are 30% more likely to develop depression”), but you do not see the underlying data. How was that 30% calculated? How many people were in the sample? Followed for how long? Which confounders were controlled for?
Open Data is what opens that box. After publication, the original dataset (anonymized as needed) is deposited in a public repository: Zenodo, OSF (Open Science Framework), Dryad. Anyone can download it and run their own analysis on it.
This is why “write the author and ask for the data” works as a tactic now. In many cases you do not even need to write. The data is already a click away.
3. Open Source: tools and code anyone can run
The third layer is code. Most modern scientific analysis runs through programs: statistical models, machine learning pipelines, simulations. If you only release data and keep the code private, no one can fully reproduce the work.
Who knows what the program did, exactly? Which rows got dropped, which were kept? Maybe a hundred data points were released, but only thirty of them were used for the correlation that the paper actually reports. Until the code is public, none of this can be checked.
Open Source closes that gap: analysis code goes on GitHub or a similar public platform, and anyone can download it, inspect it, modify it, rerun it.
FAIR Principles: making “open” actually usable
Making data, papers, and code “public” is one thing. Making them genuinely findable, comprehensible, and usable is another.
Imagine someone uploads their raw research data to their personal homepage as data_final_2.csv, with no documentation. Technically: open. In practice: nobody can find it, and even if they do, they cannot make sense of it or connect it to their own analytical workflow.
In 2016, a group of scientists (Wilkinson et al.) proposed a concrete standard for making “open” actually work, called the FAIR Principles. FAIR is an acronym; each letter is a baseline requirement:
- Findable Data should have a unique identifier (like a book’s ISBN), be discoverable by search engines, and come with clear titles and descriptions.
- Accessible Once found, it should actually be downloadable, without paywalls, without special permissions. If access is restricted (for example, for privacy), the restriction should be clearly stated.
- Interoperable The data format should be open and standard, so it can be opened by ordinary software and combined with other datasets. No proprietary, expensive-software-only formats.
- Reusable Data should come with full metadata: how it was collected, what its limits are, how to cite it. Other people should be able not just to use it, but to use it correctly.
In short: findable, accessible, interoperable, reusable.
These four principles are now adopted across the EU, the US NIH, and major science funders. “Your data must comply with FAIR” is now a hard requirement on many grants.
My first-hand experience of the shift
When I was a PhD student, I was taught from day one: keep your data clean, keep your records meticulous. My institution actively encouraged Open Science compliance in publication, so the code I wrote, the data I processed, the papers I drafted, all of it was public. Even before papers were formally published, I would put my own typeset preprint on a platform like OSF.
Having lived through the “knowledge locked behind a paywall” era and then the “research data is openly accessible” era, I am genuinely grateful to every researcher who has answered the call of Open Science. Finding sources has become enormously easier, not as fast as doing research with AI today, but at least I no longer have to scan books in the library page by page and type articles into my computer character by character. My gratitude to the Open Science movement is real, and that is why I follow these norms gladly.
Why this matters to you, even if you’re not a scientist
You might still be thinking: this all sounds like inside baseball. What does it have to do with me, an ordinary reader?
Three things, concretely.
You can verify things yourself. Back to that opening sentence. In a FAIR-compliant world, you can: find the paper (F), download it for free (A), open the dataset in Excel or in free software like R (I), follow the author’s documentation, and run the calculation yourself (R). If your number does not match the paper’s, that fact alone is worth knowing.
Citizen science becomes possible. “Doing research” used to mean having a PhD. Today, more and more scientific contribution comes from “amateurs”: retired teachers tracking birds, middle-schoolers logging butterfly migration, hobbyists photographing the sky with their phones. Open Science lets those contributions enter the real scientific pipeline, because they can reach the actual tools without first entering an academic system, and without selling a house to buy expensive equipment. (I have heard of people selling their house to do human DNA sequencing work. To that level of commitment, I really do bow.)
Science becomes more accountable. When data and code are public, results can be checked from the outside. Errors get caught faster. Fraud gets harder to maintain. For the questions that matter to all of us (climate, drug safety, education policy), that external accountability is meaningful for the whole society, not just the academic guild.
But “open” is not the whole answer
Before this essay ends, I want to plant a seed.
Open Science is a welcome shift, but it is not a panacea. When we say “data should be open,” we are quietly assuming something: that an institution (the university, the grant, the publisher) can unilaterally decide what counts as the owner of the data.
For most scientific data, that assumption holds. Climate data, chemical reactions, training corpora for large language models. In most cases, no one is harmed by making the data public. Especially because, if the data concerns human behavior, it usually goes through serious anonymization before release.
Take my own past work in bioinformatics. I could easily download large datasets of human DNA sequences. I would know that group A had been diagnosed with a certain cancer and group B had not. But I would not know their names. I would not know where they lived.
There is, however, a category of data where the picture looks very different: data tied to Indigenous communities: their language, their culture, their traditional knowledge.
When a recording of an endangered language is “released openly,” who decides how it gets used? When a community’s traditional knowledge ends up in an open repository, can any company commercialize it? In these cases, “open” can stop being empowerment and become another form of extraction.
This is why, four years after the FAIR Principles were proposed, in 2020, a group of Indigenous data-sovereignty scholars articulated a parallel framework, called the CARE Principles.
That is the subject of the next essay.
And, circling back to where we started: does smoking really raise the chance of depression? That, I’ll leave to the reader to verify, by exercising the rights Open Science has now given you!
For those who want to read further
- arXiv (arxiv.org) ── open preprint repository for physics, mathematics, computer science, linguistics, and more.
- bioRxiv and medRxiv ── the biomedical equivalents.
- OSF (Open Science Framework) (osf.io) ── integrated platform for data, pre-registration, and preprints.
- Plan S (coalition-s.org) ── open-access initiative led by European and other research funders.
- Wilkinson, M. D., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018.
Next: When “open” meets Indigenous data, what the CARE Principles are, and why they sit alongside FAIR rather than replacing it.
Footnotes
-
I made this example up at first, just something that sounded plausibly debatable. After writing the rest of the essay, I went and looked it up myself. It turns out to be true: contemporary psychiatric research using Mendelian randomization shows a causal link, in both directions, between smoking and depression. An accidental live demonstration of the very thing this essay is about. ↩