Prepared for ISA 2019 Sapphire Panel on Progress & Communication across Methodological Divides
To me, responsible scholarship means scholarship that is transparent and reproducible at every stage. A piece of scholarship, an article, or a book chapter is much more than its final print version. In my experience, a piece of research starts with a research idea, an outline, maybe some preliminary data analysis, or a pilot study, then there is the first draft, the second draft, the third draft, the nth draft, the draft that was rejected from journal X, the draft that was submitted to journal Y, the revised draft submitted to journal Y, the second revised draft submitted to journal Y, and finally the printed copy-edited version published in journal Y.
At each stage of manuscript development, the author makes decisions, some explicit, some implicit, that affect the final print version. Seemingly innocuous decisions in early stages may define the final output in ways that are not obvious early in the research process. And this is fine, but it is also our job as researchers and educators to keep detailed records of these decisions, explanations for how they were made, and share these with the broader community.
Research transparency is key to ensuring integrity and validity of every aspect of research, from the theoretical argument to the data collection and analysis. Back when there only a handful of broadly available datasets, such as the Correlates of War data, the idea of being able to reproduce a data analysis was perhaps not that interesting. But given the plethora of data, methodological tools and software available today, data analyses can be extraodinarily complicated, and following and reproducing every step of our own, let alone someone else’s research process is not a trivial task. Reproducing a given data analysis typically involves a long pipeline of merging, cleaning, and recoding, each associated with a myriad trivial and non-trivial decisions.
Many of even seemingly trivial decisions create forking paths. Just like a person lost in the woods, researchers often have to choose between two or more paths without knowing where each may ultimately lead. Sometimes one path may seem to be obviously better, so we make the decision and never go back and explore the other paths. Of course, what is obvious to one researcher may not be so obvious or even correct in the eyes of the others. The problem is researchers often provide insufficient information when describing the steps to reproduce the exact path they have taken, especially when the decisions seem obvious to the researcher. “Obvious” is the enemy of transparency and reproducibility.
A lack of research transparency creates a false illusion that producing and publishing research is quick and easy. As a discipline, we have created incentives to make research appear effortless. A folk image of a research genius is that they do not need to go through multiple drafts and polish their work. It is only natural that so many graduate students write their course papers the night before the deadline. Our students, and everyone else, see the final output—a published article—but not the years of hard work, revisions, and multiple drafts. And when their own article, written the night before the deadline, is not immediately appreciated, they get discouraged, question their career choice, and wallow.
So what are some best practices for ensuring transparency and reproducibility of your research? In my opinion, it is time that social science researchers embrace the concept of online repository hosting –rather than saving all our drafts and code in private directories on our computers, we post them online for the whole world to see. Let the whole world see the first, the second, and every other draft of your research AND your code. Let them see how you addressed or didn’t address the reviewer comments. Let them learn from you or learn from your mistakes. Let them comment on your work while it is still in progress. Let them find the typos before your submit it to a journal and let the critics and the skeptics get it off their chest before they become your reviewers. Give them access to your data and your code, so that they can find for themselves whether some coding decision makes a difference (maybe they will ask for fewer robustness checks as a result).
At a minimum, this must be done at the time of publication, although I personally see no reason to not make my work publicly available from beginning to end, and many of my colleagues have been doing the same long before me. The platform itself does not matter (Github, Dataverse, or even a personal website are nice options). The key is that anybody—a graduate student who wants to learn from your workflow, a coauthor or a colleague that wants to better understand your research or code—has access to all of your files and decisions from any stage in the research process.
Given increasingly strict journal replication policies, following this practice puts very little additional demands on the researcher (in fact, it may help speed up your article through the journal replication stage). Of course, uploading a myriad of files to an online repository is useless without logical organization and documentation—but uncommented scripts and unorganized drafts are equally useless to the author herself a month later. Good record-keeping and organization practices are paramount as long as anyone—the author or someone else—would like to go back to the same project in the future.
Documentation is hard, but it is also key to producing transparent and reproducible research. As researchers, we often equate “a job well done” with seeing our research in print, but the final print copy of our article or book is only the tip of the iceberg. Publishing is only one stage—the final stage, but hardly the most important stage—of the research process. The important stages—the countless paths chosen and not chosen—are rarely gleaned in the final output. Good record-keeping ensures that we preserve the information on these important steps, so that others can understand and build upon our research.
Finally, it is our job as a scholarly community to reward responsible research practices—something that we have failed on even compared to industry (employers from data analytics companies routinely look up applicants Github pages, whereas members of most academic search committees think Github must be a dating app). Maybe our search committees should take a tip from industry employers and start visiting their applicants’ Github pages. Journals could allow authors to correct and update data and code mistakes in real time, even post-publication. Perhaps we could even adapt our journal review practices in a way that explicitly gives reviewers access to the whole package, including the data and the code, rather than only a strictly page-limited draft.
THIS TALK WAS INSPIRED BY YIHUI XIE’S BLOG POSTS ON RESEARCH REPRODUCIBILITY AND BY A STATISTICS BLOG BY RAFA IRIZARRY, ROGER PENG, AND JEFF LEEK.