Stephen C. Johnson and Brian W. Kernighan once proposed that those hoping to write good code should, "first make it work, then make it right, and, finally, make it fast".
Unfortunately, putting this maxim into practice while working with data isn't always easy. Indeed, it's often not obvious how to tell whether code written to analyze data is correct.
As an example, imagine that I have a data set that I know contains a small amount of corrupt data that's been generated by some buggy logging code. Is my data analysis correct when it includes this corrupt data? Or do I need to remove the corrupt data before analyzing the rest? If I need to exclude the corrupt data, how do I know whether the data I'm removing is corrupt or not? If you're reviewing code I've written to perform this kind of analysis, how can you tell whether or not an approach I've taken is correct?
In general, I would argue that the correctness of data analysis code depends not only on the code itself, but also on the data that will be fed into the code. And the correctness of the data often can't be determined from the data alone, but also depends upon domain-specific details about the data collection process (e.g. knowledge that the logging code generates a small amount of corrupt data).
How can we overcome these problems? Can those working in data analysis produce code that works as reliably as traditional software engineers have learned to write? What ideas can we import from the broader programming community? And what new ideas do we need to explore to make up for the ways in which traditional software engineering practices are insufficient to address the challenges introduced by systems whose correctness depends on more than just the code we write?
In this talk, I'll describe my views on these questions based on my experiences developing data tools for the Julia language and my other experiences analyzing data in academic and industrial settings.