August 04, 2021

From July 26 to August 1 I participated in the 2021 Citadel-Correlation One PhD Datathon. This was the fourth data science case competition I have taken part in, and the first I have won. Below I share some of the background of the competition, and what my thought process was. My submission can be accessed on Github. I hope that this will be helpful to individuals preparing to participate in these type of competitions.

Background

The case competition had ~30 competitors from various universities around the world, each competing individually. On the first day of the competition we were given a very open-ended problem statement to ‘‘use climate-related data in order to discover and analyze patterns related to the causes and impacts of global warming’’, with a focus on coming up with a solution that was actionable (instead of just coming up with a very fancy/accurate model of some kind). We were given a number of climate datasets related to topics like land temperatures, emissions, energy, etc. From this we had one week to come up with a ~15 page report which we had to submit alongside our supporting code. We were to assume the judges were fairly knowledgeable about data science, but to not assume they were experts in modern/complicated methods.

1 - Overall Strategy

I figured a submission that tackled a good question, plus was able to bring some level of rigour, would be considered as strong. One of the tips given to us was to not learn any new tools but leverage our existing skills. Listening to this, I tried to arrive upon a research question where I could use my time series knowledge (i.e. my PhD research area).

2 - Coming Up With a Good Question

After we were given the research question I was a bit nervous given I knew very little about climate change, and therefore it was not obvious direction I would take. The first day I spent on the case involved reading publications about climate change. I started with a ‘wide-funnel’ reading some work from the UN’s Intergovernmental Panel on Climate Change (IPCC). I then tried to find journal articles discussing specific recommendations on how to mitigate climate change. From this reading I noticed Least Developed Countries (LDCs) were more at risk from climate change. I had heard of the concept of climate refugees before, and thought that if I could focus my report on this it would yield some actionable recommendations, and the impact it would have would be clear because of the human-element of climate refugees.

Next, I had to determine what type of things the climate might do to create climate refugees. The obvious answer was ‘the climate is warming’, and also obvious to me since Toronto was experiencing an incredibly rainy summer, was that climate change could lead to more flooding. This gave me two easy lines of analysis to work with: which LDCs will get hotter, and which LDCs will see more rain. I knew that if I could leverage my time series knowledge to come up with reasonable forecasts of these then it would feed through to reasonable conclusions about which LDCs were at risk.

3 - Bringing in Some Rigour

As mentioned before, my view was that a strong submission would have both a good question, and some rigour. My question seemed reasonable, so I just needed a statistical method which was non-trivial. One paper I had read several months ago (partially out of interest, partially related to my research) dealt with detecting trends in nonstationary time series. In the applications section of the paper it had used a temperature time series, and it appeared it could be implemented in a reasonable amount of time (much of my research over the past year has been implementing others’ time series methods in R so I had a good sense of this). At this point I knew I had a reasonable question, and that if this statistical method worked for heat, it would yield interesting conclusions. Also, given that using this method would tick-the-box of rigour, if I had to settle for something slightly less rigorous for my analysis of rain, the report would still be strong.

4 - Getting the Data

The dataset we were given did not have rain so I sought out a rain dataset. I figured given how active of a research area climatology is there had to be some sort of global precipitation dataset. After extensive Googling I found a dataset for rain, and right next to it in the FTP directory was an even better dataset of temperatures than we were given (it had maximum temperatures instead of average temperatures)! It was in some sort of weird file format which I had never seen before, and just as I was giving up hope I found a post on StackExchange which explained how to use it.

5 - Pray it Works

At this point I was more than halfway through the competition with no results. I had the dataset, and coded up the trend detection method I wanted to use for heat. I even started wondering if it did not work if I would have to drop out of the competition. Thankfully the method worked in the sense that using a reasonable value for the key hyperparameter it selectively rejected the null hypothesis for some countries and not for others.

6 - Crafting Good Arguments

Once my method for heat data worked, I was able to find a fairly easy to implement quantile regression method for my rain data. At this point my focus turned towards communicating what these statistical methods were saying in a straightforward manner. Since my question was identifying which countries were at risk of climate change I had to come up with some measures of what constituted at risk. Since time was running short I came up with some fairly ad-hoc cutoffs, while acknowledging this could be an area of future analysis.

7 - Visualize

What I realized as I was writing the report was that it was very difficult to communicate the trend detection method since it was complicated, and also the results since they were very nuanced. I knew the judges only have a finite attention span, so I sought to explain as much as I could using well-captioned charts (a previous professor I had said that a chart+caption should be able to be understood without looking at the text so I tried to do that for every chart).

Conclusion

Hopefully for those who got this far they found some value in hearing about my thought process. If I was to give just one tip it would be to find a research question which is impactful and can be answered well using the tools you know.

Thank you to Citadel and Correlation One for the great event.