When Open Source and Open Science go hand in hand

A community of miners

By opening GHTorrent to the public and unlocking a wealth of new information, researchers have been able to study a multitude of topics and gain entirely new insights into how the open source community works. The data set shows which forms of cooperation work well, but also shows how ‘offline’ prejudices affect the online world.

One study used GHTorrent to see how different nationalities interact when collaborating on open source software. Gousios: “It’s actually a shame, but you can clearly see that people from countries with political tensions work less productively together because, for example, they unnecessarily scrutinize each other’s contributions.” Another study found that contributions from female developers are less well received than contributions from men, more scrutinized or even outright rejected. This was not the case when gender could not be inferred.

Still, this opens up a way forward for Gousios: “You have to know where your problems are if you want to solve them. I also helped Microsoft use GHTorrent, in fact to the point where they now use their own variant of it.”

Seeing all this relevant data and the blossoming of a small community emphasized the importance of Diomidis Spinellis’ (his mentor, with whom he won the 10-year most influential paper award for GHTorrent at the MSR 2022 conference) motto: if you, if you want to do science, you have to be completely open about it. Even from the very beginning.

Connect the data

Diomidis was much more than just a spiritual mentor to Gousios. “Its design, especially in handling MySQL (software that manages the relationships between data elements) is fundamental to GHTorrent, and has been largely unchanged for 10 years.” Whatever happens on Github, whether someone changes code or two users post in a discussion, it’s called an “event”. The first step in the program is to collect all the different events. “But that’s only half the battle, and arguably the easiest part,” says Gousios. “The trick is to make that data useful, by giving it context and relating it to other data. For example, if an event has recorded the rejection of someone’s coding contribution, that information only becomes useful if you know the context, so you can understand why this contribution was rejected.” What sets GHTorrent apart is its ability to continuously connect “raw data” with meaningful links, providing – in a sense – additional data. Providing meaningful links to each of these events across 83 million collaborative developers is no small feat.

Gousios: “In the beginning, people using GHTorrent were a bit unaware of the amount of data they were getting. I was too, to be honest. Even personal data, such as email addresses or full names, was included. ” In fact, it took a good four years, until 2016, before problems arose. “A user complained that we shared their private information. This was already public information, mind you, and we just connected the dots. But it pushed us to change how GHTorrent works: it still collects this data, but no longer sharing them with users.”

The number of users also started to create another problem, especially between 2012 and 2014, when GitHub started to grow massively. Gousios: “It worked almost exponentially. I guess it was a network effect: as more developers became active on GitHub, more of their developer friends wanted to participate. It was great for the value of the dataset, but it also meant, that I had a platform that grew exponentially and I had to scale up all the time. It put a lot of pressure on my skills and I had to quickly become an expert in technologies like MySQL and MongoDB. I had to start using distributed systems or restart certain processes so that GHTorrent could catch errors. And when one of those errors inevitably happened, and restart the platform, I had to prevent it from downloading all the data again – which required additional coding.”

In other words, without necessarily asking for it, Gousios became a “One Man Service Reliability Engineer”, which is a specific type of developer. “This has cost me a huge amount of time, but I’ve also learned a huge amount.” Gousios explains that he was able to use many of the experiences he gained during those two years of growth in education. In 2016, he took over the responsibility for the Big Data Processing course and revised the material to include all state-of-the-art technologies.

“I have always felt very well supported by my colleagues here at TU Delft, and dealing with those difficulties can be quite stressful at times.” With a small smile, he adds, “But in retrospect, I feel like I definitely learned something from it: I can’t be an expert in everything, so it’s best to leave areas of expertise to the experts.”

“When I came to Delft, GHTorrent wasn’t finished yet, but I still had all the time and freedom to finish it properly, and then even maintain it.” Even when Gousios worked for a while at Nijmegen University, he was still allowed to host the program on TU Delft’s hardware. Microsoft then took over hosting and financial responsibility, but from 2020 GHTorrent came ‘home’ to be hosted at TU Delft. “Throughout GHTorrent’s journey, TU Delft has been instrumental in its success, for which I am very grateful.”

Leave a Comment