Measuring open source success: developing analysis for actionable insights

Project Aspen plans to enable quantitative open source community health analysis for all.

Organizations are increasingly adopting open source software development models and open source aspects of organizational culture. As a result, interest in how open source communities succeed is reaching an all-time high.

Until recent years, measuring the success of open source communities was haphazard and anecdotal. Ask someone what makes one community more successful than another, and you will likely get observations such as, “The software is great, so the community is too,” or “The people in this community just mesh well.” The problem with these evaluations is not that they are necessarily wrong, but that they don’t provide information that others can use to reproduce successful results. What works for one community is not necessarily going to work for another.

Identifying measures of robustness isn’t just an interesting project—it’s critical to the open source ecosystem

Research universities, businesses, and other organizations interested in determining what makes open source projects successful have begun to collaborate on finding ways to measure aspects of community in a qualitative and data-driven way. One of the more prominent efforts is CHAOSS, a Linux Foundation project focused on creating metrics, metrics models, and software to better understand open source community health on a global scale. Unhealthy projects hurt both their communities and the organizations relying on those projects, so identifying measures of robustness isn’t just an interesting project—it’s critical to the open source ecosystem.

From the Red Hat Open Source Program Office (OSPO) perspective, CHAOSS was a very good answer to a pressing set of questions. First, how should community health be defined? Second, as metrics begin to take shape, how can we transition from reacting to one-off requests for data-based information about a given community to creating an entire process pipeline, literally and theoretically, for this work? The development of Project Aspen is the culmination of this pipeline, which will ultimately bring community data analysis to Red Hat and anyone else who can use it in the open source community.

Collecting data

In 2017, Harish Pillay of Red Hat’s OSPO created Prospector, which was a huge inspiration for what would become Project Aspen. Prospector aimed to present information from core data sources in a graphical dashboard, giving users thresholds to gauge whether they should do additional analysis. This resonated with CHAOSS’ goal to better understand the health of open source communities, and Prospector was donated to CHAOSS and archived in July 2017. From a technical and theoretical standpoint, Project Aspen builds on the trail that Prospector first blazed.

Aspen is backed by a database generated from the Augur Project, a CHAOSS-based project that collects, organizes, and validates the completeness of open source software trace data. With this database, we can store all types of data points around the Git-based repositories from which we collect data, such as pull requests, reviews, and contributors. The data is already collected and cleaned, which, from a data science perspective, is where the most significant time drains occur. The continued data collection allows us to act agilely when questions arise. Over time, we will grow our pipeline to collect data from many other avenues in addition to Git-based repositories, such as Stack Overflow and Reddit.

As Augur regularly collects data on our selected repositories, the data is updated within a week and cleaned. With all the data collection and most preprocessing already completed, we are much better equipped to answer the analysis questions we receive and generate our own questions too. No matter where the questions come from, the same analysis process is necessary.

For every visualization or analysis, community leaders should consider these questions:

What perspective are you looking to gain or give?
What question can you directly answer from the data available to you?
What assumptions am I making, and what biases may I hold?
Who can I work with to get feedback and a different perspective?

Everyone’s individual experiences and expertise impact the lens through which they look at a problem. Some people have experience in code review, while others’ expertise lies in community management. How can we start comparing community aspects like apples to apples instead of oranges? Quantifying what people in different roles in open source are looking at when examining a project community can address this problem.

Community metrics empower all members to communicate in a common domain and share their unique expertise. Different perspectives lead to further insights, and Project Aspen uses data to make those insights more accessible to the entire community through data visualizations.

Assumptions vs. Analysis

A bar chart showing an increase in commits over time is not, by itself, a positive indicator of community health.

Analysis is a tool for narrative building, not an oracle. Data analysis can help take the ambiguity and bias out of inferences we make, but interpreting data is not simple. A bar chart showing an increase in commits over time is not, by itself, a positive indicator of community health. Nor is a stable or decreasing number always a negative sign. What any chart gives you is more information and areas to explore.

For instance, you could build from a commits-over-time visualization, creating a graph that plots the “depth” of a commit, perhaps defined as the number of line changes. Or you could dive into the specific work of your community to see what these trends actually represent.

Comparing an issues-over-time graph (Figure 1) to an issues staleness graph (Figure 2) is a great illustration of why perspective matters. These visualizations reflect the same data but reveal completely different insights. From the issue staleness graph, we can see not only how many issues are open, but how many have been open for various time intervals.

**Figure 1.** Issues over time from 8Knot community data

**Figure 2.** Issue staleness from 8Knot community data

Figure 1 shows that over many months there is relative consistency in how many issues are opened and closed. On the other hand, Figure 2 highlights the growing amount of issues that have been open for over 30 days. The same data populates each graph, but a fuller picture can only come from seeing both. By adding the perspective of the growth in issue staleness, communities can clearly see that there is a growing backlog of issues and take steps to understand what it means for their community. At that point, they will be well-equipped to devise a strategy and prioritize actions based on both good data and thoughtful analysis.

Using data wisely

Including multiple points of view also provides much-needed insight and helps guard against false positives and gamification. Economists have a saying: “When a measure becomes a target, it ceases to be a good measure.” In other words, measures used to reward performance create an incentive to manipulate measurement. As people learn which measures bring attention, money, or power, open source communities run the risk of encouraging actions taken just to play the system. Using multiple perspectives to define success will keep your metrics meaningful, so they have genuine value in maintaining your community.

To that end, Project Aspen is an exciting tool for building your own knowledge and making better decisions about communities. Whether you want to understand where your community is most vulnerable or the seasonality of activity within the community, having quality data to inform your analysis is essential. To see some of the work being done around community data analysis, please check out our GitHub organization or the demo 8Knot app instance.

SHARE THIS ARTICLE

Feature

Ops is the new code: Operate First brings open source to operations

Gordon Haff

Operations are attracting increased attention in the open source community, and the open source ethos is evolving to embrace it. The focus of open source was initially on the code. Over time, however, the health of communities creating that code and associated artifacts such as documentation has also become an open source issue. The approach […]

Feature

Faster hardware through software

Gordon Haff

Researchers have tested several techniques for using software to get the most out of hardware. Find out about three promising projects that indicate the direction of this quickly changing field. It used to be simple to make computer workloads run faster. Wait eighteen months or so for more transistors consuming the same amount of power, […]

Feature

Open source authentication exposed: how open source developers perceive user authentication

Agáta Kružíková

Ensuring security in open source software starts before a line of code is written. What role should communities and developers play? Open source projects are used in commercial products by many companies, from Microsoft and Google to Red Hat. The developers behind these projects and their user accounts are the first element in the supply […]

Feature

Where will we find the data scientists?

Jennifer Wood

Universities play a primary role in developing data skills, but traditional education alone can’t close the skills gap fast enough. The mismatch between the widespread need for strong data skills and the current workforce is an obstacle for nearly every sector of the economy, which means no single sector can solve it. Collaborative partnerships among […]

Feature

GREEN.DAT.AI: an energy-efficient, AI-ready data space

Ben Capper

Data silos, regulatory compliance, and resource consumption limit the collaboration needed to address real-world challenges. A global consortium is working to change that. Significant challenges have hindered the rapid integration of artificial intelligence (AI) in key industries that drive economic and social development such as agriculture, finance, and energy. Shared data can provide substantial efficiency […]

Feature

Adaptive streaming using Strimzi and Apache Kafka

Adam Cattermole

The competing demands of cost and performance make it challenging to optimize stream-processing applications. Current research is exploring new options. Extracting value from streams of events generated by sensors and software has become key to the success of many important classes of applications. However, writing streaming data applications is not easy. Developers are confronted with […]

Feature

Changing the world, one lesson at a time

Matej Hrušovský

Why teaching more teachers is essential to computer science education.

Feature

Open research clouds get the skills to pay the bills

Tzu-Mainn Chen

How do you charge for a cloud? Researchers at the New England Research Cloud have developed a stack to make understanding and charging for usage much simpler. Universities and research institutions are increasingly embracing the cloud as a means to bring down costs and fully utilize the technical resources they have on hand. But creating […]

Feature

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

A guide to understanding the hidden algorithms that manage the data in our everyday world, from smartphones to cloud apps. We look at which ones perform faster—and why.

Red Hat Research Quarterly

February 2023

Measuring open source success: developing analysis for actionable insights

Cali Dolfi

Red Hat Research Quarterly

February 2023

Measuring open source success: developing analysis for actionable insights

Cali Dolfi

Cali Dolfi

Red Hat Research Quarterly

February 2023

Project Aspen plans to enable quantitative open source community health analysis for all.

Collecting data

Assumptions vs. Analysis

Using data wisely

Ops is the new code: Operate First brings open source to operations

Gordon Haff

Faster hardware through software

Gordon Haff

Open source authentication exposed: how open source developers perceive user authentication

Agáta Kružíková

Where will we find the data scientists?

Jennifer Wood

GREEN.DAT.AI: an energy-efficient, AI-ready data space

Ben Capper

Adaptive streaming using Strimzi and Apache Kafka

Adam Cattermole

Changing the world, one lesson at a time

Matej Hrušovský

Open research clouds get the skills to pay the bills

Tzu-Mainn Chen

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

LEARN

ENGAGE