Deliberate Software

With deliberate practice come consistent results

Language Safety Score Mark 2

I want to make a model that predicts bugs.

I previously wrote a table for scoring language safety: Programming Language Safety Score, but it was extremely time consuming to score new languages or make modifications.

Simplify, Simplify

After being told I was overfitting the data, I’ve attempted to clean up by simply checking if each category is enforced, possible, or impossible. I score each as either 1 (language enforced), 0 (possible, but you have to remember to do it), or -1 (impossible). When the magnitudes of the new model are compared with the previous model, they come out very similar. The shape of the curve pretty much stays the same, which I was told indicates that the character count weighting was a variable that didn’t matter.

images

The code I used to generate the plot and normalize the scores can be found here: scorePlot.R

Safety Definitions

A definition of the safety checks is as follows:

Check Description
{{ check.name }} {{ check.desc }}

The new scores are shown here, with a lot more languages added in:

Safety Check
Totals {{ lang }}
Magnitude {{ percentageTotals(lang) }}%

Current Languages: {{ getName(lang) }}: {{ allLangTotals[$index] }},

So, What’s the Point?

To see how this model corresponds with data in the real world, I used the GitHub API to query for the number of bugs created in repositories with more than 15 forks created in a span from 2011 to 2015. Commits were counted by summing the commit counts of all contributors.

I decided to rely on the count of commits as a standard for a unit of work. My assumption was that across fifty different projects, the commit sizes would average out. Once the unit of work was decided on, I wanted to find the ratio of bugs per commit for each language.

I collected the ratio of bugs logged per commit for each repository, and after grouping by primary language, removed the top and bottom 25% using the bug/commit ratio, which is a common practice in statistics to help find a more accurate average. I summed the bugs and commits of those remaining repositories grouped by language, finding a total average bug/commit ratio for each language grouping. Here is that data, sorted by safety score.

Language Bugs Commits Repositories Bug/Commits
{{ lang.name }} {{ lang.bugs }} {{ lang.commits }} {{ lang.repos }} {{ getBugsRatio(lang) }}

Here are the languages sorted by safety score with bug/commit ratios:

images images

I took the magnitude of the safety scores and the bug/commit ratios. After inverting the safety scores, I overlaid them both onto a single graph.

images

Immediately it is obvious that Ruby, Python, PHP, and Clojure all seem to strongly buck the trend, but otherwise the languages follow a pretty consistent slope down in bugs. Taking the correlation gives a correlation coefficient of .55

What About Unit Tests?

Thinking that Ruby, Clojure, PHP, and Python might not correlate well due to some other factor, I collected data on how many tests each repository had. I counted the number of files containing “test” or “spec”, which gave the following, sorted by tests per commit:

Language Tests Commits Repositories Tests/Commits
{{ lang.name }} {{ lang.test }} {{ lang.commits }} {{ lang.repos }} {{ getTestsRatio(lang) }}

PHP, Python, and Ruby all have a higher then average number of tests, but Clojure does not. Additionally, Go, Scala, and Java all also have a higher than average number of tests, yet they score relatively average in bugs/commit.

Conclusion

In conclusion, the current safety model I have proposed seems to account for a moderate reduction in bugs per commit across the sampled languages, but is not the only factor. It currently is unable to account for a significantly lower than expected bug count in Ruby and Clojure.

Special Thanks

Special thanks to (in alphabetical order): Patrick Boe (Haskell, Sniff Test), Kyle Burton (General Advice), Nils Creque (Listening Board), Max Haley (Python, Ruby, Teaching me how to math), Daniel Miladinov (Java, Scala, Morale Support), Keith O’Brien (Ruby and JS), Chris Salch (CoffeeScript and JS), and Tim Visher (Clojure).

Additional thanks to the posters on /r/rust, including /u/notriddle, /u/killercup, and /u/diegobernardes who put together the Rust score.

Complaints Department

Did I mess up something about a language here, or am I missing a safety check? I’ll happily take pull requests for new languages: blog source. Just pick an existing language, edit the name and values, and “copy to clipboard” to build your own language data structure. Send it to me in a PR and I’ll include it along with a thanks on the page.

Select Language:

CheckOption
{{ check.name }}: {{score(selectedLang[check.key]) }}