04 Oct 2017

Introduction to Statistics With Gonum

Starting a bit of a new series (hopefully with more posts than with the interpreter ones) about using Gonum to apply statistics.

This first post is really just a copy-paste of this one:

https://mubaris.com/2017-09-09/introduction-to-statistics-using-numpy

but using Go and Gonum instead of Python and numpy.

Go & Gonum

Gonum is “a set of packages designed to make writing numeric and scientific algorithms productive, performant and scalable.”

Before being able to use Gonum, we need to install Go. We can download and install the Go toolchain for a variety of platforms and operating systems from golang.org/dl.

Once that has been done, installing Gonum and all its dependencies can be done with:

$> go get gonum.org/v1/gonum/...

If you had a previous installation of Gonum, you can re-install it and update it to the latest one like so:

$> go get -u gonum.org/v1/gonum/...

Gonum and statistics

Gonum provides many statistical functions. Let’s use it to calculate the mean, median, standard deviation and variance of a small dataset.

// file: stats.go

package main

import (
	"fmt"
	"math"
	"sort"

	"gonum.org/v1/gonum/stat"
)

func main() {
	xs := []float64{
		32.32, 56.98, 21.52, 44.32,
		55.63, 13.75, 43.47, 43.34,
		12.34,
	}

	fmt.Printf("data: %v\n", xs)

	sort.Float64s(xs)
	fmt.Printf("data: %v (sorted)\n", xs)

	// computes the weighted mean of the dataset.
	// we don't have any weights (ie: all weights are 1)
	// so we just pass a nil slice.
	mean := stat.Mean(xs, nil)

	// computes the median of the dataset.
	// here as well, we pass a nil slice as weights.
	median := stat.Quantile(0.5, stat.Empirical, xs, nil)

	variance := stat.Variance(xs, nil)
	stddev := math.Sqrt(variance)

	fmt.Printf("mean=     %v\n", mean)
	fmt.Printf("median=   %v\n", median)
	fmt.Printf("variance= %v\n", variance)
	fmt.Printf("std-dev=  %v\n", stddev)
}

The program above performs some rather basic statistical operations on our dataset:

$> go run stats.go
data: [32.32 56.98 21.52 44.32 55.63 13.75 43.47 43.34 12.34]
data: [12.34 13.75 21.52 32.32 43.34 43.47 44.32 55.63 56.98] (sorted)
mean=     35.96333333333334
median=   43.34
variance= 285.306875
std-dev=  16.891029423927957

The astute reader will no doubt notice that the variance value displayed here differs from the one obtained with numpy.var:

>>> xs=[32.32, 56.98, 21.52, 44.32, 55.63, 13.75, 43.47, 43.34, 12.34]
>>> xs.sort()
>>> np.mean(xs)
35.963333333333338
>>> np.median(xs)
43.340000000000003
>>> np.var(xs)
253.60611111111109
>>> np.std(xs)
15.925015262507948

This is because numpy.var uses len(xs) as the divisor while gonum/stats uses the unbiased sample variance (ie: the divisor is len(xs)-1):

>>> np.var(xs, ddof=1)
285.30687499999999
>>> np.std(x, ddof=1)
16.891029423927957

With this quite blunt tool, we can analyse some real data from real life. We will use a dataset pertaining to the salary of European developers, all 1147 of them :). We have this dataset in a file named salary.txt.

// file: stats-salary.go

package main

import (
	"bufio"
	"fmt"
	"log"
	"math"
	"os"
	"sort"

	"gonum.org/v1/gonum/stat"
)

func main() {
	f, err := os.Open("salary.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	var xs []float64
	scan := bufio.NewScanner(f)
	for scan.Scan() {
		var v float64
		txt := scan.Text()
		_, err = fmt.Sscanf(txt, "%f", &v)
		if err != nil {
			log.Fatalf(
				"could not convert to float64 %q: %v",
				txt, err,
			)
		}
		xs = append(xs, v)
	}

	// make sure scanning the file and extracting values
	// went fine, without any error.
	if err = scan.Err(); err != nil {
		log.Fatalf("error scanning file: %v", err)
	}

	fmt.Printf("data sample size: %v\n", len(xs))

	sort.Float64s(xs)

	mean := stat.Mean(xs, nil)
	median := stat.Quantile(0.5, stat.Empirical, xs, nil)
	variance := stat.Variance(xs, nil)
	stddev := math.Sqrt(variance)

	fmt.Printf("mean=     %v\n", mean)
	fmt.Printf("median=   %v\n", median)
	fmt.Printf("variance= %v\n", variance)
	fmt.Printf("std-dev=  %v\n", stddev)
}

And here is the output:

$> go run ./stats-salary.go
data sample size: 1147
mean=     55894.53879686138
median=   48000
variance= 3.0464263289031615e+09
std-dev=  55194.44110508921

The data file can be obtained from here: salary.txt together with a much more detailed one there: salary.csv.