31 Jul 2018, 10:22

Go-HEP Manifesto

Hello again.

I am starting today an article for arXiv about Go and Go-HEP. I thought structuring my thoughts a bit (in the form of a blog post) would help fluidify the process.

(HEP) Software is painful

In my introduction talk(s) about Go and Go-HEP, such as here, I usually talk about software being painful. HENP software is no exception. It is painful.

As a C++/Python developer and former software architect of one of the four LHC experiments, I can tell you from vivid experience that software is painful to develop. One has to tame deep and complex software stacks with huge dependency lists. Each dependency comes with its own way to be configured, built and installed. Each dependency comes with its own dependencies. When you start working with one of these software stacks, installing them on your own machine is no walk in the park, even for experienced developers. These software stacks are real snowflakes: they need their unique cocktail of dependencies, with the right version, compiler toolchain and OS, tightly integrated on usually a single development platform.

Granted, the de facto standardization on CMake and docker did help with some of these aspects, allowing projects to cleanly encapsulate the list of dependencies in a reproducible way, in a container. Alas, this renders code easier to deploy but less portable: everything is linux/amd64 plus some arbitrary Linux distribution.

In HENP, with C++ being now the lingua franca for everything that is related with framework or infrastructure, we get unwiedly compilation times and thus a very unpleasant edit-compile-run development cycle. Because C++ is a very complex language to learn, read and write - each new revision more complex than the previous one - it is becoming harder to bring new people on board with existing C++ projects that have accumulated a lot of technical debt over the years: there are many layers of accumulated cruft, different styles, different ways to do things, etc…

Also, HENP projects heavily rely on shared libraries: not because of security, not because they are faster at runtime (they are not), but because as C++ is so slow to compile, it is more convenient to not recompile everything into a static binary. And thus, we have to devise sophisticated deployment scenarii to deal with all these shared libraries, properly configuring $LD_LIBRARY_PATH, $DYLD_LIBRARY_PATH or -rpath, adding yet another moving piece in the machinery. We did not have to do that in the FORTRAN days: we were building static binaries.

From a user perspective, HENP software is also - even more so - painful. One needs to deal with:

  • overly complicated Object Oriented systems,
  • overly complicated inheritance hierarchies,
  • overly complicated meta-template programming,

and, of course, dependencies. It’s 2018 and there are still no simple way to handle dependencies, nor a standard one that would work across operating systems, experiments or analysis groups, when one lives in a C++ world. Finally, there is no standard way to retrieve documentation - and here we are just talking about APIs - nor a system that works across projects and across dependencies.

All of these issues might explain why many physicists are migrating to Python. The ecosystem is much more integrated and standardized with regard to installation procedures, serving, fetching and describing dependencies and documentation tools. Python is also simpler to learn, teach, write and read than C++. But it is also slower.

Most physicists and analysts are willing to pay that price, trading reduced runtime efficiency for a wealth of scientific, turn-key pure-Python tools and libraries. Other physicists strike a different compromise and are willing to trade the relatively seamless installation procedures of pure-Python software with some runtime efficiency by wrapping C/C++ libraries.

To summarize, Python and C++ are no panacea when you take into account the vast diversity of programming skills in HENP, the distributed nature of scientific code development in HENP, the many different teams’ sizes and the constraints coming from the development of scientific analyses (agility, fast edit-compile-run cycles, reproducibility, deployment, portability, …) To add insult to injury, these languages are rather ill equiped to cope with distributed programming and parallel programming: either because of a technical limitation (CPython’s Global Interpreter Lock) or because the current toolbox is too low-level or error-prone.

Are we really left with either:

  • a language that is relatively fast to develop with, but slow at runtime, or
  • a language that is painful to develop with but fast at runtime ?

nogo

Mending software with Go

Of course, I think Go can greatly help with the general situation of software in HENP. It is not a magic wand, you still have to think and apply work. But it is a definitive, positive improvement.

go-logo

Go was created to tackle all the challenges that C++ and Python couldn’t overcome. Go was designed for “programming in the large”. Go was designed to strive at scales: software development at Google-like scale but also at 2-3 people scale.

But, most importantly, Go wasn’t designed to be a good programming language, it was designed for software engineering:

  Software engineering is what happens to programming 
  when you add time and other programmers.

Go is a simple language - not a simplistic language - so one can easily learn most of it in a couple of days and be proficient with it in a few weeks.

Go has builtin tools for concurrency (the famed goroutines and channels) and that is what made me try it initially. But I stayed with Go for everything else, ie the tooling that enables:

  • code refactoring with gorename and eg,
  • code maintenance with goimports, gofmt and go fix,
  • code discoverability and completion with gocode,
  • local documentation (go doc) and across projects (godoc.org),
  • integrated, simple, build system (go build) that handles dependencies (go get), without messing around with CMakeList.txt, Makefile, setup.py nor pom.xml build files: all the needed information is in the source files,
  • easiest cross-compiling toolchain to date.

And all these tools are usable from every single editor or IDE.

Go compiles optimized code really quickly. So much so that the go run foo.go command, that compiles a complete program and executes it on the fly, feels like running python foo.py - but with builtin concurrency and better runtime performances (CPU and memory.) Go produces static binaries that usually do not even require libc. One can take a binary compiled for linux/amd64, copy it on a Centos-7 machine or on a Debian-8 one, and it will happily perform the requested task.

As a Gedankexperiment, take a standard centos7 docker image from docker-hub and imagine having to build your entire experiment software stack, from the exact gcc version down to the last wagon of your train analysis.

  • How much time would it take?
  • How much effort of tracking dependencies and ensuring internal consistency would it take?
  • How much effort would it be to deploy the binary results on another machine? on another non-Linux machine?

Now consider this script:

#!/bin/bash

yum install -y git mercurial curl

mkdir /build
cd /build

## install the Go toolchain
curl -O -L https://golang.org/dl/go1.10.3.linux-amd64.tar.gz
tar zxf go1.10.3.linux-amd64.tar.gz
export GOROOT=`pwd`/go
export GOPATH=/go
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH

## install Go-HEP and its dependencies
go get -v go-hep.org/x/hep/...

Running this script inside said container yields:

$> time ./install.sh
[...]
go-hep.org/x/hep/xrootd/cmd/xrd-ls
go-hep.org/x/hep/xrootd/server
go-hep.org/x/hep/xrootd/cmd/xrd-srv

real  2m30.389s
user  1m09.034s
sys   0m14.015s

In less than 3 minutes, we have built a container with (almost) all the tools to perform a HENP analysis. The bulk of these 3 minutes is spent cloning repositories.

Building root-dump, a program to display the contents of a ROOT file for, say, Windows, can easily performed in one single command:

$> GOOS=windows \
   go build go-hep.org/x/hep/rootio/cmd/root-dump
$> file root-dump.exe 
root-dump.exe: PE32+ executable (console) x86-64 (stripped to external PDB), for MS Windows

## now, for windows-32b
$> GOARCH=386 GOOS=windows \
   go build go-hep.org/x/hep/rootio/cmd/root-dump
$> file root-dump.exe 
root-dump.exe: PE32 executable (console) Intel 80386 (stripped to external PDB), for MS Windows

Fun fact: Go-HEP was supporting Windows users wanting to read ROOT-6 files before ROOT itself (ROOT-6 support for Windows landed with 6.14/00.)

Go & Science

Most of the needed scientific tools are available in Go at gonum.org:

  • plots,
  • network graphs,
  • integration,
  • statistical analysis,
  • linear algebra,
  • optimization,
  • numerical differentiation,
  • probability functions (univariate and multivariate),
  • discrete Fourier transforms

Gonum is almost at feature parity with the numpy/scipy stack. Gonum is still missing some tools, like ODE or more interpolation tools, but the chasm is closing.

Right now, in a HENP context, it is not possible to perform an analysis in Go and insert it in an already existing C++/Python pipeline. At least not easily: while reading is possible, Go-HEP is still missing the ability to write ROOT files. This restriction should be lifted before the end of 2018.

That said, Go can already be quite useful and usable, now, in science and HENP, for data acquisition, monitoring, cloud computing, control frameworks and some physics analyses. Indeed, Go-HEP provides HEP-oriented tools such as histograms and n-tuples, Lorentz vectors, fitting, interoperability with HepMC and other Monte-Carlo programs (HepPDT, LHEF, SLHA), a toolkit for a fast detector simulation à la Delphes and libraries to interact with ROOT and XRootD.

I think building the missing scientific libraries in Go is a better investment than trying to fix the C++/Python languages and ecosystems.

Go is a better trade-off for software engineering and for science:

with-go


PS: There’s a nice discussion about this post on the Go-HEP forum.

11 Oct 2017, 16:20

Simple Monte Carlo with Gonum and Go-HEP

Today, we’ll investigate the Monte Carlo method. Wikipedia, the ultimate source of truth in the (known) universe has this to say about Monte Carlo:

Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. (…) Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical integration, and generating draws from a probability distribution.

In other words, the Monte Carlo method is a numerical technique using random numbers:

  • Monte Carlo integration to estimate the value of an integral:
    • take the function value at random points
    • the area (or volume) times the average function value estimates the integral
  • Monte Carlo simulation to predict an expected measurement.
    • an experimental measurement is split into a sequence of random processes
    • use random numbers to decide which processes happen
    • tabulate the values to estimate the expected probability density function (PDF) for the experiment.

Before being able to write a High Energy Physics detector simulation (like Geant4, Delphes or fads), we need to know how to generate random numbers, in Go.

Generating random numbers

The Go standard library provides the building blocks for implementing Monte Carlo techniques, via the math/rand package.

math/rand exposes the rand.Rand type, a source of (pseudo) random numbers. With rand.Rand, one can:

  • generate random numbers following a flat, uniform distribution between [0, 1) with Float32() or Float64();
  • generate random numbers following a standard normal distribution (of mean 0 and standard deviation 1) with NormFloat64();
  • and generate random numbers following an exponential distribution with ExpFloat64.

If you need other distributions, have a look at Gonum’s gonum/stat/distuv.

math/rand exposes convenience functions (Float32, Float64, ExpFloat64, …) that share a global rand.Rand value, the “default” source of (pseudo) random numbers. These convenience functions are safe to be used from multiple goroutines concurrently, but this may generate lock contention. It’s probably a good idea in your libraries to not rely on these convenience functions and instead provide a way to use local rand.Rand values, especially if you want to be able to change the seed of these rand.Rand values.

Let’s see how we can generate random numbers with "math/rand":

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const N = 10
	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		fmt.Printf("%v\n", r)
	}
}

Running this program gives:

$> go run ./mc-0.go
0.8487305991992138
0.6451080292174168
0.7382079884862905
0.31522206779732853
0.057001989921077224
0.9672449323010088
0.6139541710075446
0.01505990819189991
0.13361969083044145
0.5118319569473198

OK. Does this seem flat to you? Not sure…

Let’s modify our program to better visualize the random data. We’ll use a histogram and the go-hep.org/x/hep/hbook and go-hep.org/x/hep/hplot packages to (respectively) create histograms and display them.

Note: hplot is a package built on top of the gonum.org/v1/plot package, but with a few HEP-oriented customization. You can use gonum.org/v1/plot directly if you so choose or prefer.

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const N = 10000

	huni := hbook.NewH1D(100, 0, 1.0)

	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		huni.Fill(r, 1)
	}

	plot(huni, "uniform.png")
}

We’ve increased the number of random numbers to generate to get a better idea of how the random number generator behaves, and disabled the printing of the values.

We first create a 1-dimensional histogram huni with 100 bins from 0 to 1. Then we fill it with the value r and an associated weight (here, the weight is just 1.)

Finally, we just plot (or rather, save) the histogram into the file "uniform.png" with the plot(...) function:

func plot(h *hbook.H1D, fname string) {
	p := hplot.New()
	hh := hplot.NewH1D(h)
	hh.Color = color.NRGBA{0, 0, 255, 255}
	p.Add(hh, hplot.NewGrid())

	err := p.Save(10*vg.Centimeter, -1, fname)
	if err != nil {
		log.Fatal(err)
	}
}

Running the code creates a uniform.png file:

$> go run ./mc-1.go

plot-uniform

Indeed, that looks rather flat.

So far, so good. Let’s add a new distribution: the standard normal distribution.

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const N = 10000

	huni := hbook.NewH1D(100, 0, 1.0)
	hgauss := hbook.NewH1D(100, -5, 5)

	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		huni.Fill(r, 1)

		g := rnd.NormFloat64()
		hgauss.Fill(g, 1)
	}

	plot(huni, "uniform.png")
	plot(hgauss, "norm.png")
}

Running the code creates the following new plot:

$> go run ./mc-2.go

plot-norm

Note that this has slightly changed the previous "uniform.png" plot: we are sharing the source of random numbers between the 2 histograms. The sequence of random numbers is exactly the same than before (modulo the fact that now we generate -at least- twice the number than previously) but they are not associated to the same histograms.

OK, this does generate a gaussian. But what if we want to generate a gaussian with a mean other than 0 and/or a standard deviation other than 1 ?

The math/rand.NormFloat64 documentation kindly tells us how to achieve this:

“To produce a different normal distribution, callers can adjust the output using: sample = NormFloat64() * desiredStdDev + desiredMean

Let’s try to generate a gaussian of mean 10 and standard deviation 2. We’ll have to change a bit the definition of our histogram:

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const (
		N      = 10000
		mean   = 10.0
		stddev = 5.0
	)

	huni := hbook.NewH1D(100, 0, 1.0)
	hgauss := hbook.NewH1D(100, -10, 30)

	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		huni.Fill(r, 1)

		g := mean + stddev*rnd.NormFloat64()
		hgauss.Fill(g, 1)
	}

	plot(huni, "uniform.png")
	plot(hgauss, "gauss.png")

	fmt.Printf("gauss: mean=    %v\n", hgauss.XMean())
	fmt.Printf("gauss: std-dev= %v +/- %v\n", hgauss.XStdDev(), hgauss.XStdErr())
}

Running the program gives:

$> go run mc-3.go
gauss: mean=    10.105225624460644
gauss: std-dev= 5.048629091912316 +/- 0.05048629091912316

plot-gauss

OK enough for today. Next time, we’ll play a bit with math.Pi and Monte Carlo.

Note: all the code is go get-able via:

$> go get github.com/sbinet/blog/static/code/2017-10-11/...

10 Oct 2017, 11:07

L3 LEP data

Still working our way through this tutorial based on C++ and MINUIT:

http://www.desy.de/~rosem/flc_statistics/data/04_parameters_estimation-C.pdf

Now, we tackle the L3 LEP data. L3 was an experiment at the Large Electron Positron collider, at CERN, near Geneva. Until 2000, it recorded the decay products of e+e- collisions at center of mass energies up to 208 GeV.

An example is the muon pair production:

$$ e^+ e^- \rightarrow \mu^+\mu^-$$

Both muons are mainly detected and reconstructed from the tracking system. From the measurements, the curvature, charge and momentum are determined.

The file L3.dat contains recorded muon pair events. Every line is an event, a recorded collision of a \(e^+e^-\) pair producing a \(\mu^+\mu^-\) pair.

The first three columns contain the momentum components \(p_x\), \(p_y\) and \(p_z\) of the \(\mu^+\). The other three columns contain the momentum components for the \(mu^-\). Units are in \(GeV/c\).

Forward-Backward Asymmetry

An important parameter that constrains the Standard Model (the theoretical framework that models our current understanding of Physics) is the forward-backward asymmetry A:

$$ A = (N_F - N_B) / (N_F + N_B) $$

where:

  • \(N_F\) are the events in which the \(\mu^-\) flies forwards (\(\cos \theta_{\mu^-} > 0\));
  • \(N_B\) are the events in which the \(\mu^-\) flies backwards.

Given the L3.dat dataset, we would like to estimate the value of \(A\) and determine its statistical error.

In a simple counting experiment, we can write the statistical error as:

$$ \sigma_A = \sqrt{ \frac{(1-A^2)}{N} }$$

where \(N = N_F + N_B\).

So, as a first step, we can simply count the forward events.

First estimation

Let’s look at that data:

$> head -n 5 L3.dat
     4.584    -9.763   -18.508   -24.171    50.464    95.865
    62.570   184.448  -175.983   -28.392   -83.491    70.656
     7.387   101.650    13.531    -7.853  -108.002   -14.472
    43.672    56.083   -77.367   -40.893   -52.481    72.804
   -36.620   -60.832   -46.156    35.863    59.591    -3.220

First we need to assemble a bit of code to read in that file:

func read(fname string) []float64 {
	f, err := os.Open(fname)
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	var costh []float64
	sc := bufio.NewScanner(f)
	for sc.Scan() {
		var pp, pm [3]float64
		_, err = fmt.Sscanf(
			sc.Text(), "%f %f %f %f %f %f",
			&pp[0], &pp[1], &pp[2],
			&pm[0], &pm[1], &pm[2],
		)
		if err != nil {
			log.Fatal(err)
		}
		v := pm[0]*pm[0] + pm[1]*pm[1] + pm[2]*pm[2]
		costh = append(
			costh,
			pm[2]/math.Sqrt(v),
		)
	}

	if err := sc.Err(); err != nil {
		log.Fatal(err)
	}

	plot(costh)
	return costh
}

Now, with the \(\cos \theta_{\mu^{-}}\) calculation out of the way, we can actually compute the asymmetry and its associated statistical error:

func asym(costh []float64) (float64, float64) {
	n := float64(len(costh))
	nf := 0
	for _, v := range costh {
		if v > 0 {
			nf++
		}
	}

	a := 2*float64(nf)/n - 1
	sig := math.Sqrt((1 - a*a) / n)
	return a, sig
}

Running the code gives:

events: 346
A = 0.231 +/- 0.052

plot

OK. Let’s try to use gonum/optimize and a log-likelihood.

Estimation with gonum/optimize

To let optimize.Minimize loose on our dataset, we need the angular distribution:

$$ f(\theta, A) = \frac{3}{8} (1 + \cos^2 \theta) + A \cos\theta $$

we just need to feed that through the log-likelihood procedure:

func main() {
	costh := read("L3.dat")
	log.Printf("events: %d\n", len(costh))
	a, sig := asym(costh)
	log.Printf("A= %5.3f +/- %5.3f\n", a, sig)

	fcn := func(x []float64) float64 {
		lnL := 0.0
		A := x[0]
		const k = 3.0 / 8.0
		for _, v := range costh {
			lnL += math.Log(k*(1+v*v) + A*v)
		}
		return -lnL
	}

	grad := func(grad, x []float64) {
		fd.Gradient(grad, fcn, x, nil)
	}

	hess := func(h mat.MutableSymmetric, x []float64) {
		fd.Hessian(h.(*mat.SymDense), fcn, x, nil)
	}

	p := optimize.Problem{
		Func: fcn,
		Grad: grad,
		Hess: hess,
	}

	var meth = &optimize.Newton{}
	var p0 = []float64{0} // initial value for A

	res, err := optimize.Minimize(p, p0, nil, meth)
	if err != nil {
		log.Fatal(err)
	}
	display(res, p)
}

which, when run, gives:

$> go build -o fba fba.go
$> ./fba
events: 346
A= 0.231 +/- 0.052

results: &optimize.Result{Location:optimize.Location{X:[]float64{0.21964440563754575}, F:230.8167059887097, Gradient:[]float64{0}, Hessian:(*mat.SymDense)(0xc42004b440)}, Stats:optimize.Stats{MajorIterations:4, FuncEvaluations:5, GradEvaluations:5, HessEvaluations:5, Runtime:584120}, Status:4}

minimal function value:  230.817
number of parameters: 1
grad=[0]
hess=[429.2102851867676]
errs= [0.04826862636461834]
par-000: 2.196444e-01 +/- 4.826863e-02

ie: the same answer than the MINUIT-based code there:

Results of MINUIT minimisation
-------------------------------------

 Minimal function value:               230.817  
 Estimated difference to true minimum:   9.044e-11 
 Number of parameters:           1     
 Error definition (Fmin + Delta):         0.500  
 Exact covariance matrix.

   Parameter     Value       Error    positive    negative    L_BND    U_BND
 0          A  2.196e-01  4.827e-02 +4.767e-02 -4.879e-02  0.0e+00  0.0e+00

Covariance matrix: 
  2.330e-03

Correlation matrix: 
  1.000

09 Oct 2017, 11:11

Introduction to Minimization with Gonum

Switching gears a bit with regard to last week, let’s investigate how to perform minimization with Gonum.

In High Energy Physics, there is a program to calculate numerically:

  • a function minimum of \(F(a)\) of parameters \(a_i\) (with up to 50 parameters),
  • the covariance matrix of these parameters
  • the (asymmetric or parabolic) errors of the parameters from \(F_{min}+\Delta\) for arbitrary \(\Delta\)
  • the contours of parameter pairs \(a_i, a_j\).

This program is called MINUIT and was originally written by Fred JAMES in FORTRAN. MINUIT has been since then rewritten in C++ and is available through ROOT.

Let’s see what Gonum and its gonum/optimize package have to offer.

Physics example

Let’s consider a radioactive source. n measurements are taken, under the same conditions. The physicist measured and counted the number of decays in a given constant time interval:

rs := []float64{1, 1, 5, 4, 2, 0, 3, 2, 4, 1, 2, 1, 1, 0, 1, 1, 2, 1}

What is the mean number of decays ?

A naive approach could be to just use the (weighted) arithmetic mean:

mean := stat.Mean(rs, nil)
merr := math.Sqrt(mean / float64(len(rs)))
fmt.Printf("mean=  %v\n", mean)
fmt.Printf("µ-err= %v\n", merr)
// Output:
// mean=  1.7777777777777777
// µ-err= 0.31426968052735443 

Let’s plot the data:

func plot(rs []float64) {
	mean := stat.Mean(rs, nil)

	// hbook is go-hep.org/x/hep/hbook.
	// here we create a 1-dim histogram with 10 bins,
	// from 0 to 10.
	h := hbook.NewH1D(10, 0, 10)
	for _, x := range rs {
		h.Fill(x, 1)
	}
	h.Scale(1 / h.Integral())

	// hplot is a convenience package built on top
	// of gonum.org/v1/plot.
	// hplot is go-hep.org/x/hep/hplot.
	p := hplot.New()
	p.X.Label.Text = "r"
	hh := hplot.NewH1D(h)
	hh.FillColor = color.RGBA{255, 0, 0, 255}
	p.Add(hh)

	fct := hplot.NewFunction(distuv.Poisson{Lambda: mean}.Prob)
	fct.Color = color.RGBA{0, 0, 255, 255}
	p.Add(fct)
	p.Add(hplot.NewGrid())

	err := p.Save(10*vg.Centimeter, -1, "plot.png")
	if err != nil {
		log.Fatal(err)
	}
}

which gives:

plot

Ok, let’s try to estimate µ using a log-likelihood minimization.

With MINUIT

From the plot above and from first principles, we can assume a Poisson distribution. The Poisson probability is:

$$P(n|\mu) = \frac{\mu^{n}}{n!} e^{-\mu}$$

This therefore leads to a log-likelihood of:

$$\ln L(\mu) = n \ln(\mu) - \ln(n!) - \mu$$

which is the quantity we’ll try to optimize.

In C++, this would look like:

#include <math.h>
#include <cstdio>
#include "TMinuit.h"

#define NDATA 18
int r[NDATA]    = {1,1,  5, 4,2,0,3,2, 4,1,2,1,1,0,1,1,2,1};
int rfac[NDATA] = {1,1,120,24,2,1,6,2,24,1,2,1,1,1,1,1,2,1};

void fcn(int &npar, double *gin, double &f, double *par, int iflag) {
	int i;
	double mu, lnL;

	mu = par[0];
	lnL = 0.0;
	for (i=0; i<NDATA; i++) {
		lnL += r[i]*std::log(mu) - mu - std::log((double)rfac[i]);
	}
	f = -lnL;
}

int main(int argc, char **argv) {

	double arglist[10];
	int ierflg = 0;

	double start = 1.0; // initial value for mu
	double step = 0.1;
	double l_bnd = 0.1;
	double u_bnd = 10.;

	TMinuit minuit(1); // 1==number of parameters
	minuit.SetFCN(fcn);
	minuit.mnparm(
		0, "Poisson mu",
		start, step,
		l_bnd, u_bnd, ierflg
	);

	// set a 1-sigma error for the log-likelihood
	arglist[0] = 0.5;
	minuit.mnexcm("SET ERR",arglist,1,ierflg);

	// search for minimum.
	// computes covariance matrix and computes parabolic
	// errors for all parameters.
	minuit.mnexcm("MIGRAD",arglist,0,ierflg);

	// calculates exact, asymmetric errors for all
	// variable parameters.
	minuit.mnexcm("MINOS",arglist,0,ierflg);

	// set a 2-sigma error for the log-likelihood
	arglist[0] = 2.0;
	minuit.mnexcm("SET ERR",arglist,1,ierflg);

	// calculates exact, asymmetric errors for all
	// variable parameters.
	minuit.mnexcm("MINOS",arglist,0,ierflg);

	results(&minuit);
	return 0;
}

As this isn’t a blog post about how to use MINUIT, we won’t go too much into details.

Compiling the above program with:

$> c++ -o radio `root-config --libs --cflags` -lMinuit radio.cc

and then running it, gives:

$> ./radio
[...]
Results of MINUIT minimisation
-------------------------------------

 Minimal function value:                29.296  
 Estimated difference to true minimum:   2.590e-09 
 Number of parameters:           1     
 Error definition (Fmin + Delta):         2.000  
 Exact covariance matrix.

   Parameter     Value       Error    positive    negative    L_BND    U_BND
 0 Poisson mu  1.778e+00  6.285e-01 +7.047e-01 -5.567e-01  1.0e-01  1.0e+01

Covariance matrix: 
  3.951e-01

Correlation matrix: 
  1.000

So the mean of the Poisson distribution is estimated to 1.778 +/- 0.629.

With gonum/optimize

func main() {
	rs := []float64{1, 1, 5, 4, 2, 0, 3, 2, 4, 1, 2, 1, 1, 0, 1, 1, 2, 1}
	rfac := []float64{1, 1, 120, 24, 2, 1, 6, 2, 24, 1, 2, 1, 1, 1, 1, 1, 2, 1}

	mean := stat.Mean(rs, nil)
	merr := math.Sqrt(mean / float64(len(rs)))

	fmt.Printf("mean=%v\n", mean)
	fmt.Printf("merr=%v\n", merr)

	fcn := func(x []float64) float64 {
		mu := x[0]
		lnl := 0.0
		for i := range rs {
			lnl += rs[i]*math.Log(mu) - mu - math.Log(rfac[i])
		}
		return -lnl
	}

	grad := func(grad, x []float64) {
		fd.Gradient(grad, fcn, x, nil)
	}

	hess := func(h mat.MutableSymmetric, x []float64) {
		fd.Hessian(h.(*mat.SymDense), fcn, x, nil)
	}

	p := optimize.Problem{
		Func: fcn,
		Grad: grad,
		Hess: hess,
	}

	var meth = &optimize.Newton{}
	var p0 = []float64{1} // initial value for mu

	res, err := optimize.Minimize(p, p0, nil, meth)
	if err != nil {
		log.Fatal(err)
	}

	display(res, p)
	plot(rs)
}

Compiling and running this program gives:

$> go build -o radio radio.go
$> ./radio
mean=1.7777777777777777
merr=0.31426968052735443

results: &optimize.Result{Location:optimize.Location{X:[]float64{1.7777777839915905}, F:29.296294958031794, Gradient:[]float64{1.7763568394002505e-07}, Hessian:(*mat.SymDense)(0xc42022a000)}, Stats:optimize.Stats{MajorIterations:6, FuncEvaluations:9, GradEvaluations:7, HessEvaluations:7, Runtime:191657}, Status:4}

minimal function value:   29.296
number of parameters: 1
grad=[1.7763568394002505e-07]
hess=[10.123388051986694]
errs= [0.3142947001265352]

par-000: 1.777778e+00 +/- 6.285894e-01

Same result. Yeah!

gonum/optimize doesn’t try to automatically numerically compute the first- and second-derivative of an objective function (MINUIT does.) But using gonum/diff/fd, it’s rather easy to provide it to gonum/optimize.

gonum/optimize.Result only exposes the following informations (through gonum/optimize.Location):

// Location represents a location in the optimization procedure.
type Location struct {
	X        []float64
	F        float64
	Gradient []float64
	Hessian  *mat.SymDense
}

where X is the parameter(s) estimation and F the value of the objective function at X.

So we have to do some additional work to extract the error estimations on the parameters. This is done by inverting the Hessian to get the covariance matrix. The error on the i-th parameter is then:

erri := math.Sqrt(errmat.At(i,i)).

And voila.

Exercize for the reader: build a MINUIT-like interface on top of gonum/optimize that provides all the error analysis for free.

Next time, we’ll analyse a LEP data sample and use gonum/optimize to estimate a physics quantity.

NB: the material and orignal data for this blog post has been extracted from: http://www.desy.de/~rosem/flc_statistics/data/04_parameters_estimation-C.pdf.

04 Oct 2017, 11:17

Introduction to Statistics With Gonum

Starting a bit of a new series (hopefully with more posts than with the interpreter ones) about using Gonum to apply statistics.

This first post is really just a copy-paste of this one:

https://mubaris.com/2017-09-09/introduction-to-statistics-using-numpy

but using Go and Gonum instead of Python and numpy.

Go & Gonum

Gonum is “a set of packages designed to make writing numeric and scientific algorithms productive, performant and scalable.”

Before being able to use Gonum, we need to install Go. We can download and install the Go toolchain for a variety of platforms and operating systems from golang.org/dl.

Once that has been done, installing Gonum and all its dependencies can be done with:

$> go get gonum.org/v1/gonum/...

If you had a previous installation of Gonum, you can re-install it and update it to the latest one like so:

$> go get -u gonum.org/v1/gonum/...

Gonum and statistics

Gonum provides many statistical functions. Let’s use it to calculate the mean, median, standard deviation and variance of a small dataset.

// file: stats.go

package main

import (
	"fmt"
	"math"
	"sort"

	"gonum.org/v1/gonum/stat"
)

func main() {
	xs := []float64{
		32.32, 56.98, 21.52, 44.32,
		55.63, 13.75, 43.47, 43.34,
		12.34,
	}

	fmt.Printf("data: %v\n", xs)

	sort.Float64s(xs)
	fmt.Printf("data: %v (sorted)\n", xs)

	// computes the weighted mean of the dataset.
	// we don't have any weights (ie: all weights are 1)
	// so we just pass a nil slice.
	mean := stat.Mean(xs, nil)

	// computes the median of the dataset.
	// here as well, we pass a nil slice as weights.
	median := stat.Quantile(0.5, stat.Empirical, xs, nil)

	variance := stat.Variance(xs, nil)
	stddev := math.Sqrt(variance)

	fmt.Printf("mean=     %v\n", mean)
	fmt.Printf("median=   %v\n", median)
	fmt.Printf("variance= %v\n", variance)
	fmt.Printf("std-dev=  %v\n", stddev)
}

The program above performs some rather basic statistical operations on our dataset:

$> go run stats.go
data: [32.32 56.98 21.52 44.32 55.63 13.75 43.47 43.34 12.34]
data: [12.34 13.75 21.52 32.32 43.34 43.47 44.32 55.63 56.98] (sorted)
mean=     35.96333333333334
median=   43.34
variance= 285.306875
std-dev=  16.891029423927957

The astute reader will no doubt notice that the variance value displayed here differs from the one obtained with numpy.var:

>>> xs=[32.32, 56.98, 21.52, 44.32, 55.63, 13.75, 43.47, 43.34, 12.34]
>>> xs.sort()
>>> np.mean(xs)
35.963333333333338
>>> np.median(xs)
43.340000000000003
>>> np.var(xs)
253.60611111111109
>>> np.std(xs)
15.925015262507948

This is because numpy.var uses len(xs) as the divisor while gonum/stats uses the unbiased sample variance (ie: the divisor is len(xs)-1):

>>> np.var(xs, ddof=1)
285.30687499999999
>>> np.std(x, ddof=1)
16.891029423927957

With this quite blunt tool, we can analyse some real data from real life. We will use a dataset pertaining to the salary of European developers, all 1147 of them :). We have this dataset in a file named salary.txt.

// file: stats-salary.go

package main

import (
	"bufio"
	"fmt"
	"log"
	"math"
	"os"
	"sort"

	"gonum.org/v1/gonum/stat"
)

func main() {
	f, err := os.Open("salary.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	var xs []float64
	scan := bufio.NewScanner(f)
	for scan.Scan() {
		var v float64
		txt := scan.Text()
		_, err = fmt.Sscanf(txt, "%f", &v)
		if err != nil {
			log.Fatalf(
				"could not convert to float64 %q: %v",
				txt, err,
			)
		}
		xs = append(xs, v)
	}

	// make sure scanning the file and extracting values
	// went fine, without any error.
	if err = scan.Err(); err != nil {
		log.Fatalf("error scanning file: %v", err)
	}

	fmt.Printf("data sample size: %v\n", len(xs))

	sort.Float64s(xs)

	mean := stat.Mean(xs, nil)
	median := stat.Quantile(0.5, stat.Empirical, xs, nil)
	variance := stat.Variance(xs, nil)
	stddev := math.Sqrt(variance)

	fmt.Printf("mean=     %v\n", mean)
	fmt.Printf("median=   %v\n", median)
	fmt.Printf("variance= %v\n", variance)
	fmt.Printf("std-dev=  %v\n", stddev)
}

And here is the output:

$> go run ./stats-salary.go
data sample size: 1147
mean=     55894.53879686138
median=   48000
variance= 3.0464263289031615e+09
std-dev=  55194.44110508921

The data file can be obtained from here: salary.txt together with a much more detailed one there: salary.csv.

12 Sep 2016, 14:35

Introduction to the pygo virtual machine

In the last episode, I have showed a rather important limitation of the tiny-interp interpreter:

def cond():
	x = 3
	if x < 5:
		return "yes"
	else:
		return "no"

Control flow and function calls were not handled, as a result tiny-interp could not interpret the above code fragment.

In the following, I’ll ditch tiny-interp and switch to the “real” pygo interpreter.

Real Python bytecode

People having read the AOSA article know that the structure of the bytecode of the tiny-interp interpreter instruction set is in fact very similar to the one of the real python bytecode.

Indeed, if one defines the above cond() function in a python3 prompt and enters:

### bytecode as raw bytes
>>> print(cond.__code__.co_code)
b'd\x01\x00}\x00\x00|\x00\x00d\x02\x00k\x00\x00r\x16\x00d\x03
\x00Sd\x04\x00Sd\x00\x00S'

### bytecode as numbers
>>> print(list(cond.__code__.co_code))
[100, 1, 0, 125, 0, 0, 124, 0, 0, 100, 2, 0, 107,
0, 0, 114, 22, 0, 100, 3, 0, 83, 100, 4, 0, 83,
100, 0, 0, 83]

This doesn’t look very human friendly. Luckily, there is the dis module that can ingest low-level bytecode and prints it in a more human-readable way:

>>> import dis
>>> dis.dis(cond)
  2           0 LOAD_CONST               1 (3)
              3 STORE_FAST               0 (x)

  3           6 LOAD_FAST                0 (x)
              9 LOAD_CONST               2 (5)
             12 COMPARE_OP               0 (<)
             15 POP_JUMP_IF_FALSE       22

  4          18 LOAD_CONST               3 ('yes')
             21 RETURN_VALUE

  6     >>   22 LOAD_CONST               4 ('no')
             25 RETURN_VALUE
             26 LOAD_CONST               0 (None)
             29 RETURN_VALUE

Have a look at the official dis module documentation for more informations. In a nutshell, the LOAD_CONST is the same than our toy OpLoadValue and LOAD_FAST is the same than our toy OpLoadName.

Simply inspecting this little bytecode snippet shows how conditions and branch-y code might be handled. The instruction POP_JUMP_IF_FALSE implements the if x < 5 statement from the cond() function. If the condition is false (i.e.: x is greater or equal than 5), the interpreter is instructed to jump to position 22 in the bytecode stream, i.e. the return "no" body of the false branch. Loops are handled pretty much the same way:

>>> def loop():
...     x = 1
...     while x < 5:
...             x = x + 1
...     return x
... 
>>> dis.dis(loop)
  2           0 LOAD_CONST               1 (1)
              3 STORE_FAST               0 (x)

  3           6 SETUP_LOOP              26 (to 35)
        >>    9 LOAD_FAST                0 (x)
             12 LOAD_CONST               2 (5)
             15 COMPARE_OP               0 (<)
             18 POP_JUMP_IF_FALSE       34

  4          21 LOAD_FAST                0 (x)
             24 LOAD_CONST               1 (1)
             27 BINARY_ADD
             28 STORE_FAST               0 (x)
             31 JUMP_ABSOLUTE            9
        >>   34 POP_BLOCK

  5     >>   35 LOAD_FAST                0 (x)
             38 RETURN_VALUE

The above bytecode dump should be rather self-explanatory. Except perhaps for the RETURN_VALUE instruction: where does the instruction return to?

To answer this, a new concept must be introduced: the Frame.

Frames

As the AOSA article puts it:

A frame is a collection of information[s] and context for a chunk of code.

Whenever a function is called, a new Frame is created, carrying a data stack (the local variables we have played with so far) and a block stack (to handle control flow such as loops and exceptions.)

The RETURN_VALUE instructs the interpreter to pass a value between Frames, from the callee’s data stack back to the caller’s data stack.

I’ll show the pygo implementation of a Frame in a moment.

Pygo components

Still following the blueprints of AOSA and byterun, pygo is built on the following types:

  • a VM (virtual machine) which manages the high-level structures (call stack of frames, mapping of instructions to operations, etc…). The VM is a slightly more complex version of the previous Interpreter type from tiny-interp,

  • a Frame: every Frame value contains a code value and manages some state (such as the global and local namespaces, a pointer to the calling Frame and the last bytecode instruction executed),

  • a Function to model real Python functions: this is to correctly handle the creation and destruction of Frames,

  • a Block to handle Python block management on to which control flow and loops are mapped.

Virtual machine

Each value of a pygo.VM must store the call stack, the Python exception state and the return values as they flow between frames:

type VM struct {
	frames Frames    // call stack of Frames
	fp     *Frame    // pointer to current Frame
	ret    Value     // return value
	exc    Exception // last exception
}

A pygo.VM value can run bytecode with the RunCode method:

func (vm *VM) RunCode(code Code, globals, locals map[string]Value) (Value, error) {
	frame := vm.makeFrame(code, globals, locals, vm.fp)
	return vm.runFrame(frame)
}

09 Sep 2016, 10:37

A tiny python-like interpreter

Last episode saw me slowly building up towards setting the case for a pygo interpreter: a python interpreter in Go.

Still following the Python interpreter written in Python blueprints, let me first do (yet another!) little detour: let me build a tiny (python-like) interpreter.

A Tiny Interpreter

This tiny interpreter will understand three instructions:

  • LOAD_VALUE
  • ADD_TWO_VALUES
  • PRINT_ANSWER

As stated before, my interpreter doesn’t care about lexing, parsing nor compiling. It has just, somehow, got the instructions from somewhere.

So, let’s say I want to interpret:

7 + 5

The instruction set to interpret it would look like:

code := Code{
	Prog: []Instruction{
		OpLoadValue, 0, // load first number
		OpLoadValue, 1, // load second number
		OpAdd,
		OpPrint,
	},
	Numbers: []int{7, 5},
}

var interp Interpreter
interp.Run(code)

The astute reader will probably notice I have slightly departed from AOSA’s python code. In the book, each instruction is actually a 2-tuple (Opcode, Value). Here, an instruction is just a stream of “integers”, being (implicitly) either an Opcode or an operand.


The CPython interpreter is a stack machine. Its instruction set reflects that implementation detail and thus, our tiny interpreter implementation will have to cater for this aspect too:

type Interpreter struct {
	stack stack
}

type stack struct {
	stk []int
}

Now, the interpreter has to actually run the code, iterating over each instructions, pushing/popping values to/from the stack, according to the current instruction. That’s done in the Run(code Code) method:

func (interp *Interpreter) Run(code Code) {
	prog := code.Prog
	for pc := 0; pc < len(prog); pc++ {
		op := prog[pc].(Opcode)
		switch op {
		case OpLoadValue:
			pc++
			val := code.Numbers[prog[pc].(int)]
			interp.stack.push(val)
		case OpAdd:
			lhs := interp.stack.pop()
			rhs := interp.stack.pop()
			sum := lhs + rhs
			interp.stack.push(sum)
		case OpPrint:
			val := interp.stack.pop()
			fmt.Println(val)
		}
	}
}

And, yes, sure enough, running this:

func main() {
	code := Code{
		Prog: []Instruction{
			OpLoadValue, 0, // load first number
			OpLoadValue, 1, // load second number
			OpAdd,
			OpPrint,
		},
		Numbers: []int{7, 5},
	}

	var interp Interpreter
	interp.Run(code)
}

outputs:

$> go run ./cmd/tiny-interpreter/main.go
12

The full code is here: github.com/sbinet/pygo/cmd/tiny-interp.

Variables

The AOSA article sharply notices that, even though this tiny-interp interpreter is quite limited, its overall architecture and modus operandi are quite comparable to how the real python interpreter works.

Save for variables. tiny-interp doesn’t do variables. Let’s fix that.

Consider this code fragment:

a = 1
b = 2
print(a+b)

tiny-interp needs to be modified so that:

  • values can be associated to names (variables), and
  • new Opcodes need to be added to describe these associations.

Under these new considerations, the above code fragment would be compiled down to the following program:

func main() {
	code := Code{
		Prog: []Instruction{
			OpLoadValue, 0,
			OpStoreName, 0,
			OpLoadValue, 1,
			OpStoreName, 1,
			OpLoadName, 0,
			OpLoadName, 1,
			OpAdd,
			OpPrint,
		},
		Numbers: []int{1, 2},
		Names:   []string{"a", "b"},
	}

	interp := New()
	interp.Run(code)
}

The new opcodes OpStoreName and OpLoadName respectively store the current value on the stack with some variable name (the index into the Names slice) and load the value (push it on the stack) associated with the current variable.

The Interpreter now looks like:

type Interpreter struct {
	stack stack
	env   map[string]int
}

where env is the association of variable names with their current value.

The Run method is then modified to handle OpLoadName and OpStoreName:

 func (interp *Interpreter) Run(code Code) {
@@ -63,6 +79,16 @@ func (interp *Interpreter) Run(code Code) {
                case OpPrint:
                        val := interp.stack.pop()
                        fmt.Println(val)
+               case OpLoadName:
+                       pc++
+                       name := code.Names[prog[pc].(int)]
+                       val := interp.env[name]
+                       interp.stack.push(val)
+               case OpStoreName:
+                       pc++
+                       name := code.Names[prog[pc].(int)]
+                       val := interp.stack.pop()
+                       interp.env[name] = val
                }
        }
 }

At this point, tiny-interp correctly handles variables:

$> tiny-interp
3

which is indeed the expected result.

The complete code is here: github.com/sbinet/pygo/cmd/tiny-interp

Control flow && function calls

tiny-interp is already quite great. I think. But there is at least one glaring defect. Consider:

def cond():
	x = 3
	if x < 5:
		return "yes"
	else:
		return "no"

tiny-interp doesn’t handle conditionals. It’s also completely ignorant about loops and can’t actually call (nor define) functions. In a nutshell, there is no control flow in tiny-interp. Yet.

To properly implement function calls, though, tiny-interp will need to grow a new concept: activation records, also known as Frames.

Stay tuned…


Update: correctly associate Frames with function calls. Thanks to /u/munificent.

07 Sep 2016, 10:37

Starting a Go interpreter

In this series of posts, I’ll try to explain how one can write an interpreter in Go and for Go. If, like me, you lack a bit in terms of interpreters know-how, you should be in for a treat.

Introduction

Go is starting to get traction in the science and data science communities. And, why not? Go is fast to compile and run, is statically typed and thus presents a nice “edit/compile/run” development cycle. Moreover, a program written in Go is easily deployable and cross-compilable on a variety of machines and operating systems.

Go is also starting to have the foundation libraries for scientific work:

And the data science community is bootstrapping itself around the gopherds community (slack channel: #data-science).

For data science, a central tool and workflow is the Jupyter and its notebook. The Jupyter notebook provides a nice “REPL”-based workflow and the ability to share algorithms, plots and results. The REPL (Read-Eval-Print-Loop) allows people to engage fast exploratory work of someone’s data, quickly iterating over various algorithms or different ways to interpret data. For this kind of work, an interactive interpreter is paramount.

But Go is compiled and even if the compilation is lightning fast, a true interpreter is needed to integrate well with a REPL-based workflow.

The go-interpreter project (also available on Slack: #go-interpreter) is starting to work on that: implement a Go interpreter, in Go and for Go. The first step is to design a bit this beast: here.

Before going there, let’s do a little detour: writing a (toy) interpreter in Go for Python. Why? you ask… Well, there is a very nice article in the AOSA series: A Python interpreter written in Python. I will use it as a guide to gain a bit of knowledge in writing interpreters.

PyGo: A (toy) Python interpreter

In the following, I’ll show how one can write a toy Python interpreter in Go. But first, let me define exactly what pygo will do. pygo won’t lex, parse nor compile Python code.

No. pygo will take directly the already compiled bytecode, produced with a python3 program, and then interpret the bytecode instructions:

shell> python3 -m compileall -l my-file.py
shell> pygo ./__pycache__/my-file.cpython-35.pyc

pygo will be a simple bytecode interpreter, with a main loop fetching bytecode instructions and then executing them. In pseudo Go code:

func run(instructions []instruction) {
	for _, instruction := range instructions {
		switch inst := instruction.(type) {
			case opADD:
				// perform a+b
			case opPRINT:
				// print values
			// ...
		}
	}
}

pygo will export a few types to implement such an interpreter:

  • a virtual machine pygo.VM that will hold the call stack of frames and manage the execution of instructions inside the context of these frames,
  • a pygo.Frame type to hold informations about the stack (globals, locals, functions’ code, …),
  • a pygo.Block type to handle the control flow (if, else, return, continue, etc…),
  • a pygo.Instruction type to model opcodes (ADD, LOAD_FAST, PRINT, …) and their arguments (if any).

Ok. That’s enough for today. Stay tuned…

In the meantime, I recommend reading the AOSA article.