31 Jul 2018, 10:22

Go-HEP Manifesto

Hello again.

I am starting today an article for arXiv about Go and Go-HEP. I thought structuring my thoughts a bit (in the form of a blog post) would help fluidify the process.

(HEP) Software is painful

In my introduction talk(s) about Go and Go-HEP, such as here, I usually talk about software being painful. HENP software is no exception. It is painful.

As a C++/Python developer and former software architect of one of the four LHC experiments, I can tell you from vivid experience that software is painful to develop. One has to tame deep and complex software stacks with huge dependency lists. Each dependency comes with its own way to be configured, built and installed. Each dependency comes with its own dependencies. When you start working with one of these software stacks, installing them on your own machine is no walk in the park, even for experienced developers. These software stacks are real snowflakes: they need their unique cocktail of dependencies, with the right version, compiler toolchain and OS, tightly integrated on usually a single development platform.

Granted, the de facto standardization on CMake and docker did help with some of these aspects, allowing projects to cleanly encapsulate the list of dependencies in a reproducible way, in a container. Alas, this renders code easier to deploy but less portable: everything is linux/amd64 plus some arbitrary Linux distribution.

In HENP, with C++ being now the lingua franca for everything that is related with framework or infrastructure, we get unwiedly compilation times and thus a very unpleasant edit-compile-run development cycle. Because C++ is a very complex language to learn, read and write - each new revision more complex than the previous one - it is becoming harder to bring new people on board with existing C++ projects that have accumulated a lot of technical debt over the years: there are many layers of accumulated cruft, different styles, different ways to do things, etc…

Also, HENP projects heavily rely on shared libraries: not because of security, not because they are faster at runtime (they are not), but because as C++ is so slow to compile, it is more convenient to not recompile everything into a static binary. And thus, we have to devise sophisticated deployment scenarii to deal with all these shared libraries, properly configuring $LD_LIBRARY_PATH, $DYLD_LIBRARY_PATH or -rpath, adding yet another moving piece in the machinery. We did not have to do that in the FORTRAN days: we were building static binaries.

From a user perspective, HENP software is also - even more so - painful. One needs to deal with:

overly complicated Object Oriented systems,
overly complicated inheritance hierarchies,
overly complicated meta-template programming,

and, of course, dependencies. It’s 2018 and there are still no simple way to handle dependencies, nor a standard one that would work across operating systems, experiments or analysis groups, when one lives in a C++ world. Finally, there is no standard way to retrieve documentation - and here we are just talking about APIs - nor a system that works across projects and across dependencies.

All of these issues might explain why many physicists are migrating to Python. The ecosystem is much more integrated and standardized with regard to installation procedures, serving, fetching and describing dependencies and documentation tools. Python is also simpler to learn, teach, write and read than C++. But it is also slower.

Most physicists and analysts are willing to pay that price, trading reduced runtime efficiency for a wealth of scientific, turn-key pure-Python tools and libraries. Other physicists strike a different compromise and are willing to trade the relatively seamless installation procedures of pure-Python software with some runtime efficiency by wrapping C/C++ libraries.

To summarize, Python and C++ are no panacea when you take into account the vast diversity of programming skills in HENP, the distributed nature of scientific code development in HENP, the many different teams’ sizes and the constraints coming from the development of scientific analyses (agility, fast edit-compile-run cycles, reproducibility, deployment, portability, …) To add insult to injury, these languages are rather ill equiped to cope with distributed programming and parallel programming: either because of a technical limitation (CPython’s Global Interpreter Lock) or because the current toolbox is too low-level or error-prone.

Are we really left with either:

a language that is relatively fast to develop with, but slow at runtime, or
a language that is painful to develop with but fast at runtime ?

nogo

Mending software with Go

Of course, I think Go can greatly help with the general situation of software in HENP. It is not a magic wand, you still have to think and apply work. But it is a definitive, positive improvement.

Go was created to tackle all the challenges that C++ and Python couldn’t overcome. Go was designed for “programming in the large”. Go was designed to strive at scales: software development at Google-like scale but also at 2-3 people scale.

But, most importantly, Go wasn’t designed to be a good programming language, it was designed for software engineering:

  Software engineering is what happens to programming 
  when you add time and other programmers.

Go is a simple language - not a simplistic language - so one can easily learn most of it in a couple of days and be proficient with it in a few weeks.

Go has builtin tools for concurrency (the famed goroutines and channels) and that is what made me try it initially. But I stayed with Go for everything else, ie the tooling that enables:

code refactoring with gorename and eg,
code maintenance with goimports, gofmt and go fix,
code discoverability and completion with gocode,
local documentation (go doc) and across projects (godoc.org),
integrated, simple, build system (go build) that handles dependencies (go get), without messing around with CMakeList.txt, Makefile, setup.py nor pom.xml build files: all the needed information is in the source files,
easiest cross-compiling toolchain to date.

And all these tools are usable from every single editor or IDE.

Go compiles optimized code really quickly. So much so that the go run foo.go command, that compiles a complete program and executes it on the fly, feels like running python foo.py - but with builtin concurrency and better runtime performances (CPU and memory.) Go produces static binaries that usually do not even require libc. One can take a binary compiled for linux/amd64, copy it on a Centos-7 machine or on a Debian-8 one, and it will happily perform the requested task.

As a Gedankexperiment, take a standard centos7 docker image from docker-hub and imagine having to build your entire experiment software stack, from the exact gcc version down to the last wagon of your train analysis.

How much time would it take?
How much effort of tracking dependencies and ensuring internal consistency would it take?
How much effort would it be to deploy the binary results on another machine? on another non-Linux machine?

Now consider this script:

#!/bin/bash

yum install -y git mercurial curl

mkdir /build
cd /build

## install the Go toolchain
curl -O -L https://golang.org/dl/go1.10.3.linux-amd64.tar.gz
tar zxf go1.10.3.linux-amd64.tar.gz
export GOROOT=`pwd`/go
export GOPATH=/go
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH

## install Go-HEP and its dependencies
go get -v go-hep.org/x/hep/...

Running this script inside said container yields:

$> time ./install.sh
[...]
go-hep.org/x/hep/xrootd/cmd/xrd-ls
go-hep.org/x/hep/xrootd/server
go-hep.org/x/hep/xrootd/cmd/xrd-srv

real  2m30.389s
user  1m09.034s
sys   0m14.015s

In less than 3 minutes, we have built a container with (almost) all the tools to perform a HENP analysis. The bulk of these 3 minutes is spent cloning repositories.

Building root-dump, a program to display the contents of a ROOT file for, say, Windows, can easily performed in one single command:

$> GOOS=windows \
   go build go-hep.org/x/hep/rootio/cmd/root-dump
$> file root-dump.exe 
root-dump.exe: PE32+ executable (console) x86-64 (stripped to external PDB), for MS Windows

## now, for windows-32b
$> GOARCH=386 GOOS=windows \
   go build go-hep.org/x/hep/rootio/cmd/root-dump
$> file root-dump.exe 
root-dump.exe: PE32 executable (console) Intel 80386 (stripped to external PDB), for MS Windows

Fun fact: Go-HEP was supporting Windows users wanting to read ROOT-6 files before ROOT itself (ROOT-6 support for Windows landed with 6.14/00.)

Go & Science

Most of the needed scientific tools are available in Go at gonum.org:

plots,
network graphs,
integration,
statistical analysis,
linear algebra,
optimization,
numerical differentiation,
probability functions (univariate and multivariate),
discrete Fourier transforms

Gonum is almost at feature parity with the numpy/scipy stack. Gonum is still missing some tools, like ODE or more interpolation tools, but the chasm is closing.

Right now, in a HENP context, it is not possible to perform an analysis in Go and insert it in an already existing C++/Python pipeline. At least not easily: while reading is possible, Go-HEP is still missing the ability to write ROOT files. This restriction should be lifted before the end of 2018.

That said, Go can already be quite useful and usable, now, in science and HENP, for data acquisition, monitoring, cloud computing, control frameworks and some physics analyses. Indeed, Go-HEP provides HEP-oriented tools such as histograms and n-tuples, Lorentz vectors, fitting, interoperability with HepMC and other Monte-Carlo programs (HepPDT, LHEF, SLHA), a toolkit for a fast detector simulation à la Delphes and libraries to interact with ROOT and XRootD.

I think building the missing scientific libraries in Go is a better investment than trying to fix the C++/Python languages and ecosystems.

Go is a better trade-off for software engineering and for science:

with-go

PS: There’s a nice discussion about this post on the Go-HEP forum.

11 Oct 2017, 16:20

Simple Monte Carlo with Gonum and Go-HEP

under go gonum go-hep HEP Monte Carlo

Today, we’ll investigate the Monte Carlo method. Wikipedia, the ultimate source of truth in the (known) universe has this to say about Monte Carlo:

Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. (…) Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical integration, and generating draws from a probability distribution.

In other words, the Monte Carlo method is a numerical technique using random numbers:

Monte Carlo integration to estimate the value of an integral:
- take the function value at random points
- the area (or volume) times the average function value estimates the integral
Monte Carlo simulation to predict an expected measurement.
- an experimental measurement is split into a sequence of random processes
- use random numbers to decide which processes happen
- tabulate the values to estimate the expected probability density function (PDF) for the experiment.

Before being able to write a High Energy Physics detector simulation (like Geant4, Delphes or fads), we need to know how to generate random numbers, in Go.

Generating random numbers

The Go standard library provides the building blocks for implementing Monte Carlo techniques, via the math/rand package.

math/rand exposes the rand.Rand type, a source of (pseudo) random numbers. With rand.Rand, one can:

generate random numbers following a flat, uniform distribution between [0, 1) with Float32() or Float64();
generate random numbers following a standard normal distribution (of mean 0 and standard deviation 1) with NormFloat64();
and generate random numbers following an exponential distribution with ExpFloat64.

If you need other distributions, have a look at Gonum’s gonum/stat/distuv.

math/rand exposes convenience functions (Float32, Float64, ExpFloat64, …) that share a global rand.Rand value, the “default” source of (pseudo) random numbers. These convenience functions are safe to be used from multiple goroutines concurrently, but this may generate lock contention. It’s probably a good idea in your libraries to not rely on these convenience functions and instead provide a way to use local rand.Rand values, especially if you want to be able to change the seed of these rand.Rand values.

Let’s see how we can generate random numbers with "math/rand":

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const N = 10
	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		fmt.Printf("%v\n", r)
	}
}

Running this program gives:

$> go run ./mc-0.go
0.8487305991992138
0.6451080292174168
0.7382079884862905
0.31522206779732853
0.057001989921077224
0.9672449323010088
0.6139541710075446
0.01505990819189991
0.13361969083044145
0.5118319569473198

OK. Does this seem flat to you? Not sure…

Let’s modify our program to better visualize the random data. We’ll use a histogram and the go-hep.org/x/hep/hbook and go-hep.org/x/hep/hplot packages to (respectively) create histograms and display them.

Note: hplot is a package built on top of the gonum.org/v1/plot package, but with a few HEP-oriented customization. You can use gonum.org/v1/plot directly if you so choose or prefer.

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const N = 10000

	huni := hbook.NewH1D(100, 0, 1.0)

	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		huni.Fill(r, 1)
	}

	plot(huni, "uniform.png")
}

We’ve increased the number of random numbers to generate to get a better idea of how the random number generator behaves, and disabled the printing of the values.

We first create a 1-dimensional histogram huni with 100 bins from 0 to 1. Then we fill it with the value r and an associated weight (here, the weight is just 1.)

Finally, we just plot (or rather, save) the histogram into the file "uniform.png" with the plot(...) function:

func plot(h *hbook.H1D, fname string) {
	p := hplot.New()
	hh := hplot.NewH1D(h)
	hh.Color = color.NRGBA{0, 0, 255, 255}
	p.Add(hh, hplot.NewGrid())

	err := p.Save(10*vg.Centimeter, -1, fname)
	if err != nil {
		log.Fatal(err)
	}
}

Running the code creates a uniform.png file:

$> go run ./mc-1.go

plot-uniform

Indeed, that looks rather flat.

So far, so good. Let’s add a new distribution: the standard normal distribution.

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const N = 10000

	huni := hbook.NewH1D(100, 0, 1.0)
	hgauss := hbook.NewH1D(100, -5, 5)

	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		huni.Fill(r, 1)

		g := rnd.NormFloat64()
		hgauss.Fill(g, 1)
	}

	plot(huni, "uniform.png")
	plot(hgauss, "norm.png")
}

Running the code creates the following new plot:

$> go run ./mc-2.go

plot-norm

Note that this has slightly changed the previous "uniform.png" plot: we are sharing the source of random numbers between the 2 histograms. The sequence of random numbers is exactly the same than before (modulo the fact that now we generate -at least- twice the number than previously) but they are not associated to the same histograms.

OK, this does generate a gaussian. But what if we want to generate a gaussian with a mean other than 0 and/or a standard deviation other than 1 ?

The math/rand.NormFloat64 documentation kindly tells us how to achieve this:

“To produce a different normal distribution, callers can adjust the output using: sample = NormFloat64() * desiredStdDev + desiredMean“

Let’s try to generate a gaussian of mean 10 and standard deviation 2. We’ll have to change a bit the definition of our histogram:

func main() {
	const seed = 12345
	src := rand.NewSource(seed)
	rnd := rand.New(src)

	const (
		N      = 10000
		mean   = 10.0
		stddev = 5.0
	)

	huni := hbook.NewH1D(100, 0, 1.0)
	hgauss := hbook.NewH1D(100, -10, 30)

	for i := 0; i < N; i++ {
		r := rnd.Float64() // r is in [0.0, 1.0)
		huni.Fill(r, 1)

		g := mean + stddev*rnd.NormFloat64()
		hgauss.Fill(g, 1)
	}

	plot(huni, "uniform.png")
	plot(hgauss, "gauss.png")

	fmt.Printf("gauss: mean=    %v\n", hgauss.XMean())
	fmt.Printf("gauss: std-dev= %v +/- %v\n", hgauss.XStdDev(), hgauss.XStdErr())
}

Running the program gives:

$> go run mc-3.go
gauss: mean=    10.105225624460644
gauss: std-dev= 5.048629091912316 +/- 0.05048629091912316

plot-gauss

OK enough for today. Next time, we’ll play a bit with math.Pi and Monte Carlo.

Note: all the code is go get-able via:

$> go get github.com/sbinet/blog/static/code/2017-10-11/...