# Web Appendix for numerical language comparison 2020

# Web appendix for numerical language comparison 2020

Alvaro Aguirre

Jon Danielsson

20 August 2020

Updated 26 August 2020

We compared four numerical languages, Julia, R, Python and Matlab, in a Vox blog post.

The *Execution Speed* section of the blog post includes three experiments to evaluate the speed of the four languages. The versions of the languages used are Julia 1.5.0, R 4.0.2, Python 3.7.6 and Matlab R2020a. All code ran on a late model MacBook Pro, Intel i9 with 64 Gb RAM.

- GARCH log-likelihood
- Loading large dataset
- Calculating annual mean and sd by year by PERMNO
- Julia improvements
- Python matrix multiplication
- Conclusions

## Experiment 1: GARCH log-likelihood

The first experiment involved comparing the running times for the calculation of GARCH log-likelihood. This is an iterative and dynamic process that captures a large class of numerical problems encountered in practice. Because the values at any given time t depend on values at t-1, this loop is *not vectorisable*. The GARCH(1,1) specification is

$$\sigma_t^2= \omega + \alpha y^2_{t-1}+ \beta \sigma_{t-1}^2$$

with its log-likelihood given by

$$\ell=-\frac{T-1}{2} \log(2\pi)-\frac{1}{2}\sum_{t=2}^T \left[ \log \left(\omega+\alpha y_{t-1}^2+\beta\sigma_{t-1}^2\right)+\frac{y_t^2}{\omega+\alpha y_{t-1}^2+\beta\sigma_{t-1}^2} \right]$$

For the purposes of our comparison, we are only concerned with the iterative calculation involved for the second term. Hence, the constant term on the left is ignored.

We coded up the log-likelihood calculation in the four languages and in C, keeping the code as similar as possible between languages. We included the just-in-time compilers *Numba* for Python, and the C++ integration *Rcpp* for R.
We then simulated a sample of size 10,000, loaded that in, and timed the calculation of the log-likelihood 100 times (wall time), picking the lowest in each case.

The code can be downloaded here.

### C

The fastest calculation is likely to be in C (or FORTRAN)

```
gcc -march=native -ffast-math -Ofast run.c
double likelihood(double o, double a, double b, double h, double *y2, int N){
double lik=0;
for (int j=1;j<N;j++){
h = o+a*y2[j-1]+b*h;
lik += log(h)+y2[j]/h;
}
return(lik);
}
```

### R

R can be used independently, which is likely to be slow

```
likelihood =function(o,a,b,y2,h,N){
lik=0
for(i in 2:N){
h = o + a * y2[i-1]+ b * h
lik = lik + log(h) + y2[i]/h
}
return(lik)
}
```

but using *Rcpp* will make it much faster by integrating R with C++ and compiling the likelihood function

```
require(Rcpp)
a = "double likelihood(double o, double a, double b, double h, NumericVector y2, int N){
double lik=0;
for (int j=1;j<N;j++){
h = o+a*y2[j-1]+b*h;
lik += log(h)+y2[j]/h;
}
return(lik);
}"
cppFunction(a)
```

### Python

Python is likely to be quite slow,

```
def likelihood(hh,o,a,b,y2,N):
lik = 0.0
h = hh
for i in range(1,N):
h=o+a*y2[i-1]+b*h
lik += np.log(h) + y2[i]/h
return(lik)
```

but will be significantly sped up by using *Numba*.

```
from numba import jit
@jit
def likelihood(hh,o,a,b,y2,N):
lik = 0.0
h = hh
for i in range(1,N):
h=o+a*y2[i-1]+b*h
lik += np.log(h) + y2[i]/h
return(lik)
```

### MATLAB

```
function lik = likelihood(o,a,b,h,y2,N)
lik=0;
for i = 2:N
h=o+a*y2(i-1)+b*h;
lik=lik+log(h)+y2(i)/h;
end
end
```

### Julia

Julia was designed for speed so we expected it to be fast:

```
function likelihood(o, a, b, h, y2, N)
local lik = 0.0
for i in 2:N
h = o+a*y2[i-1]+b*h
lik += log(h)+y2[i]/h
end
return(lik)
end
```

### Results

All runtime results are presented relative to the fastest (C).

If we only look at the base version of the four languages, Julia comes out at the top. This is unsurprising since it is a compiled language while the other three are interpreted. The speed of base Python for this type of calculation is very slow, being almost three hundred times slower than our fastest alternative, C.

However, when we speed up R and Python using *Rcpp* and *Numba* respectively, their performance increase significantly, beating Julia. In the case of *Rcpp* this involved compiling a C function in R, whereas in *Numba* we only have to import the package and use the JIT compiler without modifying our Python code, which is convenient.

## Experiment 2: Loading a large database

Our second experiment consisted in loading a very large CSV data set, CRSP. This file is almost 8GB uncompressed, and over 1GB compressed in gzip format. The code for this experiment and the next one can be downloaded here.

### R

R can read both compressed and uncompressed file using the *fread* function from the *data.table* package:

```
require(data.table)
uncompressed <- fread("crsp_daily.csv")
compressed <- fread("crsp_daily.gz")
```

### Python

Python's *pandas* is a convenient option to easily import csv files into DataFrames. We used it to read the uncompressed file, and in combination with the gzip package to read the compressed file:

```
import pandas as pd
import gzip
uncompressed = pd.read_csv("crsp_daily.csv")
f = gzip.open("crsp_daily.gz")
compressed = pd.read_csv(f)
```

### MATLAB

To date, MATLAB can only read uncompressed files:

```
uncompressed = readtable('crsp_daily.csv');
```

### Julia

We used Julia's CSV and GZip packages to read both type of files:

```
using CSV
using GZip
uncompressed() = CSV.read("crsp_daily.csv");
compressed() = GZip.open("crsp_daily.gz","r") do io
CSV.read(io)
end
```

### Results

All runtime results are presented relative to the fastest, which was R's uncompressed reading time. All code ran on a late model MacBook Pro.

R's *fread* function was the fastest to load both the uncompressed and compressed file. In all applicable cases, reading the uncompressed file was faster than reading the compressed file, although this difference was much more significant in the case of Julia. Matlab came out at the bottom, with an uncompressed loading time over five times slower than R, and the inability to load the compressed file.

## Experiment 3: Calculating annual mean and standard deviation by year and firm

Our last experiment involved performing group calculations on the large dataset imported in experiment 2. We wanted to evaluate the speed in computing the annual mean and standard deviation by year and firm.

### R

Using R's *data.table* package:

```
require(data.table)
R <- data[,list(length(RET), mean(RET), sd(RET)), keyby = list(y, PERMNO)]
```

### Python

Python's *pandas* allows groupby operations:

```
import pandas as pd
R = data.groupby(['PERMNO', 'year'])['RET'].agg(['mean', 'std', 'count'])
```

### MATLAB

MATLAB can perform group by operations on table objects using the *grpstats* function from the Statistics and Machine Learning Toolbox:

```
import pandas as pd
statarray = grpstats(data, {'PERMNO', 'year'}, {'mean', 'std'}, 'DataVars', 'RET');
```

### Julia

Julia's DataFrames package allows for this type of computations:

```
using Statistics
using DataFrames
R = by(data, [:year, :PERMNO]) do data
DataFrame(m = mean(data.RET), s = std(data.RET), c = length(data.RET))
end
```

### Results

All runtime results are presented relative to the fastest (Julia).

Julia proved its superiority in speed due to being a compiled language and was the fastest of the four. It was closely followed by R and Python. Matlab was by far the worst.

## Julia improvements

After the article was published, Bogumił Kamiński, developer of Julia’s DataFrames package, suggested we take a different approach for the Julia calculation by using `combine()`

instead. Doing so necessitated fixing a bug, implemented in the v0.21.7 release of DataFrames on August 25th, 2020.

We were able to do the calculation with:

```
using Statistics
using DataFrames
R = combine(groupby(data, [:year, :PERMNO]),
:RET => mean => :m, :RET => std => :s, nrow => :c)
```

The results of this was a decrease in processing time to almost half, even when Julia was already the fastest language for this task, as shown in the plot below.

## Python matrix multiplication

In the article we mention that if you want to multiply two matrices, X and Y, using Python, you would have to do:

```
import numpy as np
np.matmul(X,Y)
```

From Python 3.5 onwards, this can also be done via the operator @:

```
import numpy as np
X @ Y
```

## Conclusion

When comparing the four languages without compiler packages like *Numba* or *Rcpp*, Julia proved itself superior in computations, like the GARCH log-likelihood and the group by statistics, while R's *fread* function was unparalleled in loading large files.

When we introduce compiler packages, the picture changes.

Python's *Numba* package allows for efficient just-in-time compiling with minimal additional code — implementing *Numba* led to the calculation here being over 200 times faster. However, *Numba* can only be used in relatively simple cases and can not be considered a good general solution to the slow numerical speed of Python. We could have used *Cython*, but it is much more involved than *Numba*.

By coding the likelihood function in c and using R's *Rcpp*, we obtained the fastest speed in calculating GARCH log-likelihood relative to C. We could, of course, do the same for the other three languages.