numericals | dense-numericals: write fast code fast

Github Watch Star Issue

In case of any inaccuracies, ambiguities or suggestions, please create an issue here.

numericals / dense-numericals brings to you the performance of proto-NumPy with the debugging and development abilities of Common Lisp.

Even before that

Why two names: numericals and dense-numericals?

These are the names of the two main asdf systems provided as a part of this project.

  • numericals is designed to work with cl:array so that interfacing with the rest of the lisp ecosystem is trivial.
  • dense-numericals is designed to work with dense-arrays:array which provides a numpy-like array object for common lisp. dense-arrays itself is fairly extensible, and trivial extensions are provided for static-vectors as well as cl-cuda*.

*Currently dense-numericals only works with cl:vector as the backend storage. Extension to static-vectors or cl-cuda will be provided in the future if at least one user expresses their needs (through an email or an issue).

And while, in theory, this choice can be provided through a configuration variable, but it will only add to the usage overhead.

About common lisp

Common Lisp is great for prototyping:

  • REPL-based programming means that you can write your program one function at a time. Type, structure, function, and class redefinitions are handled more gracefully compared to most other languages.
  • Global variables in common lisp have dynamic scope, while local variables have default lexical scope. This means you can use global variables locally :). See this example.
  • Its condition system builds upon this and goes beyond traditional exception handling, allowing one to provide restarts for resuming a "crashed" program.

But it is also great in the longer run:

  • Optional static typing with implementations like SBCL means you do not have to worry yourself with type declarations during prototyping. But as your code and types stabilize, you can add the type declarations to aid documentation as well as speed and safety, while also inlining the small functions whose call overhead exceeds the work they do.
  • A 1994 ANSI standard coupled with several portability libraries means that it is possible to write code that works without changes even after decades.
  • Common Lisp also allows for the deployment of binaries, and there are tools to distribute the shared libraries that the binary depends upon.

Indeed, there are limitations:

  • If you want to write fast code, then redefinitions, dynamic scoping, and condition systems can get in the way. So, to some extent, it is a tradeoff that you can work around as your program stabilizes.
  • The type system is much limited than ML like systems and does not provide true parametric types. But see this example. Coalton is an effort to work around this. It provides a ML like programming system built on top of Common Lisp.

About numericals

Like Julia, Common Lisp (SBCL) already offers a solution to the purported two-language problem. However, lisp code built using generic functions is not helpful for solving this. specialized-function and static-dispatch help here, but generic functions have other disadvantages:

  • inability to dispatch on specialized array types
  • inability to dispatch on optional or keyword arguments

In addition, while Common Lisp's ANSI standard offers compiler macros that allows one to control what specialized form a particular call site compiles to (see this or this), compiler macros are of limited use without access to the type information of the arguments. Unfortunately, this type information is not available trivially. To access it, one requires a somewhat sophisticated machinery of CLTL2 in the form of cl-environments, cl-form-types, and polymorphic-functions. In particular, the latter overcomes both the disadvantages of generic functions, and allows dispatching over specialized array types, as well as optional or keyword arguments. And coupled with the former two, it is also able to dispatch statically to a reasonable extent if the call site is compiled with (optimize speed) with safety<speed and debug<speed.

Put together, this allows compiling the following code:

(disassemble
 (lambda (x)
   (declare (optimize speed)
            (type (simple-array single-float 1) x))
   (nu:sin! x)))

into code containing no high level function calls but only:

...
; BFA:       4D8B55F0         MOV R10, [R13-16]               ; thread.alien-linkage-table-base
; BFE:       41FF92082E0000   CALL [R10+11784]                ; &BMAS_ssin
; C05:       488BE3           MOV RSP, RBX
...

and the following slight variant (using double-float instead of single-float)

(disassemble
 (lambda (x)
   (declare (optimize speed)
            (type (simple-array double-float 1) x))
   (nu:sin! x)))

into the following:

...
; 672:       4D8B55F0         MOV R10, [R13-16]               ; thread.alien-linkage-table-base
; 676:       41FF92C82E0000   CALL [R10+11976]                ; &BMAS_dsin
; 67D:       488BE3           MOV RSP, RBX
...

What this aggressive inlining means is that the performance is minimally impacted with increasing loop lengths. Below, notice that the overall number of sin calculated are the same in each of the three cases, although the loop lengths are different.

IMPL> (let ((a (nu:rand 100000 :type 'single-float))
            (b (nu:rand 100000 :type 'single-float)))
        (declare (optimize speed)
                 (type (simple-array single-float) a b))
        (time (loop repeat 1000 do
          (nu:sin a :out b :broadcast nil))))
Evaluation took:
  0.252 seconds of real time
  0.252051 seconds of total run time (0.252051 user, 0.000000 system)
  100.00% CPU
  556,048,756 processor cycles
  32,752 bytes consed

NIL
IMPL> (let ((a (nu:rand 1000 :type 'single-float))
            (b (nu:rand 1000 :type 'single-float)))
        (declare (optimize speed)
                 (type (simple-array single-float) a b))
        (time (loop repeat 100000 do
          (nu:sin a :out b :broadcast nil))))
Evaluation took:
  0.256 seconds of real time
  0.258519 seconds of total run time (0.258331 user, 0.000188 system)
  101.17% CPU
  570,228,290 processor cycles
  3,176,944 bytes consed

NIL
IMPL> (let ((a (nu:rand 10 :type 'single-float))
            (b (nu:rand 10 :type 'single-float)))
        (declare (optimize speed)
                 (type (simple-array single-float 1) a b))
        (time (loop repeat 10000000 do
          (nu:sin a :out b :broadcast nil))))
Evaluation took:
  1.820 seconds of real time
  1.822983 seconds of total run time (1.822983 user, 0.000000 system)
  [ Real times consist of 0.040 seconds GC time, and 1.780 seconds non-GC time. ]
  [ Run times consist of 0.038 seconds GC time, and 1.785 seconds non-GC time. ]
  100.16% CPU
  4,022,371,122 processor cycles
  319,987,040 bytes consed

NIL

Without inlining, the performance can be impacted severely. For instance, below, the lack of (optimize speed) declaration prevents inlining. Note that the below can be optimized further by avoiding type checks if a (safety 0) declaration is added.

IMPL> (let ((a (nu:rand 100000 :type 'single-float))
            (b (nu:rand 100000 :type 'single-float)))
        (declare (type (simple-array single-float) a b))
        (time (loop repeat 1000 do
          (nu:sin a :out b :broadcast nil))))
Evaluation took:
  0.251 seconds of real time
  0.251034 seconds of total run time (0.251034 user, 0.000000 system)
  100.00% CPU
  554,269,126 processor cycles
  97,792 bytes consed

NIL
IMPL> (let ((a (nu:rand 1000 :type 'single-float))
            (b (nu:rand 1000 :type 'single-float)))
        (declare (type (simple-array single-float) a b))
        (time (loop repeat 100000 do
          (nu:sin a :out b :broadcast nil))))
Evaluation took:
  0.308 seconds of real time
  0.308766 seconds of total run time (0.308766 user, 0.000000 system)
  [ Run times consist of 0.013 seconds GC time, and 0.296 seconds non-GC time. ]
  100.32% CPU
  679,509,546 processor cycles
  9,583,360 bytes consed

NIL
IMPL> (let ((a (nu:rand 10 :type 'single-float))
            (b (nu:rand 10 :type 'single-float)))
        (declare (type (simple-array single-float 1) a b))
        (time (loop repeat 10000000 do
          (nu:sin a :out b :broadcast nil))))
Evaluation took:
  7.144 seconds of real time
  7.145966 seconds of total run time (7.142114 user, 0.003852 system)
  [ Real times consist of 0.076 seconds GC time, and 7.068 seconds non-GC time. ]
  [ Run times consist of 0.078 seconds GC time, and 7.068 seconds non-GC time. ]
  100.03% CPU
  15,773,699,348 processor cycles
  959,986,352 bytes consed

NIL

This may or may not matter for most use cases. But when it does matter, a naive lisp implementation using generic functions stops being as useful. Then, one is either forced to reimplement the lisp code to suit their needs or move to a different language altogether which handles this.

Thus, numericals and dense-numericals intend to provide a solution to this separation between "write code fast" and "write fast code".

PS: The above BMAS_ssin foreign function is built over SLEEF and uses SIMD to compute the sine. The equivalent SBCL non-SIMD equivalent is about 30 times slower:

IMPL> (let ((a (nu:rand 100000 :type 'single-float :max 1.0))
            (b (nu:rand 100000 :type 'single-float)))
        (declare (type (simple-array single-float 1) a b)
                 (optimize speed))
        (time (loop repeat 1000 do
          (loop for i below 100000 do
            (setf (row-major-aref b i) (sin (row-major-aref a i)))))))
Evaluation took:
  5.576 seconds of real time
  5.579197 seconds of total run time (5.575260 user, 0.003937 system)
  100.05% CPU
  12,320,119,856 processor cycles
  0 bytes consed

NIL

Some overhead of the computation indeed stems from the conversion of single-float to double-float and back again; however, even if we stick to double-float, there is a difference of about 10x. But that's also the point, if an application wants performance over accuracy, they should not be forced to look beyond lisp.

What numericals/dense-numericals do not intend to provide?

numericals / dense-numericals do not intend to offer a one-stop solution to the numerical computation ecosystem in common lisp. Even planning to do so is delusionary and should be considered a severe case of NIH syndrome. To that extent, we widely adopt:

This is further coupled with multithreading using lparallel and the C-library builtin OpenMP.

As further testament to the profoundly found elsewhere attitude adopted here, we have

  • numericals working with cl:array that can be used in conjunction with other libraries
  • two systems numericals/magicl and dense-numericals/magicl are provided that are essentially wrappers around magicl

Thus, we intend to provide facilities for writing fast code fast, but at the same time, we want to promote the use of existing facilities wherever appropriate.

Where next?

Acknowledgements