2015-02-09

Concatenating numbers

A colleague asked how to efficiently concatenate numbers. I was supposed to write a literature review, so I jumped to the opportunity to procrastinate.

The basic idea is simple. Given two numbers a and b (b ≠ 0), e.g. 42 and 3, produce a number whose high digits form a and whose low digits form b, 423 in the example.

a ⧺ b = ab

42 ⧺ 3 = 423

The obvious solution is string concatenation. That is probably going to be slow. Can we do it with math? Yes we can:

a ⧺ b = a * 10^log₁₀(b) + b

As it turns out, exponentation is slow. Another colleague suggested using shifts instead. The first colleague did not have anything against binary, so lets try that. Interpreting a, b and the result in binary (<< is left shift and ^ is xor):

a ⧺ b = a * 2^log₂(b) + b = (a<<log₂(b))^b

For example 10 ⧺ 1010 = 101010. Note that we can not recover the decimal representation from the result: 10₂ = 2₁₀, and 1010₂ = 10₁₀, but

10₂ ⧺ 1010₂ = 101010₂ = 42₁₀ ≠ 210₁₀ = 2₁₀ ⧺ 10₁₀.

Anyways, binary was fine for the use case, shifting should be way faster than exponentiation, and I happened to know that there is an instruction in x86 that implements log₂ (fast, hopefully).

C is good for bit twiddling

Let’s start with the obvious thing, base 10 with exponentation and logarithm.

#include<stdlib.h>
#include<math.h>

#define count 10000000

unsigned numbers[count*2];
unsigned result[count];

int main(void) {
  int i;
  for (i = 0; i < count*2; i++) {
    numbers[i] = rand();
  }
  for (i = 0; i < count; i++) {
    result[i] = numbers[i]*pow(10, log10(numbers[i+1])) + numbers[i+1];
  }
}

What does this do? We allocate two arrays, one with input numbers, one for results. We initialize the input array with random numbers. Then the actually interesting thing happens. We fill the output array with numbers concatenated from two elements of the input numbers array, using the a * 10^log₁₀(b) + b method.

We can compile and run the program like this:

stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm base10.c
stefan@stefan-work:~/src/concat-numbers$ time ./a.out
real    0m1.034s
user    0m1.020s
sys     0m0.013s

I ran the thing a bunch of times, it takes about one second every time. We can do better base 10. According to the internet, pow and log10 use floating point math, which requires conversion and is slow. On stackoverflow, people suggest to avoid using the floating point math operations and use a loop instead. In concatenate we repeatedly multiply 10 by 10 until we get a one with enough zeroes, then use the same math as before.

#include<stdlib.h>

#define count 10000000

unsigned numbers[count*2];
unsigned result[count];

/* http://stackoverflow.com/a/12700533 */

unsigned concatenate(unsigned a, unsigned b) {
    unsigned pow = 10;
    while(b >= pow)
        pow *= 10;
    return a * pow + b;
}

int main(void) {
  int i;
  for (i = 0; i < count*2; i++) {
    numbers[i] = rand();
  }
  for (i = 0; i < count; i++) {
    result[i] = concatenate(numbers[i], numbers[i+1]);
  }
}

We compile and run again:

stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm stackoverflow.c
stefan@stefan-work:~/src/concat-numbers$ time ./a.out
real    0m0.275s
user    0m0.253s
sys     0m0.020s

That’s a 4 times improvement over the naive math solution, base 10, no bit manipulation or assembly tricks required. Pretty good.

Let’s do the same thing, but interpret numbers in binary, and use shift and xor like this (a<<log₂(b))^b.

#include<stdlib.h>
#include<stdint.h>

static inline uint32_t log2(const uint32_t x) {
  uint32_t y;
  asm ( "\tbsr %1, %0\n"
      : "=r"(y)
      : "r" (x)
  );
  return y;
}

#define count 10000000

unsigned numbers[count*2];
unsigned result[count];

int main(void) {
  int i;
  for (i = 0; i < count*2; i++) {
    numbers[i] = rand();
  }
  for (i = 0; i < count; i++) {
    result[i] = (numbers[i] << log2(numbers[i+1])) ^ numbers[i+1];
  }
}

BSR is the instruction that will give us the position of the most significant (left-most) set bit, which implements log₂. Our C compiler does not emit that instruction by default (at least I don’t know how to tell it to do so), so we use inline assembly. I do not know how to do these things, I got that from some stackoverflow post that I can’t find anymore.

We can compile and run this as before:

stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm base2.c
stefan@stefan-work:~/src/concat-numbers$ time ./a.out
real    0m0.185s
user    0m0.153s
sys     0m0.030s

That’s about 5 times faster than the naive base 10 version.

So is Common Lisp

Common Lisp used to be my favourite language and it tends to do well with bittwiddling code, if you know what you are doing, which I don’t. I enjoy reading about this kind of stuff on Paul Khuong’s blog though. So, let’s try it in Lisp.

(defun base2-no-hints (a b)
  (logxor (ash a (integer-length b)) b))

logxor is xor, ash is arithmetic left shift, integer-length is “the number of bits needed to represent [the argument] in binary two’s-complement format”. The code that is generated for that is not very good, because a and b could be anything. Let’s add some type hints:

(defun base2-lg-shift-xor (a b)
  (declare (type (unsigned-byte 32) a b))
  (declare (optimize (safety 0) (speed 3)))
  (the (unsigned-byte 32)
       (logxor (ash a (integer-length b)) b)))

We can disassemble this function using (disassemble #'base2-lg-shift-xor):

; disassembly for BASE2-LG-SHIFT-XOR
; Size: 32 bytes. Origin: #x1004E84F6F
; 6F:       488BCF           MOV RCX, RDI                     ; no-arg-parsing entry point
; 72:       4883C901         OR RCX, 1
; 76:       480FBDC9         BSR RCX, RCX
; 7A:       48D1FA           SAR RDX, 1
; 7D:       48D3E2           SHL RDX, CL
; 80:       48D1FF           SAR RDI, 1
; 83:       4831FA           XOR RDX, RDI
; 86:       48D1E2           SHL RDX, 1
; 89:       488BE5           MOV RSP, RBP
; 8C:       F8               CLC
; 8D:       5D               POP RBP
; 8E:       C3               RET

That’s pretty cool. We see BSR, which implements integer-length followed by a SHL and an XOR. There are some superfluous shifts though. I think those are for dealing with the tag bits that SBCL uses to distinguish boxed from unboxed data. I’m pretty sure Paul Khuong would know how to get rid of them, but this is where my knowledge of SBCL internals ends.

If we disassemble the binary that resulted from our C code, we can see how far we could potentially get.

stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm base2.c
stefan@stefan-work:~/src/concat-numbers$ gdb -batch -ex 'file a.out' -ex 'disassemble main'
Dump of assembler code for function main:
   0x00000000004005a0 <+0>:     push   %rbx
   0x00000000004005a1 <+1>:     xor    %ebx,%ebx
   0x00000000004005a3 <+3>:     data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
   0x00000000004005b0 <+16>:    callq  0x400490 <rand@plt>
   0x00000000004005b5 <+21>:    mov    %eax,0x2c26420(%rbx)
   0x00000000004005bb <+27>:    add    $0x4,%rbx
   0x00000000004005bf <+31>:    cmp    $0x4c4b400,%rbx
   0x00000000004005c6 <+38>:    jne    0x4005b0 <main+16>
   0x00000000004005c8 <+40>:    mov    0x2825e52(%rip),%edx        # 0x2c26420 <numbers>
   0x00000000004005ce <+46>:    mov    $0x4,%eax
   0x00000000004005d3 <+51>:    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
   0x00000000004005e0 <+64>:    mov    0x2c26420(%rax),%esi
   0x00000000004005e6 <+70>:    bsr    %esi,%ecx
   0x00000000004005e9 <+73>:    shl    %cl,%edx
   0x00000000004005eb <+75>:    xor    %esi,%edx
   0x00000000004005ed <+77>:    mov    %edx,0x600a1c(%rax)
   0x00000000004005f3 <+83>:    add    $0x4,%rax
   0x00000000004005f7 <+87>:    cmp    $0x2625a04,%rax
   0x00000000004005fd <+93>:    mov    %esi,%edx
   0x00000000004005ff <+95>:    jne    0x4005e0 <main+64>
   0x0000000000400601 <+97>:    xor    %eax,%eax
   0x0000000000400603 <+99>:    pop    %rbx
   0x0000000000400604 <+100>:   retq
End of assembler dump.

This is a lot longer than the Lisp disassembly, because it does a lot more stuff. The interesting bits are from the line that says <+70> where we see our BSR followed immediately by SHL, and XOR, just as expected. The rest is about the two loops and random numbers.

One more note: whether we use + or xor is not that important. In the end, it’s just the difference between XOR and ADD. I would need some actual benchmarking to see a runtime difference, if there is one at all.

Conclusions

Bit fiddling is great for procrastination. My benchmarking methodology sucks. My C and Common Lisp do too.

Don’t use floating point operations to concatenate base 10 numbers, use loops. Concatenating binary numbers can be done in three instructions. If you use C and inline assembly (or possibly some compiler intrinsic). Common Lisp has a standard library function to do it and SBCL’s compiler is clever enough to use the correct instruction if you give it enough type hints (no more than C needs).

In practice, my colleague will not actually use any of it, because, it turns out, he has to recover the components. Of course that is impossible, we can not decide whether 423 resulted from 4 ⧺ 23 or 42 ⧺ 3.

The code is on github. It might be slightly different than the blog post version though. Oh, and the machine I used to run this stuff on:

stefan@stefan-work:~/src/concat-numbers master$ cat /proc/cpuinfo | grep 'model name' | head -n1
model name      : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz

stefan@stefan-work:~/src/concat-numbers master$ clang --version
clang version 3.5.1 (tags/RELEASE_351/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix