Concatenating numbers
A colleague asked how to efficiently concatenate numbers. I was supposed to write a literature review, so I jumped to the opportunity to procrastinate.
The basic idea is simple. Given two numbers a and b (b ≠ 0), e.g. 42 and 3, produce a number whose high digits form a and whose low digits form b, 423 in the example.
a ⧺ b = ab
42 ⧺ 3 = 423
The obvious solution is string concatenation. That is probably going to be slow. Can we do it with math? Yes we can:
a ⧺ b = a * 10log10(b) + b
As it turns out, exponentation is slow. Another colleague suggested using shifts instead. The first colleague did not have anything against binary, so lets try that. Interpreting a, b and the result in binary (<< is left shift and ^ is xor):
a ⧺ b = a * 2log2(b) + b = (a<<log2(b))^b
For example 10 ⧺ 1010 = 101010. Note that we can not recover the decimal representation from the result: 102 = 210, and 10102 = 1010, but
102 ⧺ 10102 = 1010102 = 4210 ≠ 21010 = 210 ⧺ 1010.
Anyways, binary was fine for the use case, shifting should be way faster than exponentiation, and I happened to know that there is an instruction in x86 that implements log2 (fast, hopefully).
C is good for bit twiddling
Let’s start with the obvious thing, base 10 with exponentation and logarithm.
#include<stdlib.h>
#include<math.h>
#define count 10000000
unsigned numbers[count*2];
unsigned result[count];
int main(void) {
int i;
for (i = 0; i < count*2; i++) {
[i] = rand();
numbers}
for (i = 0; i < count; i++) {
[i] = numbers[i]*pow(10, log10(numbers[i+1])) + numbers[i+1];
result}
}
What does this do? We allocate two arrays, one with input numbers, one for results. We initialize the input array with random numbers. Then the actually interesting thing happens. We fill the output array with numbers concatenated from two elements of the input numbers array, using the a * 10log10(b) + b method.
We can compile and run the program like this:
stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm base10.c
stefan@stefan-work:~/src/concat-numbers$ time ./a.out
real 0m1.034s
user 0m1.020s
sys 0m0.013s
I ran the thing a bunch of times, it takes about one second every time. We can do better base 10. According to the internet, pow
and log10
use floating point math, which requires conversion and is slow. On stackoverflow, people suggest to avoid using the floating point math operations and use a loop instead. In concatenate
we repeatedly multiply 10 by 10 until we get a one with enough zeroes, then use the same math as before.
#include<stdlib.h>
#define count 10000000
unsigned numbers[count*2];
unsigned result[count];
/* http://stackoverflow.com/a/12700533 */
unsigned concatenate(unsigned a, unsigned b) {
unsigned pow = 10;
while(b >= pow)
*= 10;
pow return a * pow + b;
}
int main(void) {
int i;
for (i = 0; i < count*2; i++) {
[i] = rand();
numbers}
for (i = 0; i < count; i++) {
[i] = concatenate(numbers[i], numbers[i+1]);
result}
}
We compile and run again:
stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm stackoverflow.c
stefan@stefan-work:~/src/concat-numbers$ time ./a.out
real 0m0.275s
user 0m0.253s
sys 0m0.020s
That’s a 4 times improvement over the naive math solution, base 10, no bit manipulation or assembly tricks required. Pretty good.
Let’s do the same thing, but interpret numbers in binary, and use shift and xor like this (a<<log2(b))^b.
#include<stdlib.h>
#include<stdint.h>
static inline uint32_t log2(const uint32_t x) {
uint32_t y;
( "\tbsr %1, %0\n"
asm : "=r"(y)
: "r" (x)
);
return y;
}
#define count 10000000
unsigned numbers[count*2];
unsigned result[count];
int main(void) {
int i;
for (i = 0; i < count*2; i++) {
[i] = rand();
numbers}
for (i = 0; i < count; i++) {
[i] = (numbers[i] << log2(numbers[i+1])) ^ numbers[i+1];
result}
}
BSR is the instruction that will give us the position of the most significant (left-most) set bit, which implements log2. Our C compiler does not emit that instruction by default (at least I don’t know how to tell it to do so), so we use inline assembly. I do not know how to do these things, I got that from some stackoverflow post that I can’t find anymore.
We can compile and run this as before:
stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm base2.c
stefan@stefan-work:~/src/concat-numbers$ time ./a.out
real 0m0.185s
user 0m0.153s
sys 0m0.030s
That’s about 5 times faster than the naive base 10 version.
So is Common Lisp
Common Lisp used to be my favourite language and it tends to do well with bittwiddling code, if you know what you are doing, which I don’t. I enjoy reading about this kind of stuff on Paul Khuong’s blog though. So, let’s try it in Lisp.
defun base2-no-hints (a b)
(logxor (ash a (integer-length b)) b)) (
logxor
is xor, ash
is arithmetic left shift, integer-length
is “the number of bits needed to represent [the argument] in binary two’s-complement format”. The code that is generated for that is not very good, because a
and b
could be anything. Let’s add some type hints:
defun base2-lg-shift-xor (a b)
(declare (type (unsigned-byte 32) a b))
(declare (optimize (safety 0) (speed 3)))
(the (unsigned-byte 32)
(logxor (ash a (integer-length b)) b))) (
We can disassemble this function using (disassemble #'base2-lg-shift-xor)
:
; disassembly for BASE2-LG-SHIFT-XOR
; Size: 32 bytes. Origin: #x1004E84F6F
; 6F: 488BCF MOV RCX, RDI ; no-arg-parsing entry point
; 72: 4883C901 OR RCX, 1
; 76: 480FBDC9 BSR RCX, RCX
; 7A: 48D1FA SAR RDX, 1
; 7D: 48D3E2 SHL RDX, CL
; 80: 48D1FF SAR RDI, 1
; 83: 4831FA XOR RDX, RDI
; 86: 48D1E2 SHL RDX, 1
; 89: 488BE5 MOV RSP, RBP
; 8C: F8 CLC
; 8D: 5D POP RBP
; 8E: C3 RET
That’s pretty cool. We see BSR, which implements integer-length
followed by a SHL and an XOR. There are some superfluous shifts though. I think those are for dealing with the tag bits that SBCL uses to distinguish boxed from unboxed data. I’m pretty sure Paul Khuong would know how to get rid of them, but this is where my knowledge of SBCL internals ends.
If we disassemble the binary that resulted from our C code, we can see how far we could potentially get.
stefan@stefan-work:~/src/concat-numbers$ clang -O3 -lm base2.c
stefan@stefan-work:~/src/concat-numbers$ gdb -batch -ex 'file a.out' -ex 'disassemble main'
Dump of assembler code for function main:
0x00000000004005a0 <+0>: push %rbx
0x00000000004005a1 <+1>: xor %ebx,%ebx
0x00000000004005a3 <+3>: data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
0x00000000004005b0 <+16>: callq 0x400490 <rand@plt>
0x00000000004005b5 <+21>: mov %eax,0x2c26420(%rbx)
0x00000000004005bb <+27>: add $0x4,%rbx
0x00000000004005bf <+31>: cmp $0x4c4b400,%rbx
0x00000000004005c6 <+38>: jne 0x4005b0 <main+16>
0x00000000004005c8 <+40>: mov 0x2825e52(%rip),%edx # 0x2c26420 <numbers>
0x00000000004005ce <+46>: mov $0x4,%eax
0x00000000004005d3 <+51>: data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
0x00000000004005e0 <+64>: mov 0x2c26420(%rax),%esi
0x00000000004005e6 <+70>: bsr %esi,%ecx
0x00000000004005e9 <+73>: shl %cl,%edx
0x00000000004005eb <+75>: xor %esi,%edx
0x00000000004005ed <+77>: mov %edx,0x600a1c(%rax)
0x00000000004005f3 <+83>: add $0x4,%rax
0x00000000004005f7 <+87>: cmp $0x2625a04,%rax
0x00000000004005fd <+93>: mov %esi,%edx
0x00000000004005ff <+95>: jne 0x4005e0 <main+64>
0x0000000000400601 <+97>: xor %eax,%eax
0x0000000000400603 <+99>: pop %rbx
0x0000000000400604 <+100>: retq
End of assembler dump.
This is a lot longer than the Lisp disassembly, because it does a lot more stuff. The interesting bits are from the line that says <+70>
where we see our BSR followed immediately by SHL, and XOR, just as expected. The rest is about the two loops and random numbers.
One more note: whether we use + or xor is not that important. In the end, it’s just the difference between XOR and ADD. I would need some actual benchmarking to see a runtime difference, if there is one at all.
Conclusions
Bit fiddling is great for procrastination. My benchmarking methodology sucks. My C and Common Lisp do too.
Don’t use floating point operations to concatenate base 10 numbers, use loops. Concatenating binary numbers can be done in three instructions. If you use C and inline assembly (or possibly some compiler intrinsic). Common Lisp has a standard library function to do it and SBCL’s compiler is clever enough to use the correct instruction if you give it enough type hints (no more than C needs).
In practice, my colleague will not actually use any of it, because, it turns out, he has to recover the components. Of course that is impossible, we can not decide whether 423 resulted from 4 ⧺ 23 or 42 ⧺ 3.
The code is on github. It might be slightly different than the blog post version though. Oh, and the machine I used to run this stuff on:
stefan@stefan-work:~/src/concat-numbers master$ cat /proc/cpuinfo | grep 'model name' | head -n1
model name : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
stefan@stefan-work:~/src/concat-numbers master$ clang --version
clang version 3.5.1 (tags/RELEASE_351/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix