the all-thing | 2010-09-09 18:23:33 -0400 ========================================== Copying objects in memory ------------------------- Date: January 15, 2009 3:39pm Author: William Morgan Labels: vm, optimization URL: http://all-thing.net/old11.txt One of the fundamental questions in VM design for OO languages is how you represent your object handles internally. Regardless of what the language itself exposes in terms of fundamental types, boxing, and the like, the VM still has to shuffle objects around on the stack frame, pass them between methods/functions, etc. The traditional way to do this is the "tagged union", where you use a two-element struct consisting a type field and a union of values for each possible type. One of these types is probably an object pointer; the other types let you represent unboxed fundamental types like ints and floats. This is the approach used by Rubinius, by Lua, and probably many more. The Neko VM instead uses 31-bit pointers for everything except ints [1], and fixes the lowest bit as the integer bit. If this bit is on, the value represents a 31-bit integer; if it's off, the value is a pointer. Of course this means that Neko objects can only be at even addresses in memory. (I'm not sure what happens on 64-bit machines; either ints stay at 31 bits or they grow to 63-bit longs; the pointers certainly grow to 64-bit). The result is that Neko object handles are the size of pointers, hence small, but Neko loses the ability to handle unboxed floats. All float operations will require lots of heap allocation and dereferencing. On the other hand, Lua object handles are much larger, but Lua can do float arithmetic on the stack. (The VM stack, not the system stack.) The Neko folks claim that their representation is better, because it's smaller, and faster when you're copying things around. But what value do you really get by sacrificing floats? And what about when we take into account different architectures? Comparing sizes is easy. On a 32-bit machine, Lua objects take up 12 bytes: a double is 8 bytes, and the tag grows the struct to 12. So Lua object handles are three times the size of Neko handles. On a 64-bit machine, Lua objects take up 16 bytes, and Neko objects take up 8. Note that Lua handles are now only twice as big as Neko handles. Comparing speed is a little more interesting. How much is lost, exactly by copying around those extra 8 bytes, for each architecture? I did some simple experiments where I copied objects of various sizes around 10 million times, picking a random start and end point for each within a block of allocated memory on the heap, and measuring how long everything took. On my 32-bit machine, taking 10m random numbers took 2.874 seconds; copying 12-byte objects from one location to the other each time took an additional 91ms. Copying 4-byte objects took only an extra 77ms. That works out to a 15.3% slowdown for Lua. On my 64-bit machine, taking 10m random numbers took 517ms; copying 16-byte objects each time took an additional 1.85 seconds; copying 8-byte objects took an additional 1.81 seconds. That works out to a 2.2% slowdown for tagged unions. So personally, the 32-bit case is _maybe_ arguable, but the 64-bit case doesn't seem that compelling. Copying object handles around is one thing of very many that the VM spends its time on, so the overall slowdown is going to be much less than 2.2%. I don't know that sacrificing float performance, and half of your integer space, is really worth it. If you want to run these experiments for yourself, the code is here [2]. Please let me know if I'm doing something wrong! [1] http://nekovm.org/lua#runtime_representation [2] http://gist.github.com/47645 Replies -------- Daniel Stutzbach, on January 15, 2009 5:23pm: ["| \n", "| I think you mean 31-*bit*.\n", "| \n", "| For integer operations, I wonder how much speed they lose by having to\n", "| bit-shift everything back and forth.\n", "| \n"] _why, on January 15, 2009 7:17pm: ["| Now wait a minute. There are other things to account for. For instance, how\n", "| quick is type checking for the two strategies? How quick is retrieval of\n", "| string pointers? How many steps to mark the GC fields?\n", "| \n", "| I also wonder if there could be some scaling up for Neko's strategy in 64-bit\n", "| mode to allow doubles to be kept as immediate values.\n", "| \n", "| I might be totally off, but I actually think that bignum and float operations\n", "| could be quite fast if I kept open space on the stack for local operations on\n", "| them. You'd have to package them as pointers once you left the routine, but\n", "| they could be unpacked into registers for the duration of a scope. Intense\n", "| math is usually pretty localized anyway.\n", "| \n"] William, on January 15, 2009 7:48pm: ["| Daniel: thanks, fixed.\n", "| \n", "| _why: yes, copying is only one small part of the picture. I don't have a good\n", "| sense of how those other factors measure up, or even whether they're even\n", "| significant compared to things like method dispatch, opcode dispatch, etc. (I\n", "| would love to find out!) All I'm saying is that \"copying is faster\" doesn't do\n", "| it for me.\n", "| \n", "| Doing bignum and float operations in registers is a good idea and I agree with\n", "| your intuition that most intensive math is local, at least in the world where\n", "| + isn't a method dispatch. Maybe even regular old int math could be moved\n", "| there.\n", "| \n", "| One of the things that's got me excited about register VMs is that you can\n", "| play fun games like this.\n", "| \n"] Nicolas Cannasse, on April 2, 2009 8:18pm: ["| \n", "| The main issue with increasing value size is not that much about copying.\n", "| \n", "| The two current bottlenecks in CPU speed are mispredicted jumps and cache\n", "| misses. Of course, the less the data takes memory, the more it can fit into\n", "| CPU cache, and thus reduce cache misses.\n", "| \n", "| Of course, when talking about performances, it all depends what kind of\n", "| application you're talking about. Neko is very good for symbolic processing\n", "| and data manipulation due to its low memory footprint, while Lua will be\n", "| faster for numerical floating point calculus.\n", "| \n", "| Nicolas\n", "| \n"] This delicious text version served up by Whisper .