Raspberry Pi vs SPARCstation 20: Fight!

A couple weeks back, I tweeted the following:

Turns out a Raspberry Pi now is about 6 times as fast as a SPARCstation 20 was 20 years ago. And a Pi 2 is more like 15 times as fast.

I was a little low in my numbers, too — they’re more like 7 times and 16 times to 41 times as fast — since I was going from memory!

Here’s how I came up with that.

The BYTE UNIX Benchmark

The standard benchmark for UNIX systems back in the day was the BYTE UNIX Benchmark, a set of benchmarks originally developed at a university and fleshed out substantially by BYTE Magazine so they could evaluate the new servers and workstations that were coming to market.

Even though BYTE itself is no more (RIP) the benchmark lives on: The most recent version was posted on Google Code and had some additional portability and enhancement work done. These days, the most up-to-date version is on GitHub.

What’s useful about this benchmark is that it’s scaled, and UNIX hasn’t changed all that much in itself, so it’s still moderately useful as a way to compare systems with each other.

I’d recently made an offhand comment that the Raspberry Pi, despite feeling “underpowered” by today’s standards, was actually extremely powerful — and that it put a decent workstation from the mid-1990s to shame, the kind of system we tended to be jealous of as college students.

What was a SPARCstation 20?

Back then, Sun was the biggest UNIX workstation vendor, primarily because both their hardware and baseline operating system were good and they offered a ton of flexibility in their product line.

In 1994, Sun introduced a new lineup of SPARCstation systems that had dramatically improved performance compared to their previous models — the original of which was so iconic that it defined the “pizza box” form factor for desktop workstations — and the SPARCstation 20 was one of their flagships.

Here are some specs for the Sun SPARCstation 20 model 61, which shipped in June 1994:

  • One 60 MHz SuperSPARC CPU
  • 1 MB of cache
  • 32MB RAM (expandable to 512MB)
  • 20 MB/second SCSI-2
  • 1152 by 900 8-bit graphics

In 1994, this was quite a substantial system, and it cost $16,195 in its minimum configuration. (That’s $25,580 today!) And if you used one, it felt like it: This thing was wicked fast.

This was also the last system for which the BYTE benchmark was re-indexed, defining this SPARCstation 20 to have a score of 10.0.

The Benchmarks

Actually running the benchmarks under Raspbian Jessie on my Raspberry Pi and Raspberry Pi 2 was trivial, literally just a matter of cloning the git repository and running the script.

Here are the results. Note that the Raspberry Pi 2 has two sets of results, because the BYTE UNIX Benchmark runs once to get “single-CPU” performance numbers and another time to get “multi-CPU” numbers. Its single-CPU numbers are really more like “single process” numbers, however, since the other three cores aren’t actually disabled while the benchmark is run.

System Benchmarks Index Values SS20-61 Result SS20-61 Index RPi Result RPi Index RPi2x1 Result RPi2x1 Index RPi2x4 Result RPi2x4 Index
Dhrystone 2 using register variables 16700.0 10.0 1647374.0 141.2 3000237.2 257.1 1948737.7 1023.9
Double-Precision Whetstone 55.0 10.0 239.6 43.6 435.3 79.1 1729.8 314.5
Execl Throughput 43.0 10.0 167.7 39.0 321.5 74.8 1210.6 281.5
File Copy 1024 bufsize 2000 maxblocks 3960.0 10.0 30363.8 76.7 70026.8 176.8 110940.6 280.2
File Copy 256 bufsize 500 maxblocks 1655.0 10.0 9473.6 57.2 20353.5 123.0 31384.0 189.6
File Copy 4096 bufsize 8000 maxblocks 5800.0 10.0 76219.4 131.4 186926.9 322.3 296346.9 510.9
Pipe Throughput 12440.0 10.0 118393.6 95.2 181562.5 146.0 713070.2 573.2
Pipe-based Context Switching 4000.0 10.0 14539.1 36.3 33809.8 84.5 126241.1 315.6
Process Creation 126.0 10.0 434.6 34.5 1190.8 94.5 2572.9 204.2
Shell Scripts (1 concurrent) 42.4 10.0 354.5 83.6 1087.0 256.4 2395.0 564.9
Shell Scripts (8 concurrent) 6.0 10.0 44.9 74.8 301.0 501.7 317.0 528.3
System Call Overhead 15000.0 10.0 276169.1 184.1 399939.7 266.6 1545514.4 1030.3
System Benchmarks Index Score 10.0 71.9 165.6 417.4

What does this tell us?

A lot can happen in 20 years. Even when it comes to things like I/O throughput, where the Raspberry Pi really falls down compared to other systems — because it attaches to everything via USB — it’s still way faster than a mid-1990s Sun that we all thought was extremely fast.

In particular, according to the indexes, a Raspberry Pi is about seven times as fast as a baseline SPARCstation 20 model 61 — and has substantially more RAM and storage, too. And the Raspberry Pi 2 is sixteen times as fast at single-threaded tasks, and on tasks where all cores can be put to use it’s forty one times faster.

Ideally, this would also mean that even a Raspberry Pi Zero should feel exceptionally fast. However, our software appetite has grown even faster than our appetite for fast hardware, and the feel of systems compared like this can demonstrate that well.

What’s next?

Well, I just got a DragonBoard 410c, which is a quad-core 64-bit ARM board using a Qualcomm CPU, and which doesn’t have any of the major design issues of the Raspberry Pi…

Erlang on LLVM? or: Outsource your JIT!

Has anyone been working on using [LLVM][1] to do just-in-time code generation for the [Erlang][2] virtual machine?

Depending on the design and structure of the Erlang virtual machine, it doesn’t seem like it would be all that tough a project. And it could provide a nice performance boost for those projects that are starting to use Erlang like [CouchDB][3] and [ejabberd][4].

For an example of what I’m talking about, there’s a project called [VMKit][5] that has implemented the Java and .NET virtual machines atop LLVM with reasonable performance. Essentially, if you have a virtual machine, rather than skipping either just-in-time or static code generation entirely, or trying to do it all yourself for some specific platform on which you want to run, take a look at what you can do with LLVM and see if you can leverage its code generation instead.

[1]: http://llvm.org/
[2]: http://erlang.org/
[3]: http://couchdb.org/
[4]: http://www.ejabberd.im/
[5]: http://www.ejabberd.im/

LLVM terminology

I thought the proper terminology was worth pointing out, since I’ve seen — and heard — some misuses lately.

* **[LLVM][1]** is the Low-Level Virtual Machine and the project surrounding it.

* **[LLVM-GCC][2]** is a compiler that uses GCC for its front-end and LLVM for its back-end.

* **[Clang][3]** is the C language family front-end that is part of the LLVM project. It’s a parser, semantic analyzer, and code generator — in other words, a compiler front-end that uses LLVM for its back-end.

* **[The Clang Static Analyzer][4]** is what people have been trying out lately, to find subtle bugs in their and other projects. It’s a great tool.

I just thought this was important to mention, because people have been referring to “LLVM” instead of “LLVM-GCC” in reference to the compiler included in Xcode 3.1, and people have been referring to “Clang” instead of “the Clang Static Analyzer” in reference to what they’ve been using to find bugs in their projects.

[1]: http://llvm.org/
[2]: http://llvm.org/docs/CommandGuide/html/llvmgcc.html
[3]: http://clang.llvm.org/
[4]: http://clang.llvm.org/StaticAnalysis.html

peterb hits it out of the park

peterb of Tea Leaves, in Game Developer To World: Please Revolve Around Me! summarizes the position taken by Tim Sweeny of Epic during an interview thusly:

  1. People aren’t buying expensive enough PCs.
  2. Even the expensive PCs aren’t good enough to run his games.
  3. People who buy cheaper machines with Intel integrated graphics are giving their money to Blizzard instead of Epic.
  4. This aggression cannot stand. The solution is that everyone except us should change what they’re doing and buy machines with more expensive graphics hardware.

This problem is endemic in the game industry.

The most recent example, I was going to buy my girlfriend The Sims 2 for Valentine’s Day to play on her MacBook. Oops! Her MacBook has the dread Intel integrated graphics and therefore can’t run it! Or, indeed, any of the other games ported to the Mac using the same technology! Thanks a bunch, it’s not like anybody has a MacBook! (Except, of course, everybody these days.)

But wait, what are the actual system requirements for The Sims 2 on Windows? 800 MHz CPU and a T&L-capable video card, or 2GHz CPU and non-T&L-capable video card. Her MacBook definitely meets those criteria, and it’s also a huge portion of the Mac customer base for a game like that. I wonder if the PowerPC build would run acceptably under Rosetta — the original Sims ran fine on an iBook DV a half-decade ago, after all, and it’s not like The Sims 2 is new.

I also heard a lot of commentary around the time the iMac G5 debuted about its “terrible” 5200FX video chipset. After all, it meant that a lot of games people were working on for then-high-end machines wouldn’t run! Except, uh, why wouldn’t they run? Because developers didn’t actually design for what users were buying! They were designing for some ideal system that very few people had, and beyond that they had the gall to complain that they weren’t selling many games. Hmm. I wonder why. If you limit your market to people with beefy dual-G5 systems with high-end video cards who are allowed to install games on them, maybe that’s not such a big market… On the other hand, if you design for the iMac G4, and the iMac G5 comes out, chances are you’ll be able to sell to a lot more people…

Game developers should be targeting the systems people are using rather than systems with every feature under the sun. No wonder casual games do so well — nobody else is willing to serve that vast majority of the market!

So if you’re writing a game, or thinking about writing a game, or any other performance-sensitive application, look at what the bulk of the users you want to target are currently using and design for that. Neither you nor your users are likely to be disappointed with the results.

Designing for Core Data performance

On the comp.sys.mac.programmer.help newsgroup, Florian Zschocke asked about improving the performance of his Core Data application. Here’s an adapted version of my reply to his post.

Core Data applications should scale quite well to large data sets when using an SQLite persistent store. That said, there are a couple implementation tactics that are critical to performance for pretty much any application using a technology like Core Data:

  1. Maintain a well-normalized data model.
  2. Don’t fetch or keep around more data than you need to.

Implementing these tactics will make it much easier to both create well-performing Core Data applications in the first plce, and to optimize the performance of applications already in progress.

Maintaining a normalized data model is critical for not fetching more data than you need from a persistent store, because for data consistency Core Data will fetch all of the attributes of an instance at once. For example, consider a Person entity that can have a binary data attribute containing a picture. Even if you’re just displaying a table of Person instances by name, Core Data will still fetch the picture because it’s an attribute of Person. Thus for performance in a situation like this, you’d normalize your data so that you have a separate entity, Picture, to represent the picture for a Person on the other side of a relationship. That way the image data will only be retrieved from the persistent store if the relationship is actually traversed; until it’s traversed, it will just be represented by a fault.

Similarly, if you have lots of to-many relationships and need to display summary information about them, de-normalizing your data model slightly and caching the summary information in the main entity can help.

For example, say your app works with Authors and Books. Author.books is a to-many relationship to Book instances and Book.authors is a to-many relationship to Author instances. You may want to show a table of Authors that includes the number of Books related to the Author. However, binding to books.@count for that column value will cause the relationship fault to fire for every Author displayed, which can generate a lot more traffic to the persistent store than you want.

One strategy would be to de-normalize your data model slightly so Author also contains a booksCount attribute, and maintains that whenever the Author.books relationship is maintained. This way you can avoid firing the Author.books relationship fault just because you want to display the number of Books an Author is related to, by binding the column value to booksCount instead of books.@count.

Another thing be careful of is entity inheritance. It’s an implementation detail, but inheritance in Core Data is single-table. Thus if you have every entity in your application inheriting from one abstract entity, it’ll all wind up in a single table, potentially increasing the amount of time fetches take etc. because they require scanning more data.

Retaining or copying the arrays containing fetch results will keep those results (and their associated row cache entries) in memory for as long as you retain the arrays or copies of them, because the arrays and any copies will be retaining the result objects from the fetch. And as long as the result objects are in memory, they’ll also be registered with a managed object context.

If you want to prune your in-memory object graph, you can use -[NSManagedObjectContext refreshObject:mergeChanges:] to effectively turn an object back into a fault, which can also prune its relationship faults. A more extreme measure would be to use -[NSManagedObjectContext reset] to return a context to a clean state with no changes or registered objects. Finally, you can of course just ensure that any managed objects that don’t have changes are properly released, following normal Cocoa memory management rules: So long as your managed object context isn’t set to retain registered objects, and you aren’t retaining objects that you’ve fetched, they’ll be released normally like any other autoreleased objects.