Tech Notes: Medium data and small data

Medium data and small data

March 04, 2025

Two related threads, both views on limits to the useful size of data.

Medium data

I have recently been working with WebAssembly and Win32, which both use 32-bit pointers and are limited to 4gb of memory. Meanwhile, modern computers usually use 64-bit pointers. 64-bit bit pointers let you address more than 4gb of memory. (They also have other important uses, including pointer tricks around memory mapping like ASLR.)

But these larger pointers also cost memory. Fancy VMs like v8 employ "pointer compression" to try to reduce the memory cost. As always, you end up trading off CPU and memory. Is Memory64 actually worth using? talks about this tradeoff in the context of WebAssembly's relatively recent extension to support 64-bit pointers. The costs of the i386 to x86-64 upgrade dive in this in the context of x86, where it's difficult to disentangle the separate instruction set changes that accompanied x86-64 from the increased pointer size.

I provide these links as background for an interesting observation I heard about why a 4gb limit turns out to be "big enough" much of the time: to the extent a program needs to work with data sets larger than 4gb, the data is large enough that you end up working with it differently anyway.

In other words, the simple small data that programs typically work with, like command-line flags or configuration files, comfortably fits within a 4gb limit. And in contrast, the kind of bigger data that crosses the 4gb limit fundamentally will have different access patterns — typically smaller views onto the larger space — because it is too large to traverse quickly.

Imagine a program that generates 4gb of data or loads it from disk or network. Even if the program grinds through hundreds of megabytes per second, it still takes over 30 seconds to work through 4gb. This is enough time to probably require a different user interface such as a progress bar. Such a program likely will work with the data in a streaming fashion, paging in/out smaller blocks of the data via some file API that manages blocks of the >4gb data.

A standard example application that will make use of a ton of memory is a database. But a database is expressly designed around managing the larger size of the data, using indexing data structures so that a given query knows exactly which subsets of the larger data to access. Otherwise, large queries like table scans work with the data in a streaming fashion as in the previous paragraph.

Another canonical "needs a lot of memory" application is an image editor, which operates on a lot of pixels. I worked on one of those! To make the software grind through pixels fast you will take efforts to avoid needing to individually traverse all your pixels. To get the pixels onto the screen quickly, you instead load data piecewise into the GPU and let the GPU handle the rest. Or you write specialized "shader" GPU code. Both are again specialized APIs that work with <4gb subsets of the data. AI applications use GPUs in a similar way for the same reason.

How about audio and video? These involve large amounts of data, but also have a time dimension, where most operations only work with a portion of the data near a particular timestamp.

In all, a 32-bit address space for the program's normal data structures coupled with some alternative mechanism for poking at the larger data indirectly has surprisingly ended up working out almost as well as a larger flat address space.

A lack of imagination

It's interesting to contrast this observation to a joke from the DOS era: "640kb ought to be enough for anybody". 640kb was a memory limit back then and the phrase was thrown around ironically to remark that reasonable programs actually need more. According to Gates (who it was apocryphally, falsely, attributed to) at the time, 640kb was already understood to be a painful limit.

In contrast, we've been easily fitting most programs in 4gb for the last 30 years — from the 1990s through today, where browsers still limit web pages to 4gb.

Why has this arbitrary threshold held up? You could argue it's a lack of imagination: maybe we have sized our programs to the limits we had? But 64-bit has been around for quite a while, long enough that there ought to be better examples of different kinds of programs that truly make use of it.

One answer is to observe is that many of the limits above are related to human limits. Humans won't wait 30s for a program to load its data; humans can't listen to 4 minutes of audio simultaneously, humans don't write >4gb of configuration files... or to put it together, humans don't consume 4gb of data in one gulp.

And that observation links to an interesting related phenomenon, which I sometimes call:

Small data

Data visualization is the problem of presenting data in a form where your brain can ingest it. What if you have a lot of data? We use the tools of statistics, to summarize collections of data into smaller representations that capture some of its essence.

Imagine building a stock chart, a simple line chart of a value over time. Even though you have data for every second of the day, when charting the data over a larger timeline, it's not useful to draw all of that data. Instead we will summarize, with either an average price per time window, or something that attempts to reduce each time window's data to a few numbers like a violin plot or OLHC.

Why do we do this? Visually, there's only so much you can usefully see. Even with a higher resolution display, shoving more details into smaller pixels does not convey usefully more information. In fact, it's often a better chart when it has fewer visual marks that still tells the same story. Much like the 4gb limit, the bandwidth of information into your brain sets an upper bound on the amount of useful data that can go into a chart.

From this you can derive an interesting sort of principle about data visualization software: your data visualization does not need be especially fast in supporting a lot of marks, beacuse you won't ever need to display many of them. For example, libraries like d3 are fine to do relatively low performance DOM traversal for rendering.

This is kind of a subtle point, so let me be clear: it's of course useful and important to be able to work with large data sets, and data visualization software often will provide critical API for processing them. The point is that once you are at the level of putting actual objects on the screen, you should have already reduced the data mathematically to a relatively easy small set of data, so you don't really need speedy handling of screen objects.

Retorts

But wait, you say, this is a dumb argument — it really is more convenient to just have 64-bit pointers everywhere and not worry about any limits. And wouldn't a charting API where you can hand a million points to a scatter plot be more useful than one where you can't?

I think both of those are preferable — in the absence of performance constraints. I prefer those in the same way I'd prefer, in a hypothetical future where performance is completely free, making a database with no need for indexes and running big AI matmuls without specialized GPU code or hardware. But at least in today's software it is the case that even 64-bit pointers are costly enough that implementers will go to lengthy efforts to reduce their cost.

So don't take this post as advocating that these lower limits are somehow just or correct. Rather, I aimed only to observe why it is that the memory limits from the 90s, or the visualization limits from the pen and paper days, have not been as constraining as you might first expect.