Sunday, July 13, 2014


I've had to take a bit of a break from project work recently due to other commitments but hopefully I'll be able to get back to at least tinkering with things soon.  As a bit of a preamble to that I've been looking at a bit of a fringe task for procedurally generated worlds, namely the procedural generation of names.

My plan for the near future is to focus  more on generating a plausible political and logistical infrastructure for a planet instead of purely working on the visual terrain as I have done many times before so I'm going to need to be able to name things be they continents, countries, regions, cities, airports, premiers or whatever.  This is a completely new area to me and so sounded like an interesting diversion.

There are many ways to procedurally construct names ranging from simply producing random sequences of letters to utilising complex linguistic rules but the one I chose to experiment with was the use of Markov Chains (, an algorithm that decides what comes next in a sequence by looking only at the preceding X elements and using some probability tables to decide what's most likely to follow.  In my case the sequence is the letters of the name being generated so by looking at the last so many characters of the name so far I can choose a plausible next character and so on.

So where do these probability tables come from?  Well to make names that look sensible I thought it best to reference the real world as a starting point, so to generate names for countries I use a set of 275 real world country names as my exemplars - something that's not too hard to come by these days thanks to the internet.  These real names are analysed to store the probability that any given sequence of characters is followed by another, for example it may be that 23% of the time the letter 'a' is followed by the letter 't', or that 15% of the time the sequence "tr" is followed by the letter "s" in the real world names.

The resultant behaviour of the algorithm is heavily dependent upon the size of the exemplar pool but even more significantly upon the number of characters you look at when deciding what might come next.  For example running the same algorithm with the same exemplars but different length groups produced the following sets of names:

It's immediately obvious that the longer the character sequence examined the more rigidly the generated names stick to the exemplars while the shorter the character sequence the more variety the generated names exhibit.  The extreme example here is table (1d) where only the last character emitted is used to decide what comes next resulting in a virtually random sequence of characters that bears little resemblance to the exemplars or even anything vaguely pronounceable!

Of the variations I think the three character sequence version (1b) is most promising but while the longer sequence length prevents too much randomness in the output at times there simply isn't enough randomness meaning that the generated names are quite similar or even identical to the real world exemplars.  There is also nothing preventing the same sequence being generated more than once such as "falklands uk".

To address the first of these issues I decided to run a check on each generated name to see if it was different enough to the exemplars to be accepted.  This is done by comparing the generated name to each exemplar and calculating the Levenshtein Distance between the two strings (, essentially the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

By rejecting any generated names that produced a Levenshtein Distance to any exemplar that was too low we can help ensure that we don’t accidentally end up with something too similar to a real world place name.  Running the same test again but now with this added exemplar distance check rejecting any generated names with an exemplar distance score of less than two we get a better selection of plausible but not actually real world names:

Note here that the distance check is performed on each separate word of the generated and exemplar names separately to avoid generating names where one word is unique but the other lifted straight from an exemplar - certain key words such as "the", "of" or "island" are exempted from this check however as they are perfectly valid in the output.

As an experiment, running the test one final time but this time with the distance rejection limit raised from less than two to less than five produced these country names:

Comparing these to the previous test's output the biggest change here is with the longer character sequence generated names (3b and to a lesser extent 3c) as the distance check has rejected anything even remotely similar to one of the exemplars causing far more random and less recognisable names.  The shorter sequence names at the bottom have been less affected as they were pretty random to start with.  Note there is no four character sequence output here as the system was unable to generate any names using that most restrictive of criteria that also passed the distance check - the names were by definition most similar to the exemplars therefore least likely to have an acceptably high distance score.

So far these have all been country names but how does the system work with a different set of exemplars?  To find out I used it to try to generate plausible city names instead - first using a full set of 26914 Earth city names with a three character sequence length and a distance threshold of three:

As you can see there is a lot of variety in the names reflecting the huge real world variety of city names depending on culture and language, as one of my main goals at the moment is to produce a procedural world that exhibits more heterogeneity than it typically found by global application of procedural systems however this one-size-fits-all name style is not really what I'm after.

By limiting the set of exemplars to those real world cities from a particular country however more localised results can be achieved.  Out of interest I ran the system over just those real world city names from the United Kingdom, United States, France and Japan in turn to see how closely the output would reflect their dramatically contrasting morphology:

I'm no language expert but to my layman eyes the output from each while varied amongst it's peers is noticeably different from those in the other tables which I think is good enough for my needs - I just want each of my virtual countries to have an identity reflected in it's nomenclature.

These tables have been generated from just a few runs and with specific parameters, but it's important to bear in mind that when used for real the exemplar pool, sequence length and the distance threshold can all be varied each time a name is needed providing a steerable but near limitless potential set of generated names.

Anyway, I’m happy enough with it for now so I’m going to move on to something else...comments as always are welcome

Sunday, September 15, 2013

The end of an era...

It's with great sadness that I have to announce the company I have greatly enjoyed working for over the last eighteen years has had to close:

I think it impossible to overstate just what a great group of tremendously skilled people we had there producing a huge variety of projects on many diverse platforms.  The games industry is an unforgiving place however and a whole raft of factors conspired against us over the last year or two making it impossible to continue.

I wish everyone from there the very best of luck finding something new and exciting to work on.

Thursday, July 11, 2013

Damn, Voxel Data is BIG

I left off the last post with a little question about how many voxels I might need to represent a planet.  There were a few guesses of  4.7810528e+209.8782084e+17 and 10^21 which at first glance seem absurdly large but in fact they were pretty close to the mark.  You see the thing about voxel data is it's big.  Very big.

Apart from the inherent inefficiencies of using GPUs to render voxels with raycasting when they would generally be much happier doing what they were designed for and be rasterising triangles, it's the amount of memory voxel data sets consume that prohibit their use in many scenarios and while that's pretty much accepted wisdom and hardly newsworthy - I thought it might be interesting to just take a moment to put out there some cold hard numbers.

So, how many voxels and therefore of course bytes of storage do you need to render a planet?  The answer as you would expect depends on the fidelity of your representation and your method of storage, but these are my numbers what what they're worth:

To encompass a maximum data set of 25,000 Km cubed I am using 22 levels of detail each offering eight times the number of voxels of the previous one (i.e. twice the number along each axis).  The lowest detail level uses a 19^3 grid of voxel bricks each of which is 14^3 voxels in size so the entire lowest detail level provides 266^3 voxels.  Dividing the 25,000 Km region by this equates to each voxel encompassing a cubic region of space 93.98 Km on a side (830,185 Km^3).  That's pretty big!

So obviously the lowest level of detail is pretty coarse but even at that low detail level there are 266*266*266 voxels making up the space which equates to 18,821,096 individual voxels.  So that's nearly Twenty Million voxels for just one very crude detail level!

The next level down gives us 532^3 voxels each 46.99 Km on a side so we end up with over 150 Million of them there - things are scaling up pretty quickly.  Continuing this down to the maximum detail 22nd level gives us the following results:
Voxel counts, sizes and storage requirements at each level of detail
As you can see the numbers get silly very quickly, and this is storing just a single distance field value per voxel in half precision floating point (i.e. two bytes per voxel) without any thought to storing normals or occlusion for lighting or any kind of texturing information

So to put it another way, storing the full 25,000 Km sized cube of space at a maximum detail level of 4.5 cm per voxel would take 173,593,970,549,359,000,000,000,000 (173 million billion billion) voxels taking a mind boggling 294 Zetabytes of storage just for the distance field!  Putting that astronomical number into real world terms, if you burnt it all to dual layer DVD disks and stacked them on top of each other the pile would be nearly five times as high as the distance from the Earth to the edge of the Solar System!

I know the cost of hard drive storage continues to decrease but I'm pretty sure I can neither afford that sort of storage nor fit it in my PC!

Let's think about it a bit more though, firstly 25,000 Km is the maximum size my system can store - let's for the sake of argument say that I'm only interested in storing something Earth sized for now. It's not a perfect sphere but The Earth has a generally agreed approximate radius of 6371 Km which when plugged in to my formula reduces the numbers quite substantially:
Same voxel statistics but for an Earth sized cube of space
Which while better is still pretty extreme at a cool 39 Zetabytes - but let's keep going anyway. These numbers are for the entire cubic space but planets are conventionally round so with the volume of a sphere being (4/3)*PI*(Radius^3) our Earth sized planet has a volume of 1,083,206,916,845 KM^3, just 52.4% of the entire cube volume.

Before proceeding further lets define the planet a bit more accurately though; while they are basically spheres a perfectly smooth spherical surface isn't very interesting so we need some amount of surface detail. Conversely though we don't typically want to be able to travel to the planet's core so data below a certain depth underground probably isn't needed. Combine these two and you end up with a spherical shell of some determined thickness that defines the area of interest for rendering. My chosen thickness is currently 18Km (about 59,000 feet) which is approximately twice the height of Everest; I am anticipating that having roughly 10 Km available for the tallest mountains and 8 Km available for the deepest caves or underwater abysses ought to be sufficient

Making this assumption that we are only interested in a hollow shell of data is a great optimisation both of storage and rendering performance because instead of tracing the ray through the entire cubic data set we can first intersect it with the outer sphere of the planet's data shell and start tracing from there removing the need to store a great many of the distance samples. Assuming there aren't any holes in the shell you can also ignore any space inside the shell as you're bound to hit something solid before getting there.

Some basic calculations shows that for an Earth sized planet an 18 Km thick shell takes up just 0.23% of the planet's bounding cube and has a volume of about 9 billion Km^3 requiring something like 1.017E+23 voxels taking 172 ZB at two bytes per voxel.  Damn, that's still pretty large.

This estimation is pretty crude as it doesn't take account of the topology of the actual terrain which will allow many more bricks to be discarded and there are basic storage optimisations such as compressing the brick data but no matter which way you look at it you simply can't store brick data for a whole planet at this kind of detail.

There is also a major downside to all these assumptions of course - we are restricted to sensible planet shapes with sensible features which is a shame as being able to create crazy shaped planets you can travel through the core of sounds like a lot of fun.  Even though my efforts at the moment are on conventional planet shapes therefore I'm taking care to make sure there are no artificial restrictions in place that would preclude more radical geology. These limitations are simply optimisations to make the data set more manageable and easier to generate (i.e. faster) while I develop the system, pushing the boundaries of what's possible is a large part of this project's raison d'ĂȘtre.

Although all these crazy numbers make this project sound like an exercise in futility, remember that these are for the highest 4.5cm per voxel detail level.  Each level lower takes only 1/8th the amount and of course you only need high detail data for the immediate vicinity of the viewpoint; the key therefore is to have a system that can generate brick data on demand for any given position in the clipmap hierarchy.  Combine this with a multi-tier memory and disk caching strategy and you get something usable.

Remember also that one of the benefits of clipmaps is that the memory overhead for rendering is fixed regardless of detail level.  I support ten renderable levels from the set of 22 so it makes no difference whether I'm rendering levels 0 to 9 or level 12 to 21 the overhead is the same.

Finally, you're maybe wondering why bother to store the bricks at all when there are some pretty cool projects out there showing what can be achieved with noise based functions directly on the GPU such as iq's Volcanic.  The main reason I'm sticking with them is that I want to be able to add all sorts of voxely infrastructure onto my planets such as roads, cities and bridges which are hard to do dynamically in real-time but also because I want to experiment more with creating truly heterogeneous terrains rather than the fairly homogeneous looking constructs real time noise tends to favour.  That's the goal anyway.

I realise this has been a pretty dry number-heavy post so +1 if you're still with me, hopefully the next one will be a bit more visual.

Wednesday, July 3, 2013

Interplay of Light

This is just a quick shout-out for the blog of a colleague here at Blitz Games Studios:

"Interplay of Light" is the work of talented graphics programmer Kostas Anagnostou and is a recommended addition to the feed list of anyone interested in the field.