April 28, 2010

The first Entrelacs prototype is emerging

I've eventually decided to build up the first prototype of the Entrelacs system. After a few weeks of work, I think I finished these last days writing the first and main software component, that is the arrows space.

The arrows space is a kind of data base which allows to store and fetch information in the form of arrow structures, in compliance with the arrow paradigm.

As I expected, the resulting piece of code is very short. But it isn't as simple as I wished.

Of course, as predicted by the arrow theory, the mass memory stands as a self-referred open-addressed hash table using all the available storage space.

But I had to better integrate data strings in my plans.

A data string unambiguously defines a tag arrow. I decided a long time ago that such data would be stored among the space cells (hash-table buckets). However, the point I missed is that such a data sequence can't be mixed together with data already present in expected locations by following the open addressing way. One must link all string slices each other with a kind of "next" field.

But I didn't want to use a full memory pointer to link related cells. It would have double the storage need. So I tried to be smart and decided to introduce a tertiary hash.

As a reminder, the "double addressing" approach consists in computing 2 hash codes. The primary hash gives the default cell where to store an arrow while the secondary hash gives a probing offset which is used to look for a vacant cell when the primary cell is already used. Using these two hashes theoretically avoid data clustering

I decided to introduce a tertiary hash so to compute a new offset: the sequence offset, which separate cells involved into data string storage.

So upon conflict, one only needs for a few administrative bits per cell to build an offset "factor". This factor stands as a replacement for a "pointer". If there is no unwanted data between a cell and the next cell in the sequence, this factor is set to one. But when conflicts occur, this factor allows to jump over all unwanted cells up to the next cell included in the wanted sequence. Of course, this value range is very short but one considers that many consecutive busy cells is rare.

I found this solution so handy I eventually decided to use it for arrow back-references list storage (even if such lists can be safely polluted by unrelated data).

However, I had to made concessions.
  • BLOB arrows raw data are not stored into the arrows space. They are stored as host files instead. As long as Entrelacs won't be an autonomous operating system, I guess this is a reasonable concession.
  • Incoming children arrows and outgoing children arrows are put in the same back-references list. It means one can't distinguish them without fetching them and examining their ends. This is a bit disappointing news, but I can't have more than one coalesced linked list per arrow.
  • For the same reason, tag arrows are not back-referred. Otherwise, the back-references list would have been put beside the char string in the same coalesced linked list. Tag arrows only have a reference counters used for garbage collector.
At first, I intended to reuse the factor bits to build up a coalesced list of all arrows sharing the same primary offset by defining factor*probe-offset jumps. It would have lead to a whole coalesced hash table enhanced by a double hashing trick.

But for the first prototype, I had only 8 admin bits per cell. Combined with 24 bits for a memory address, it just fits in a 32 bits cell.

Now, I use these administrative bits for two other needs.
  • to distinguish between regular arrows, rooted arrows, tag arrows and stuffed cells.
  • to count cell "crossovers". This other invention of mine consists in incrementing a short-range counter per cell each time the cell is included into a probe sequence. It is supposed to make cell freeing feasible as in linear probing. TBC!
So I was obliged to reuse the same bits to store either the offset factor or a cell type, depending on whether the cell is an arrow definition start or some of its subsequent cells. It was hard but I finally found a working use of these 8 administrative bits.

I've thought about this code for like 11 years. So I'm more than happy to see something concrete emerging. Now I must confess I need a break. I haven't even tried yet to compile or obviously to debug the whole thing. It also lacks for a few features, like a commit log to ensure transaction atomicity.

Then the next step will be to code the virtual machine for the Entrelacs language -a query language for the arrows space-. Sigh! If only I could be seconded in this task.

No comments:

Post a Comment

Blog Archive