File-System: Delayed Allocation, fsync() solution

September 19, 2009 Matteo Bertozzi | Filed Under Unix C | No Comments

Last week on LWN Valerie Aurora as posted a great article (as always) POSIX v. reality: A position on O_PONIES. http://lwn.net/Articles/351422/.

fsync() is often more expensive than it absolutely needs to be. The easiest way to implement it is to force out every outstanding write to the file system, regardless of whether it is a journaling file system, a COW file system, or a file system with no crash recovery mechanism whatsoever. This is because it is very difficult to map backward from a given file to the dirty file system blocks needing to be written to disk in order to create a consistent file system containing those changes. For example, the block containing the bitmap for newly allocated file data blocks may also have been changed by a later allocation for a different file, which then requires that we also write out the indirect blocks pointing to the data for that second file, which changes another bitmap block… When you solve the problem of tracing specific dependencies of any particular write, you end up with the complexity of soft updates. No surprise then, that most file systems take the brute force approach, with the result that fsync() commonly takes time proportional to all outstanding writes to the file system.

Thinking for a while… Is not hard to implement fsync(), to flush just one file using Delayed Allocation (Allocate on Flush). We’ve all new data in memory, and old data stay on its block. So, modification in place means that you’ve just to flush blocks. Append means that you need to allocate something.
RaleighFS in Memory StructureThe image above, is a little bit old, but it’s the original idea of the RaleighFS in Memory Structure. There’re general information like super-block, bad blocks list and free blocks lis, current cache size and some other things. But today I’m focusing on Block Cache and Write Items Cache.

When you open a file, you load its metadata in memory then when you need file content you load it in the Block Cache. RaleighFS data block contains the File Key, so you can easily find blocks with specified key, but also you can easily find your file blocks using pointers.

So, why is easy to fsync() only the specified file with Delayed Allocation:

  • Modification in place, requires just a scan of the block cache to find what blocks are to flush (and obviously metadata)
  • Append to file, has all new data in memory and there’re no Modification to B*Tree(s) or Free blocks until you flush something.
  • Remove is just a command, and when fsync() is called all the delete operation on B*Tree(s) and list will take places.

But remember, syncing just one file is not a good idea, Trust your File-System’s flush policy!

Grand Central Dispatch: First Look

September 6, 2009 Matteo Bertozzi | Filed Under Apple | No Comments

Mac OS X Grand Central DispatchIn the last years I’ve always used a “parallel task” approach foreach loops that I’ve in the code, not always to speedup but even to clean-up the code. How to do it? Wrapping threads and Thread Pool like in this C# Parallel Forech Code.

Snow Leopard has introduced a new BSD-level infrastructure, with simple and efficient API to do this job. Here a little usage preview.

Block objects are a C-based language feature that you can use in your C, Objective-C, and C++ code. Blocks make it easy to define a self-contained unit of work. Blocks are something like Actions (delegate {}) in C#. Very useful to embed function in loops.

Blocks looks like a “private” function pointer, but you can access to the “parent” vars. (If you’re a Python coder, you’ve exactly the same thing).

/* Blocks in Python...
 * def main():
 *    a = 10
 *    def test(k):
 *        print a, k
 *    test(128)
 */
int main (int argc, const char *argv[]) {
  int a = 12;

  void (^test_block) (int) = ^(int k) {
    printf("A Block: PARENT(%d) ARG(%d)\n", a, k);
  };

  test_block(128);

  return 0;
}

The GCD queue API provides dispatch queues from which threads take tasks to be executed. Because the threads are managed by GCD, Mac OS X can optimize the number of threads based upon available memory, number of currently active CPU cores, and so on. This shifts a great deal of the burden of power and resource management to the operating system itself, freeing your application to focus on the actual work to be accomplished.

#include <dispatch/dispatch.h>
#include <stdlib.h>
#include <stdio.h>

#define ITEM_VMIN       (1)
#define ITEM_VMAX       (200)
#define NR_ITEMS        (100)

static void __fill_item (void *items, size_t n) {
  int *i_items = (int *)items;
  i_items[n] = (ITEM_VMIN + (int)(ITEM_VMAX * ((double)rand() / RAND_MAX)));
}

static void __work_on_item (void *items, size_t n) {
  int *i_items = (int *)items;
  i_items[n] *= 100;  /* Do some Computation on this Item */
}

int main (int argc, const char *argv[]) {
  dispatch_queue_t queue;
  int data[NR_ITEMS];

  /* Get Global Dispatch Queue */
  queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

  /* Initialize data Elements, and run computation on each element */
  dispatch_apply_f(NR_ITEMS, queue, data, __fill_item);
  dispatch_apply_f(NR_ITEMS, queue, data, __work_on_item);

  /* Brief review of the items */
  dispatch_apply(NR_ITEMS, queue, ^(size_t n) {
    printf("Results: Item %lu = %d\n", n, data[n]);
  });

  return 0;
}

File-System and Data Block Back Reference

September 6, 2009 Matteo Bertozzi | Filed Under Algorithms | No Comments

While I’m thinking and waiting for suggestions on how to improve my file-system block cache algorithm, I’ve decided to apply some changes to the Raleigh File-System Format (source code is not published yet).

Following the ideas of Valerie Aurora of Repair-driven File System Design, I’ve decided to add for each block (B*Tree and Data blocks) an head that contains a Magic Number and a CRC Sum of the block. In this way you can easily identify what kind of block you’ve peeked without scanning all metadata. Another step is to add a back reference (or back pointer) to the data block, in this way you can easily jump back to it’s the extent block (and obviously to its OID) so you can easily understand what is the Object owner of this block and you can easily swap two blocks reading at most 4 blocks (2 Data and 2 Extends).

Another idea stolen from Valerie is to double the metadata blocks with a COW-like approach, as explained in this paper “Double the Metadata, Double the Fun: A COW-like Approach to FS Consistency“, really useful for personal file-systems but maybe less in a distributed file-system. I’m working on it adding only as an mkfs option.

When the source Code will be online? I don’t know.. I’ve less time to work on it. Maybe at the end of this year I’ll publish the File-System and the Distributed System (explained some posts ago).

Double the Metadata, Double the Fun: A COW-like Approach to File
System Consistency

Mac OS X 10.6 Snow Leopard

September 5, 2009 Matteo Bertozzi | Filed Under Apple | No Comments

It’s finally here.  Mac OS X Snow Leopard, The world’s most advanced operating system. Finely tuned.

Mac OS X 10.6 Snow Leopard
Ok, It’s one week later… but I’ve installed it just right now. Maybe tomorrow morning an usage example of Grand Central Dispatch.

iPhone: Voice Mill

August 30, 2009 Matteo Bertozzi | Filed Under iPhone | No Comments

Yesterday I’ve played a bit with AVAudioRecorder, and this is a very small and funny example.
The main Idea is to create something like a wind mill that works with voice instead of wind.

The code below shows you how to record something. Then with the updateMeters() and peakPowerForChannel() you can extract the audio “noise”.

NSURL *url = [NSURL fileURLWithPath:@"/dev/null"];
NSDictionary *settings = [NSDictionary dictionaryWithObjectsAndKeys:
    [NSNumber numberWithFloat:44100.0], AVSampleRateKey,
    [NSNumber numberWithInt:kAudioFormatAppleLossless], AVFormatIDKey,
    [NSNumber numberWithInt:1], AVNumberOfChannelsKey,
    [NSNumber numberWithInt:AVAudioQualityLow], AVEncoderAudioQualityKey,
    nil];

NSError *error;
recorder = [[AVAudioRecorder alloc] initWithURL:url
            settings:settings error:&error];
if (recorder) {
    [recorder prepareToRecord];
    recorder.meteringEnabled = YES;
    [recorder record];
    ...
}


The SWF Video is available here Voice Mill Video.
The Source Code is available here Cocoa Voice Mill Source Code.

Block Cache Algorithm

August 25, 2009 Matteo Bertozzi | Filed Under Algorithms | 1 Comment

I need to replace my old filesystem cache algorithm with something more new and efficient. The old one is based on LRU/LFU algorithm. There’s a queue of cached blocks and an Hashtable to speedup block lookup.

struct blkcache_buf {
    struct blkcache_buf *  next;    /* Next Queue Item */
    struct blkcache_buf *  prev;    /* Prev Queue Item */
    struct blkcache_buf *  hash;    /* Next Item with the same hash */

    xuint16_t              count;   /* Retain count */
    xxx_block_t            block;   /* Cached Block */
};

typedef struct {
    struct blkcache_buf ** buf_hash;        /* Bufs Hashtable */
    xuint16_t              buf_hash_size;   /* Bufs Hashtable Size */
    xuint16_t              buf_used;        /* Bufs in use */

    struct blkcache_buf *  head;            /* Head of the Bufs Queue */
    struct blkcache_buf *  tail;            /* Tail of the Bufs Queue */

    xxx_device_t *         device;          /* Block Device used for I/O */
} xxx_blkcache_t;


Above, you can see the cache data structure and below the core of the cache Algorithm.

#define BUFHASH(cache, blocknr)     ((blocknr) % (cache)->buf_hash_size)

xxx_block_t *xxx_blkcache_read (xxx_blkcache_t *cache,
                                xxx_blkptr_t blocknr)
{
    struct blkcache_buf *buf;
    xuint16_t hash_index;

    /* Scan the hash chain for block */
    hash_index = BUFHASH(cache, blocknr);
    if ((buf = __blkcache_find(cache, blocknr, hash_index)) != NULL) {
        buf->count++;

        /* Move Buf far from head */
        __blkcache_buf_shift(cache, buf);

        return(&(buf->block));
    }

    /* Cache is Full, Remove one Item */
    if ((cache->buf_used + 1) > cache->buf_hash_size) {
        /* All buffers are in use */
        if (cache->head->count > 0)
            return(NULL);

        /* Remove Least-Frequently Used */
        buf = __blkcache_remove_lfu(cache);
        cache->buf_used--;
    }

    /* Desidered block is not on available chain, Read It! */
    if ((buf = __blkcache_buf_alloc(cache, buf, blocknr)) == NULL)
        return(NULL);

    /* Add One Use, Block cannot be removed */
    buf->count++;

    /* Go get the requested block unless searching or prefetching. */
    __blkcache_readwrite(cache, buf, RFALSE);

    /* Update Cache Hash */
    cache->buf_used++;
    buf->hash = cache->buf_hash[hash_index];
    cache->buf_hash[hash_index] = buf;

    /* Update Cache Queue */
    if (cache->head == NULL) {
        cache->head = cache->tail = buf;
    } else {
        buf->prev = cache->tail;
        cache->tail->next = buf;
        cache->tail = buf;
    }

    return(&(buf->block));
}


You can download a demo implementation here: lru-block-cache.c, but I’m waiting some ideas or suggestions to improve (or change radically) the Cache Algorithm. Thanks in advance!

[TIP] Generic Binary Format

August 11, 2009 Matteo Bertozzi | Filed Under Networking, Tips | No Comments

I Love use Binary formats, instead of XML, and JSON. Here is my Generic Binary Format for data transmissions or serializations. Data is composed by three blocks. The first one 1byte that describe all information about the object, like “is a single Int object” or “is a list”, then tells you the second block length. The second block contains the size of the third block (The Data-Block) or the Number of Items in List.

Generic Binary Format

[TIP] Generic Key Comparer

August 10, 2009 Matteo Bertozzi | Filed Under Tips | No Comments

I’m back to code on my “Cloud” FileSystem, and distributed tools. And here is a little tip.

When you’re working on Data, probably you store it as a Key-Value Pair on a BTree or something similar, and maybe this key is an aggregation of information… Maybe you’ve one bit of flag, N bytes of ParentKey, and others…

Now, the problem is… How can a “foreign” server sort correctly my keys? The solution is to send to the server the information on how to sort.. or a method to do it… but today, I’m focusing on the first one.

Generic Key

The code below show an implementation in Python of the Generic Key Comparer. At the end of the source code you can find an usage example. The Full Source Code is available here. Generic Key Comparer Source Code.

def __indexOfOne(data, tokens, offset):
   for i in xrange(offset, len(data)):
   if data[i] in tokens:
      return i
   return -1

def rawComparer(data1, data2, comparer):
   typeIds = [ 's', 'u', 'c', 'i', 'f', 'x' ]
   pyBinMap = {
      ('u', 1): 'B', ('u', 2): 'H', ('u', 4):'L', ('u', 8):'Q',
      ('i', 1): 'b', ('i', 2): 'h', ('i', 4):'l', ('i', 8):'q',
      ('f', 4): 'f', ('f', 8): 'd'
   }

   p = i = 0
   while i < len(comparer):
      nextIdx = __indexOfOne(comparer, typeIds, i + 1)
      if (nextIdx < 0): nextIdx = len(comparer)

      format = None
      length = 1 if (i + 1) == nextIdx else int(comparer[i+1:nextIdx])
      if comparer[i] == 's':
         format = str(length) + 's'
      elif comparer[i] == 'c':
         format = 'c'
      elif (comparer[i], length) in pyBinMap:
         format = pyBinMap[(comparer[i], length)]         

      if format != None:
         d1 = struct.unpack(format, data1[p:p+length])[0]
         d2 = struct.unpack(format, data2[p:p+length])[0]

         if d1 < d2:
            return -1
         elif d1 > d2:
            return 1         

      p += length
      i = nextIdx
   return 0

# Usage Example
if __name__ == '__main__':
  data1 = struct.pack('4sLch', 'test', 10, 'A', -3)
  data2 = struct.pack('4sLch', 'test', 10, 'A', -3)
  print 'Equal (test 10 A -3)', rawComparer(data1, data2, 's4u4ci2')

  data1 = struct.pack('4sLch', 'test', 10, 'A', 1)
  data2 = struct.pack('4sLch', 'test', 10, 'A', 0)
  print '(test 10 A 1) > (test 10 A 0)', rawComparer(data1, data2, 's4u4ci2')

  data1 = struct.pack('4sLch', 'test', 10, 'A', 0)
  data2 = struct.pack('4sLch', 'test', 10, 'A', 1)
  print '(test 10 A 0) < (test 10 A 1)', rawComparer(data1, data2, 's4u4ci2')

« Previous PageNext Page »