Friday, January 20, 2012

Talk: A guided tour through the bytestring library

Yesterday, I held a talk about the bytestring library at the Zurich HaskellerZ meetup group. The goal of the talk was to enable the audience to write efficient code based on the bytestring library; i.e., explain enough about internals of the library and GHC such that they can judge the cost of the operations on bytestrings both in terms of time and space. The talk also covers the new bytestring builder (based on blaze-builder), which is currently under review for inclusion in the next release of the bytestring library.

Here are the slides and their corresponding handout version for interested readers.

Thursday, March 10, 2011

Efficient UTF-8 Encoding

Belgien beer and UTF-8 encoding: what a coincidence. Well, this post is just a quick post about the results for UTF-8 encoding 100'000'000 characters to /dev/null using the various options that our current Haskell ecosystem offers. It happensto be written after a very nice Belgian evening spent in the awesome Minimum climbing gym in Zurich Wollishofen. Here's the code and the corresponding Makefile. The following results are the wall-time results obtained on my Core 2 Duo laptop with 2GB RAM using GHC 7.0.2 on a 32bit Linux 2.6.32-29 installation.

text         1.08s
blaze        1.70s
base         2.63s
via-text     5.97s
utf8-string 47.99s
utf8-light  40.18s

The result for text is the time it takes to UTF-8 encode a nicely packed lazy text value of length 100'000'000. The other times for blaze, base, via-text, and utf8-string measure how long it takes to UTF-8 encode a String of length 100'000'000 using blaze-builder, GHC 7.0.2's base library, packing to a lazy text value and encoding it, and using the utf8-string library. Sadly, utf8-light supports only strict ByteStrings and, hence, it's not really usable to UTF-8 encode a String that long. Therefore, the time given for utf8-light is the time it takes to encode a String of length 1'000'000 a hundred times.

My conclusions after this experiment are the following:

  1. Using [Word8] list based encoding implementations is likely to result in suboptimal performance.
  2. It is not worth packing a String first to a Text value, if it is encoded right away again.
  3. The current work I'm spending on integrating blaze-builder with the bytestring library is really worth the effort. Compared to the text benchmark, which uses a better (packed) representation of Char sequences, we are only a factor 1.6 slower. Moreover, it might even be worth a try to replace the String encoding functions in the base library by according Builders to gain these additional 50 percent of speed. As an additional benefit, we could even think of executing Builders directly on the buffer associated to a Handle and, therefore, output byte streams denoted by Builders without any buffer allocations.
  4. It would be very interesting to see how well we fare against other languages. If a reader would implement the same benchmark in C, C++, Java, Python, ... I'd be very glad to publish the obtained results here.

Thursday, November 18, 2010

On the design of an efficient builder monoid

Lately (here and here), I reported on promising results about the efficient construction of lazy bytestrings with a large average chunk size using the blaze-builder library. In the meantime, I have done some more research, which further confirmed the strength of the design underlying the blaze-builder library. However before reporting on more results, I'd like to present you its design in the hope that you can apply the underlying ideas to other performance critical areas you are working on.

The Foreign module provides us with all the functions we require for working with pointers of various forms. This includes the ForeignPtr pointers used by strict bytestrings to reference their underlying buffer.

  import Foreign
  import Data.Monoid
  import qualified Data.ByteString          as S
  import qualified Data.ByteString.Internal as S

A builder should represent a sequence of bytes such that the following two objectives are maximized:

  • the efficiency of writing the represented sequence of bytes to a sequence of buffers (e.g., the chunks of a lazy bytestring) and 
  • the efficiency of appending two builders.

In our design, the work of a builder is handled by build steps. Once we tell a build step where it can start to write and where the buffer ends, the build step will write all the bytes it represents and return a build signal that tells us how to proceed.

  • If the build signal is Done pf, then the build step has completed its work and the next free byte in the buffer is pointed to by pf
  • If the build signal is BufferFull requiredSize pf nextStep, then the build step has filled the buffer up to pf and now requires a new buffer with at least the size requiredSize. If we were creating a lazy bytestring, we would now ship of the full buffer as a chunk and allocate a new buffer that we can pass to nextStep.
  • The build signal InsertByteString pf bs nextStep, tells us that the build step has filled the buffer up to pf and would now like to insert a bytestring directly into the output sequence of buffers. The idea behind this signal is that it allows us to avoid copying large bytestrings.


A builder is just a build step parametrized over the build step that should be executed after the builder has done its output.

  type BuildStep =  Ptr Word8     -- first free byte of buffer
         -> Ptr Word8     -- first byte after buffer
        -> IO BuildSignal


  data BuildSignal =        -- next free byte
      Done                  !(Ptr Word8) 
    | BufferFull       !Int !(Ptr Word8)               !BuildStep
    | InsertByteString      !(Ptr Word8) !S.ByteString !BuildStep
  
  newtype Builder = Builder (BuildStep -> BuildStep)

Based on these definitions, we can easily define a Monoid instance that models concatenation of builders. An empty builder just returns the build step to be executed afterwards. We append two builders by telling the first builder that it should call the second builder once its done and by telling the second builder that it should call the continuation builder k once it is done.

  instance Monoid Builder where
      mempty                            = Builder id
      mappend (Builder b1) (Builder b2) = Builder $ \k -> b1 (b2 k)

Writing an actual builder requires some care, as we are working in the IO monad and, hence, the safety belts are off. We describe the construction of builders that serialize values that do not have to be wrapped over buffer boundaries using the notion of a Write. A value Write size io denotes an atomic write to a buffer of size bytes, which can be executed by a call io with a pointer to the first byte that should be written.

  data Write = Write Int (Ptr Word8 -> IO ())


  writeWord8 :: Word8 -> Write
  writeWord8 x = Write 1 (\pf -> poke pf x)

Defining writes is simple. Constructing a builder from a write is also not difficult once you have understood how to define a function such that it can call the BufferFull signal with a reference to itself.

  fromWrite :: Write -> Builder
  fromWrite (Write size io) =
      Builder step
    where
      step !k !pf !pe
        | pf' <= pe = do io pf
                         k pf' pe
        | otherwise = do return $ BufferFull size pf (step k)
        where
          pf' = pf `plusPtr` size
      
Once such basic builder constructors are defined, the code quickly looses it's C-like style, while retaining good performance.

  fromWord8 :: Word8 -> Builder
  fromWord8 = fromWrite . writeWord8


  fromWord8s :: [Word8] -> Builder
  fromWord8s = mconcat . map fromWord8


  buildABC :: Builder
  buildABC = fromWord8s [65..90]

The missing piece is now a driver function that actually runs a builder. A very nice property of our builder design is that it completely decouples the allocation strategy for the output buffers from the actual writing to them. Hence, whatever buffer you have ready, you can tell a builder to fill it. The only caveat is that a build step may require a large buffer than you can provide. However, all builders I implemented up to now can be wrapped at almost every point. Therefore, they would even work with a very small output buffer and the above caveat likely never applies; i.e., expensive solutions to get rid of it are OK.

An additional advantage of separating the buffer allocation strategy from the buffer filling is that the whole state of the allocation strategy is kept out of the performance critical functions; i.e., the concatenation of builders and their implementation itself. This is one of the core differences to the builder implementation provided by Data.Binary.Builder from the binary package.

The driver I implement here runs a builder and executes an IO action on each full buffer, which we represent by a strict bytestring. We can for example use this driver to send the bytes represented by a builder over the network using Network.Socket.ByteString. This driver is slightly simpler to implement than the generation of lazy bytestrings. Moreover, this driver has the nice property that no unsafePerformIO is involved, which means that there are no other semantic pitfalls than the ones already provided by the IO monad ;-)

  toByteStringIOWith :: Int -> (S.ByteString -> IO ()) -> Builder -> IO ()  
  toByteStringIOWith bufSize io (Builder b) = 
      fillBuffer bufSize (b finalStep)
    where
      finalStep pf _ = return $ Done pf
      fillBuffer !size step = do
          S.mallocByteString size >>= fill
        where
          fill fpbuf = do
              let !pf = unsafeForeignPtrToPtr fpbuf
                  -- safe due to later reference of fpbuf
                  -- BETTER than withForeignPtr, as we lose a tail call otherwise
              signal <- step pf (pf `plusPtr` size)
              case signal of
                  Done pf' -> io $ S.PS fpbuf 0 (pf' `minusPtr` pf)
                  BufferFull minSize pf' nextStep  -> do
                      io $ S.PS fpbuf 0 (pf' `minusPtr` pf)
                      fillBuffer (max bufSize minSize) nextStep
                  InsertByteString pf' bs nextStep  -> do
                      io $ S.PS fpbuf 0 (pf' `minusPtr` pf)
                      io $ bs
                      fillBuffer bufSize nextStep
                  
The implementation is rather straightforward. We construct a final step that we can hand over to the builder. Then we allocate a buffer and fill it using the given step. We handle the signals as described above and recurse using a tail call to fill the next buffer. The tail call is important to get good performance and to not waste stack space. A misplaced (outermost) withForeignPtr can easily destroy this property. Therefore, I switched to using unsafeForeignPtrToPtr and reasoning directly about when it is safe to extract the pointer of a foreign pointer.

Lets test our driver with an artificially small chunk size.

  testABC :: IO ()
  testABC = toByteStringIOWith 10 (putStrLn . show) buildABC


  *BuilderDesign> testABC
  "ABCDEFGHIJ"
  "KLMNOPQRST"
  "UVWXYZ"

Yihaa, nice :-) Now, you have seen the core design of the builder monoid provided by he blaze-builder library. It gains a lot of its performance from its simplicity and the option to safely break the abstraction and implement build steps directly when needed. What I was hiding in the above definitions are the {-# UNPACK #-} pragmas and a (not yet so judicious) use of {-# INLINE #-} pragmas. See the source and the accompanying benchmarks of the blaze-builder library for more information on them.

Let me give two last examples of how we improve performance without loosing code maintainability. The key to these examples is the Write abstraction. Writing of a list of elements calls for a typical optimization: move invariant data out of the inner loop. We can implement it once and for all as follows.

  fromWriteList :: (a -> Write) -> [a] -> Builder
  fromWriteList write = 
      \xs -> Builder $ step xs
    where
      step xs0 !k !pf0 !pe0 = go xs0 pf0
        where
          go []          !pf = k pf pe0
          go xs@(x':xs') !pf
            | pf' <= pe0  = io pf >> go xs' pf'
            | otherwise   = return $ BufferFull size pf (step xs k)
            where
              !pf' = pf `plusPtr` size
              Write size io = write x'

Using inlining, we can give the compiler enough information such that it can optimize the code for the actual implementations of calls like fromWriteList writeWord8.

Another optimization often used in the blaze-builder library is to ensure that costly operations are amortized over enough bytes being output. For example, this is why it is OK for buffer wrapping to be a bit more expensive. For consecutive writes, we can reduce the cost of appending them using the following Monoid instance.

  instance Monoid Write where
      mempty = Write 0 (const $ return ())
      mappend (Write l1 f1) (Write l2 f2) = Write (l1 + l2) $ \ptr -> do
          f1 ptr
          f2 (ptr `plusPtr` l1)

If the compiler knows all the implementations of the writes, then he will produce very nice straight-line code for builders constructed from mappend'ed writes. A typical example would be the serialization of a Word16 represented by two bytes b1 and b2, as fromWrite (writeWord8 b0 `mappend` writeWord8 b1).

Well, that's it for now. I hope you did enjoy this excursion into some not always Haskell-like code. For me, the development of this library was (and still is) very entertaining and inspiring and I'd love to hear about how it fares in practice.

Wednesday, November 10, 2010

Defragmenting lazy bytestrings

In the last post, I introduced the blaze-builder library and stressed the point that it is important to ensure a large average chunk size for the constructed lazy bytestrings. In this post, I'll give a concrete example of why ensuring a large average chunk size matters.

You probably know the nice zlib library that allows you to compress a lazy bytestring with a single call to

    compress :: L.ByteString -> L.ByteString

I will use compress to illustrate that small chunk sizes can be costly. Actually, they can be so costly that it is  worth it to first "defragment" the lazy bytestring before compressing it. Using the blaze-builder library, defragmentation is easily defined as follows.

    defragment :: L.ByteString -> L.ByteString
    defragment = toLazyByteStringfromLazyByteString

The builder created using fromLazyByteString copies the chunks up to a size of 8kb and insert them directly in the output stream otherwise. This way we can guarantee a minimal average chunk size of 4kb no matter when the output buffer is flushed due to a direct insertion of an 8kb block.

The following plot shows the measured times in boxplot format for defragmenting ("compaction only"), direct compression, and compression with preceding compaction of 200kb of data represented as a lazy bytestring for different fixed chunk sizes.

As in my previous post, the benchmarks were measured on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. The corresponding measurement log can be found here. A log-log plot exhibiting more information on the behaviour of defragment for larger chunk sizes can be found here.

The above plot shows that compress profits heavily from defragmentation. Sadly, I do not yet know the cause for the significant slowdown of compress for lazy bytestrings with a small average chunk size. I guess it is a combined effect of the cost of the FFI calls (how big are they actually?) and perhaps some implementation overhead stemming from large amount of state being threaded through the implementation of compress. Comments and clarifications are very much welcome.

Note that as for filesystems, an even better solution than using regular defragmentation is to avoid fragmentation in the first place, which you can achieve by using a Builder for constructing lazy bytestrings. Note also that fromLazyByteString currently does not wrap bytestrings around buffer boundaries, which results in some unnecessarily spilled memory for medium (1kb - 8kb) chunk sizes. I'll implement that in the near future and will post the new measurements here.

Sunday, November 7, 2010

The blaze-builder library: faster construction of bytestrings

Hi, I am Simon Meier, a swiss Haskell enthusiast currently pursuing his PhD in computer science at ETH Zurich. In this blog post, I'll introduce you to the blaze-builder library.

The blaze-builder library provides you with a Builder type that you can use to efficiently construct sequences of bytes represented in a packed form as a strict or lazy bytestring. Hence, typical use cases for a Builder are saving your application data in a space efficient binary form to a file or sending a response to some request over the network.

Probably, you know about the binary package, which also provides a Builder type in the Data.Binary.Builder module targeting exactly the same usecase as our Builder. This is no coincidence. During this year's Google Summer of Code, Jasper Van der Jeugt and I developed the blaze-builder library to overcome performance shortcomings of Data.Binary.Builder with respect to the specific needs of the blaze-html HTML generation library. Since then, I have restructured the blaze-builder library to serve as a drop-in replacement for Data.Binary.Builder, which it improves upon with respect to both speed as well as expressivity.

Usage example
We start by importing the necessary modules. We also define a convenient abbreviation for mappend, which actually will become part of the base library according to rumors I heard at this years ZuriHac.

Our example is about serializing a very simple representation of a person to a sequence of bytes. As usual, this serialization also requires us to fix the encoding format. We encode strings using UTF-8 and prefix them with their length encoded as a 32bit little-endian integer to make parsing unambiguous. We also encode the age of a person as a 32bit little-endian integer. I guess the code speaks for itself.

The above code is typical for serialization code based on builders. One uses the predefined functions for creating builders with a fixed encoding format from standard Haskell values. These builders are then combined using the functions from the Monoid typeclass. Builders essentially store the recipe for building their corresponding sequence of bytes. Once one needs a concrete representation of this sequence of bytes, one just calls toLazyByteString or toByteString to execute that recipe.

The benefit of using builders to construct a bytestring is twofold: First, appending two builders is an O(1) operation, which is also efficient in absolute terms, as it corresponds to a single function call. Second, when constructing the resulting lazy bytestring the blaze-builder makes sure that the average chunk size is large. A large average chunk size is important to make good use of cache prefetching in later processing steps (e.g. compression) and it also reduces the sytem call overhead when writing the resulting lazy bytestring to a file or sending it over the network.

For example, the above code results in the following sequence of chunk sizes.

The 170001 bytes represented by lazyBinaryCloneVillage feature an average chunk size of ~24kb. The first buffer is only ~4kb large, because for short output sequences the buffer allocation cost is significant. toLazyByteString compensates this cost by allocating the first buffer with the minimal expected chunk size. Note that these chunk sizes reflect the default settings of toLazyByteString, which is optimized to yield efficient and well-chunked results for all lengths of output sequences. If you know more about your typical serialization tasks, then you can tune these settings to your favor.

Speaking of efficiency, I'm quite sure you would also like to see some benchmark figures. I'm not going to present the figures for the above example. Not because they are embarassing; they are not. However, without good competition, the interpretation of benchmark figures is difficult; and currently, I don't know of a good competitor for the above usecase. However, we can also use builders to pack a [Word8] list into a strict or lazy bytestring; and there, we definitely do have good competitors.

Packing [Word8]
For our benchmark, we use the following implementations for packing [Word8] lists.
The implementations S.pack, L.pack, declPackLazy, and binaryDeclPackLazy are trivial. The implementations packStrict and packLazy make use of  fromWord8s :: [Word8] -> Builder, which is a very efficient function to serialize lists of bytes, as the following plot shows.

The plot is a log-log plot of the mean time for packing [Word8] lists using the above implementations when being run on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. I created this plot by adapting Bryon O'Sullivan's excellent Criterion benchmarking library to handle scaling benchmarks (cf. ScalingBenchmarks.hs). In the spirit of Criterion, I also generate a boxplot version for every scaling benchmark (using more transparent lines to draw the quartiles and whiskers), which allows us to judge the quality of the measurements. The boxplot version of the above plot shows that nothing went wrong during its measurement.

Note that the mean times are plotted with respect to a logarithmic scale. Hence, a constant difference between two graphs means a constant factor improvement. As you can see from the measurement log, using blaze-builder is a definitive win for output sequences longer than 1kb: packStrict beats S.pack by almost a factor 2 and packLazy beats L.pack by a factor 10 and binaryDeclPackLazy by a factor 92 (!).

The crucial ingredient for this improvement is the fromWord8s function. It is constructed using the Write abstraction Jasper introduced during his work on blaze-html. The function fromWrite8List forces and writes eight list elements at a time, which allows the compiler to bundle the actual writes to the output buffer.

For shorter output sequences, the improvement gained from using blaze-builder gets smaller and S.pack is even faster for very short sequences. The following plot, its boxplot version, and the measurement log give a more detailed comparison for such short sequences.





The results are not surprising when comparing the implementations: packStrict uses toByteString, which simply runs toLazyByteString and copies all chunks into a single buffer of the appropriate size. Hence, packStrict is always slightly slower than packLazy. The S.pack function from Data.ByteString works in two passes over the input list: first, it determines the length of the list and then it copies all bytes to the allocated buffer. Traversing linked lists of bytes is costly and pays off only for very short lists, as there the output buffer allocation cost is dominant. The peak of packLazy at 64 bytes stems from the fact that it first allocates a 64 byte buffer which is copied to a 4kb buffer once its clear that more than 64 bytes are output. This is done to compensate the buffer allocation cost for very short output sequences. It can be switched off using toLazyByteStringWith, if required.

Conclusions
The blaze-builder library provides an expressive and efficient way to construct both lazy as well as strict bytestrings. The accompanying benchmarks show that it improves (often significantly) in all cases over Data.Binary.Builder from the binary package. The benchmarks presented in this post also show that the implementation of blaze-builder compares favorably against special purpose functions for packing [Word8] lists; on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. Yeah, that's what the benchmarks state ;-). However, I expect that the conclusions drawn from them stay also valid for most other settings. For example, the GHC-7.0.1 release candidate makes Data.ByteString.Builder run a bit faster, but still not as fast as blaze-builder.

During the work on blaze-html, I learned from several benchmarks that ensuring a large average chunk size is very important for lazy bytestrings to be efficient. However, many encoding functions on Hackage produce bytestrings or lazy bytestrings. Hence, we have to copy their result again to guarantee large average chunk sizes, which is a waste of resources. Hence, I suggest that encoding functions produce a builder instead of strict or lazy bytestrings. Apart from guaranteeing a fast append and a large average chunk size, this change also simplifies and generalizes the encoding code, as it separates the buffer allocation strategy from the encoding function.

In order for such a change to be effective, I suggest that the bytestring library itself provides an implementation of Data.ByteString.Builder, which would provide a blessed way to incrementally create bytestrings. The blaze-builder library offers one possible implementation path for such a bytestring builder. If the community would see it fit, then I'd be happy to port the builder parts to the bytestring library. The string encodings currently provided by blaze-builder would then move into their own libraries.

Well that's it for now. I will publish more of the experiments I have done during the work on blaze-builder once I find some more time. I'm also looking very much forward to your feedback.

Happy packing :-p