«Bookmark» :

Memory Barriers: a Hardware View for Software Hackers

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

This paper gives a clear explanation on techniques for increasing cache utilization, and justify the existence of memory barriers as a necessary evil that is required to enable good performance and scalability.

Its general structure is as follows:

  1. Presents the structure of a cache;
  2. Explains how cache-coherency protocols ensure that different per-CPU caches coordinate with each other;
  3. Describes a technique called "store buffer", which can be used to ease the performance loss caused by invalidate-acknowledgement message passing.
  4. Gives an example on why write memory barriers are needed -- Store buffers will reorder the execution of instructions to achieve better performance but we need methods to ensure some critical orders will not be undermined;
  5. Outlines another technique named "invalidate queue" for making invalidate-acknowledgement messages arrive more quickly.
  6. Gives a corresponding example on why read memory barriers are needed -- Invalidate queues will cause another kind of reordering which can be prevented by read memory barriers.

The paper also gives many quizzes and discussions on real implementations (e.g. ARM, IA64).

Scalable Event Multiplexing: epoll vs. kqueue

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://www.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs-kqueue.html

This article gives a fairly good review on common event multiplexing techniques, especially the difference between epoll (in Linux) and kqueue (in BSD).

There are also some interesting anecdotes enclosed in the post, such as:

  • Stateful event multiplexing techniques (epoll, kqueue) are derived from the paper by Banga et al, published at USENIX ATC 1999.
  • kqueue is technically superior to epoll.
  • It is often quoted that “In Unix, everything is a file”, but it is not always true.

Pitfalls of Object Oriented Programming

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

Since the time consumed by a single CPU cycle is much lesser than RAM latency in nowadays, it becomes critical and profitable to better utilize the cache.

In this slide, Tony Albrecht states that with modern hardware, excessive encapsulation in OO is BAD. According to the principle of OOP, an instantiated object will generally contain all data associated with it. But when we only need a small portion of its fileds during the calculation there may be a lot of avoidable cache misses.

Fast and accurate long-read alignment with Burrows–Wheeler transform @ BIOINFORMATICS Vol. 26 no. 5 2010

On in Bookmark by Mingxing Zhang
Tags: , ,

URL: http://dl.acm.org/citation.cfm?id=1741825

A new generation of faster methods to find DNA sequence matches was developed since 2000. They are tailored for short (<100 bp) reads alignment and 10–1000 times faster than general algorithms, such as BLAST. However, reads coming from the new sequencing technologies are not short any more, which makes efficiently aligning long reads against a long reference sequence a new challenge to the development of alignment tools.

This paper presents Burrows-Wheeler Aligner’s Smith-Waterman Alignment (BWA-SW), a novel algorithm to align long sequences up to 1Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAST, and is several to tens of times faster than both.

Similar to other algorithm papers, if you are familiar with the background, it’s easy for you to grasp the core algorithm within a few minutes. So I will just recommend several supporting material here:

  • First, Heng Li et al.'s previous paper "Fast and accurate short read alignment with Burrows–Wheeler transform" gives a more detailed explanation of the building blocks of this algorithm, such as Prefix trie and Burrows–Wheeler transform. It is a good idea to read this paper in advance.
  • Second, if you still have trouble in understanding how FM-index works, there is an excellent diagrammatic presentation in Alex's post "FM-Indexes and Backwards Search".
  • Finally, if you have no idea what is "read alignment" but still interesting in this exquisite application of FM-index. You can read these two slides [1, 2] at first.

Production-Run Software Failure Diagnosis via Hardware Performance Counters @ ASPLOS'13

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://dl.acm.org/citation.cfm?id=2451128

This paper presents PBI, a system that uses existing hardware performance counters to diagnose production-run failures caused by sequential and concurrency bugs with low overhead (< 10%).

Firstly, we must notice that this tool is used to diagnose bugs, not detect or prevent bugs during production runs. This purpose enables PBI to leverage some kinds of statistical methods. As a consequence, you must collect enough failure runs before diagnosing it, which usually requests that you should know how to trigger the bug.

Then, personally, I think the most important observation in this paper is: a wide variety of common software bugs can be reflected by a small portion of hardware events supported by hardware performance counters.

  • For concurrency bugs, those events are cache-coherence events (state change in MESI protocol). For example, the I-predicate and S-predicate can differentiate failure runs from success runs for all 4 types of atomicity violations. More detailed discussions can be found in Sec. 3.1.2.
  • For sequential bugs, PBI use branch-related events, because many semantic bugs are related to wrong control flows.

This paper also proposes a statistical method to identify which events are highly correlated with failure runs.

All about Eve: Execute-Verify Replication for Multi-Core Servers @ OSDI'12

On in Bookmark by Mingxing Zhang
Tags: , ,

URL: http://dl.acm.org/citation.cfm?id=2387903

State machine replication (SMR) is a powerful fault tolerance technique, it enforces replicas to deterministically process the same sequence of requests so that they produce the same sequence of outputs.

But this technique doesn't suit for parallel systems, because if different servers interleave requests’ instructions in different ways, the states and outputs of correct servers may diverge even if no faults occur. As a result, most SMR systems require servers to process requests sequentially.

In contrast, Eve partitions requests in batches and allows different replicas to execute requests within each batch in parallel. After each batch, it speculates that whether the results of these parallel executions (i.e. the system’s important state and output at each replica) will match across enough replicas. If such a correct state/output can be identified, Eve makes those incorrect replicas issue an incremental state transfer request to other replicas (by leveraging a Merkle tree). However, if too many replicas diverge so that the correct state/output cannot be identified, Eve guarantees safety and liveness by rolling back and sequentially and deterministically re-executing the requests.

Since diverges will seriously impair Eve's performance, it uses a mixer stage to apply application-specific criteria to produce groups of requests that are unlikely to interfere.

As a side effect, Eve is especially useful in tolerating concurrency bugs. First, Eve’s mixer reduces the likelihood of triggering latent concurrency bugs by attempting to run only unlikely-to-interfere requests in parallel; Second, Eve's execute-verify architecture allows Eve to detect and recover when concurrency causes executions to diverge.

In their experiments, Eve achieves a 2.6x ~ 6.5x speedup over traditional sequential execution replica and at most 4.7x speedup over the Remus primary-backup system. It also finds new concurrency bugs in H2 database.

Consistency Models Explained Briefly

On in Bookmark by Mingxing Zhang
Tags:

URL: http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-bar/posts/88

Consistency model, which describes how far the behavior of your multi-threaded (or distributed) system is from the ideal "sequential behavior", belongs to the most important concepts of concurrency systems. But unfortunatly, discriptions about this topic are usually incomplete and even inconsistent with each other.

This article, as its title indicates, clearly explains all the important consistency models with vivid diagrammatic presentations, which is well worth reading.

Its main content is following:

  • "Multithreading" Consistency Models

    • Strong Consistency aka Linearizability
    • Sequential Consistency
    • Quiescent Consistency
  • "Distributed" Consistency Models

    • Eventual Consistency
    • Strict Consistency
    • Serializability

Parallel Merge Sort

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-bar/posts/49

I used to think that the "merge" step of merge sort cannot be efficiently parallelized, because the fact that an element belongs to the tail of one sorted array could not even guarrantee that it won't rank first in the other sorted arrary.

But this article illustrates that we can get across this obstacle by doing some binary search in advence.

And even better, this algorithm preserves merges sort's ability to be a external sort.