Articles de Mingxing Zhang :

Welcome!

On in Pages by Mingxing Zhang

Hello, I'm ZHANG, Mingxing a.k.a. james0zan.

Currently, I am a 5th year Ph.D. student at the MadSys Group of Tsinghua. I'm interested in building efficient and reliable parallel systems. You can refer to my CV for more information.

Thank you for the visiting. The blog is here.

Anticipating Invariant

On in Pages by Mingxing Zhang

AI in Brief

Concurrency bugs (CBugs) are notoriously difficult to be eradicated in the testing phase because of their non-deterministic nature, and the bug fixing procedure is also time-consuming and error-prone.

Thus, tolerating concurrency bugs in the production phase emerges as an attractive complementary approach. But unfortunately, the existing tolerating tools are usually 1) constrained in types of bugs they can handle; or 2) requiring roll-back mechanism, which can hitherto not be fully achieved efficiently without hardware supports.

In contrast, the Anticipating Invariant (AI) can anticipate CBugs before any irreversible changes have been made. Based on it, we implemented a software-only tool to tolerate concurrency bugs on-the-fly.

The tool will restrict the program's interleaving space, such that it avoids AI-violating (i.e., potentially failure-triggering) interleavings during the production runs. Since AI can detect the bugs beforehand, we are able to bypass the suspicious interleavings through stalling, instead of resorting to roll-back.

Experiments with 35 real-world concurrency bugs demonstrate that AI is capable of detecting and tolerating most types of concurrency bugs, including both atomicity and order violations.

We also evaluate AI with 6 representative parallel programs. Results show that AI incurs negligible overhead (< 1%) for many nontrivial desktop and server applications. And its slowdown on computation-intensive programs can be reduced to about 2x after using the bias instrumentation.

To the best of our knowledge, this is the first attempt to efficiently tolerate previously unknown order and atomicity violations at run time without using rollback.

Paper

Won SIGSOFT Distinguished Paper Award

Software

You can download and try AI at here.

In the package, we present:

  1. the source code of our LLVM-based AI implementation;
  2. several demos for demonstrating AI's ability of tolerating CBugs;
  3. applications from different categories (desktop, server, HPC) for evaluating AI's overhead;
  4. an example of the APIs' usage.

Documentations, screencasts and some auxiliary scripts are also provided.

DEF CON CTF Qualifier 2013 OMGACM 4 Writeup

On in Blog by Mingxing Zhang
Tags: ,

Crossposting

This artical is also cross-posted in blue-lotus's official blog at here.

Problem Description

Each problem gives you a circuit board, which will have a dimension, a set of nulls (that trace cannot go through), a feed point, and a set of antenna points.

Your goal is to draw a trace for each antenna point that starts at the feed point, ends at the antenna point and does not intersect other traces.

The distance of these traces must also be equal for all antenna points.

Our Solution

Since we intuitively classify this problem as NP-Complete, we resort to iterative deepening depth-first search for solving it.

In our algorithm, we choose to "grow" the traces from antenna points, which means that all the traces simultaneously start from their respective antenna point and go one step further one by one.

This strategy forces that all the traces will be equal length with each other, thus many unnecessary states are avoided.
And if one trace encounters with another trace before reaching the feed point, we can simply merge them into one to eschew intersection.

More specifically, in order to record the current state, we use:

  1. Nine bool variables for each point: one for the point and 8 for adjacent edges;
  2. One pair < int, int > (one for 1 and the other for sqrt 2) variable for each point to store the distance from this point to the corresponding antenna point if it has been covered by a trace;
  3. Current "growing" point for each trace;
  4. Which trace’s turn is it to grow in this iteration;
  5. The final distance, if one of the traces has reached the feed point;

And we also use a set of pruning tricks to accelerate the algorithm.

  1. Iterative deepening on length of each path;
  2. Limit the depth of dfs to be at most 30. (This ought to have been another iterative deepening argument, but we hard code it with an empirical number for simplifying the code);
  3. Preprocess the minimal distance from each point to the feed point and combine this information with iterative deepening threshold for further pruning;
  4. If one of the trace has already reached the feed point, use the distance for pruning.

Here is another tip:

The intersections will not only happen at points but also in small squares, you may need to double check this.

Code

#include <stdio.h>
#include <stdlib.h>
#include <map>
#include <set>
#include <vector>
#include <queue>
#include <string.h>
using namespace std;

typedef unsigned long long llu;
#define MAX 100
#define mp make_pair
#define pb push_back

bool Null[MAX][MAX], Null2[MAX][MAX];
int XY[MAX][MAX];
int N, M, eN, fX, fY, P, UP;
int dy[] = {1, 0, -1, 1, -1, 1, 0, -1};
int dx[] = {1, 1, 1, 0, 0, -1, -1, -1};
int verse[] = {8, 7, 6, 5, 4, 3, 2 , 1};
set<vector<bool> > S;
pair<int, int> dp[] = {
    mp(0, 1),
    mp(1, 0),
    mp(0, 1),
    mp(1, 0),
    mp(1, 0),
    mp(0, 1),
    mp(1, 0),
    mp(0, 1)
};

struct node {
    vector<bool> t;
    vector<pair<int, int> > len;
    int cnt;
    pair<int, int> ans;
    vector<pair<int, int> > now;
    vector<bool > used;
    int I;
};
node Key;

int get(int x, int y) {
    return x*M + y;
}

pair<int, int> getLen(int j) {
    return mp(Key.len[get(Key.now[Key.I].first, Key.now[Key.I].second)].first+dp[j].first, 
        Key.len[get(Key.now[Key.I].first, Key.now[Key.I].second)].second+dp[j].second);
}

void output(node ans) {
    printf("Solution %d\n", P);
    int kk=0, L=0;
    for (int i=0; i<N*M; i++) {
        kk++;
        int tx = i / M;
        int ty = i % M;
        for (int j=0; j<4; j++, kk++) {
            if (ans.t[kk])
                L++;
        }
        kk+=4;
    }
    printf("Line %d\n", L);
    int k=0;
    for (int i=0; i<N*M; i++) {
        k++;
        int tx = i / M;
        int ty = i % M;
        for (int j=0; j<4; j++, k++) {
            if (ans.t[k])
                printf("Segment %d %d %d %d\n", tx, ty, tx+dx[j], ty+dy[j]);
        }
        k+=4;
    }
}

bool none() {
    for (int i=0; i<eN; i++)
        if (Key.used[i]) return false;
    return true;
}

int myabs(int xx) {
    if (xx<0) return -xx;
    return xx;
}


int UPPER_BOUND = 28;
void dfs(int depth) {
    if (depth > UPPER_BOUND) return;

    int x = Key.now[Key.I].first;
    int y = Key.now[Key.I].second;

    int i = rand() % 8;
    for (int iii=0; iii<8; iii++, i=(i+1)%8) {
        int newX = x + dx[i];
        int newY = y + dy[i];

        if (newX < 0 || newX >= N) continue;
        if (newY < 0 || newY >= M) continue;
        if (Null[newX][newY]) continue;
        if (XY[newX][newY] == -1 || getLen(i).first + getLen(i).second + XY[newX][newY] > UP) continue;
        if (Key.cnt > 0 &&
                            (Key.ans.first < getLen(i).first || Key.ans.second < getLen(i).second) ) continue;
        if (Key.cnt > 0 &&
                            (Key.ans.first == getLen(i).first && Key.ans.second == getLen(i).second) ) continue;


        if (i == 0) {
            if (Key.t[get(x+1, y) * 9 + 6]) continue;
        }
        if (i == 2) {
            if (Key.t[get(x+1, y) * 9 + 8]) continue;
        }
        if (i == 7) {
            if (Key.t[get(x, y-1) * 9 + 6]) continue;
        }
        if (i == 5) {
            if (Key.t[get(x, y+1) * 9 + 8]) continue;
        }


        if (newX == fX && newY == fY) {
            if (Key.cnt > 0 && Key.ans != getLen(i)) continue;
            Key.cnt++; Key.ans = getLen(i);
            Key.t[(newX*M + newY)*9 + verse[i]] = true;
            Key.t[get(x, y)*9 + i + 1] = true;
            Key.used[Key.I] = false;
            Key.now[Key.I] = mp(newX, newY);


            if (Key.cnt > 0 && none()) {
                output(Key);
                exit(0);
            }

            int II = Key.I, ttt = 0;
            for (Key.I=(Key.I+1)%eN; ttt<eN; ttt++, Key.I=(Key.I+1)%eN)
                if (Key.used[Key.I])
                    break;
            if (ttt < eN) {
                dfs(depth+1);
            }

            Key.t[(newX*M + newY)*9 + verse[i]] = false;
            Key.t[get(x, y)*9 + i + 1] = false;
            Key.I = II;
            Key.used[Key.I] = true;
            Key.now[Key.I] = mp(x, y);
            Key.cnt--;
        } else if (Key.t[get(newX, newY) * 9]) {
            if (Key.len[newX*M + newY] != getLen(i)) continue;

            Key.t[(newX*M + newY)*9 + verse[i]] = true;
            Key.t[get(x, y)*9 + i + 1] = true;
            Key.used[Key.I] = false;
            Key.now[Key.I] = mp(newX, newY);

            int II = Key.I, ttt = 0;
            for (Key.I=(Key.I+1)%eN; ttt<eN; ttt++, Key.I=(Key.I+1)%eN)
                if (Key.used[Key.I])
                    break;
            if (ttt < eN) {
                dfs(depth+1);
            }

            Key.t[(newX*M + newY)*9 + verse[i]] = false;
            Key.t[get(x, y)*9 + i + 1] = false;
            Key.I = II;
            Key.used[Key.I] = true;
            Key.now[Key.I] = mp(x, y);
        } else {
            Key.t[(newX*M + newY)*9] = true;
            Key.t[(newX*M + newY)*9 + verse[i]] = true;
            Key.t[get(x, y)*9 + i + 1] = true;
            Key.len[newX*M + newY] = getLen(i);
            Key.now[Key.I] = mp(newX, newY);

            int II = Key.I, ttt = 0;
            for (Key.I=(Key.I+1)%eN; ttt<eN; ttt++, Key.I=(Key.I+1)%eN)
                if (Key.used[Key.I])
                    break;
            if (ttt < eN) {
                dfs(depth+1);
            }

            Key.t[(newX*M + newY)*9] = false;
            Key.t[(newX*M + newY)*9 + verse[i]] = false;
            Key.t[get(x, y)*9 + i + 1] = false;
            Key.I = II;
            Key.now[Key.I] = mp(x, y);
        }
    }
}

void dfs2(int x, int y) {
    for (int i=0; i<8; i++) {
        int newX = x + dx[i];
        int newY = y + dy[i];

        if (newX < 0 || newX >= N) continue;
        if (newY < 0 || newY >= M) continue;
        if (Null[newX][newY]) continue;

        if (XY[newX][newY] != -1 && XY[newX][newY] <= XY[x][y] + 1) continue;
        XY[newX][newY] = XY[x][y] + 1;
        dfs2(newX, newY);
    }
}

void solve() {
    scanf("%*s%d", &P);

    scanf("%*s%*d");
    scanf("%d%d", &N, &M);

    scanf("%*s%*d");
    scanf("%d%d", &fX, &fY);



    Key.t.resize(N * M * 9);
    Key.len.resize(N * M);

    scanf("%*s%d", &eN);
    for (int i=0; i<eN; i++) {
        int tx, ty;
        scanf("%d%d", &tx, &ty);

        Key.now.pb(mp(tx, ty));
        Key.t[get(tx, ty) * 9] = true;
        Key.used.pb(true);
    }

    int nN;
    scanf("%*s%d", &nN);
    memset(Null, 0, sizeof(Null));
    memset(Null2, 0, sizeof(Null2));
    for (int i=0; i<nN; i++) {
        int tx, ty;
        scanf("%d%d", &tx, &ty);
        Null[tx][ty] = true;
        Null2[tx][ty] = true;
    }



    memset(XY, -1, sizeof(XY));
    XY[fX][fY] = 0;
    dfs2(fX, fY);

    node base = Key;
    for (UP=6; ; UP++) {
        dfs(0);
        Key = base;
    }
}

int main() {
    solve();
}

Result

You can check out http://ascii.io/a/3644 for the result.

CAVEAT: We use many heuristics in the program, so it will not guarantee success for every run.

Memory Barriers: a Hardware View for Software Hackers

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

This paper gives a clear explanation on techniques for increasing cache utilization, and justify the existence of memory barriers as a necessary evil that is required to enable good performance and scalability.

Its general structure is as follows:

  1. Presents the structure of a cache;
  2. Explains how cache-coherency protocols ensure that different per-CPU caches coordinate with each other;
  3. Describes a technique called "store buffer", which can be used to ease the performance loss caused by invalidate-acknowledgement message passing.
  4. Gives an example on why write memory barriers are needed -- Store buffers will reorder the execution of instructions to achieve better performance but we need methods to ensure some critical orders will not be undermined;
  5. Outlines another technique named "invalidate queue" for making invalidate-acknowledgement messages arrive more quickly.
  6. Gives a corresponding example on why read memory barriers are needed -- Invalidate queues will cause another kind of reordering which can be prevented by read memory barriers.

The paper also gives many quizzes and discussions on real implementations (e.g. ARM, IA64).

Scalable Event Multiplexing: epoll vs. kqueue

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://www.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs-kqueue.html

This article gives a fairly good review on common event multiplexing techniques, especially the difference between epoll (in Linux) and kqueue (in BSD).

There are also some interesting anecdotes enclosed in the post, such as:

  • Stateful event multiplexing techniques (epoll, kqueue) are derived from the paper by Banga et al, published at USENIX ATC 1999.
  • kqueue is technically superior to epoll.
  • It is often quoted that “In Unix, everything is a file”, but it is not always true.

Pitfalls of Object Oriented Programming

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

Since the time consumed by a single CPU cycle is much lesser than RAM latency in nowadays, it becomes critical and profitable to better utilize the cache.

In this slide, Tony Albrecht states that with modern hardware, excessive encapsulation in OO is BAD. According to the principle of OOP, an instantiated object will generally contain all data associated with it. But when we only need a small portion of its fileds during the calculation there may be a lot of avoidable cache misses.

Fast and accurate long-read alignment with Burrows–Wheeler transform @ BIOINFORMATICS Vol. 26 no. 5 2010

On in Bookmark by Mingxing Zhang
Tags: , ,

URL: http://dl.acm.org/citation.cfm?id=1741825

A new generation of faster methods to find DNA sequence matches was developed since 2000. They are tailored for short (<100 bp) reads alignment and 10–1000 times faster than general algorithms, such as BLAST. However, reads coming from the new sequencing technologies are not short any more, which makes efficiently aligning long reads against a long reference sequence a new challenge to the development of alignment tools.

This paper presents Burrows-Wheeler Aligner’s Smith-Waterman Alignment (BWA-SW), a novel algorithm to align long sequences up to 1Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAST, and is several to tens of times faster than both.

Similar to other algorithm papers, if you are familiar with the background, it’s easy for you to grasp the core algorithm within a few minutes. So I will just recommend several supporting material here:

  • First, Heng Li et al.'s previous paper "Fast and accurate short read alignment with Burrows–Wheeler transform" gives a more detailed explanation of the building blocks of this algorithm, such as Prefix trie and Burrows–Wheeler transform. It is a good idea to read this paper in advance.
  • Second, if you still have trouble in understanding how FM-index works, there is an excellent diagrammatic presentation in Alex's post "FM-Indexes and Backwards Search".
  • Finally, if you have no idea what is "read alignment" but still interesting in this exquisite application of FM-index. You can read these two slides [1, 2] at first.

Production-Run Software Failure Diagnosis via Hardware Performance Counters @ ASPLOS'13

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://dl.acm.org/citation.cfm?id=2451128

This paper presents PBI, a system that uses existing hardware performance counters to diagnose production-run failures caused by sequential and concurrency bugs with low overhead (< 10%).

Firstly, we must notice that this tool is used to diagnose bugs, not detect or prevent bugs during production runs. This purpose enables PBI to leverage some kinds of statistical methods. As a consequence, you must collect enough failure runs before diagnosing it, which usually requests that you should know how to trigger the bug.

Then, personally, I think the most important observation in this paper is: a wide variety of common software bugs can be reflected by a small portion of hardware events supported by hardware performance counters.

  • For concurrency bugs, those events are cache-coherence events (state change in MESI protocol). For example, the I-predicate and S-predicate can differentiate failure runs from success runs for all 4 types of atomicity violations. More detailed discussions can be found in Sec. 3.1.2.
  • For sequential bugs, PBI use branch-related events, because many semantic bugs are related to wrong control flows.

This paper also proposes a statistical method to identify which events are highly correlated with failure runs.

All about Eve: Execute-Verify Replication for Multi-Core Servers @ OSDI'12

On in Bookmark by Mingxing Zhang
Tags: , ,

URL: http://dl.acm.org/citation.cfm?id=2387903

State machine replication (SMR) is a powerful fault tolerance technique, it enforces replicas to deterministically process the same sequence of requests so that they produce the same sequence of outputs.

But this technique doesn't suit for parallel systems, because if different servers interleave requests’ instructions in different ways, the states and outputs of correct servers may diverge even if no faults occur. As a result, most SMR systems require servers to process requests sequentially.

In contrast, Eve partitions requests in batches and allows different replicas to execute requests within each batch in parallel. After each batch, it speculates that whether the results of these parallel executions (i.e. the system’s important state and output at each replica) will match across enough replicas. If such a correct state/output can be identified, Eve makes those incorrect replicas issue an incremental state transfer request to other replicas (by leveraging a Merkle tree). However, if too many replicas diverge so that the correct state/output cannot be identified, Eve guarantees safety and liveness by rolling back and sequentially and deterministically re-executing the requests.

Since diverges will seriously impair Eve's performance, it uses a mixer stage to apply application-specific criteria to produce groups of requests that are unlikely to interfere.

As a side effect, Eve is especially useful in tolerating concurrency bugs. First, Eve’s mixer reduces the likelihood of triggering latent concurrency bugs by attempting to run only unlikely-to-interfere requests in parallel; Second, Eve's execute-verify architecture allows Eve to detect and recover when concurrency causes executions to diverge.

In their experiments, Eve achieves a 2.6x ~ 6.5x speedup over traditional sequential execution replica and at most 4.7x speedup over the Remus primary-backup system. It also finds new concurrency bugs in H2 database.