Sunday, April 28, 2013

Iterating visitables


A common design pattern in programming is the 'visitor design pattern' (see here). The idea is that, the caller passes a visitor object to the callee, and the callee uses this object to make calls as it visits some internal structure that it maintains, passing the info about the visited objects as a parameter to the calls made on the visitor object.

Here is a real-world use case I run across recently: I was using a persistent R-tree library written in C++. R-tree is a tree used for indexing spatial data, such as 2-d and 3-d objects. This particular implementation provided a visitor interface, where querying the R-tree index was done by passing a visitor object to its query method. Every time a node of the tree is visited during the answering of the query, the query method makes a call to the visitor object, passing in the node being visited as a parameter.

An alternative approach, typically used with indexes like this is iteration (cursors in DB lingo). An iterator can be used to iterate through the results. In my use case, I needed to provide an iterator interface over the R-tree index. Doing this kind of adaptation requires use of non-traditional abstractions, which we will get to soon. Take a few minutes to think about a solution, and you will see that it is not at all straightforward.

Btw, I should note that the problem of providing an iterator interface on top of a visitor one is quite real, and something you can run across in real life programming, as I did. See here for a StackOverflow question about the same issue (which I tried to answer). It is also worth noting that the reverse problem of providing a visitor interface over an iterator one is trivial.

Here I will give a simple example that involves a counter to illustrate the problem. Let us first introduce abstract classes that define the concepts of visitor and visitable.

template<typename Data>
class Visitor
{
public:
  virtual ~Visitor() { }
  virtual bool visit(Data const & data) = 0;
};

template <typename Data>
class Visitable
{
public:
    virtual ~Visitable() {}
    virtual void performVisits(Visitor<Data> & visitor) = 0;
};

Let us illustrate their use with a simple counter that is visitable:

class Counter : public Visitable<int>
{
public:
    Counter(int start, int end)
        : start_(start), end_(end) {}
    void performVisits(Visitor<int> & visitor)
    {
        bool terminated = false;
        for (int current=start_; !terminated && current<=end_; ++current)
            terminated = visitor.visit(current);
    }
private:
    int start_;
    int end_;
};

Here, we have a Counter class that provides a Visitable interface for visiting numbers between 1 and n. Using it is rather straightforward:

class CounterVisitor : public Visitor<int>
{
public:
    bool visit(int const & data)
    {
        std::cerr << data << std::endl;
        return false; // not terminated
    }
};

int main(void)
{
    Counter counter(1, 100);
    CounterVisitor visitor;
    counter.performVisits(visitor);
    return EXIT_SUCCESS;
}

How can we convert this code to use iteration, rather than visitation? Let's first define the concept of an iterator:

template<typename Data>
class Iterator
{
public:
    virtual ~Iterator() {}
    virtual bool isValid()=0;
    virtual void moveToNext()=0;
    virtual Data const & getData()=0;
};

It is worth noting that I am not using STL style code here, to make the exposition more general. I am following a mainly OO approach. Here is what we want to do ideally:

int main(void)
{
    Counter counter(1, 100);
    Iterator<int> & it = magicFunction(counter);
    for (; it.isValid(); it.moveToNext()) {
        int const & data = it.getData();
        std::cerr << data << std::endl;
    }
    return EXIT_SUCCESS;
}

Below are two ideas on how to achieve this, both with serious shortcomings.

Idea 1) We write a wrapper that collects all data items into a collection and then we can provide the usual iteration interfaces.

Unfortunately, this idea has a critical shortcoming. It requires space in the order of number of items visited. In some applications this is unacceptable. A good example is the R-tree example I gave earlier, where the number of query results might be too large. If the only thing you want to do is to compute some aggregate over the results, or perhaps just look at the first k of the results, this approach will be very costly in some cases. So we rule it out as a general solution.

Idea 2) Another idea is to use threads.

We can spawn a new thread that will execute the performVisits method.  We can then start waiting on a conditional variable until the spawned thread signals us about a result. When a visit call is made on the spawned thread, the current data item will be saved into a shared variable, the spawned thread will signal us about the readiness of the result, and it will start waiting on another conditional variable until we signal it for the next request. This way our iterator can return the item stored in the shared variable when the getData method is called. When the moveToNext is called for the next step of the iteration, we could signal the spawned thread's conditional variable, so that it can do the next visit. Meanwhile we can wait on our conditional variable for the results. This could continue until the the visit is completed.

Unfortunately, this switching between threads is really costly if we end up doing it on a per-item basis. Alternatively, we can buffer some data items to improve performance. In general, threads are somewhat heavyweight, and also this kind of implementation is somewhat dirty. It can be made to work, however.

The second idea is actually moving us in the right direction: Threads have their own stack! So what we need is a different stack. We do not need concurrent execution, however. Instead we need alternating execution between two contexts with their own stacks. An abstraction that can provide this is 'coroutines'. See here, if you are not familiar with coroutines. Coroutines enable two functions to suspend and resume each other. That's where the name 'co' comes from. I won't describe them in detail here.

C++, being the flexible language it is, can support coroutines via a library. boost::coroutines is one such library and surprisingly it is very easy to use (assuming 'easy' can be applied to anything in boost).

Below, I will provide the code for a class that can convert a visitable into an iterator. Fasten your seatbelt:


template<typename Data>
class VisitableIterator : public Iterator<Data>
{
private:
    typedef boost::coroutines::coroutine<void()> coro_t;
    typedef coro_t::caller_type caller_t;
public:
    VisitableIterator(Visitable<Data> & visitable)
        : valid_(true), visitable_(visitable)
    {
        coro_ = coro_t(boost::bind(&VisitableIterator::visitCoro, this, _1));
    }
    inline bool isValid()
    {
        return valid_;
    }
    inline Data const & getData()
    {
        return visitor_.getData();
    }
    inline void moveToNext()
    {
        if(valid_)
            coro_();
    }
private:
    class InternalVisitor : public Visitor<Data>
    {
    public:
        InternalVisitor() {}
        inline bool visit(Data const & data)
        {
            data_ = &data;
            (*caller_)(); // return back to caller
            return false;
        }
        inline void setCaller(caller_t & caller)
        {
            caller_ = &caller;
        }
        inline Data const & getData()
        {
            return *data_;
        }
    private:
        Data const * data_;
        caller_t * caller_;
    };
    void visitCoro(coro_t::caller_type & caller)
    {
        visitor_.setCaller(caller);
        visitable_.performVisits(static_cast<Visitor<Data> &>(visitor_));
        valid_ = false;
    }
private:
    bool valid_;
    coro_t coro_;
    InternalVisitor visitor_;
    Visitable<Data> & visitable_;
};

In the code above, we see a class called VisitableIterator. The class provides a constructor that takes a Visitable object as parameter. Since it extends an Iterator, it provides the appropriate methods to perform iteration over the visitable object.

The code makes use of a coroutine. The coroutine is stored in the variable coro_. The code for the coroutine is specified in the visitCoro function. The visitCoro function takes as a parameter a caller object. The caller object represents the caller of the coroutine and is used to transfer the control back to the caller when needed. The visitCoro function makes use of a visitor object of type InternalVisitor, named visitor_. This is the object that implements the visit function. After performing each visit, it saves the current data item and transfers the control back to the caller. Since it needs to perform this transfer, it needs access to the caller object, which is kept in the caller_ variable. There are a few other details that should be obvious from the code.

Here is how we use this:

int main(void)
{
    Counter counter(1, 100);
    VisitableIterator<int> iter(static_cast<Visitable<int>&>(counter));
    for (; iter.isValid(); iter.moveToNext()) {
        int data = iter.getData();
        std::cerr << data << std::endl;
    }
    return EXIT_SUCCESS;
}

That was not too hard!

I have not yet investigated performance implications. Would update once I do.