yurai blog

Wednesday, February 4, 2015

Smart Message Dispatcher

1. Introduction

Message dispatchers are central part of event-driven systems. There are many approaches to message dispatching. Starting from simple where handlers are part of code to complex beasts with possibility of adding handlers in run-time and crazy C++14 implementation[1]. Recently I discovered really smart and elegant implementation which I would like to demonstrate here.

2. Motivation

Suppose we have some message-driven system and want to transfer some message over transport layer. Sending is piece of cake. Receiving is not:)
We cannot write receiving code just as:

template<typename Message> Message receive()
{
 ...
}

SomeMsg msg;
msg.receive();
...

The reasons are simple. We have never known real type of received message. With a different type a different handler is associated so we must know the real message type.
Dispatching always ends with downcasting through message class hierarchy in shared memory system or adding some kind of getTypeId inside every message and therefore testing that with id from deserialized incoming message header.
As a result there is a need to create a class dispatching received message to right handler. We want to focus on case where number of handlers is relatively small so adding handlers in run-time will be unnecessary. We would like to have the ability to write following use case:

dispatcher
    .handle<messages::enter_pin>(
        [&](messages::enter_pin const& msg)
        {
            ...
        })
    .handle<messages::pin_correct>(
        [&](messages::pin_correct const& msg)
        {
            ...
        })
    .handle<messages::pin_incorrect>(
        [&](messages::pin_incorrect const& msg)
        {
            ...
        });

3. Solution

At the beginning I decided to show code which will be equivalent with code above and which should be actually generated after inlining dispatcher internals:

receiver.wait_on_msg();
    if (receiver.received_msg_with_type(enter_pin::id))
        dispatch<enter_pin>();
    else
    if (receiver.received_msg_with_type(pin_correct::id))
        dispatch<pin_correct>();
    else
    if (receiver.received_msg_with_type(pin_incorrect::id))
        dispatch<pin_incorrect>();

where dispatch is something like

Msg msg;
receiver.deserialize_from_buffer(buffer);
handler(msg);

OK. So the problem is well defined. How to transform multiple calls of handle with handlers in form of lambdas to nested conditionals choosing right handler looking on message id? Solution is interesting and clever and comes from excellent book "C++ Concurrency in Action" written by Anthony Williams. Implementation first, comments after.

template<typename previous_dispatcher, typename Msg, typename Func>
class template_dispatcher
{
    std::shared_ptr<remote_transport::receiver> receiver;
    previous_dispatcher* prev;
    Func handler;
    bool chained;

    template_dispatcher(template_dispatcher const&) = delete;
    template_dispatcher& operator=(template_dispatcher const&) = delete;

    template<typename Dispatcher, typename OtherMsg, typename OtherFunc> friend class template_dispatcher;

    void wait_and_dispatch()
    {
        receiver->wait_on_msg();
        dispatch();
    }

    bool dispatch()
    {
        char id = Msg::message_id();
        if (receiver->received_msg_with_type(id))
        {
            Msg msg;
            receiver->deserialize_from_buffer(msg);
            handler(msg);
            return true;
        }
        else
            return prev->dispatch();
    }

public:

    template_dispatcher(template_dispatcher&& other) :
        receiver(other.receiver),
        prev(other.prev),
        handler(std::move(other.handler)),
        chained(other.chained)
    {
        other.chained = true;
    }

    template_dispatcher(std::shared_ptr<remote_transport::receiver> preceiver,
                        previous_dispatcher* prev_, Func&& handler_) :
        receiver(preceiver),
        prev(prev_),
        handler(std::forward<Func>(handler_)),
        chained(false)
    {
        prev_->chained = true;
    }

    template<typename OtherMsg, typename OtherFunc>
    template_dispatcher<template_dispatcher, OtherMsg, OtherFunc> handle(OtherFunc&& handler_)
    {
        return template_dispatcher<template_dispatcher, OtherMsg, OtherFunc>(
            receiver, this, std::forward<OtherFunc>(handler_));
    }

    ~template_dispatcher()
    {
        if (!chained)
            wait_and_dispatch();
    }
};



class dispatcher
{
    std::shared_ptr<remote_transport::receiver> receiver;
    bool chained;

    dispatcher(dispatcher const&) = delete;
    dispatcher& operator=(dispatcher const&) = delete;

    template<typename Dispatcher, typename Msg, typename Func> friend class template_dispatcher;

    void wait_and_dispatch()
    {
        receiver->wait_on_msg();
        dispatch();
    }

    bool dispatch()
    {
        char id = last_msg::message_id();
        if (receiver->received_msg_with_type(id))
        {
            assert(false);
        }
        return false;
    }

public:
    dispatcher(dispatcher&& other) :
        receiver(other.receiver),
        chained(other.chained)
    {
        other.chained = true;
    }

    explicit dispatcher(std::shared_ptr<remote_transport::receiver> preceiver) :
        receiver(preceiver),
        chained(false)
    {
    }

    template<typename Message, typename Func>
    template_dispatcher<dispatcher, Message, Func> handle(Func&& handler)
    {
        return template_dispatcher<dispatcher, Message, Func>(
            receiver, this, std::forward<Func>(handler));
    }

    ~dispatcher()
    {
        if (!chained)
            wait_and_dispatch();
    }
};

Above message dispatcher implementation is interesting for many reasons. It's short and compact. Moreover many modern C++11 features are being involved. Let's discuss how does this code work and why it works:

We introduced main dispatcher class and helper class called template_dispatcher. The idea behind those two classes is very simple. We want to compare the type of received message from buffer with the type registered on last handler. If it matches then we execute deserialization and run handler. If not then we continue process comparing the type of received message with the type registered on one before last handler and so on. When we cannot match message with any handler then program is terminated by assert(false).
Method dispatcher::handle is the only one public method except constructors and destructor. This method responsibility is limited to creating temporary template_dispatcher object with forwarding handler. This trick let us call handle again (but now template_dispatcher::handle) in client code with different handler. Inside template_dispatcher::handle creating temporary template_dispatcher object takes place so we have come full circle.
Sooner or later flow in client code achieve last handler which causes a call to last template_dispatcher::handle on some temporary template_dispatcher. After that destructor must be called. This is the moment when actual dispatching is running. Finally. Only last temporary template_dispatcher is not chained so we have call to wait_and_dispatch.
Wait_and_dispatch of course waits for message and dispatches it to the right handler:) First step is to receive message inside receiver->wait_on_msg call. Second step is to iterate over dispatchers by prev pointers (they build one directly linked list) and testing the type of received message with the message type associated with current template_dispatcher/dispatcher. The first dispatcher is always object from dispatcher class (not template_dispatcher) so when message types doesn't match then assert(false) is called.
Friendship is used for access to chained field. Classes are not not copy able (proper constructors are disabled) but moveable. I'm not sure if this is important because both classes are very lightweight. Explicit keyword is used for preventing passing e.g char value as pointer to receiver.
perfect forwarding in dispatcher and template_dispatcher constructors by std::forward and by std::move. Using perfect forwarding here may be unnecessary because for dispatching hot spot will be I/O probably. Anyway I didn't make any measurement to prove the last statement.

Notice how versatile presented solution is. Receiver from remote_transport namespace may pull messages from remote machine by network or may work on the same machine as sender using IPC or even ITC (inter-thread communication) mechanism. We may choose the kind of transport layer we want and dispatcher will handle this in proper way. Another interesting thing about dispatchers is that they usually cooperate with objects like handlers or event queues. Our dispatcher may be part of greater pattern like e.g. Proactor. That's quite common pattern which lets e.g handle many network data streams without explicit concurrency in simple way.
If you are interested you can take a look on Boost Asio internals which was implemented with Proactor pattern in mind[2].

[1] see slides (in polish, but for code doesn't matter): http://cppwroclaw.pl/dokuwiki/_media/spotkania/007/03_dyspozytor_w_cpp11.pdf
[2] http://www.boost.org/doc/libs/1_47_0/doc/html/boost_asio/overview/core/async.html

Sunday, October 19, 2014

Clean Code Approach For Coffee Machine Example

"Hi, I'm Uncle Bob and THIS IS Clean Code". I think that some of us who are initiated already know what this post will be about :)
Recently during work when I was watching some Clean Code episodes (first sentence from this post is beginning sentence from every episode) I realized something.
I have noticed there is so little free information about Clean Code methodologies especially for C++ in the Internet. Lack of complete non-trivial examples about Google Test and Google Mock in web is issue as well. That's why I decided to show demonstration based on popular coffee machine example. The purpose of this post is to present basic and more advanced Google Framework features. I also want to focus on some deeper aspects of Clean Code like clean architecture and SOLID. Notice I'm not going to explain some details to minimize post size so be warned that this post is NOT Clean Code tutorial and NOT Gtest/Gmock tutorial.

1. Testing and mocking in a nutshell

Let's talk about some basic concepts associated with unit testing and mocking. Unit testing is testing on methods/functions level. Every single unit test usually contains part with preparing input data for method (Given), actual testing scenario (When) and checking results with assertion sequence (Then). Suppose we have some function foo with this prototype:

bool foo(Nam nam);

Test cases for foo may look like:

void test_foo_return_true()
{
   // Given:
   Nam nam(...); //prepare nam 

   // When:
   bool foo_result = foo(nam);

   // Then:
   ASSERT_TRUE(foo_result);
}

void test_foo_return_false()
{
   // Given:
   Nam nam(...); //prepare nam

   // When:
   bool foo_result = foo(nam);

   // Then:
   ASSERT_FALSE(foo_result);
}

Now we should explain why unit testing is important and extra effort put in creating tests will repay. Let's see advantages of unit testing:

set of well written unit tests is core concept of Clean Code
with set of working tests we can get rid of fear of changing/refactoring our project. Every refactoring iteration ends with production code state in which all tests pass
with set of working tests we can apply TDD (Test-driven development) and many short iteration cycles test-refactor-run
tests are form of documentation[1]

Now it's time on mocking. Mocking is replacing real objects by objects with the same interface but another functionality. Take a look on previous example with some details revealed and tests rewritten:

bool foo(Nam nam)
{
   if (nam.bar())
      return quk()
}
void test_foo_bar_return_true()
{
   EXPECT_CALL(mock_nam, bar)
      .Return(true);

   foo(mock_nam);
   ... 
}
void test_foo_bar_return_false()
{
   EXPECT_CALL(mock_nam, bar)
      .Return(false);

   foo(mock_nam);
   ... 
}

What's going on here? We have function foo again and we want to cover foo in 100% so we must force nam.bar() to return true in one test and false in another. It can be challenging in some cases. For example preparing appropriate nam state may be time consuming. Instead we will inject mock_nam to foo and will call mocked bar returning value explicitly set in EXPECT_CALL.

Why mock?

to simplify testing. As I said before sometimes creating real resource representing real operation like database/network connection or reading file may be painful and expensive. Replacing those objects by mocks may simplify testing and decrease its execution time
to speed up prototyping. If we are stuck with sub-optimal design changing complex class hierarchy in general is not easy. Adding mocked class instead full-featured is way easier task
to decrease dependencies so instead relying on external libraries/frameworks using hand-written mocked calls may better choice
to help designing system

Of course there is some disadvantage of mocking. It's not free. Mocking may be really time-consuming if tested logic is non-trivial. Mocking (in gmock) may force us to reimplementing/refactoring some parts of code only for introducing interfaces and allowing DI. You will see those issues in next paragraphs.

2. Clean Code concepts

I used "Clean Code" term many times without explaining what it is. Now it's the moment.
Definition of Clean Code is not easy. It is quite broad mix of philosophy and methodologies created by Robert C. Martin known as Uncle Bob and customized for object oriented design. Concepts related with Clean Code:

good names in code, short functions/methods, avoiding comments, simple classes/structures, good formatting. These are probably the most basic rules, according Uncle Bob: "Clean code is simple and direct. Clean code reads like well-written prose".
SOLID and design pattern. These are important tools preventing code smells and fragility in object oriented code
suit of tests and Test-driven development
clean architecture[2]. The most high level part of Clean Code says that every object oriented system should contain few layers. It' s recommended to have the top layers based on user use cases and domain concepts. Those layers should be independent of frameworks/libraries which are only pluggable tools

3. Google Test and Google Mock

Not much to say. Gtest and Gmock are probably the best testing frameworks for C++ because:

they have many features: from basic to very advanced[3]
they are portable and platform independent. Support for Linux, Mac OS X, Windows, and more. Moving coffee machine project from Windows 7 to Ubuntu 14.04 was immediate and painless
have good failures reporting with providing much information about problems in runtime
easy to start
good support and some community (http://googletesting.blogspot.com/)

4. Coffee machine design

In previous paragraphs we presented basic tools which will be used to solve our task. I believe this is the right moment when we can define it. We would like to create an application to simulate real coffee machine. It should provide functionalities like:

ingredients and menu settings
coffee choosing
coin insertion
coffee receiving
taking change

For simplicity I assumed machine was synchronous and one-threaded. Every object oriented system is use case driven in his nature, so good candidate for first step in design would be prepared with use case diagram[4]. In the name of time saving I will show only one use case. Use case diagram is great tool to show "big picture" of system without any details like modules or classes.

From coffee machine's point of view we have 2 actors (implementation doesn't reflect this separation actually) and 6 possible use cases which correspond to features posted above.With set of well defined use cases designing system class hierarchy is straightforward. In first iteration one use case may map on one method which produces following simple hierarchy expressed in UML:

Lets summarize our results. We have in this moment one big class called coffee_machine with well exposed interface. Unfortunately every single change in this class methods or inside some field will cause recompilation of entire class. Extending functionality and maintenance of such coffee machine will be a nightmare. It' s obvious that there are too many responsibilities for one class, which clearly breaks SRP[5]. The best what we can do now is splitting functionality into 2 categories: user interface functions and resource management functions, which leads to coffee_machine and coffee_machine_manager classes respectively.

Take a look again on current hierarchy on the left. Coffee making proccess takes place in coffee machine manager which have access to all ingredients and recipes.
Coffee_machine itself expose user interface and is responsible for cash managing. No inheritance and one aggregation relation (by manager field[10]). Not bad. Although this hierarchy is much better than previous but still have many design problems. I wouldn't be myself if I wouldn't mention about it.

lack of base class problem. What if we want to have many types of coffee machines with different work characteristics but common methods? To achieve that every separate coffee machine class will be copy of another with small changes. DRY[6] will be broken.
lack of "degrees of freedom" problem. What if we want to exchange coffee machine manager or switch to another ingredients storage data structure? Or hold user_credit data structure on another machine and transfer it by network connection? In those situations we can't rely on simple std::map. To achieve that we must rewrite our coffee machine with many switch statements with spaghetti fashion or add some kind of enum field with type. DRY will be broken. OCP will be as well. It's even worse because if client decide to use only one type of ingredients storage with one type of user credit then lots of code will be redundant and YAGNI will be broken. Life is hard.

All above issues originate from one important reason. It is fragility problem. Fragility is one of design smells. It occurs when tiny change of a system code in some place gives big changes in many other places. Everything is too rigid in this moment. As in first approach with one class hierarchy as here maintenance will be challenging.
Just to make it clear. In real life this design is not so bad but for complex system those violations would cause code bloat and in result fear of change because of fragility.

OK. There are many issues, is it possible to solve them all? Yes, it is. Clean Code is remedy.

as I said in second paragraph good object oriented software should have one or more abstract layers which exposes plugin system for concrete classes.
to overcome problem with rigidity of dependencies we introduce layer of interfaces (abstract_coffee_machine, abstract_coffee_machine_manager and abstract_storage. I know, I know. I should use word interface not abstract, language purists may complain). Thanks to dynamic polymorphism we will be able to inject concrete class with custom implementation.
notice that we have dependency inversion (DIP is applied) in this moment. Source code dependency is opposite to runtime dependency.
now our main classes (coffee_machine and coffee_machine_manager) derive from interfaces. There is an extra generic (template) interface - abstract_storage<Key, Value> for replacing concrete std::map.
it is worth to mention that introducing extra abstraction level by adding coffee_machine_manager helps to apply SLA[7]. Now coffee_machine may handle only use cases without operating on concrete data structures like map, set, vector etc.

Current hierarchy is very close to final. Anyway, if we want to test our implementation carefully we need mocking. Therefore we add one mock class per every class. In Google Mock mocks should derive from class we want to mock. Without abstract bases mocking wouldn't be possible because call may be mocked only if it is virtual[8].

5. Implementation

I haven't much to say about implementation because there isn't anything special:) Coffee_machine_manager code is trivial. The most interesting part is probably coffee_machine::compute_change_with_greedy_approach where I used this approach http://en.wikipedia.org/wiki/Change-making_problem. Notice how C++11 features like auto, std::initializer_list, nullptr and std::shared_ptr affect on code readability.
Take a look on some method implementation, for example get_available_coffees:

std::map<std::string,unsigned> coffee_machine::get_available_coffees() const
{
   const auto& ingredients = manager->get_ingredients();
   const auto& coffees = manager->get_coffees();
   std::map<std::string, unsigned> result;

   for (auto coffee : coffees)
   {
      auto& recipe_for_coffee = manager->get_recipe(coffee);
      bool available = check_ingredients_availability(recipe_for_coffee, ingredients);
      if (available)
         result.insert( { coffee, manager->get_price(coffee) } );
   }
   return result;
}

We can see here clean code techniques in action:

code is self-explaining. We have good, descriptive names of methods and variables like ingredients or get_recipe. With those names additional comments in code are unnecessary. With those names I haven't much to say about how does this code works ;)
short methods. Uncle Bob says that method longer than 3-4 lines is too long. I believe methods not bigger than my hand with no more than double nested statement (like if/for/while) are good enough. Every nested loop should be delegated to separated method (in this case I extracted for loop to check_ingredients_availability)
order of methods is important (like order public/private in class header). There are public methods on top and lower level methods with details on bottom. After opening source code file developer see right method implementation. If he want he can scroll down to take a look on details
const-correctness

Let's consider clean code influence on performance. Some people may think that organising code as set of many tiny methods may have bad impact on performance. Notice that today the most important thing about x86/x64 speed code is memory access politics not level of code indirection. Nowadays compilers perform aggressive inlining so many small functions are actually joined into few big one. Even performance gurus like Martin Thompson[9] confirms that clean code means better speed because we actually can see what happen. Anyway, benchmarking and generated code observation is always the best what we can do and if clean code really degrade performance in some parts of system we can rewrite those parts and violate clean code rules.

6. Tests

This paragraph is the most important part of whole post. Programming with Gmock is not piece of cake and lots of details are getting involved so I decided to split this part into 3 sections starting with simple unit test and ending with more advanced techniques. Also this is the right moment to opening coffee machine sources (link on very end of post) and following pieces of code which I demonstrate below.

6.1 Unit testing without Gmock

TEST(ut_coffee_machine_manager, add_recipe)
{
    // Given:
    coffee_machine_manager manager;
    std::set<std::string> coffees = {"cappucino"};

    // When:
    manager.add_recipe("cappucino", 250u, cappucino);

    // Then:
    EXPECT_EQ(coffees, manager.get_coffees());
    EXPECT_EQ(250u, manager.get_price("cappucino"));
    EXPECT_EQ(cappucino, manager.get_recipe("cappucino"));
}

Take a look on test case above. This test is quite simple that's why I will limit myself to short comments:

want to test coffee_machine_manager::add_recipe. We use approach Given/When/Then presented in paragraph 1. It is black box test, we set input and check output. Coffee_machine_manager::add_recipe implementation is very simple so mocking would be useless
EXPECT_EQ behaves like normal assertion but doesn't abort tests and therefore allow more than one failures to be reported
EXPECT_EQ is not symmetric what surprise many people. Google Test failures reporting works better if we put expected value first and actual value as second argument
in this case we just call add_recipe and check if all values were actually added to manager

6.2 Unit testing with Gmock

Until we don't use mocks everything is easy and boring. Coffee machine implementation is more complex than coffee machine manager so mocks are needed. Notice mocking require knowledge about tested class implementation so all coffee machine tests are actually white box tests.
Take a look on test case below:

TEST_F(coffee_machine_fixture, get_available_coffees_empty)
{
    // Given:
    reset_mocks();
    expect_index_operators_from_coffee_machine_constructor();

    const coffee_machine lcoffee_machine(mock_manager, user_credit_);
    const std::set<std::string> expected_coffees;
    const std::map<std::string unsigned> expected_coffees_and_price;

    // When:
    expect_getter_calls_from_get_available_coffees(expected_coffees);
    auto actual_coffees_and_price = lcoffee_machine.get_available_coffees();

    // Then:
    EXPECT_EQ(expected_coffees_and_price, actual_coffees_and_price);
}

Above test case tests coffee_machine::get_available_coffees. There is get_available_coffees implementation in previous chapter but I will put it again to see what actually is going on:

std::map<std::string,unsigned> coffee_machine::get_available_coffees() const
{
   const auto& ingredients = manager->get_ingredients();
   const auto& coffees = manager->get_coffees();
   std::map<std::string, unsigned> result;

   for (auto coffee : coffees)
   {
      auto& recipe_for_coffee = manager->get_recipe(coffee);
      bool available = check_ingredients_availability(recipe_for_coffee, ingredients);
      if (available)
         result.insert( { coffee, manager->get_price(coffee) } );
   }
   return result;
}

As get_available_coffees_empty says this ut tests case when there is no available coffees. Let's look on get_available_coffees code. That situation may occur e.g when every coffee recipe demands more ingredients than there are actually leave in storage inside coffee machine manager. Another case is when there is no coffees registered in coffee machine manager at all.
As I said before one of testing goal is covering implementation code in 100% (if possible). It means that we have to force program flow in get_available_coffees to get inside loop in one test case:

for (auto coffee : coffees)

To get full coverege we must achieve this statement as well:

if (available)

Now we have get_available_coffees_empty case. How to force get_available_coffees to return empty result? Forcing manager to get empty coffees from get_coffees is one approach. Easy.

Let's look on coffee_machine_fixture.get_available_coffees_empty again. We are going to force mock_manager to return empty expected_coffees. We will use EXPECT_CALL macros as in paragraph 1. Putting our expectations and mock calls behavior takes place inside expect_getter_calls_from_get_available_coffees which looks like:

void expect_getter_calls_from_get_available_coffees(
   const std::set<std::string> &expected_coffees)
{
   EXPECT_CALL(*mock_manager, get_ingredients())
      .WillOnce(ReturnRef(*mock_storage_));

   EXPECT_CALL(Const(*mock_manager), get_coffees())
      .WillOnce(ReturnRef(expected_coffees));
}

How does it works in short? We want to change behavior of this line:

 const auto& coffees = manager->get_coffees();

Take a look on class hierarchy in previous paragraph. Type of manager is abstract_coffee_machine_manager and abstract_coffee_machine_manager::get_coffees is (pure) virtual. Classes coffee_machine_manager and mock_coffee_machine_manager are derived from this abstract class. So behavior of get_coffees method may be changed in run-time by dynamic polymorphism. Indeed method get_coffees was overrided inside mock_coffee_machine_manager with this magic macro:

 MOCK_CONST_METHOD0(get_coffees, const std::set<std::string>&);

So now mock haves own get_coffees method which may be "programmed" by EXPECT_CALL-s from ut level. We did that by

EXPECT_CALL(Const(*mock_manager), get_coffees())
   .WillOnce(ReturnRef(expected_coffees));

We told our mocked get_coffees to return reference to expected_coffees which is empty set of strings. That's not end. We must dispatch in someway our mock to get_available_coffees method in run-time. We used dependency injection mechanism to inject mock_manager object to get_available_coffees method by constructor. Apart setting EXPECT_CALL-s we must check results by some kind of asserts. After calling tested code in When section by

lcoffee_machine.get_available_coffees();

we checked results by

EXPECT_EQ(expected_coffees_and_price, actual_coffees_and_price);

in Then section.
There is some extra code which may seems to be redundant. I'm talking about expect_index_operators_from_coffee_machine_constructor call. Before I will explain what does this call do I would like to ask you question. What if we comment all EXPECT_CALL-s from get_available_coffees_empty by commenting expect_index_operators_from_coffee_machine_constructor and expect_getter_calls_from_get_available_coffees ? Answers later.

6.3 More advanced unit testing

Over a dozen seconds ago you read a lot of information about how gmock handle mocking and how dependency injection and dynamic polymorphism help in that. It's great but what if we forget about mocking some call? In other words how gmock report lack of EXPECT_CALL on some method. Let's comment only one line with expect_index_operators_from_coffee_machine_constructor in coffee_machine_fixture.get_available_coffees_empty test case ad see what happen.

Gmock reports lack of EXPECT_CALL by "Uninteresting mock function call" message with name of call with arguments. In this case indexOperator method is not expected. Opposite situation take place when we add redundant EXPECT_CALL, e. g. we forget inject mock object but set EXPECT_CALL in test. Let's do that by duplicating following code in expect_getter_calls_from_get_available_coffees:

EXPECT_CALL(*mock_manager, get_ingredients())
   .WillOnce(ReturnRef(*mock_storage_));

Now we get something like that:

Gmock reports redundant EXPECT_CALL by "Unsatisfied and active" message
with name of call with arguments. In this case added before EXPECT_CALL is redundant. Nice. I persuade you to remember those 2 types of reports because there are very common and occur often.

Let's look at expect_index_operators_from_coffee_machine_constructor code:

void expect_index_operators_from_coffee_machine_constructor()
{
     static std::vector<std::pair<ECoin, unsigned>> coins_and_quantities =
         { { ECoin_10gr, 0 }, { ECoin_20gr, 0 }, { ECoin_50gr, 0 },
         { ECoin_1zl, 0 }, { ECoin_2zl, 0 }, { ECoin_5zl, 0 } };

     for (size_t i = 0;i < coins_and_quantities.size(); i++)
        EXPECT_CALL(*user_credit_, indexOperator(coins_and_quantities[i].first))
           .WillRepeatedly(ReturnRef(coins_and_quantities[i].second));
}

You may wonder why coins_and_quantities has static qualifier. There is some subtle issue associated with ReturnRef. This macro is required here because indexOperator returns reference not value[14]. Unfortunately indexOperator will be called outside expect_index_operators_from_coffee_machine_constructor (in lcoffee_machine constructor) so reference to coins_and_quantities[i].second will be used outside this function. So effect will be the same as returning by function reference to local object. Reference to destroyed object and crash...
Static is remedium because it makes lifetime of coins_and_quantities as long as lifetime of whole program.

Another interesting issue is class abstract_storage which was introduced in chapter 4 during design process. It's generic interface declared as

template<class Key, class Value>
class abstract_storage
{
public:
    virtual Value & operator[](const Key &key) = 0;
    virtual const std::shared_ptr<Value> find(const Key &key) const = 0;
    virtual std::map<Key, Value> convertToMap() const = 0;
    virtual ~abstract_storage() {}
};

and replacing concrete std::map. Motivation to writing this class was mocking operator[]. Unfortunately mocking C++ operator is not possible in gmock[11] but workaround exists[12]. I applied this solution in mock_storage.h file by adding indexOperator helper method:

MOCK_METHOD1_T(indexOperator, Value & (const Key &key));

Notice using MOCK_METHOD1_T macro for all methods because of templated type. Also take a look on public keyword in declaration line

class mock_storage : public abstract_storage<Key, Value>

Without this keyword strange things during compilation happen[13].

At the end of post I would like to put some guidelines regarding environment preparing (gtest, gmock and gcov).

gmock/gtest building under linux is trivial (configure + make) but be careful with with *.la objects located in lib directory. That's not static linked libraries (*.a)! That's libtool files and do not feed you linker those files as I did.
it may be surprising that you can control gtest execution behaviour by passing arguments to your binary e.g:
```
./coffee_machine --gtest_repeat=2 --gtest_shuffle --gtest_break_on_failure     
```
the most straightforward way to measure source code coverage is probably using tool called gcov. In first step you must pass following flags to gcc: -fprofile-arcs, -ftest-coverage. Also -fprofile-arcs for linker is required. Second step is running gcov and gathering reports with detailed code coverage informations per compilation unit. For example gcov coffee_machine.cpp will create detailed report in coffee_machine.cpp.gcov.

Link to coffee machine source: https://github.com/yurai007/coffee_machine

[1] consider situation with that you have project backup containing 2 CD - one with your source code of project and the second one with unit tests sources. Which CD do you prefer to lose?
[2] look at: http://blog.8thlight.com/uncle-bob/2012/08/13/the-clean-architecture.html
[3] https://code.google.com/p/googlemock/wiki/CookBook
[4] http://www.ivarjacobson.com/Use_Case_Driven_Development/
[5] SRP is first of SOLID principle. More information: http://en.wikipedia.org/wiki/SOLID_(object-oriented_design)
[6] DRY - Don't Repeat Yourself; many people understand that only in terms of avoiding duplication of code but DRY means also avoiding duplication of effort (e.g if every functionality in your system is only in one place but in case of bug you must fix this bug in many places you clearly break DRY). Notice that if you must change something in many places (production code, tests, scripts, configuration files, database schema etc) from the same reason you violate DRY.
[7] SLA - Single Level Of Abstraction
[8] That's not actually true. There are ugly hacks for mocking non-virtual methods:
https://code.google.com/p/googlemock/wiki/CookBook#Mocking_Nonvirtual_Methods
[9] http://www.infoq.com/presentations/top-10-performance-myths
[10] It's quite interesting how relations like aggregation or composition are handled by C++. If A compose B then it means that B is part of A and A is single owner of B. Both object are created and destroyed in the same time. In C++ it is expressed that B is field of A or B is nested class in A. Aggregation is handled in different way. When A aggregate B then B is some kind of pointer field or reference field. Lifetime of B is not limited by A lifetime and B may be used outside A scope as well by many other classes.
[11] see: http://stackoverflow.com/questions/6492664/how-to-create-a-mock-class-with-operator
[12] look at: https://code.google.com/p/googlemock/issues/detail?id=53
[13] I guess that without hint from SO about gcc message - "conversion from derived* to base* exists but is inaccessible" discovering source of real problem would be hard. Hint comes from http://stackoverflow.com/questions/10472848/conversion-from-derived-to-base-exists-but-is-inaccessible
[14] Using normal Return macro instead ReturnRef will cause compilation error with terrible long compiler message from gmock internals. BTW despite that gmock is really cool framework I must confirm that his source code is really nightmare. If you have ever heard statement that C++ is like seven legs cat ( http://yosefk.com/c++fqa/images/cat.png) but don't understand/believe take a look on gmock internals:)

Sunday, July 20, 2014

400K Aktywnych Połączeń TCP cz. I

Napisanie wydajnego, skalowalnego serwera od zawsze było nie lada wyzwaniem. Jeszcze kilkanaście lat temu programiści walczyli z rozwiązaniem tzw. C10K Problemu[1] czyli z napisaniem serwera skalowanego do 10 tysięcy klientów naraz. Aby wykorzystać maksymalnie maszynę trzeba było uciekać się do wspołbieżności, asynchroniczności i wsparcia ze strony OS-a. Wraz ze wzrostem mocy obliczeniowej komputerów oraz rozwojem internetu C10K przeobraził się w C1024K. W tym poście chcę pokazać jak stosunkowo niewielkim nakładem pracy napisać wydajny serwer mogący obsłużyć nawet 400K aktywnych klientów. I to wszystko na jednym boxie!

1. Ale to już było
Kilkanaście miesięcy temu na Stack Overflow nijaki Shen Feng opisał w jaki sposób udało mu się uzyskać 1.6M bezczynnych połączeń TCP na jednym komputerze[2]. Napisany przez niego serwer w C został odpalony pod linuksem na maszynie z 16 GB pamięci RAM. Niezły wynik, ale pojawia się kilka pytań. Czy da się uzyskać podobny wynik używając języka programowania wyższego poziomu? I jeszcze jedno. Czy można jakoś zaprząc ten serwer do nietrywialnej pracy (pomijam echo) tak by obsługiwał on przychodzące requesty i generował odpowiedzi każdemu z ponad miliona klientów w czasie rzeczywistym? 2x TAK!

2. Boost Asio i 400K bezczynnych połączeń TCP
Na samym początku chce zaznaczyć ze moja maszynka ma 4x mniej RAM-u od maszynki Shen Fenga dlatego zadowole się "jedynie" 400K klientami. Pozostałe dane techniczne są nieistotne, system to Ubuntu 12.04. Kluczem do sukcesu Shenga było wykorzystanie mechanizmu epoll dostępnego od wersji 2.6 Linuksa. Epoll jest wydajnym zamiennikiem select-a i w odróżnieniu od niego doskonale radzi sobie nawet przy bardzo duzej liczbie obserwowanych deskryptorów plików. My również go użyjemy ale nie bezpośrednio. Serwer oraz generator klientów zostaną zaimplementowane w C++ z użyciem biblioteki Boost Asio. Ten rozdział traktuje po części jako prosty tutorial Boost Asio.

2.1 Generator klientów
Wszystko zacznie się od przygotowania systemu, gdyż potrzebujemy wsparcia z jego strony. Przed napisaniem czegokolwiek utworzymy kilkadziesiąt wirtualnych interfejsów sieciowych oraz zwiększymy liczbę portów per interfejs.

for i in `seq 21 54`; do sudo ifconfig eth0:$i 192.168.1.$i up ; done

sudo sysctl -w net.ipv4.ip_local_port_range="1025 65535"

Dodatkowo zwiększymy limit na liczbę deskryptorów do niemal miliona. Trzeba zmodyfikować plik /etc/security/limits.conf:

* - nofile 999999

Statystyki:

cat /proc/net/sockstat

Nadszedł czas na Boost Asio. Programowanie z Asio jest nieco trikowe i wymaga obycia, ponieważ:

napisany kod sprowadza się do małego zbioru handler-ów odpalanych asynchronicznie. Mamy odwrócony flow co może utrudnić zrozumienie kodu.

łatwo jest popełnić subtelne błedy (póżniej pokaże jakie), trudno je wyłapać.

biblioteka intensywnie używa szablonów co w połączeniu z boost::bind i innymi wynalazkami może skutkować rozwlekłymi i nieczytelnymi komunikatami gcc podczas kompilacji.

Kod generatora jest krótki dlatego przedstawię go w całości i od razu skomentuje:

using namespace boost::asio;

const int connections_per_IP = 12500;
const int log_step = 10000;
const int interfaces_number = 32;

io_service m_io_service;
ip::tcp::resolver m_resolver(m_io_service);
std::vector<ip::tcp::socket> m_sockets;
std::array<char, 4096> m_buffer;

void read_handler(const boost::system::error_code &, std::size_t)
{
    std::cout << "Read_handler: We got sth from server\n";
    exit(1);
}

void connect_handler(const boost::system::error_code &p_error_code)
{
    static int clients_number = 0;
    if (!p_error_code)
    {
        ++clients_number;
        if (clients_number % log_step == 0)
            std::cout << clients_number << " clients are connected\n";

        m_sockets[clients_number-1].async_read_some( buffer(m_buffer), &read_handler);
    }
    else
    {
        std::cout << "Connect_handler: Some error occured: " << p_error_code.value() << "\n";
        exit(1);
    }
}

void resolve_handler(const boost::system::error_code &p_error_code,
                     ip::tcp::resolver::iterator p_endpoint_iterator,
                     int p_interface_id)
{
    if (!p_error_code)
    {
        for (int i = 0; i < connections_per_IP; i++)
        {
            int socket_index = connections_per_IP * p_interface_id + i;
            m_sockets[socket_index].async_connect(*p_endpoint_iterator, connect_handler);
        }
    }
    else
    {
        std::cout << "Resolve handler: Some error occured: " << p_error_code.value() << "\n";
        exit(1);
    }
}

void resolve_all_connections()
{
    for (int interface_id = 0; interface_id < interfaces_number; interface_id++)
    {
        for (int i = 0; i < connections_per_IP; i++)
            m_sockets.push_back(ip::tcp::socket(m_io_service));

        std::string host = "192.168.1." + std::to_string(21 + interface_id);
        ip::tcp::resolver::query query(ip::tcp::tcp::v4(), host, "9090");

        m_resolver.async_resolve(query, boost::bind( &resolve_handler,
             placeholders::error, placeholders::bytes_transferred, interface_id) );
    }
}

int main()
{
    try
    {
        resolve_all_connections();
        m_io_service.run();
    }
    catch (std::exception& e)
    {
        std::cerr << "Exception: " << e.what() << "\n";
    }
    return 0;
}

Idea jest prosta.Chcemy nawiązać po connections_per_IP połączeń TCP na każdym z interfaces_number interfejsów sieciowych. Przed wejściem do pętli głównej obsługi komunikatów zaszytej w io_service::run, w resolve_all_connections tworzymy po jednym gnieździe per połączenie oraz przygotowujemy query z danymi adresu hosta. Ostatecznie w resolverz-e rejestrujemy resolve_handler wraz z danymi hosta. Handler ten odpalony zostanie po rozwiązaniu adresu. Zadaniem resolve_handler-a jest obliczenie indeksu właściwego gniazda oraz wywołania na nim async_connect wraz z przekazaniem handlera do obsługi połączenia - connect_handler-a.

Connect_handler - jaki jest każdy widzi. Spełnia dwie funkcje tzn. zlicza i wypisuje liczbę połączonych klientów oraz odbiera potencjalne dane wysłane przez serwer. Stąd dodatkowe async_read_some. Moją intencją było jedynie podtrzymywanie połaczenia przy życiu dlatego założyłem że serwer nigdy nic nie wyśle i read_handler nigdy nie zostanie odpalony. Tutaj asynchroniczne I/O pokazuje swoją siłe. Użycie synchronicznego read-a natychmiast zablokowałoby połączenie, a co z tym idzie całą aplikacje bo mamy tutaj tylko jeden wątek. Z tych samych względów odpadają wszelkie sleep-y i inne cuda.

Teraz kilka detali.
Po pierwsze każda z operacji na sieci jest wykonywana asynchronicznie o czym świadczy prefiks async. Dodatkowo w Asio każda z takich operacji może sie nie udać (np. z powodu przerwania połączenia) stąd handler-y powinny ZAWSZE sprawdzać error_code np.

if (!error_code)
{
    ...
}

Po drugie należy zwracać baczną uwagę na czas życia obiektów przekazywanych do Asio. Wszelkie handlery są przekazywane przez wartość, ale już np. bufor w funkcji buffer przez referencje (tutaj m_buffer). Ostatnio straciłem mnóstwo czasu na wychwycenie błędu spowodowanego właśnie przekazaniem lokalnego bufora do środka jednej z funkcji asio.
Ostatni detal to pytanie do czytelnika.

Czy w ramach refactoringu do C++11 można zamienić wszystkie wywołania boost::bind na std::bind i tym samym pozbyć się zależności od nagłówka boost/bind.hpp?

Odpowiedź na końcu posta [3].

2.2 Skalowalny serwer echo
W poprzednim rozdziale pokazaliśmy w jaki sposób, bez uciekania się do socket-ów napisać program do nawiązywania połączeń z serwerem. Przy okazji zobaczylismy siłę drzemiącą w bibliotece Boost Asio. Okazuje sie ze stworzenie serwer-a jest tylko odrobine trudniejsze, co więcej nie będziemy musieli ręcznie zarządzać obserwowaniem deskryptor-ów za pomocą epoll-a. Klasa epoll_reactor[4] zrobi to za nas!
Hierarchia klas naszego serwera jest trywialna. Funkcjonalność została rozbita na 2 proste klasy: server i connection.

class server
{
public:
    server(short port);
    void run();
    void remove_connection(std::shared_ptr<connection> p_connection);
private:
    void start_accept();
    void handle_accept(std::shared_ptr new_connection,
                     const boost::system::error_code& error);
    io_service m_io_service;
    tcp::acceptor m_acceptor;
    std::set<std::shared_ptr<connection>> m_connections;
};

Server składa się z tcp::acceptor-a, io_service-u oraz dodatkowo zawiera zbiór aktywnych połaczeń w postaci obiektów klasy connection. Serwer nasłuchuje na porcie 9090 oraz za pomocą pary metod start_accept + handle_accept tworzy obiekt connection i oddelegowuje pracę do tego obiektu po czym kontynuuje nasłuch.

class connection : public std::enable_shared_from_this<connection>
{
public:
    connection(const connection&) = delete;
    connection& operator=(const connection&) = delete;
    connection(io_service& io_service, server &p_server);
    tcp::socket& socket();
    void start();
private:
    void handle_read(const boost::system::error_code& error, size_t bytes_transferred);
    void handle_write(const boost::system::error_code& error);
    server &m_server;
    tcp::socket m_socket;
    const static int max_length = 1024;
    static char data[max_length];
};

Klasa connection zawiera statyczny bufor na requesty i responsy, gniazdo oraz agreguje klasę server. OK. Póki co wszystko wydaje się być oczywiste i całkowicie naturalne. Serwer utrzymuje zbiór połączeń TCP, każde połączenie ma swój socket oraz wszystkie połączenia współdzielą jeden bufor wiadomości. Skąd jednak zależność klasy connection od klasy server? Czy server nie jest czasem właścicielem obiektów klasy connection?
Kluczem do zrozumienia są niewinnie wyglądające linijki z connection::handle_read i connection::handle_write:

m_server.remove_connection(shared_from_this());

O to w tym wszystkim chodzi. Tylko klasa connection zna moment przerwania połączenia, zatem tylko obiekt klasy connection wie kiedy zakończyć swój żywot. Obiekt klasy nadrzędnej - serwer nie ma tym pojęcia. Można by tutaj wykonać samobója przez delete this. Z drugiej strony jest to nieelegackie tym bardziej że serwer nie zostałby o tym poinformowany. O wiele rozsądniej dodać zależność do serwera, przekazać mu this-a i liczyć na to że serwer zajmie się usunięciem połączenia. Tak też zrobiono, przy czym zamiast przekazywania gołego this-a przekazujemy this-a opakowanego w shared_ptr aby ustrzec się przed niebezpieczeństwem podwójnego ususnięcia obiektu [5].
Spójrzmy jeszcze raz na deklaracje connection:

class connection : public std::enable_shared_from_this<connection>
{
   ...
};

Aby skorzystać z dobrodziejstw shared_from_this konieczne jest dziedziczenie po enable_shared_from_this. Warto zauważyć tutaj występowanie idiomu CRTP(Curiously recurring template pattern), który ma dość szerokie zastosowanie w C++[6].

Nasz prosty serwer jest w stanie utrzymać nawet 400K bezczynnych połączeń. Fajnie by było jednak wykonać jakąś, choćby minimalną pracę dla każdego z klientów. Uznałem, że całkiem prostym i jednocześnie ciekawym problemem będzie problem parsowania adresów URL. Połączony klient wyśle do serwera adres URL, ten zaś odeśle poszczególne fragmenty adresu takie jak scheme, userinfo, port, query itp. Oczywiście zadanie jest na tyle proste, że o wiele lepiej było by je wykonać po prostu lokalnie, jednak tak jak wspomniałem chce w ten sposób obciążyć serwer nietrywialną pracą jednocześnie utrzymując b. dużą liczbę połączeń.

3. Parser URL
Wielokrotnie zarówno w pracy jak i na różnej maści forach internetowych spotkałem się ze stwierdzeniem w stylu: "Nie ma sensu optymalizować tego kodu ponieważ obecnie kompilatory są tak dobre że zrobią to za Ciebie". Jest to bzdura totalna co udowodnie w dalszej części posta. W niniejszym rozdziale na przykładzie parsowania postaram się pokazać na czym polega i jak przebiega optymalizacja kodu oraz dlaczego diabeł tkwi w szczegółach.
Mamy adres URL o następującej gramatyce:

URL  = scheme "://" [ userinfo "@" ] host [ ":" port ] "/" path [ "?" query ] [ "#" fragment ]

Zadanie polega na wyciągnięciu wszystkich zaznaczonych wyżej części tj. scheme, userinfo, host, port, path, query i fragment. Dodatkowo chcemy to zrobić WYDAJNIE tzn. tak szybko jak to tylko możliwe. Wkrótce zobaczymy że z poziomu C++ nie jest to tak proste jak się wydaje...

3.1 Naiwne parsowanie
Całkowicie naturalnym podejściem dla programisty C++ jest wykorzystanie w tym miejscu kombinacji std::string + funkcji z <algorithm>. Dorzucając do tego keyword auto dostajemy całkiem zwięzły i elegancki kod:

inline void naive_parse(const string &str_in)
{
    input_str = str_in;
    scheme = userinfo = host = query = hash = port = "";

    auto scheme_it = search(input_str.begin(), input_str.end(),
                           separator_scheme.begin(), separator_scheme.end());
    scheme.reserve(distance(input_str.begin(), scheme_it));
    transform(input_str.begin(), scheme_it, back_inserter(scheme),ptr_fun(tolower));
        if( scheme_it == input_str.end() )
            return;
    advance(scheme_it, separator_scheme.length());

    auto userinfo_it = find(scheme_it, input_str.end(), '@');
    if (userinfo_it != input_str.end())
    {
        userinfo.assign(scheme_it, userinfo_it);
        scheme_it = ++userinfo_it;
    }

    auto host_end_it = find(scheme_it, input_str.end(), '/');
    host.reserve(distance(scheme_it, host_end_it));
    transform(scheme_it, host_end_it, back_inserter(host),ptr_fun(tolower));
    auto port_begin_it = find(host.begin(), host.end(), ':');

    if (port_begin_it != host.end())
    {
        port.assign(++port_begin_it, host.end());
    }

    auto query_begin_it = find(host_end_it, input_str.end(), '?');
    auto hash_begin_it = query_begin_it;
    if (query_begin_it != input_str.end())
    {
        hash_begin_it = find(query_begin_it, input_str.end(), '#');
        query.assign(++query_begin_it, hash_begin_it);
    }
    else
    {
        hash_begin_it = find(host_end_it, input_str.end(), '#');
    }

    if (hash_begin_it == input_str.end())
        return;
    hash.assign(++hash_begin_it,  input_str.end());
}

W skrócie. Wykorzystując find-a wielokrotnie skanujemy nasz napis wejściowy i wyszukujemy separatory rozdzielające interesujace nas fragmenty URL-a. Przy okazji w dwóch miejscach niezbędne jest użycie funkcji tolower tj. dla schematu i dla nazwy hosta. Warto zwrócić uwagę na możliwości jakie dają iteratory.

Wróćmy na chwilę do to_lower. W jaki sposób najefektywniej zaimplementować tę funkcję?

Odpowiedź na pytanie na końcu posta[7].
Czy powyższy kod jest szybki? Zanim odpowiemy na to pytanie musimy przygotować benchmark który pozwoli nam stwierdzić na ile kod jest szybki/wolny. Rozmowa o szybkości bez dobrego testu wydajnościowego nie ma sensu. Przygotowany test jest na tyle prosty by nie wprowadzać dodatkowych narzutów do wywołań metody parse i jednocześnie na tyle złożony by wyeliminować możliwość usunięcia przez kompilator fragmentów parse przy agresywnych optymalizacjach. Test kompilowany z flagą -O2.

void simple_performance_test()
{
    std::string str1("HTTP://stackoverflow.com/questions/2616011/parse-a.py?url=1");
    std::string str2("scheme://username:password@subdomain.domain.tld:port/path/file-name.suffix?query-string#hash");
    std::string str3("http://blogs.yandex.ru/cachedcopy.xml?f=3dadaafe9e5bfdb5fd2c22e7155441aa&i=91&m=http%3A"
                 "%2F%2Fmewild.livejournal.com%2F24304.html&cat=theme&id=12843#fooobar123");
    std::string str4("scheme://subdomain.domain.tld:0321/path/file-name.suffix?query-string");
    uri u;

    int empty_hash_number = 0;
    for (size_t iteration_number = 100000; iteration_number > 0; iteration_number--)
    {
        u.parse(str1);
        empty_hash_number += (u.hash.empty());
        u.parse(str2);
        empty_hash_number += (u.hash.empty());
        u.parse(str3);
        empty_hash_number += (u.hash.empty());
        u.parse(str4);
        empty_hash_number += (u.hash.empty());
    }
    assert(empty_hash_number == 200000);
}

Wróćmy do zasadniczego pytania. Czy powyższy kod jest szybki? Niekoniecznie.
Przyczyny:

główna przyczyna to ogromna liczba alokacji dynamicznych. Aby ją zmierzyć wystarczy użyć valgrinda:
```
sudo valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --track-fds=yes ./naive_uri_parser 
```
Dla powyższego scenariusza testowego (tj. 400000 wywołań parse) valgrind zaraportował 3000000 alokacji na stercie! OMG, to prawie 8 dla każdego pojedynczego URL-a.

liczba kopiowań też nie jest mała. Każde wywołanie transform wykonuje pod maską kod zbliżony do std::copy/memcpy + dodatkową pracę. Teoretycznie sytuację może nieco ratować mechanizm copy-on-write[8] który jest w stanie odwlec w czasie takie kopiowanie.

find-y nie pozwalają na przerwanie wyszukiwania po natrafieniu na jakiś dodatkowy znak, np. dla
```
auto userinfo_it = find(scheme_it, input_str.end(), '@'); 
```
może się zdarzyć że userinfo w ogóle nie występuje w URL-u (jest to opcjonalny fragment). Oznacza to, że nie znajdziemy znaku @ a zatem będziemy przeszukiwać URL do samego końca, chociaż pojawienie się znaku / mogłoby przerwać wyszukiwanie.

bardzo ważną kwestią jest to na ile kod jest cache-friendly. W tym momencie tego nie wiemy, zajmiemy się tym nieco poźniej po rozwiązaniu pierwszych 3 problemów.

Czas wykonania testu dla naive_parse to 270ms. Jak to przyspieszyć?
Przede wszystkim należy mocno ograniczyć liczbę alokacji oraz liczbę kopiowań czyli trzeba zająć się dwoma pierwszymi punktami. Oto co należy zrobić:

zrezygnować z std::string::operator= w []. Operacja tak prosta jak przypisanie pustego stringa wymusza zwolnienie pamięci starego bufora, zaalokowanie nowego bufora oraz skopiowanie "". Zamiast tego zaalokujmy pamięć jeden raz w konstruktorze a w naive_parse użyjmy clear/assign.
zrezygnować całkowicie z reserve
zastąpić transform + back_inserter przez assign + transform
usunąć "zmniejszanie znaków" dla hosta

Ulepszona wersja naive_parse znajduje się w źródłach. Ile udało się ugrać? Czas parsowania spadł do 135ms! Tak proste i subtelne zmiany przyspieszyły nasz program 2x. Dodatkowo według valgrinda liczba alokacji spadła z 3 milionów do 14. Oznacza to że pamięć dla napisów alokowana jest teraz TYLKO w konstruktorze.

Zatrzymaliśmy się na 135ms. Zastanówmy się czy jest to dobry czas. Ile zajmuje przetwarzanie pojedynczego URL-a? 0.135s/400000 = ~340ns. Jest to całkiem niezły czas. Pojedynczy rdzeń mojego CPU potrafi wykonać w tym czasie około 3700 instrukcji arytmetycznych w około 900 cyklach zegara. My jednak chcemy wiedzieć czy można jeszcze lepiej. Perf prawdę nam powie:

perf stat ./naive_uri_parser

sudo perf stat -e cycles,instructions,cache-misses ./naive_uri_parser

Output:

Performance counter stats for './naive_uri_parser':

        134.925923 task-clock                #    0.997 CPUs utilized
                14 context-switches          #    0.104 K/sec
                 2 cpu-migrations            #    0.015 K/sec
               318 page-faults               #    0.002 M/sec
       429,456,353 cycles                    #    3.183 GHz
        97,661,246 stalled-cycles-frontend   #   22.74% frontend cycles idle
        53,600,326 stalled-cycles-backend    #   12.48% backend  cycles idle
       850,989,218 instructions              #    1.98  insns per cycle
                                             #    0.11  stalled cycles per insn
       213,892,182 branches                  # 1585.256 M/sec
         4,366,554 branch-misses             #    2.04% of all branches

       0.135361927 seconds time elapsed

Oraz dodatkowo

       417,777,346 cycles                    #    0.000 GHz                    
       850,802,868 instructions              #    2.04  insns per cycle        
             7,648 cache-misses                                                

       0.134220801 seconds time elapsed

Z powyższych statystyk można wysnuć następujące wnioski:

ogólna wydajność jest całkiem niezła. Wykonywanych jest ~2 instrukcje/cykl. Z uwagi na sekwencyjny dostęp do pamięci (skanując URL pracujemy na kolejnych adresach) problemu nie sprawia również liczba cache miss-ów wynosząca jedynie ~7600. W obliczu 400000 wywołań parse jest to zaniedbywalna wartość.
wąskim gardłem jest tutaj frontend procesora (przestoje na poziomie ~23%) co wynika zapewne ze zjawiska branch misprediction. Perf raportuje aż 4 mln branch miss-ów co daje 10 na wywołanie parse. Średni koszt miss-a to nawet kilkanaście cyklów zegara co daje max. 200 cyklów per parse. 200/900 ~ 20% . Rzeczywiście jest to wartość odpowiadająca zaraportowanym wcześniej przestojom na poziomie 23%.

OK. Widzimy zatem, że niwelując negatywne zjawisko branch missprediction możemy teoretycznie ugrać ok. 20 % szybkości. Sęk w tym że z poziomu naive_parse niewiele możemy zrobić. Wszystkie branch-e pochowane są w wywołaniach z STL-a i nie mamy do nich dostępu. Cóż, w tym momencie wysokopoziomowe mechanizmy C++ nie ułatwiają nam zadania. A może by tak ...

3.2 Szybkie parsowanie
A może by tak spróbować nieco innego podejścia? Algorytm parsowania sam w sobie jest OK, trzeba jednak ulepszyć implementacje. Pokazaliśmy że eliminując branch misprediction możemu uzyskać ok. 20% zysku wydajnościowego.Czy to wszystko co możemy zrobić?
Nie od dzisiaj wiadomo, że std::string ma problemy z wydajnością, głównie ze względu na to że śmieci w pamięci poprzez tworzenie dużej liczby obiektów tymczasowych. Oprócz tego niektóre metody wprowadzają spore narzuty czasowe[9]. Z tych względów np. programiści gier zdecydowanie bardziej preferują stare dobre char stringi [10]. My również użyjemy ich w nowej implementacji parsera.
Kluczowe kwestie na które chcemy zwrócić uwagę to:

proste skanowanie URL-a polegające na sekwencyjnym wyszukiwaniu kolejnych fragmentów tzn. najpierw scheme, następnie userinfo, później host itd. CPU preferuje kilka krótkich iteracji z drobnymi if-ami od długiego for-a z długim switcho-casem (branch-misprediction). Nie jest to nowość - takiego podejścia użyliśmy już wcześniej. Nowością jest skanowanie napisu z ograniczonym "zawracaniem". Zawracania nie można całkowicie wyeliminować ponieważ niektóre fragmenty są opcjonalne i po prostu trzeba się cofnąć. Ważne żeby zrobić to jak najwcześniej a nie dopiero na końcu URL-a.
ograniczenie kopiowania. Chcemy to zrobić tylko raz na początku parsowania tak by stać się właścicielem wejściowego napisu. W naive_parse typem każdego fragmentu był std::string. Teraz będzie to char*. Chcielibyśmy trzymać wskaźniki na początki naszych fragmentów, co jednak z końcówkami? Wykorzystamy separatory. Struktura URL-a pozwala nam w miejscu separatora wstawić znak 0. Kopiowanie fragmentów nastąpi tylko na żądanie.

Aby wprowadzić w życie powyższe założenia musimy zrezygnować niemal całkowicie z STL-a. Nowy parser postanowiłem napisać w C. Główna funkcja fast_parse wygląda następująco:

char input_str[str_size];
const char *scheme, *userinfo, *host, *path, *query, *hash, *port;

inline void fast_parse(const char *str)
{
    strcpy(input_str, str);

    scheme = userinfo = host = path = query = hash = port = 0;
    int i = 0;
    int old_i = 0;

    i = parse_scheme(&old_i, i);
    i = parse_userinfo(&old_i, i);
    i = parse_host_and_port(&old_i, i);
    parse_query_and_hash(&old_i, i);
}

Link do pełnego kodu źródłowego fast_parse znajduje się na końcu posta. Spójrzmy zatem na osiągi naszego zoptymalizowanego parsera.

Performance counter stats for './fast_uri_parser':

         46.583753 task-clock                #    0.992 CPUs utilized          
                 7 context-switches          #    0.150 K/sec                  
                 2 cpu-migrations            #    0.043 K/sec                  
               120 page-faults               #    0.003 M/sec                  
       147,551,153 cycles                    #    3.167 GHz                    
        26,831,425 stalled-cycles-frontend   #   18.18% frontend cycles idle   
         9,712,240 stalled-cycles-backend    #    6.58% backend  cycles idle   
       390,444,647 instructions              #    2.65  insns per cycle        
                                             #    0.07  stalled cycles per insn
        95,951,406 branches                  # 2059.761 M/sec                  
           107,114 branch-misses             #    0.11% of all branches        

       0.046959240 seconds time elapsed

Jest moc! Czas testu spadł do 47s, średni czas fast_parse do ok. 120 ns. Uzyskaliśmy niemal trzykrotne przyspieszenie i to bez zmiany algorytmu! Jak to możliwe? Rezygnując z klasy std::string oraz STL-a przede wszystkim udało nam się dwukrotnie zmniejszyć liczbę instrukcji wysyłanych na CPU (z ~800 mln do ~400 mln). Dodatkowo bardzo mocno spadła liczba branch miss-ów.
Kod jest już obecnie mocno zoptymalizowany. 120 ns to czas porównywalny z czasem kopiowania przez memcpy 512B kawałka pamięci na moim komputerze.

4. Podsumowanie
Podsumowując. W paragrafie drugim utworzyliśmy prostą, skalowalną infrastrukturę klient - serwer z użyciem Boost Asio. Pokazaliśmy, że echo serwer skaluje się do 400 tysięcy połączeń. Nie obeszło się bez wsparcia ze strony systemu operacyjnego. Przy okazji zaprezentowaliśmy niektóre smaczki nowego standardu C++11.
W paragrafie trzecim napisaliśmy prosty parser URL a następnie zajeliśmy się jego tuningiem. Na początku pod wpływem drobnych modyfikacji proces parsowania udało nam się przyspieszyć 2x, a po przepisaniu na język C kolejne 3x. Ostatecznie parser przyspieszył aż sześciokrotnie. Przy okazji zaprezentowaliśmy linuksowe narzędzia pomocne przy optymalizacji.
Na razie to by było na tyle. W następnej części postaram się zebrać wszystko do kupy i zaprezentować działający serwer udostępniający parsowanie URL-ów. Nie zabraknie również statystyk oraz osiągów serwera takich jak latency i throughput.
C.D.N...

[link do boost_server + boost_client]
[link do naive_parse + fast_parse]

1. Dogłębna analiza tematu znajduje się tutaj: http://www.kegel.com/c10k.html
2. http://stackoverflow.com/questions/651665/how-many-socket-connections-possible/9676852#9676852
3. Nie. Nie można zamienić boost::bind na std::bind w linijce ?. Wynika to z ?. Więcej na temat różnic tu: http://stackoverflow.com/questions/10555566/is-there-any-difference-between-c11-stdbind-and-boostbind
4. http://www.boost.org/doc/libs/1_43_0/boost/asio/detail/epoll_reactor.hpp
5. Podwóje usunięcie obiektu może wystąpić np. w następującej sytuacji:
shared_ptr<Foo> foo(new Foo());
shared_ptr<Foo> foo2(foo.get());
Aby zapobiec takiej sytuacji wystarczy nigdy nie tworzyć więcej niż jednego shared_ptr-a z ustalonego raw pointer-a (zamiast tego używać konstruktora kopiującego itp.).
W przypadku this-a obiektu wskazywanego przez jakiś shared_ptr podwójne usunięcie nastąpi jeśli tylko przekażemy this-a do shared_ptr-a tzn:
shared_ptr<Foo> foo(this);
Rozwiązaniem jest użycie enable_shared_from_this, które opakowuje this w shared_ptr w bezpieczny sposób. Więcej: http://stackoverflow.com/questions/712279/what-is-the-usefulness-of-enable-shared-from-this?rq=1
6. CRTP znajduje zastosowanie m. in. w zliczaniu liczby obiektów klasy, w symulacji dynamicznego wiązania w czasie kompilacji oraz w implementacji polimorficznego clone-a. Więcej pod adresem
http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern
7. Zamiana dużego znaku na mały sprowadza się jedynie do zapalenia 5-tego bitu co jest operacją natychmiastową. Można jednak całkowiecie wyeliminować obliczenia poprzez stablicowanie wyniku. Tak też zrobiono w implementacji std::tolower z libstdc++ na moim systemie. Proste testy wykazały, że stablicowanie przyspiesza fast_parse o ~5%. Wszystko co przed chwilą opisałem działa tylko dla napisu z kodowaniem ASCII, którym jest właśnie std::string
8. http://en.wikipedia.org/wiki/Copy-on-write.Z linku wynika, że w C++11 std::string z różnych powodów nie używa Copy-on-write.
9. O tym jak bolesne dla wydajności mogą być std::string-i przekonał się Fabian Giesen w swoim genialnym cyklu o tunigu demka technologicznego Intel-a. Link: http://fgiesen.wordpress.com/2013/01/30/a-string-processing-rant/
10. Przykładowy, efektywny zamiennik klasy std::string dostępny jest pod tym adresem https://home.comcast.net/~tom_forsyth/blog.wiki.html w sekcji "A sprintf that isn't as ugly"