HackDay 2018: Condensing Vapour

This post describes experimental features within the Caplin Platform. It is highly probable that something similar will be implemented and made available in the not-so-distant future.

The Plan and idea

Here at Caplin we’ve been thinking about how to improve our cloud offering for a while. Our software stack, of course, works quite happily using any combination of virtualisation, containerisation that can be chosen – indeed we only use physical hardware ourselves for bare metal benchmarking and stress testing.

It’s worthwhile thinking about some of the keywords that come up when people talk about “cloud” – this list is chosen from themes that arose during both internal and external workshops:

  • Save money
  • Lift and shift
  • Containerisation
  • Virtualisation
  • Outsourced infrastructure
  • Network resilience

It’s fair to say that most of those reasons are business focussed and would make for a pretty poor technical demonstration, so a brief glance at meeting minutes led me to these more technical ones:

  • Autodiscovery of peers
  • Dynamic scaling
  • Less configuration

Then we look at the current barriers to being able to do this:

  • Static configuration
  • DataServices
  • Discovery using UDP/Multicast may not work in certain environments

Static configuration is usual a requirement for the traditional deployments, where operators have wanted to lock down production configuration so that it is known and proven. With a more flexible deployment, placeholders can be introduced into the configuration to allow for components to connect, however that may not allow for load spike management.

DataServices are Caplin’s way of defining the data routing rules. They are very powerful and allow configuration of load-balancing and failover (or combinations of both), affinity (for example to bind the trade channel and blotter to the same adapter) and minimum service levels. Given this level flexibility it’s worth keeping them and avoid having to re-write them in the future.

At present the routing is defined by specifying explicitly every peer, for example:

add-data-service
   include-pattern ^/FX
   add-source-group
     add-priority
           label fxdata1
           label fxdata2
     end-priority
   end-source-group
end-data-service

defines a load balanced configuration, with an equal number of requests going to fxdata1 and fxdata2. A logical step towards allowing peer auto discovery would be to use a regex, for example:

add-data-service
   include-pattern ^/FX
   add-source-group
     add-priority
           label-regex ^fxdata
     end-priority
   end-source-group
end-data-service

Which would then load balance between all connected peers with a label that starts with fxdata.

Dynamically creating peers is just a case of writing some code to receive a trigger to create new peers on demand.

Handling the discovery – and having a resilient orchestrator to do this – ended up with a surprisingly easy choice. We’ve started to use Hazelcast in other parts of our stack to allow us to focus on implementing business functionality rather than getting bogged down in low-level protocols.

The implementation

With a plan decided up-front, at 2pm it was a case of putting down project work, switching code workspaces and starting on the hacking.

The key component to change was our underlying DSDK library. Here there were three big changes that needed to be made: Adding new peer connection whilst the system was running, adjusting data services to work with a regex for labels rather than explicit labels and finally, integration into Hazelcast to publish and retrieve peer connection details.

Being able to dynamically add new peers turned out to be not as tricky as I’d feared, we already had an array of peers so all that was needed was to resize the array and add the new peer to the end.  A strategic decision was made to not actually remove peers, just mark them as disabled.

Adjusting data-services was surprisingly awkward. The surprise being that most of the code that needed to be adjusted was configuration validation code! The new watermark feature performed extensive sanity checks on the number of configured and required peers which prevented starting up. This being a hackday, these were just bypassed and the Liberator started up without too many complaints.

Next up was the auto discovery component. I made the slightly unusual choice to use the Java Hazelcast API within a C code based project. I’ve spent the past couple of years writing Java/Kotlin based applications using Hazelcast, so in the interest of getting things working quickly, I chose Java rather than the C++ API.

The result

My usual aim with hackdays is to finish the project on the first day and then do final verifications on the second day. As a result of the up-front planning and thought, everything was completed by 5:30pm on the first day.

The demo script was pretty short:

  • Start up orchestration server
  • Start up Liberator
  • Show the missing add-peer configuration
  • Go to the status page, observe no peers connected and service down
  • Start up a providing datasource
  • Observe that the status page now shows a peer connected
  • Shutdown the Liberator
  • Observe that the datasource no longer attempts to connect to the Liberator

Having proven the basic idea, we’re now working on translating a quick 3 hour POC into a production ready feature that can cope with the complicated topologies of some of our customers’ deployment. If everything goes to plan we’ll be releasing a version of the Caplin Platform that does auto-discovery at some point during 2019.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *