Seif Lotfy: Skizze: Behind the Scenes of Alpha 2 |
Based on the feedback we got for our initial alpha release, we worked on improving Skizze and moving the project forward.
To recap, Skizze is a sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
My old time hacking buddy Neil Patel, who is also Xamarin Insights Technical Lead and Architect, blogged about the latest release, and also provided some background on why Skizze exists, and how to get started.
This second alpha focuses mainly on improving development and operating experience. It is an early alpha so don't expect much, but hopefully now it is easier to run and experiment with.
In this post I wanted to highlight a couple of features and ideas we have for Skizze:
As Neil mentioned in his post, we have a clear definition of Domains now:
One of the first pieces of feedback we got with the initial release was: "My data needs all four Sketches and I'm sending the values four times to the server!". Which is a fair point, in most cases the stream of values you have, you almost always want all four kinds of questions answered. Domains help solve this problem.
When you create a Domain, behind the scenes, Skizze creates all four sketches of the same name. From then on, as you add values to that domain, they automatically get multiplexed into each Sketch that belongs to it.
Here is a small Mockup of how flow looks like.
One of the main new features is the introduction of an initial functioning persistence functionality, which is done via AOF (Append Only File) inspired by Redis:
In basic terms, append-only log files keep a record of data changes that occur by writing each change to the end of the file. In doing this, anyone could recover the entire dataset by replaying the append-only log from the beginning to the end -
In our case we record all incoming (write) requests (serialised protobuf) to the AOF. This is all done in a go routine.
Upon restart we replay them again. Currently this might be a bit slow, which is why we intend to add support for snapshotting to allow quicker startup/restoring.
While we do intend to add clustering in the future, for now we are focusing on making Skizze work reliably on one machine. All feedback or support on implementing better persistence is welcome.
We also focused on improving the throughput complexity/speed. Thanks to Damian Gryski's tips, this was achieved by using:
And although often when working in a high-throughput environment, beefier machines are expected, benchmarking on a standard low tier VM offering should be a good reference point. To benchmark our throughput speed we used the lowest Digital Ocean Tier (5$/month)
Memory | 512MB Ram |
Disk | 20GB SSD |
OS | Ubuntu 14.04 |
CPU | Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (1Core) |
Our benchmark scenario is 1 Domain with 4 sketches, CML, BloomFilter, Top-K (tracking the top 100) and HyperLogLog++ (disregards the expected capacity). The error rate is set to 1% with an expected 1000000 values capacity. The domain is subject to 10 million insertions (unique: 349902) in a Zipf (1.1) distribution. The following are our insertions benchmark results.
batch size | elapsed time | values per second |
1000 | 34s | 285714 |
10000 | 29s | 333333 |
100000 | 26s | 370370 |
We will be attempting to improve the throughput with every release.
I will follow up with a blog post on the "querying" accuracy, speed and resource consumption soon.
There is still a lot to be done, and we appreciate every kind of support. For the next alpha #3 release we are focusing on:
For alpha #n we are playing with the following ideas:
A big thank you to this release's contributors:
Please feel free to open an issue, or join us for a chat in our channel on the Gophers Slack on #skizze.
Комментировать | « Пред. запись — К дневнику — След. запись » | Страницы: [1] [Новые] |