Seif Lotfy: Counting flows (Semi-evaluation of CMS, CML and PMC)

Вторник, 01 Сентября 2015 г. 01:25 + в цитатник

Assume we have a stream of events coming in one at a time, and we need to count the frequency of the different types of events in the stream.

In other words: We are receiving fruits one at a time in no given order, and at any given time we need to be able to answer how many of a specific fruit did we receive.

The most naive implementation is a dictionary in the form of , and is most accurate and suitable for streams with limited types of events.

Let us assume a unique item consists of 15 bytes and has a dedicated uint32 (4 bytes) counter assigned to it.

At 10 million unique items we end up using 19 MB which is a bit much, but on the plus side its as accurate as it gets.

But what if we don't have the 19 MB. Or what if we have to keep track of several streams?

Maybe saving to a DB? Well when querying the DB upon request, something in the lines of:

SELECT count(event) WHERE event = ?)

The more items we add, the more resource intensive the query becomes.

Thankfully solutions come in the form of Probabalistic datastructures (sketches).

I won't get into details but to solve this problem I semi-evaluated the following data structures:

Count-Min sketch (CMS) [2]
Count-Min-Log sketch (CML) [1][2]
Probabilistic Multiplicity Counting sketch (PMC) [1]

Test details:

For each sketch I linearly added a new flow with equivalently linear events. So the first flow got 1 event inserted. The second flow for 2 events inserted, all the way up to 10k-th flow with 10k events inserted.

flow 1: 1 event  
flow 2: 2 events  
...
flow 10000: 10000 events

All three data structures were configured to have a size of 217KB (exactly 1739712 bits).

A couple dozen runs yielded the following results (based on my unoptimized code esp. for PMC and CML)

CMS: 07s for 50005000 insertion (fill rate: 31%)  
CML: 42s for 50005000 insertion (fill rate: 09%)  
PMC: 18s for 50005000 insertion (fill rate: 54%)

CMS with

http://geekyogre.com/counting-flows-semi-evaluation-of-cms-cml-and-pmc/