How SmartNews Utilizes PipelineDB


Hello, I am @tamtam180 at SmartNews as Ads Enginner.

Today let me introduce how SmartNews utilizes PipelineDB, which is an RDB product supporting Continuous Query.
SmartNews adopted PipelineDB mainly because of its significant benefit in combination with Chartio, which is a popular visualizing dashboard tool that SmartNews has deeply exploited.

( About Continuous Query, readers can refer other alternatives such as Norikra, which is popular among other Japanese engineers, or Azure Stream Analytics, while we highly valued the high connectivity to Chartio. )

Good points

  • Applicability of an existing driver: PipelineDB is based on PostgreSQL and hence usage of PostgreSQL drivers eases our development.
  • Computational efficiency of distinct counting command, thanks to HyperLogLog (HLL) and its merging functionalities.
  • Merging multiple HLL enables efficient aggregation among different time scales, e.g., 1-hour statistics is efficiently computed as the aggregation of multiple 10-minute statistics.
  • The HyperLogLog’s intrinsic error, which stems from its probabilistic approximation nature, has been never problematic in our practice.
  • We are satisfied with the significant performance improvement in version 0.8.0, compared to the initial version 0.7.7.

Use case

  • Various flash report of advertising (It cannot be in batch processing)
    • For example: Time series metrics per advertiser, campaign, device, media, etc..
  • Anomaly detection

System configuration: an efficient combination of PipelineDB, Kinesis, and Chartio.

One key idea to simplify all of the required post processing as merely Continuous View: storing parts of logs into Kinesis.

  • One JSON Object for each record
  • Bulk insert of these records into PipelineDB
  • Storing the running status of each of the Kinesis Shared into LevelDB
  • One-column JSONB field in PipelineDB

pipelinedb01

 

CREATE STREAM imp_stream ( item JSONB );

Examples of using Continuous View

CREATE CONTINUOUS VIEW imp_count AS
SELECT 
  COUNT(*) as cnt, 
  MAX( (item->>'timestamp')::bigint ) as last_timestamp
FROM imp_stream
WHERE
  hour(arrival_timestamp) > clock_timestamp() - interval '30 days';

The following SQL example provides a good understanding about the benefit of hourly GROUP BY syntacs. Here we retrieve the smoothed amount of ad supply.

CREATE CONTINUOUS VIEW plus_imp_ymdh AS
SELECT
  (item->>'campaignId')::bigint as campaignId,
  (item->>'channelId')::bigint as channelId,
  to_char(to_timestamp((item->>'timestamp')::bigint + 3600*9), 'YYYY-MM-DD') as ymd_jst,
  date_part('hour', to_timestamp((item->>'timestamp')::bigint + 3600*9)) as h_jst,
  COUNT(*) as cnt
FROM imp_stream
WHERE
  hour(arrival_timestamp) > clock_timestamp() - interval '10 days'
  AND (item->>'productCode')::text = 'xxx'
GROUP BY
  campaignId, channelId, ymd_jst, h_jst;

Let us provide another SQL example as a usage of distinct count. The application is also in ad domain, where we compute the statistics of the assignment in A/B test.

CREATE CONTINUOUS VIEW abt_allocation AS
SELECT
  to_char(to_timestamp((item->>'timestamp')::bigint + 3600*9), 'YYYY-MM-DD') as ymd_jst,
  (item->>'abtExpLabel')::text as abt_exp_label,
  (item->>'abtGrpLabel')::text as abt_grp_label,
  count(distinct (item->>'uuid')::text) as ucnt
FROM ads_stream
WHERE hour(arrival_timestamp) > clock_timestamp() - interval '3 days'
GROUP BY ymd_jst, abt_exp_label, abt_grp_label
;

If you do not need the detail histogram, we recommend to use the minute with arrival_timestamp. By doing so, ****_mrel0(the actual data is stored) becomes compact.

Dashboard

Here the underlying dataset is artificial, due to confidentiality considerations.

pipelinedb02

 

pipelinedb03

pipelinedb04

Conclusion

We introduced how SmartNews uses PipelineDB and the rationale behind our adoption. We particularly noticed the time-series nature of our data, where both business and engineering decision making must be done based on the freshest information.

The primal discipline in such rational decision making is to check the newest indicators, which has been enabled through the high connectivity with dashboards including Chartio, and support of HyperLogLog functionality.

The second discipline, of which we now consider the introduction, is anomaly detection for improving service level.

Here PipelineDB’s high capacity to store lengthy time-series will greatly ease the application of advanced anomaly detection algorithms, without significant overhead in data processing.

We are hiring!!
If you have any questions, please contact me.