How SmartNews Utilizes PipelineDB
Hello, I am @tamtam180 at SmartNews as Ads Enginner.
Today let me introduce how SmartNews utilizes PipelineDB, which is an RDB product supporting Continuous Query.
SmartNews adopted PipelineDB mainly because of its significant benefit in combination with Chartio, which is a popular visualizing dashboard tool that SmartNews has deeply exploited.
( About Continuous Query, readers can refer other alternatives such as Norikra, which is popular among other Japanese engineers, or Azure Stream Analytics, while we highly valued the high connectivity to Chartio. )
- Applicability of an existing driver: PipelineDB is based on PostgreSQL and hence usage of PostgreSQL drivers eases our development.
- Computational efficiency of distinct counting command, thanks to HyperLogLog (HLL) and its merging functionalities.
- Merging multiple HLL enables efficient aggregation among different time scales, e.g., 1-hour statistics is efficiently computed as the aggregation of multiple 10-minute statistics.
- The HyperLogLog’s intrinsic error, which stems from its probabilistic approximation nature, has been never problematic in our practice.
- We are satisfied with the significant performance improvement in version 0.8.0, compared to the initial version 0.7.7.
- Various flash report of advertising (It cannot be in batch processing)
- For example: Time series metrics per advertiser, campaign, device, media, etc..
- Anomaly detection
System configuration: an efficient combination of PipelineDB, Kinesis, and Chartio.
One key idea to simplify all of the required post processing as merely Continuous View: storing parts of logs into Kinesis.
- One JSON Object for each record
- Bulk insert of these records into PipelineDB
- Storing the running status of each of the Kinesis Shared into LevelDB
- One-column JSONB field in PipelineDB
CREATE STREAM imp_stream ( item JSONB );
Examples of using Continuous View
CREATE CONTINUOUS VIEW imp_count AS SELECT COUNT(*) as cnt, MAX( (item->>'timestamp')::bigint ) as last_timestamp FROM imp_stream WHERE hour(arrival_timestamp) > clock_timestamp() - interval '30 days';
The following SQL example provides a good understanding about the benefit of hourly GROUP BY syntacs. Here we retrieve the smoothed amount of ad supply.
CREATE CONTINUOUS VIEW plus_imp_ymdh AS SELECT (item->>'campaignId')::bigint as campaignId, (item->>'channelId')::bigint as channelId, to_char(to_timestamp((item->>'timestamp')::bigint + 3600*9), 'YYYY-MM-DD') as ymd_jst, date_part('hour', to_timestamp((item->>'timestamp')::bigint + 3600*9)) as h_jst, COUNT(*) as cnt FROM imp_stream WHERE hour(arrival_timestamp) > clock_timestamp() - interval '10 days' AND (item->>'productCode')::text = 'xxx' GROUP BY campaignId, channelId, ymd_jst, h_jst;
Let us provide another SQL example as a usage of distinct count. The application is also in ad domain, where we compute the statistics of the assignment in A/B test.
CREATE CONTINUOUS VIEW abt_allocation AS SELECT to_char(to_timestamp((item->>'timestamp')::bigint + 3600*9), 'YYYY-MM-DD') as ymd_jst, (item->>'abtExpLabel')::text as abt_exp_label, (item->>'abtGrpLabel')::text as abt_grp_label, count(distinct (item->>'uuid')::text) as ucnt FROM ads_stream WHERE hour(arrival_timestamp) > clock_timestamp() - interval '3 days' GROUP BY ymd_jst, abt_exp_label, abt_grp_label ;
If you do not need the detail histogram, we recommend to use the
arrival_timestamp. By doing so,
****_mrel0(the actual data is stored) becomes compact.
Here the underlying dataset is artificial, due to confidentiality considerations.
We introduced how SmartNews uses PipelineDB and the rationale behind our adoption. We particularly noticed the time-series nature of our data, where both business and engineering decision making must be done based on the freshest information.
The primal discipline in such rational decision making is to check the newest indicators, which has been enabled through the high connectivity with dashboards including Chartio, and support of HyperLogLog functionality.
The second discipline, of which we now consider the introduction, is anomaly detection for improving service level.
Here PipelineDB’s high capacity to store lengthy time-series will greatly ease the application of advanced anomaly detection algorithms, without significant overhead in data processing.