I Introduction
Statistical divergence is widely applied in multimedia processing. Prevalent applications include multimedia event detection [1], content classification [2, 3] and qualification [4, 5]. It has been attracting more attention since the dawn of big data era, basically due to regularity and interpretable features displayed in the data. However, in a broader range of data realm, these advantages may not outstand (e.g. in online sales data records). It requires a more general approach.
Currently, there are more than 2.7ZB data in the digital universe [6] and the growing speed is doubling every two years. It has already been hard and will be much harder in the future to harness the exploding volume of data that has resulted in many problems in data management and engineering, threatening trustworthiness and reliability of data flows inside working systems. Data error rate in enterprises is approximately 1% to 5%, and for some, even above 30% [7]. Those data anomalies may arise due to both internal and external reasons.
On one hand, components inside systems may generate problematic source data. For example, in a sensor network, some sensors may generate erroneous data when it experiences power failure or other extreme conditions [8]. Data packages will be lost if sensor nodes fail to connect to network or some sensor hubs break down [9]. Also, human operators act as a heavily vulnerable part to bugs and mistakes. Malicious insiders even deliberately modify system configurations for fatal compromises [10]. A study shows that 65% of organizations state that human errors are the main cause of data problems [11] .
On the other hand, data manipulation [12] from outside hackers composes another potential threat of data quality and reliability. Data Manipulation
here, according to a NSA definition, refers to that “hackers can infiltrate networks via any attack vector, find their way into databases and applications and change information contained in those systems, rather than stealing data and holding it for ransom”. If data is compromised, it will severely affect mining and learning algorithms and further change the final decision driven by the data. In 2013, hackers from Syria put up fake reports via Associated Press’ Twitter account and caused a 150point drop in the Dow
[13].It is hard to detect a single record that is alerted but still remain in correct value scopes, but if sufficient data records are altered to change a final decision, we can still detect malicious data manipulation behaviours. According to our observation, typical manipulations on numerical data will lead to a drift or distortion of its original distribution. For measurable reshaping, we can enclose data collections with similar distribution patterns and filter out those strangely shaped ones. To address problems caused by data manipulation, we proposed a novel technique which sorts out manipulated data collections from normal ones by adopting statistical divergence. In this paper, we focus on a concrete data manipulation problem: click farming in online shops, and try to apply our technique to pick out those dishonest sellers. Our technique maps data collections to points in distribution spaces and reduce the problem to classical point anomaly detection. Optimizations estimate ground truth, mapping each data collection into a single real number within a definite interval. Then a Gaussian classifier can be applied to detect outliers derived from manipulated data. To automatically calculate adaptive threshold for the classifier, we keep two evidence sets for both normal points and anomalies, taking advantage of the property provided by statistical divergence. In the dynamic environments, these evidence sets are modified after every data collection is checked, in which manner they act intuitively as slide windows and keep up to the evolving features in dynamic scenarios. Our contribution includes: 1) A brief review on data anomaly detection and a study on the problem of click farming; 2) Detailed description of both basic and optimized framework of our technique, resolving several technical difficulties such as automated adaptive threshold; 3) Real world and synthetic data experiments that test efficiency of our technique and a comparison with a previous work on the same topic.
The rest of the paper is organised as follows: Section II states related work on data anomaly detection and describes a real world problem. Section III introduces statistical distance. Details of proposed technique are introduced in section IV. Then section V presents evaluation results and further findings of the algorithm. Finally, the paper is concluded in section VI.
Ii Related Work
Iia Data Anomaly Detection
Statistical divergence was applied mainly as classifiers on multimedia content [3], especially as kernels in SVMs [2]. As a similarity measurement, it can also be used in qualitative and quantitative analysis in image evaluation [4, 5]. [1] adopted divergence to detect events in multimedia streams.
Anomaly detection, also known as outlier detection, has been studied for a long time and discussed in diverse research domains, such as fraud detection, intrusion detection, system monitoring, fault detection and event detection in sensor networks. Anomaly detection algorithms deal with input data in the form of points (or records), sequences, graphs and spatial and geographical relationships.
[14] According to relationships within data records, outliers can be classified into point anomalies, contextual (or conditional) anomalies and collective anomalies. [15]Currently, distance based [16, 17] and feature evolving algorithms [18, 19, 20] catch most attention. Others adopted tree isolation [21], model based [22] and statistical methods [23] in certain applications.
To detect collective anomalies, [24] adopts the ART (Adoptive Resonance Theory)neural networks to detect timeseries anomalies. Box Modeling is proposed in [25]. Longest Common Subsequence was leveraged in [26] as similarity metric for symbolic sequence. Markovian modeling techniques are also popular in this domain[27, 28, 29]. [30] depicts groups in social media as combinations of different “roles” and compare groups according to the proportion of each role within each group.
Wang et al. proposed a technique, Multinomial GoodnessofFit
(MGoF), to analyze likelihood ratio of distributions via KullbackLeibler divergence, and is fundamentally a hypothesis test on distributions
[31]. MGoF divides the observed data sequence into several windows. It quantifies data in each window into a histogram and check these estimated distributions against several hypothesis. If the target distribution rejects all provided hypothesis, it is considered an anomaly and preserved as a new candidate of null hypothesis. If the target distribution failed to reject some hypothesis, then it is considered a supporting evidence of the one that yields most similarity. Furthermore, if the number of supporting evidence is larger than a threshold
, it is classified as nonanomaly.MGoF is the best competitor out of the similar techniques, and we use it as our baseline against our approach.
IiB Real World Problem: Click Farming Detection
Taobao possesses a market share of 50.6% to 56.2% in China by 2016 [32]. Currently, there are more than 9.4 million sellers in Taobao, providing more than 1 billion different products. Under the superpressure caused by massive competitors, a number of the sellers choose to use some cheating techniques to raise reputation and sale volumes, then improve rankings in search lists.
The most popular approach to manipulate transaction and reputation data is Click Farming, where sellers use a large number of customer accounts to create fake transaction records and give high remarks on products. Professional click farmers are usually well organized groups or companies containing thousands of people. Some companies even develop professional applications that can be deployed on common PCs to improve productivity [33].
There are two types of click farming behaviours: Centralized and Equalized. Centralized click farming refers to the scenarios that transactions are randomly generated throughout the day. A significant feature of this approach is that the cheating transactions usually assemble together in a short period of time since most workers work at the same time. Equalized click farming refers to the circumstances that click farms are arranged by some well programmed applications or teams carefully managed and strictly commit transactions according to a timetable. Thus the transaction distribution may not vary too much with and without click farming.
A research performed in China showed that 81.9% of investigated people had heard of the behaviour of click farming, 51.2% are aware of click farm and 18.9% of them had experience of click farming themselves [34]. American researchers reported in 2015 that over 11000 sellers on Taobao were detected to have click farmed records and only 2.2% of 4000 investigated dishonest sellers had been penalized because of the cheating attempts [35].
Current detection techniques for click farming mainly focus on user behaviours, such as browsing frequencies and periods, most common purchasing time, favourite products, remarks and whether they communicate with sellers [36]. Those techniques require the platform to keep lots of records and user features. However, the detection can be easily bypassed by trained workers and some well programmed applications.
Although it is hard to classify users as honest or malicious, we can still find clues from the sellers’ aspect. For normal sellers, their customers are usually similar since choices of products are seldom changed. Therefore, the distribution of transactions in a fixed period of time, say one day, is relatively stable. No matter how much alike between honest users and robots or the employed workers, the fake transaction records will always cause a bias or distortion of the original transaction distribution. To better observe the problem, we downloaded a real world data set containing Taobao online sellers’ transaction records and emulated the circumstances if it had been click farmed (see section VA). Fig. 1 shows the difference between normal and click farmed distributions of one day in the data set. Thus, if we can measure the similarity between different transaction distributions, there is still a chance for us to detect dishonest sellers.
Iii Preliminaries
Statistical divergence, also called statistical distance, measures the similarity between two or more distributions. Mathematically, statistical divergence is a function which describes the “distance” of one probability distribution to the other on a statistical manifold. Let
be a space of probability distributions, then a divergence is a function from to nonnegative real numbers:(1) 
Divergence between two distributions and , written as , satisfies:


, if and only if
For our purposes, we do not require the function to have the property: . But we do need it to be true that if is more similar with than , then . There are ways to calculate divergence, several frequently used divergence metrics are as follows:
Iiia KullbackLeibler Divergence
Let be discrete probability distributions, implies for , the KullbackLeibler Divergence from to is defined to be:
(2) 
For being continuous distributions:
(3) 
IiiB JensenShannon Divergence
Let be discrete probability distributions, JensenShannon Divergence between and is defined to be:
(4) 
where .
A more generalized form is defined to be:
(5) 
where is Shannon Entropy, and .
Especially, if , then:
(6) 
JensenShannon divergence has some fine properties:

.

. If a based algorithm is adopted.

To calculate , it need not necessarily to be true that implies .
IiiC Bhattacharyya Distance
Let be discrete probability distributions over same domain , Bhattacharyya Distance between and is defined to be:
(7) 
IiiD Hellinger Distance
Let be discrete probability distributions, Hellinger Distance between and is defined to be:
(8) 
IiiE KolmogorovSmirnov Statistic
Let be discrete onedimensional probability distributions, and are their cumulative probability functions respectively, KolmogorovSmirnov Statistic between and is defined to be:
(9) 
Iv Statistical Detection
Diverse data sets in the real world show certain structures caused by hidden patterns or relationships among records. For example, traffic volume in the highway and the business transaction records, they may show a relatively stable distribution in the daily scale. Manipulation on those data (e.g. Fig. 1) results in a drift or distortion of the distribution, which can be captured to trigger an alarm.
Iva Statistical Divergence Detection with Reference(SDDR)
From Section III we know that statistical divergence only provides a distance between two or more distributions. In a set of data collections, we can only draw a complete graph where nodes denote data collections and edges refer to the symmetric divergence between two connected nodes. From the graph we can find some points that have apparently larger distances with most of other points and return them as anomalies. This may work if anomalous nodes do not compose a large proportion. However the procedure will be too complicated to work out with large amounts of data. If it is assured that data collections form only one cluster, some optimizations can be applied to reduce complexity.
Alternatively we can provide a frame of reference that generates absolute coordinates rather than the relative ones. This optimization is feasible if data collections form one single cluster in distribution space. This is true in most reality scenarios given that distribution is adopted to depict a macro property which comes out as one universal conclusion. In other words , if multiple distributions are used to describe subgroups of entire sample space, then a conclusive one can be obtained by averaging all these subdistributions. Therefore, we can use an estimate cluster center as reference and test distances between the reference and each other data collections(Algorithm 1), yielding absolute distances.
Fig. 2 shows distribution of all divergences against the reference. It can be approximated as a Gaussian distribution though the true one may differ a little more from the standard Gaussian than the expected estimation error. That is due to the unknown randomness within real world data. Few assumptions can be applied in real world data sets, no mention that data volume is sometimes relatively low. This topic is out of the domain discussed in this paper and we here only introduce the technique instead of the specific distribution model. Certainly, if stronger assumptions can be included to provide a more precise model, this component in the framework can be replaced to give better results. For the simplicity of our proposal, we deem the distributions of divergences to be Gaussian.
By this approach, time complexity can be reduced from quadratic to linear. Fig. 8 in Section VB demonstrates the result of the above process. Red distribution refer to the distances calculated from normal data collections, blue and green ones are from clickfarmed data collections. Clearly, distances of normal data collections assembles together around a small value while anomalous ones lay around a larger distance value.
IvB Optimization: Statistical Divergence Detection with Evidence(SDDE)
It is possible to further optimize SDDR if we can provide this algorithm with evidence(Algorithm 2).
Evidences enables the algorithm to not only refine estimation of real distribution but also build knowledge of anomalous collections, which is similar to the parameter estimation within a certain sample set.
According to the property of statistical divergence, we can infer that the true distribution of divergences calculated from normal data collections are close to but not exactly a Gaussian distribution since for each point, there are both definite upper and lower bounds instead of infinities. Therefore, should be slightly larger than zero(, for real world data sets, this is highly unlikely). Time complexity for this algorithm is still linear but with a larger coefficient.
For certain divergence, it is possible to compare similarity from one distribution against multiple others, such as JensenShannon Divergence. Although it reduces time complexity, it sacrifices unaffordable accuracy because divergence among multiple distribution dilutes differences. Take JSD as an example, suppose and , then and .
This algorithm can be slightly modified to deal with concept drift(for example, trading trend changes over time for online shops as they are often in the process of expanding or dwindling) by turning the two evidence sets as sliding windows and adopting certain update strategies such as Least Recently Used(LRU). Time complexity for this optimization is , where denotes time complexity of divergence calculation.
IvC Threshold
One important factor in algorithm SDDE is the value of threshold. A lower threshold rejects more instances, improving the sensitivity of anomalous data while increasing the number of false alarms. A higher threshold provide higher true negative rates yet neglecting more possible threats.
A naive but prevalent approach is to set a fixed value as the threshold(As is shown in Algorithm 1). This approach is easy to implement and may give satisfying results in specific cases. However, a fixed threshold requires specific analysis in the certain scenario, manual observations and tuning of parameters, which involves lots of human labour. The rule of “” declares all instances outside to be anomalous. It can be used to automatically determine a threshold. But as a rigid metric, it is merely an estimation of a suitable boundary considering average situations, which is far from optimality when concrete data is provided. It would be either lower than the optimum if anomalous data lies far away from the normal cluster, or higher than the optimum if the anomalies sit close to the cluster centre.
Fortunately, applying divergence as the distance measurement among data collections provides a fine property. With a reference distribution, divergences of normal data collections form a quasiGaussian distribution as we have seen in section IVA. The same applies to those anomalous ones.
Moreover, as is verified in experiments, the metadistribution of anomalous data collections lies in the righthandside to the normal one on the real line. As shown in Fig 3, the black curve ( or in short
) displays the probability density function (PDF) fitting those divergences calculated from normal data collections; the blue curve (
or in short ) displays the PDF derived from anomalous data collections. Threshold is chosen to minimize total errors(both false negative and false positive).Suppose:
(10)  
(11) 
Then the optimal threshold is calculated by E.q.(IVC). The optimal threshold will minimize total errors and yield an optimal outcome. However, this is not accurate enough, since E.q.(IVC) implicates an assumption that chances are the same for a new data collection to be either anomalous or not. If we can determine the probability for a new data collection to be anomalous in any segment of data sequence, the equation should be modified as E.q.(IVC), where is the anomaly probability.
(12) 
Note: when , keep the root s.t.
(13) 
Note: when , keep the root s.t.
Moreover, with an estimated anomaly probability, SDDR can be also optimized by ranking all data collections according to their divergence value and select first ones with highest values as anomalies.
V Evaluation
Our algorithm was implemented and interpreted in Python 3.5.2. All experiments were tested on Ubuntu 16.04. In the following experiments, we figured out properties of real world data and performance of our technique against anomalous data collections. We also made a comparison among variations of SDD algorithms and MGoF.^{1}^{1}1All resources and more detailed experiment results can be retrieved online: https://github.com/TramsWang/StatisticalAnomalyDetection
Va Methodology
We adopted two data sets: 1) Koubei sellers’ transaction records^{2}^{2}2https://tianchi.aliyun.com/competition/information.htm?raceId=231591; 2) Synthetic random distribution data set. Koubei data set was provided by Alibaba Tian Chi big data competition where all records were collected from real world business scenarios. It contained information about seller features, user payments and browsing behaviour. We randomly chose one seller (ID: 1629) and extracted transaction history of this seller, records ranging from Nov. 11th 2015 to Oct. 31st 2016. Entire transaction set was then divided into 325 collections, each containing records in one day. Fig. 4 and 5 give an overview of it.
Two types of clickfarmed data was generated according to patterns described in section IIB. To emulate centralized click farming, we randomly inserted some Gaussiandistributed transactions in the chosen collection. As for emulating the equalized click farmers, we simply doubled each record in the chosen collection to make the new distribution exactly the same as the original one, which is harder for the online platform to discover. Usually, the clickfarmed transactions are several times more than the volume it originally has, if the seller hires a group of organized workers. In our experiments, we use to denote the magnitude coefficient of click farming. Hence . In the following experiments without extra illustration, we adopted .
One defect of this data set is that the detailed time stamp is aligned at each hour of the day due to desensitization. We constructed an enhanced data set by assigning every time stamp a random value for minutes and seconds. Therefore, the enhanced data set should be closer to the reality.
The synthetic data set was divided into four sections. First two sections contained sample sets drawn from a uniform and a Gaussian distribution respectively. The third section used a mixture of one uniform distribution and two Gaussian distributions to simulate a randomshaped distribution. Moreover, we made the randomshaped distribution drift slightly to form the last section of test data. Corresponding anomalies were drown from distributions with deviated parameters respectively.
We adopted histograms to depict distributions of any shape. Surely, the kernel density estimation approaches will give smooth and continuous estimations on any sampled data. But the computational cost will be too much to afford. Step size of histograms is chosen by:
(14) 
where is sample size, is a constant relative to the shape of distribution (e.g. for normal distribution, ) and
the standard deviation. For data sets with a large number of elements, a random sampling method, such as MonteCarlo method, can be applied to speed up the estimation procedure.
Divergence metric adopted in each SDD algorithms was JensenShannon divergence if no specific notation is made. However, MGoF used only KullbackLeibler divergence due to its special mechanism. We use a “+” to denote algorithms optimized by a given .
VB Experiments on Koubei Data Set
We first tested our algorithms on Koubei data set in order to see whether and why the algorithm works. Anomalies were random selected days replaced by corresponding click farmed version. To play the role of purchasing platform, we investigated two levels of transaction distribution. The first level is to simply draw a histogram aligned to time spans. The second level is to draw a histogram on the subvolumes in each time span(i.e. a histogram on frequencies in the first level histogram, as shown in Fig. 6).
On the raw data set, we had no choice but to set one hour a basket. While on the enhanced data set, we adopted E.q.(14) to determine step size automatically. To test SDDE, we randomly selected 30 correct days and 10 click farmed days as normal and anomalous evidence respectively. Here . The results are shown in Table II and II.
Centralized  Equalized  
1st Level  2nd Level  1st Level  2nd Level  
Pre(%)  Rec(%)  F1(%)  T(ms)  Pre(%)  Rec(%)  F1(%)  T(ms)  Pre(%)  Rec(%)  F1(%)  T(ms)  Pre(%)  Rec(%)  F1(%)  T(ms)  
SDDR  89.51  48.75  63.12  266.77  21.97  99.38  35.98  11.12  6.67  0.63  1.14  249.15  21.22  72.50  32.83  7.81 
SDDR+  91.25  91.25  91.25  265.96  61.88  61.88  61.88  10.08  9.38  9.38  9.38  247.31  44.38  44.38  44.38  6.92 
SDDE Static  92.46  68.75  78.86  292.50  36.55  98.13  53.26  5.75  6.67  0.63  1.14  271.45  36.07  86.25  50.86  5.64 
SDDE Static+  85.02  32.50  47.02  293.77  46.24  91.88  61.52  5.95  10.00  0.63  1.18  272.71  43.60  76.25  55.48  5.68 
SDDE Dynamic  49.11  99.38  65.73  699.97  23.01  99.38  37.37  245.65  10.36  18.13  13.18  681.09  22.09  93.13  35.71  242.85 
SDDE Dynamic+  73.21  98.75  84.09  701.06  48.02  96.25  64.07  255.43  8.15  6.88  7.46  681.89  40.79  78.13  53.59  253.03 
MGoF  14.08  21.88  17.13  292.14  13.01  4.38  6.55  3.64  12.50  3.13  5.00  250.42  12.50  3.13  5.00  3.71 
Centralized  Equalized  
1st Level  2nd Level  1st Level  2nd Level  
Pre(%)  Rec(%)  F1(%)  T(ms)  Pre(%)  Rec(%)  F1(%)  T(ms)  Pre(%)  Rec(%)  F1(%)  T(ms)  Pre(%)  Rec(%)  F1(%)  T(ms)  
SDDR  81.54  41.88  55.33  238.61  17.49  92.50  29.42  13.86  6.67  0.63  1.14  239.44  21.15  71.88  32.69  12.08 
SDDR+  91.25  91.25  91.25  236.94  36.88  36.88  36.88  12.98  8.13  8.13  8.13  236.47  38.75  38.75  38.75  11.03 
SDDE Static  88.31  67.50  76.52  257.60  31.99  92.50  47.54  9.74  6.67  0.63  1.14  257.65  34.49  91.25  50.06  9.37 
SDDE Static+  69.67  13.13  22.09  259.30  42.12  73.75  53.62  9.59  10.00  0.63  1.18  258.45  44.17  82.50  57.54  9.32 
SDDE Dynamic  42.93  99.38  59.96  1106.25  20.12  95.63  33.25  277.70  10.57  17.50  13.18  1108.93  20.30  94.38  33.42  271.74 
SDDE Dynamic+  69.18  98.13  81.15  1110.81  33.25  87.50  48.19  292.60  7.06  3.75  4.90  1118.00  37.21  80.63  50.92  283.50 
MGoF  13.97  17.50  15.54  294.08  14.66  8.13  10.45  8.01  12.50  3.13  5.00  250.10  18.42  8.75  11.86  7.55 
When classifying toward 1st level histograms, centralized click farming behaviours can be easily discovered. As displayed in the first two rows in Fig. 7, normal collections share a similar distribution while centralized clickfarmed ones abruptly violated the original shape. However, as a clever click farmer, equalized click farming did not in the least distort the distribution. Most of them escaped the check under perfect disguises. But when it came to 2nd level histograms, the “clever disguise” did not work any longer. It can be clearly seen in Fig. 8 that distribution of divergence of both click farming types shows an obvious deviation from the normal one.
The result showed that our technique outperformed MGoF in every real world cases. SDDE provided best performance, yet it consumed the most computing power. Comparison among SDDR revealed improvement of reference as well as the importance of threshold under this technique. Although dynamic SDDE consumes more computation power, it is clear that dynamic SDDE is capable of tracing the gradual shift of environment. MGoF turned out to be the worst since it always mark several false positive when had not been met and much more false negatives when similar errors occurred too many.
Parameter improved total accuracy of dynamic SDDE algorithm by 1020% as was supposed. It also increased its F1 by more than 20%. made a great difference in SDDR as well, which illustrated that divergence sorted almost all collections in correct order according to the averaged reference. However, static SDDE did not show the same improvement. Since environment drift took greater influence in the result. In comparison with , adaptive threshold given by evidence sets did not bring the most improvement. But this threshold can be applied together with other optimizations such as slide windows.
VC Test against Anomaly Proportion and Magnitude
In this experiment, we tested algorithm performance under various anomaly proportion and magnitude. ranged from 0.1 to 0.9 when and when , other settings remains the same.
Fig. 9 shows that our technique outperformed MGoF and was relatively stable when dealing with all proportions of 1st level centralized anomalies. SDDE performed even better since it maintains knowledge of both normal and anomalous distributions and calculates the threshold according to the best expectation. However, it relies on the accuracy of distribution estimation. When it came to 2nd level distributions, histograms became much coarser since data available was highly limited and thus its performance suffered dramatically.
MGoF tended to classify every distribution as anomaly, therefore benefited most by larger . It always classifies as anomalous the first distributions supporting every null hypothesis. Thus when increased, the proportion of misclassified normal collections also became larger, while those anomalies were still considered anomalous. And given that the total number of normal collections drops down, the overall accuracy tended to increase as more instances are correctly classified as anomalous. However, the right half shows a different trend. One reason is that MGoF uses KLD other than JSD. In the Koubei dataset, the discrete estimation of distributions oscillated in a wide range, leading to that the prerequisite of KLD is often unsatisfied. Thus the calculation of KLD may not give a correct measurement. Furthermore, the 2nd histogram provided fewer probability entries than the 1st level did. Thus it shows a more significant deviation from our expectation. For the classifiers of MGoF, they compromised to a high error rate. Because more anomalies gathered together and the algorithm recognized them as clusters of normal data.
From Fig. 10 we can conclude that our algorithms are still the best, given that they are most sensitive toward tiny anomalous variations. However, static SDDE did not rise until , this is because it suffered from fluctuation on the trade environment at the mean time. MGoF is not sensitive toward minor anomalies either. For a relatively small magnitude of click farming, the classifiers of MGoF quickly degrade to be trivial. The rigid threshold could not automatically rise up and was thus far from to optimal.
VD Results on Synthetic Data Set
In this experiment, we tested all seven algorithms on totally synthetic data sets. Results are shown in TABLE III. It shows that our technique can be applied towards any kind of distributions. And these techniques worked better under irregular distributions since difference were clearer among these. Comparison between SDDR and static SDDE shows that adaptive thresholds provided more flexible classifiers. Results under randomshape drifting proves the efficiency of sliding windows toward drifting context.
The last experiment was carried out on randomshaped distribution data set, with and rest parameters the same. Under different divergence metrics mentioned in section III, F1 scores were calculated among all SDD algorithms and ROC curves were recorded on SDDR and static SDDE. Given that MGoF was defined specifically on KullbackLeibler divergence, it cannot be tested in the same way. Results are shown in Fig. 11 and Fig. 12.
It is indicated that JensenShannon divergence is suited to all techniques due to its symmetry. KullbackLeibler divergence provides more evident differences when references were given. Bhattacharyya distance and Hellinger distance turned out almost as good as JensenShannon divergence, but they consumed less time. KolmogorovSmirnov Statistic performed relatively poor since it considers only the largest gap between two distributions, which provides little information.
VE Discussion
MGoF’s learning procedure of anomalous probability hypothesis is inefficient. To maintain a comprehensive knowledge of anomalies, MGoF has to reserve a single hypothesis entry for every type of them. But in reality, it is always the case that we face the heterogeneity of outliers. In the Koubei data set, there can be tens of anomalous distributions caused solely by centralized click farming. It takes a long time to discover every possible type of anomaly. Besides, if there happens to be more than anomalous distributions of the same type, later discovered collections will no longer declared to be anomalous any more.
However, in SDDR and SDDE, that is not a problem since it can map and gather all anomalies together and draw a universal boundary between them and all normal collections. These techniques are suitable to all typical divergence metrics and consume little computation power(except dynamic SDDE). The only drawback is that they require comprehensive estimation of target distributions. Although other parameters need estimation as well, they are naturally addressable under big data circumstances.
Vi Conclusion
This paper proposes a series of collective anomaly detection techniques, which helps detect data manipulations in modern data pipelines and data centres. Different from existing algorithms designed for collective anomalies, our approach employs statistical distance as the similarity measurement. We explored several technical points involved in the design of the algorithm and performed a thorough experiment to test its efficiency. The comparison experiment also illustrated the advantages of our technique. It can be concluded that the our technique can efficiently discover anomalies within the data collections and the classifier is sensitive enough toward real world data manipulations.
References

[1]
E. Amid, A. Mesaros, K. J. Palomäki, J. Laaksonen, and M. Kurimo, “Unsupervised feature extraction for multimedia event detection and ranking using audio content,” in
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 5939–5943.  [2] P. J. Moreno, P. P. Ho, and N. Vasconcelos, “A kullbackleibler divergence based kernel for svm classification in multimedia applications,” in Advances in neural information processing systems, 2004, pp. 1385–1392.
 [3] D.C. Park, D.H. Nguyen, S.H. Beack, and S. Park, “Classification of audio signals using gradientbased fuzzy cmeans algorithm with divergence measure,” in PacificRim Conference on Multimedia. Springer, 2005, pp. 698–708.
 [4] H. S. Pheng, S. M. Shamsuddin, W. Y. Leng, and R. Alwee, “Kullback leibler divergence for image quantitative evaluation,” in AIP Conference Proceedings, vol. 1750, no. 1. AIP Publishing, 2016, p. 020003.

[5]
J. Goldberger, S. Gordon, and H. Greenspan, “An efficient image similarity measure based on approximations of kldivergence between two gaussian mixtures,” in
null. IEEE, 2003, p. 487.  [6] (2017) Big data statistics and facts for 2017. [Online]. Available: https://www.waterfordtechnologies.com/bigdatainterestingfacts/
 [7] B. Saha and D. Srivastava, “Data quality: The other face of big data,” in Data Engineering (ICDE), 2014 IEEE 30th International Conference on. IEEE, 2014, pp. 1294–1297.
 [8] M. A. Rassam, M. A. Maarof, and A. Zainal, “Adaptive and online data anomaly detection for wireless sensor systems,” KnowledgeBased Systems, vol. 60, pp. 44–57, 2014.
 [9] H. Herodotou, B. Ding, S. Balakrishnan, G. Outhred, and P. Fitter, “Scalable near realtime failure localization of data center networks,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 1689–1698.
 [10] F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. MainarRuiz, and M. Russinovich, “Vc3: Trustworthy data analytics in the cloud using sgx,” in Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 2015, pp. 38–54.
 [11] TowerData. (2013) 4 steps to eliminating human error in big data. [Online]. Available: http://www.towerdata.com/blog/bid/113787/4StepstoEliminatingHumanErrorinBigData
 [12] Is data manipulation the next step in cyber crime. [Online]. Available: https://www.cloudmask.com/blog/isdatamanipulationthenextstepincybercrime
 [13] (2016). [Online]. Available: https://www.cnbc.com/2016/03/09/thenextbigthreatinhackingdatasabotage.html
 [14] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
 [15] A. L. Goldberger et al., “Components of a new research resource for complex physiologic signals, physiobank, physiotoolkit, and physionet, american heart association journals,” Circulation, vol. 101, no. 23, pp. 1–9, 2000.
 [16] L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner, “Scalable distancebased outlier detection over highvolume data streams,” in Data Engineering (ICDE), 2014 IEEE 30th International Conference on. IEEE, 2014, pp. 76–87.
 [17] L. Cao, Y. Yan, C. Kuhlman, Q. Wang, E. A. Rundensteiner, and M. Eltabakh, “Multitactic distancebased outlier detection,” in Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 2017, pp. 959–970.
 [18] M. M. Masud, Q. Chen, L. Khan, C. C. Aggarwal, J. Gao, J. Han, A. Srivastava, and N. C. Oza, “Classification and adaptive novel class detection of featureevolving data streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 7, pp. 1484–1497, 2013.
 [19] Y. Li, Q. Li, J. Gao, L. Su, B. Zhao, W. Fan, and J. Han, “On the discovery of evolving truth,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 675–684.
 [20] J. Shao, Z. Ahmadi, and S. Kramer, “Prototypebased learning on conceptdrifting data streams,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 412–421.
 [21] X. Zhang, W. Dou, Q. He, R. Zhou, C. Leckie, R. Kotagiri, and Z. Salcic, “Lshiforest: A generic framework for fast tree isolation based ensemble anomaly analysis,” in Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 2017, pp. 983–994.
 [22] J. Yin and J. Wang, “A modelbased approach for text clustering with outlier detection,” in Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 2016, pp. 625–636.
 [23] Y. Zhu and D. Shasha, “Statstream: Statistical monitoring of thousands of data streams in real time,” in Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002, pp. 358–369.
 [24] T. Caudell and D. Newman, “An adaptive resonance architecture to define normality and detect novelties in time series and databases,” in IEEE World Congress on Neural Networks, Portland, Oregon, 1993, pp. 166–176.
 [25] P. K. Chan and M. V. Mahoney, “Modeling multiple time series for anomaly detection,” in Data Mining, Fifth IEEE International Conference on. IEEE, 2005, pp. 8–pp.
 [26] S. Budalakoti, A. N. Srivastava, R. Akella, and E. Turkov, “Anomaly detection in large sets of highdimensional symbol sequences,” 2006.

[27]
N. Ye et al.
, “A markov chain model of temporal behavior for anomaly detection,” in
Proceedings of the 2000 IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop, vol. 166. West Point, NY, 2000, p. 169.  [28] C. Warrender, S. Forrest, and B. Pearlmutter, “Detecting intrusions using system calls: Alternative data models,” in Security and Privacy, 1999. Proceedings of the 1999 IEEE Symposium on. IEEE, 1999, pp. 133–145.
 [29] D. Pavlov, “Sequence modeling with mixtures of conditional maximum entropy distributions,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 2003, pp. 251–258.
 [30] R. Yu, X. He, and Y. Liu, “Glad: group anomaly detection in social media analysis,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 10, no. 2, p. 18, 2015.
 [31] C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, and K. Schwan, “Statistical techniques for online anomaly detection in data centers,” in Integrated Network Management (IM), 2011 IFIP/IEEE International Symposium on. IEEE, 2011, pp. 385–392.
 [32] Iresearch. (2016). [Online]. Available: http://report.iresearch.cn/content/2016/11/265551.shtml
 [33] M. Zhao, “On the phenomenon of click farming in taobao,” CHINA BUSINESS, no. 12, pp. 31–33, 2016.
 [34] Z. Yan, “Report of click farming phenomenon in taobao–protection for consumers’ right to know in online shopping,” FaZhi Yu JingJi, vol. 20, pp. 195–196, 2015.
 [35] T. L. Mirror. (2016). [Online]. Available: http://tech.163.com/15/0405/14/AMENOJ7D00094ODV.html
 [36] (2015). [Online]. Available: http://www.ebrun.com/20150906/147724.shtml
Comments
There are no comments yet.