A Docker Container Anomaly Monitoring System Based on Optimized Isolation Forest

5 Abstract— Container-based virtualization has gradually become a main solution in today‘s cloud computing environments. Detecting 6 and analyzing anomaly in containers present a major challenge for cloud vendors and users. This paper proposes an online container 7 anomaly detection system by monitoring and analyzing multidimensional resource metrics of the containers based on the optimized 8 isolation forest algorithm. To improve the detection accuracy, it assigns each resource metric a weight and changes the random feature 9 selection in the isolation forest algorithm to the weighted feature selection according to the resource bias of the container. In addition, 10 it can identify abnormal resource metrics and automatically adjust the monitoring period to reduce the monitoring delay and system 11 overhead. Moreover, it can locate the cause of the anomalies via analyzing and exploring the container log. The experimental results 12 demonstrate the performance and efﬁciency of the system on detecting the typical anomalies in containers in both simulated and real 13 cloud environments.

As the containers continue to rise and fall, one of the chal-34 lenges is how to monitor multiple resources at the same time 35 in a dynamic environment with a low overhead.Rule-based 36 methods [6], [7], [8] detect abnormalities by setting a threshold for each metric.They assume that only one container is running on the host at the beginning, and set a fixed threshold for each resource metric of the container.When another container is created with a resource priority, the original resource threshold of the first container is adjusted according to the resource usage of the second container.This adjustment becomes impractical when there exist numerous and dynamically changing containers.The statistics-based method [9] assumes that the data obeys some standard distribution models and finds outliers that deviate from the distribution.Since most models are based on univariate assumptions, they are not applicable to multidimensional data.In order to solve the above-mentioned problems, the academic community has proposed a density-based method such as Local Outlier Factor (LOF) [10] and Angle-Based Outlier Detection (ABOD) [11].They identify outliers by estimating the density of local data or calculating the angle change.However, they both incur a large computation overhead when the sample data size is large.
The existing monitoring systems (e.g., Ganglia [6], Nagios [8], Akshay [12], cAdviosr [13]) generally adopt a fixed monitoring period to query the abnormality of the system.When the monitoring period is very small, the monitoring system can quickly locate abnormalities.However, this results in a huge system overhead when there are too many monitoring objects.When the monitoring period is large, the monitoring delay will also increase.Thus, it is necessary to adopt a proper monitoring period according to the system running state.
When an exception occurs in a container, it usually causes a change in the resource usage of the container.For example, an endless loop in a running program can eat all the CPU resource, and a memory leak will cause the memory usage to become higher.Therefore, it is necessary to identify the and analyze the cause of the anomalies.[14].
Second, Docker has fewer layers of abstraction and does not require an additional Operating System (OS) and hypervisor support [15].Thanks to this, Docker has better resource utilization.Typically, there can be thousands of Docker containers running on a single machine which can hold only a small number of virtual machines.Because of Docker's lightweight, the startup time only needs a few seconds, far faster compared with several minutes that a virtual machine needs.
Third, Docker can run on almost any platform, which makes Docker have better mobility and scalability [16].In addition, it is easy to deploy and maintenance.
Due to the advantages of Docker over traditional virtual machines, more and more researchers begin to use Docker instead of virtual machines [16], [17], [18], [19].For instance, Tihfon et al. [16] implemented the multi-task PaaS (Platform as a Service) cloud infrastructure with Docker, and they achieved rapid deployment of applications, application optimization and isolation.Nguyen et al. [18] implemented distributed Message Passing Interface (MPI) clustering for high-performance computing through Docker.Setting up MPI clusters was originally very time-consuming, but with Docker, they made this work relatively easy.Julian et al. [19] optimized the auto-scaling network cluster with Docker, and they believe that Docker containers can be used more widely in larger production environments.

Classic Isolation Forest Algorithm
Unlike other algorithms, the Isolation Forest algorithm (i.e., iForest [20]) does not need to define a mathematical model nor does it require training.It is somewhat similar to the dichotomy.The iForest consists of a number of isolation trees (i.e., iTree) where the leaf nodes are all single data.The sooner data is isolated, the more sparse it is in the data set, and therefore the more likely it is abnormal.
Assume that there are N data items in the data set.The steps of building an iTree are as follows: First, we get n samples from the N data items as the training samples for this tree.
Second, we randomly select a feature, and randomly select a value p within the range of all values of this feature as the root node of the tree, then perform a binary division on the samples.The sample value that is smaller than p is divided into the left side of the root node, and the sample value that is greater than p is divided into the right side of the root node.Third, we repeat the above process on the left and right data items until reach the termination condition.One is that the data itself cannot be divided (only one sample or all samples are the same), and the other is that the height of the tree reaches log 2 ðnÞ.
To make anomaly detection, we construct an iForest that consists of a number of iTrees.Assume the path length between each data x and the root node is hðxÞ, the average of all hðxÞ is EðhðxÞÞ.sðx; nÞ is the anomaly value of data x in the n samples of a data set.We compute it as follows: sðx; nÞ ¼ 2

Anomaly Detection Method
The mathematical statistics-based method [9] builds some standard distribution models based on historical data, finds data points that deviate from distribution, and judges them as anomalies.However, most of the models are based on the assumption of a single variable.When the monitoring metric is multidimensional, it is difficult to accurately identify the anomaly.In addition, these models are calculated using the original data which contains noise data that has a significant impact on the building of the distribution model [21].
The information entropy based method [22] detects anomalies by comparing the entropies of the same cluster at different time.If there is a large fluctuation, it indicates the occurrence of anomalies.However, this method is only suitable for a stable operating environment.The dynamically changing container cluster will result in inaccurate detection results.
The idea of the distance-based method [23] is to calculate the distance between different data.When the distance between two data items is less than a neighbor distance D, they are regarded as "neighbors".If the number of neighbors of a data is less than the threshold p, then the data is judged to be an anomalous data.However, this method is not suitable for scenarios where the data distribution belongs to a multi-cluster structure [24].Typically, multiple continuous anomalous resource metric data appear and cluster to be neighbors when an anomaly occurs.However, they cannot by identified by this method.
The most representative of the density-based methods is the Local Outlier Factor [10], which measures the degree of abnormality of each data instance based on the densitybased local outlier factor.The larger the local outlier factor, the more likely it is abnormal.However, the local data density estimate can cause significant computational overhead when the sample data size is large [25].Thus this is not suitable for a large number of containers.

SYSTEM DESIGN AND IMPLEMENTATION
3.1 Architecture The monitoring system architecture is shown in Fig. 1.It mainly consists of four components: Monitoring agent, Monitoring data storage, Anomaly detection, and Anomaly analysis.
There is only one monitoring agent on each host machine.
It uses the non-invasive way to obtain the resource utilization rate of the container.The monitoring data storage module receives the monitoring data from each host.Only the monitoring data in the most recent period of time is stored, and the data is organized into a specified format and sent to the anomaly detection module.The anomaly detection module detects data received from the monitoring data storage module through a iForest-based abnormality evaluation method, and sends abnormal container information to the anomaly analysis module, which first obtains the log of the abnormal container from each host, then analyzes the log and locates the cause of the anomaly.

Monitoring Agent
The internal design of the monitoring agent is shown in Fig. 2. Monitoring agent collects the container data through the monitoring data collector.Then the monitoring agent   to collect the container information.When a container is found to be likely to be abnormal, its monitoring period is reduced by half in order to identify the anomaly as soon as possible.In this case, the corresponding container information will be collected more frequently.Thus the container will be adjusted to a position in the front of the queue.In contrast, if a container recovers to normal, its monitoring period will double.The container will be adjusted to a position in the back of the queue.
Log Collection.Based on the log collection command from the monitoring server, the module collects logs for the specified container and passes the log to the transmission module in the specified format.
Transmission.It mainly has two functions: On one hand, it accepts various commands from the monitoring server and forwards the commands to the corresponding modules.On the other hand, it transfers the monitoring data to the monitoring server.

Monitoring Data Storage
The monitoring data storage module is responsible for storing the data collected by the monitoring agent and transmitting the data to the anomaly detection module in a specified format.
It uses InfluxDB [26] to store the collected container infor-   likely to be found.
Here, a self-learning method for resource bias optimization is proposed.During the normal use of a container, the container's bias parameters M for each resource is calculated as formula (3): W 0 is the initial weight value of the resource metric, and its value is 1. is the resource threshold.Algorithm 1. Weighted Random Algorithm Output: i ///A feature among the four features (CPU usage rate, Memory usage rate, IO rate, Network usage rate).
1: M all is the sum of all weight values.R is a random data in the range of 0 to M all , and the last returned i is an index number of the resource selected as a feature to divide the data set.
Anomaly Resource Metric Judgement.The iForest algorithm can calculate the anomaly value of the multidimensional resource metrics, but cannot determine which metric causes the anomaly.For example, there are two kinds of exception cases, one is that the CPU usage is abnormally increased, and the other is that the memory usage is abnormally increased.The anomaly value is similar in both cases using the iForest algorithm.It is impossible to distinguish which kind of anomaly in resource usage that has caused this.In order to solve this problem, this paper proposes a method to judge the anomaly metric.
1) When constructing an isolation tree, if a leaf node is generated when a division is performed, the feature selected by the division is called an isolation feature of the data on the leaf node, indicating that this data is isolated by this feature in the last division.
2) Set an isolation feature group for each data, such as SðS 1 ; S 2 ;:::; S n Þ. S i represents the number of times   2).We can further compute the number of times that each resource metric is used as the isolation feature and identify the anomalous resource metric.

Monitoring Period Adjustment
In order to improve the timeliness of monitoring, the monitoring period can be reduced to collect more monitoring data to detect changes in the monitoring data anomaly value earlier in the case of possible anomalies.An anomaly sensitivity threshold f is set to determine whether an anomaly is likely to occur.The value of f is related to the anomaly detection threshold d and can be expressed as: p is the normal anomaly value originally set for the isolation forest and is set to 0.5 by default.When the average value of the anomaly value of the data in a period is between f and d, although the criterion for judging the anomaly is not reached at this time, the high anomaly value indicates that the container may be abnormal.At this time, the container is set as an intensive monitoring object, and the monitoring server sends a message such as {"container_id": 100 ; "type": inten-sive} to the monitoring agent.The container_id is the ID of the container, and there are two types: intensive and extensive.
When the type is intensive, the corresponding monitoring period is set to half of the initial monitoring period.When the average value of the anomaly value of the data is lower than f, the command of type extensive is sent to the monitoring agent to adjust the monitoring period to the initial monitoring period.

Anomaly Analysis
The anomaly analysis module mainly analyzes the log of the abnormal container identified by the anomaly detection module, and finds why the anomaly is caused.The source data for the anomaly analysis are the log collected by the log collection module in the monitoring agent.The anomaly analysis module mainly contains the following two parts.

Log Preprocessing
Before the log analysis, the first step is to perform log preprocessing.We extract only useful log events to reduce storage overhead and analysis overhead.We demonstrate the monitoring system with two representative benchmarks in cloud environment.One of them is Memcached, and the other one is Web Search in CloudSuite.
Memcached is an open source, high-performance, distributed memory object caching system and intended for use in speeding up dynamic web applications by alleviating database load [30].CloudSuite is a benchmark suite for cloud services and consists of eight applications that have been selected based on their popularity in today's data centers [31] Since there is no benchmark for container anomaly injection, we divided anomaly into four common categories that involve different resource metrics.They are shown and illustrated in Table 3.
Similar to the previous work [33], we use the following four cases to simulate the anomalies.
Endless Loop in CPU.We inject this fault in the application by inserting additional code to call stress tool [34], which can simulate an endless loop in the CPU and take up CPU utilization of 100 percent .
Memory Leak.The injected code allocates heap memory without releasing objects, which can gradually take up 100 percent of memory utilization.
Disk I/O Fault.We use FIO [35] to inject extra operations of reading and writing disk and simulate disk I/O fault.
Network Congestion.We simulate network congestion by using wondershaper [36] to limit the bandwidth of the specified network interface.

The Result Comparison of Anomaly Detection
We use detection rate and false alarm rate to evaluate the result of anomaly detection.The above experiments assume that the injected malicious programs consume 100 percent of CPU by endless loops.
However, in practical, the malicious user who tries to compromise the performance of whole system can use malicious programs that not only take 100 percent of CPU but, for example, 60 percent of CPU for a long time.The anomalous resource metric needs to be located after 787 detecting container anomaly.We propose a method that calculates the ratio of isolation features in the anomalous phase to isolation features in the normal phase.Table 7 shows the ratio of isolation features when endless loop in CPU and network congestion are injected.It can be seen that the ratios of isolation features for anomalous resource metrics are higher than others.So this method can accurately locate the anomalous resource metric.

Detection Threshold d
The detection rate and false alarm rate are closely related to the detection threshold d.In order to find the optimal value, 200 tests were performed, including the four typical anomalies mentioned above and each of them was performed 50 times.Different detection thresholds were used for detection.The results are shown in Fig. 6.
Both the detection rate and false alarm rate decrease rapidly with the increase in d.We need to choose the value of d with a high anomaly detection rate and a low false alarm rate.According to the Fig. 6, the optimal value of d is 0.54.

The Number of iTrees
The number of iTrees is an important parameter in the optimized iForest.In order to find its optimal value, we measure the detection rate and the false alarm rate and the computation time under different numbers of iTrees.The detection threshold is set as 0.54.The results are shown in Fig. 7.
It can be seen that the detection rate increases and the false alarm rate decreases as the number of iTrees increases.
But the computation time still increases proportionally.
Increasing the number of iTrees does not improve anomaly detection effect after the number of iTrees is bigger than 100.So the optimal value of the number of iTrees is 100.The interval between when the anomaly is injected and when 821 the anomaly is found is defined as the monitoring delay.Two One is reading and writing disk constantly using postmark to simulate the disk attack.The other is to send a large number of GET requests to the webpage to simulate the network attack.
Disk Attack.When postmark is running constantly, the disk read-write rate increases abnormally, and the container

102
We propose an optimized isolation forest algorithm 103 that sets weights for different resource metrics and can 104 locate the anomalous resource metric by taking into 105 account the type of container application workload.106 We have implemented both the system and algo-107 rithm and evaluated them in both simulated and real 108 commercial cloud (AWS) environments on a wide 109 variety of anomaly cases in terms of detection accu-110 racy, monitoring delay and log analysis.

111 2 BACKGROUND
AND RELATED WORK 112 In this Section, we first describe the background technolo-113 gies on Docker and isolation forest.Then we elaborate the 114 related work on the monitoring system and anomaly detec-115 tion methods.

116 2 . 1
Docker Technology 117 Docker is a lightweight virtualization solution that is essen-118 tially a process on the host machine.Docker implements 119 resource isolation through kernel-level namespaces.It allows 120 process communications between hosts and containers with-121 out interfering with each other.Compared with virtual 122 machines, Docker has the following advantages: 123 First, Docker has higher performance and efficiency than 124 traditional virtualization methods.Unlike hardware-layer 125 virtualization of virtual machines, Docker does not have 126 hardware emulation, and implements virtualization at the 127 operating system level

314 host and its
monitoring period.When receiving the monitor-315 ing period adjustment command sent by the server, the mod-316 ule changes the monitoring period and sends the changed 317 results to the data collection control module.318 Data Collection Control.This module is the control center 319 of the monitoring agent and maintains a collection queue.It 320 will calculate the next monitored container based on the last 321 collection time and monitoring period of each container, 322 and send this information to the monitoring data collection 323 module.At the same time, the module also accepts the con-324 tainer start and stop information transmitted by the con-325 tainer information management module, thereby adding or 326 deleting containers in the queue.The module can also adjust 327 the monitoring sequence of the containers in the queue 328 according to the monitoring period modification informa-329 tion transmitted by the monitoring period adjustment module.The monitoring period indicates the time interval
N i is the usage rate of the resource at the time i. p is the number of times to measure the resource usage.If x > 0, then fðxÞ ¼ 1, otherwise fðxÞ ¼ 0. If the value of the resource metric is always 0, the container does not use the resource.So we set its weight to 0. The larger the parameter M, the more the container is biased toward the resource.The bias parameter M is used as the weight value for each resource metric.First of all, by default, all resource indicators have a weight value of 1. Then we determine the period under which the weight value is modified.We specify every 10 minutes as a period.The bias parameter M is calculated by the data usage rate during this period, and then the weight value is replaced by M. Finally, a weighted random algorithm is used to select the eigenvalues.The pseudocode of the algorithm is shown in Algorithm 1.

: end for 6 : return i M 1 ,
M 2 , M 3 , and M 4 are the four resource weight values.

3 )
Randomly select a value n from the range of the value of feature F; 4) According to the feature F, the data set is divided.The data with the value of feature F less than n are divided into the left branch, and the data with the value of feature F greater than or equal to n are divided into the right branch.5) Repeat steps 2) through 4) recursively to construct the left and right branches of the iTree until the following conditions are met: a) There is only one data in the data set to be split; b) The height of the tree reaches a predefined height As shown in Fig. 3, the construction of the isolation forest is somewhat similar to the random forest.Each part of the data set is randomly sampled to construct each tree.Then we calculate the average height of each data in all the itrees and compute the anomaly value of the data according to formulas (1) and (
. The Web Search benchmark is one of them and relies on the Apache Solr search engine framework.It contains a 12 GB index which was generated by crawling a set of websites with Apache Nutch.For Memcached, we use Mutilate [32] as a workload generator, and for Web Search, we use Faban client provided by CloudSuite as a workload generator.

Fig. 5 .
Fig. 5. Anomaly values of Memcached container at runtime.The red line shows the detection threshold.

4 . 7
822 sets of anomaly detection tests based on the optimized iForest 823 are performed.One of tests uses the fixed monitoring period 824 of 4 seconds, i.e., we get a group of container data every 825 4 seconds.The other test adopts the method of dynamically adjusting the monitoring period.The initial monitoring period is also 4 seconds.We inject four typical anomalies mentioned above for each test.The comparison results are shown in the Fig. 8.The monitoring delay of dynamically adjusting period is significantly lower than the monitoring delay of fixed monitoring period.When an anomaly is identified, the monitoring period reduces by half.More monitoring data is collected in a unit of time, making the anomaly detected earlier.When the container recovers to the normal status, the monitoring period is adjusted to the initial value.The dynamically adjusting period reduces monitoring delay by an average of 13.5 percent.The average monitoring delays are between 40 and 55 seconds while the setting of monitoring period is fixed 4 seconds.The reason is as follows.The optimized iForest algorithm initially gets 100 groups of data to build an iForest.It has a window size of 100 and a sliding distance of 10.Whenever it gets 10 new groups of data, it uses previously 90 groups of data and this 10 new groups of data to build a new iForest.If the average anomaly value of this 10 groups of data exceeds the detection threshold, an anomaly can be identified.As it takes 4 seconds to get a group of data, it needs a total of 40 seconds to get this 10 groups of container data.Thus when the anomaly of these data is identified, the monitoring delay is at least 40 seconds.Comparatively, when the monitoring period can be dynamically adjusted, the monitoring period can be below 4 seconds.Thus the monitoring delay can be lower than 40 seconds sometimes.Cases for Log Analysis Here are two examples showing how to analyze containers logs.In order to locate the cause of anomaly by analyzing logs, two anomalies which leave traces in the logs are injected.

Fig. 6 .Fig. 7 .
Fig. 6.Anomaly detection effect diagram in the case of different detection threshold d.

Table 1 :
398The first is to delete the redundant data in the data set.
399Redundant data can affect the structure of isolation forests 400 and reduce the accuracy of anomaly detection.When multi-401 ple identical records appear, the extra data must be deleted.402 In addition, the integrity of the data set must be pre-403 served.The absence of data often occurs in datasets and 404 must therefore be handled appropriately, or else it will 405 affect the structure and anomaly detection accuracy of isola-406 tion forests.Severe missing cases are defined as: a) Missing 407 more than 20 percent of monitoring points over a period of 408 time.b) Missing consecutive 5 or more monitoring points.409 If there is a serious loss of data in the data set, the data in 410 that period is excluded from the detection range.411 3.4.2Optimization of Isolation Forest Algorithm 412 Introduction and Calculation of Resource Weight.The idea of the 413 classic iForest algorithm has been very concise and efficient, 414 and can be directly applied to many application scenarios.415 However, there are still some problems when it is applied to 416 the container environment.In container monitoring, there 417 are four most commonly used monitoring indicators: CPU 418 usage, memory usage, disk read and write rates, and net-419 work speed.When the iForest algorithm is applied to the 420 container monitoring, these four indicators become the fea-421 tures used to divide the data set.However, in the classic iFor-432 The basic principle of this optimization is to set a weight 433 value for each of the four resource indicators, and then to 434 change the random selection to weighted randomness when 435 selecting features in the construction of isolation trees.In 436 this way, resource indicators with high weights are more 437 likely to be selected for data classification than other indica-438 tors.Therefore, the anomalies in containers that are more 439 dependent and more sensitive to such resources are more 440

TABLE 1
I E E E P r o o f 498 the metric feature numbered i is used as an isolation 499 feature of the data in the isolation forest.500Whenwerepeatedlyconstruct isolation trees and make a 501 summarize of the isolation feature group for each data, the 502 resource metric with a higher value in the isolation feature 503 group is more likely to be anomalous than the resource met-504 ric with a lower value.Thus it can be judged which resource 505 metric mainly caused the increase in the anomaly value of 506 the monitoring data.507Themethod is based on a premise: if a feature value of a 508 data has a large difference from the value of this feature of 509 other data, then when dividing by this feature, this data is 510 more likely to be isolated separately.Therefore, it can be 511 inferred that the isolation feature of a data is also the feature 512 that is most likely to have the biggest anomalous value.513When it is determined that the container is abnormal, the 526 iTree construction steps are as follows: 527 1) Calculate the bias of each resource of the current con-528 tainer based on the monitoring data, and modify the 529 corresponding feature weight; 530 2) Select a feature F among the four container resource 531 features.(i.e., CPU usage rate, Memory usage rate, 532 IO rate and write rate, Network rate) according to 533 Algorithm 1;

TABLE 2 Configuration
Information of the Experiment 696In order to test the detection result of the proposed method, 697 two other detection methods are used as comparisons.One is 698 original iForest-based anomaly detection method, and the 699 other is based on local anomaly factor algorithm (i.e., LOF

Table 5
4.3 A Case for Anomaly DetectionHere is an example showing how to detect anomaly in Memcached container.During the period of running in Memcached

TABLE 4 The
Result Comparison of Anomaly Detection on Memcached and Web Search

TABLE 6 The
Weights of Resource Metrics 4ig.4.Resource metrics monitored at Memcached containers runtime.Note that in a docker system with n cores, the total system CPU utilization can be 0-n*100% [38],[39].The value of n is 16 in this experiment.

TABLE 7
Ratio of Isolation Features when Endless Loop in CPU and Network Congestion are Injected