EPM

From MAGGIE

Jump to: navigation, search

Contents

[edit] End-to-End Performance Monitoring - Fahad Ahmad

[edit] Architecture

Image:epm_general_arch.jpg

[edit] Description:

EPM comprises of four levels, and three paths.

[edit] Level 0 or Monitored Node:

Level 0 or Monitored Node is the client system which acts as the target machine for any network related test. This machine can be any router, server or a client machine that has the capability to answer a request generated by some higher level System. The Monitored Node is equipped with client side part of the Tool that is being used by the Monitoring Host to conduct some test. This means if the Monitoring Host is just sending PING request to the Monitored Node then the Monitored Node must have ICMP traffic enabled. Monitored Node can be any client machine it just has to satisfy the needs of the Tools being used.

[edit] Level 1 or Monitoring Host:

The Monitoring Host is the part of the system that does the actual monitoring of the network. A "Path" is defined as the connection between the Monitoring Host and the Monitored Node. A Monitoring Host conducts its tests on a path; where the Monitoring Host initiates a test and the monitored node replies to it. The term "End to end" implies that the monitoring host wants to test the path; irrespective of the distance of the path. So when we are saying path we mean the connection of the host and the target and not the way; i.e. we are not concerned with the machines and systems coming in between the host and the target. In case of PING the Monitoring Host must have the script to conduct ping test. However in case of Monitoring Host some other things are also required to convert a normal PC into a Monitoring Host. It requires as mentioned above the server processes of the tools employed, a database that contains, information useful for EPM, some tables for storing the Network Metrics, and some scripts to initiate the tests and extract information, and those who manage the database and populate the summary tables, some scheduling data and other data that is critical to EPM. A certain Monitoring Host contains a database that has some summary tables which contain the raw metrics, i.e. all metrics extracted from the tests, and also hourly aggregation of these metrics in separated tables. But the point is that a Monitoring Host contains only the metrics it collected itself.

[edit] Level 2 or Archive Server:

Level 2 or archive serve is an optional but very important part of EPM. EPM has the capability to operate without an archive server; using just the Monitoring Hosts and the Monitored Node. However this design won't allow sharing of data with other networks as well as between the Monitoring Hosts. Hence although possible it is suggested that, in order to make sure good usage of data; archive server should be included. The Archive server acts as the central storage for all the monitoring hosts falling under its domain. This means that every enterprise can have just one archive server, which it uses as the central storage. The archive server not only increases the availability by putting redundancy but also allows for an external interface for users to view the information. It is expected that after the whole system is in place the users will have a public view of all the data placed in the archive server. This data will be used primarily however to get more details the user will need to connect to the Monitoring Host to collect data. The Archive server consists of a database containing raw metrics from all the Monitoring Hosts falling under its domain, in addition to some summary tables again for all the Monitoring Hosts falling under an Archive Server's domain. Some scripts will be in place to manage the database and populate the Tables. The Archive server will also contain the external interface for the information in the database. It is expected that the interface will be a website hosted on the archive server, providing access to the information.

[edit] Level 3 or User:

The user is the Network Analyst or Administrator who is interested in the Network Metrics; he will be able to view the information by connecting to the Archive Server or the Monitoring Hosts depending on the amount of details he requires.

[edit] Database Design

[edit] Schema

[edit] Data Insertion Tables

Image:ERD_for_Raw_Storage.jpg

[edit] Data Table:

The most prominent part of our database is the fact that the network metrics are stored generically in one table and are not stored separately against the Tool that was used or the metric name. This saves the time of searching the table name before making insertions. A particular metric value in a data table is made unique by the id field; however the metric value is associated by the test that was used to find out the metric and a timestamp at which the test was performed. The Data table does round robin storage and is flushed after 1 day. So the size of the data table is kept small and in control.

[edit] Test Table:

The Test table contains the test entries. A test is defined by the Referenced Tool used to extract the Referenced Metric along a Referenced Path. This means extracting RTT (Metric) using Ping (Tool) between xxx.xxx.xxx.xxx and yyy.yyy.yyy.yyy (Path) is a Test table entry. This table contains information on how a metric and Tool are related and how to extract information from data received as result from a Test conducted by a Tool. In order to extract a metric from the result there are two ways, one of them is using the regular expression on the result string (This method just searches for a given string in the result string). The other method is using the user defined methodology of extracting metric from the result. The result string is the output of the Test conducted with line breaks replaced by spaces. The output can be on the screen or in a file; "readResultFromFile" defines which methodology to use and "FilePath" consists of the link where the file can be found. The user provides the definition by implementing the abstract class "AbstractDataExtractor". This class consists of an abstract method "processdata"; this method contains the actual methodology; it receives a string containing the result of the Test conducted and returns the metric as string value to be stored in the Data table. The name of this class is placed in the field "InformationExtractorClass". Note that the name must be placed in particular format that enables java to dynamically load the class file. This format is "packagename.classname". EPM contains a package by the name of "MetricExtractorUtil" this package is converted to jar and deployed with the system. This package contains the abstract class as well as the other implemented classes for extracting the metric. The type of methodology to use; in order to extract the metric; is defined by "isRegular" field. The value of this field is 1 in case regular expression are used otherwise if Extractor class is to be used the value is 0. "gpID" defines the group this Test belongs to, while "gpTestID" defines the execution sequence of Tests, The Test with the lowest gpTestID is executed first and then so on.

[edit] TestGroup Table:

The TestGroup table consists of data about the whole Group. Each Group is currently executed in a separate thread, The variable testinterval is to be used to define the time when this test should run, but has not yet been implemented. THe variable isAlone defines is this test is to be run independent of the other tests or not.

[edit] Tool Table:

The Tool Table contains description of a particular tool; like the dir it is in and the arguments it is to be run with; e.g. in case of Ping the tool table will have its name as "ping" its location can be "/bin/ping" and the arguments can be "-c 5". However the arguments must contain a string "clientip" separated by spaces. In the application used to conduct test this string is replaced by the monitored node's IP.

[edit] Metric Table:

The Metric table consists of metadata related to metrics. These include name of the metric, its minimum value, its maximum value, its unit and its data type. The "metricValueType" field is an enumeration which can contain any one of the "double", "string" or "integer" type. This representation is currently used for description only and is not used in any application.

[edit] Path Table:

The path table consists of the Monitoring Host's and the Monitored Nodes References.

[edit] Node Table:

The Node Table contains general information about a node; whether it's monitoring host or a monitored node. This information is used for testing. It contains the name of the host, its IP address and the port with external interface. It also references a NodeDesc Table entry which contains a bit of detailed information about the node.

[edit] NodeDesc Table:

This table contains some detailed information about the node; which includes its latitude, longitude and hostname along with Node Tale entries. This table will consist of more details in future; that may be required by the user.

[edit] Summary Tables:

The summary tables consist of data that has been processed and ready for user to access. It contains the metric value and its most useful attributes. Summary tables exist on Monitoring Hosts and Archive Servers. The idea is to let the summary tables handle the Select queries to fetch metrics while using the other tables for temporary storage and tests.

Image:Summarytableformat.jpg

[edit] Summary Tables on Monitoring Hosts:

The Monitoring Hosts consist of two kinds of Sumamry Tables. These are:

[edit] Raw Summary Tables:

The format for naming of these tables is "summarytable[month][year]". These tables contain raw metrics for one month time period. These are updated after every hour by the application named "LocalBackup". These tables provide the primary data storage for all metrics; since the data table gets flushed after every day.

[edit] Summary Tables with Hourly Aggregated Values:

The format of naming these tables is "summarytablebyhour[year]". These tables are formed by aggregating metric values for the last hour; using the aggregation scheme provided by the user. This scheme is discussed in detail in the coming sections. This table is again update after every hour by the application named "AggregatedSummarizationControlCenter". This table will contain fewer records and hence is formed for a complete year. The number of entries in one such table will be n x 24 x 365 where n is the set of metrics with distinct combination of Tool, Metric and Path.

[edit] Summary Tables on Archive Servers:

The Archive Server consists of three types of summary tables; that are:

[edit] Summary Tables with Raw Data:

This table consists of Raw Metric Data collected from all the Monitoring Hosts falling under an Archive Servers domain. The format for naming of this table is "summarytable[month][year]", and it is updated after every hour by the application "BackupToArchiveServer". This table consists of the maximum number of entries as it contains Raw Data and that too for multiple Monitoring Hosts.

[edit] Summary Table with Hourly Aggregated Data:

The format of this table is again "summarytablebyhour[year]", it is updated after every hour and the application doing so is "AggregatedSummarizationControlCenter". This table consists of hourly aggregated values but for all the Monitoring Hosts falling under an Archive Server.

[edit] Summary Table with Daily Aggregated Data:

The format of naming this table is "dailysummarytable[year]", it is also updated after every hour by the application "AggregatedSummarizationControlCenter". This table consists of metric values aggregated after a day interval. The values from this table are taken from Raw Summary Table on the Archive Server. Hence if lets say a test is conducted just three times in a day then its value may be missed in the hourly aggregation but it result will be reflected in the daily aggregation table.


[edit] Data Input Tables On Mon Host

CREATE TABLE Data(
id int(32) AUTO_INCREMENT PRIMARY KEY,
TestID int(32),
timestamp VarChar(50),
value VarChar(250));
CREATE TABLE Node
(id int(32) AUTO_INCREMENT PRIMARY KEY,
Name VarChar(50),
IP VarChar(50),
port int(6),
NodeDescID int(32));
CREATE TABLE NodeDesc
(id int(32) AUTO_INCREMENT PRIMARY KEY,
Name VarChar(50),
IP VarChar(50),
port int(6),
Lat VarChar(7),
Lon VarChar(7),
Hostname VarChar(50));
CREATE TABLE Path
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MonHostID int(32),
MondNodeID int(32));
CREATE TABLE Metric
(id int(32) AUTO_INCREMENT PRIMARY KEY,
metricName VarChar(50),
minValue int(50),
maxValue int(50),
metricValueType enum('int','string','double'),
metricUnit VarChar(50));
CREATE TABLE Tool
(id int(32) AUTO_INCREMENT PRIMARY KEY,
Name VarChar(50),
Loc VarChar(250),
arguments VarChar(250));
CREATE TABLE Test 
(id int(32) AUTO_INCREMENT PRIMARY KEY,
ToolID int(32), 
MetricID int(32), 
TestVar VarChar(50), 
Formula VarChar(50), 
RegularExpression VarChar(250), 
PathID int(32), 
InformationExtractorClass VarChar(250), 
isRegular TinyInt(1), 
gpTestID int(32) NOT NULL, 
gpID int(32) NOT NULL, 
readResultFromFile int(1) NOT NULL, 
FilePath varchar(50));
create TABLE TestGroup
(id int(32) PRIMARY KEY,
testinterval varchar(50),
isAlone int(1));

[edit] Summary Tables On Mon Host

1. These tables are used to keep detailed testing information, these are of two types:

a. a summary table for every month that will contain all the data down to every single test conducted. This table get populated after every hour. its format will be like this:

CREATE TABLE summarytablejan2008
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MetricName VarChar(50),
ToolName VarChar(50),
MonHostIP VarChar(20),
MondNodeIP VarChar(20),
value VarChar(250),
MetricUnit VarChar(15),
timestamp VarChar(50));

similarly for next month:

CREATE TABLE summarytablefeb2008
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MetricName VarChar(50),
ToolName VarChar(50),
MonHostIP VarChar(20),
MondNodeIP VarChar(20),
value VarChar(250),
MetricUnit VarChar(15),
timestamp VarChar(50));

b. this type of table will contain aggregated data. Tests will be aggregated up to one hour. i.e. to say that after every hour a single value for the previous whole hour will be inserted into this table. Its format is:

CREATE TABLE summarytablebyhour2008
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MetricName VarChar(50),
ToolName VarChar(50),
MonHostIP VarChar(20),
MondNodeIP VarChar(20),
value VarChar(250),
MetricUnit VarChar(15),
timestamp VarChar(50),
TestCount int(30));

[edit] Summary Tables On Archive Server

Information for all the monhosts under this archive server will be kept in this table:

CREATE TABLE monhosts
(id int(32) AUTO_INCREMENT PRIMARY KEY,
 monhostName VarChar(50),
 MonHostIP VarChar(20));

Note entries in this table will define which MonHosts can upload data to this archive server. Additionally this IP is used to provide a link to the MonHost, from Archive Server's website.

After every hour data will be collected from MonHost's and stored on archive server. This data will be kept for every month separately in a monthly data table. This table will provide the necessary redundancy that will assure data availability. The size of this table can grow extensively because all MonHost under one archive server will be involved.

CREATE TABLE summarytablejan2008
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MetricName VarChar(50),
ToolName VarChar(50),
MonHostIP VarChar(20),
MondNodeIP VarChar(20),
value VarChar(250),
MetricUnit VarChar(15),
timestamp VarChar(50));
CREATE TABLE summarytablefeb2008
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MetricName VarChar(50),
ToolName VarChar(50),
MonHostIP VarChar(20),
MondNodeIP VarChar(20),
value VarChar(250),
MetricUnit VarChar(15),
timestamp VarChar(50));

The data from these tables will be aggregated on hourly basis and provide once again redundancy to hourly aggregated data. Note the tables will contain data for one year only.

CREATE TABLE summarytablebyhour2008
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MetricName VarChar(50),
ToolName VarChar(50),
MonHostIP VarChar(20),
MondNodeIP VarChar(20),
value VarChar(250),
MetricUnit VarChar(15),
timestamp VarChar(50),
TestCount int(30));

This table will provide the primary interface to a user request. This will contain data aggregated for every day of one year. The expected max number of entries in this table will be 366.

CREATE TABLE dailysummarytable2008
(id int(32) AUTO_INCREMENT PRIMARY KEY,
MetricName VarChar(50),
ToolName VarChar(50),
MonHostIP VarChar(20),
MondNodeIP VarChar(20),
value VarChar(250),
MetricUnit VarChar(15),
timestamp VarChar(50),
TestCount int(30));

[edit] Database Stress Test

The time is in nano seconds.


Test Case 1 desc: This testcase just includes time taken by selection queries for one month to execute in order of increasing details queries are for the month of jan:

query: select * from dailysummarytable2008
where timestamp between '1199145600' and '1199145629'
initial Time = 39638726534517 ns	 finalTime = 39638727251927 ns
Total Time = 717410 ns


query: select * from summarytablebyhour2008
where timestamp between '1199145600' and '1199145629'
initial Time = 39638727387698 ns	 finalTime = 39638727940003 ns
Total Time = 552305 ns


query: select * from summarytablebyhour2008
where timestamp between '1199145600' and '1199145629'
initial Time = 39638728001184 ns	 finalTime = 39638730057870 ns
Total Time = 2056686 ns


query: select * from summarytablejantojun2008
where timestamp between '1199145600' and '1199145629'
initial Time = 39638730595368 ns	 finalTime = 39638731297972 ns
Total Time = 702604 ns

Test Case 2 desc: This testcase just includes time taken by selection queries for one metric to executein order of increasing details queries are for the metric metric1:

query: select * from dailysummarytable2008
where timestamp between '1199145600' and '1199145629'
and metricName='metric1'
initial Time = 39638731436537 ns	 finalTime = 39638731901121 ns
Total Time = 464584 ns


query: select * from summarytablebyhour2008
where timestamp between '1199145600' and '1199145629'
and metricName='metric1'
initial Time = 39638731983534 ns	 finalTime = 39638732610429 ns
Total Time = 626895 ns


query: select * from summarytablebyhour2008
where timestamp between '1199145600' and '1199145629'
and metricName='metric1'
initial Time = 39638733103788 ns	 finalTime = 39638734345007 ns
Total Time = 1241219 ns


query: select * from summarytablejantojun2008
where timestamp between '1199145600' and '1199145629'
and metricName='metric1'
initial Time = 39638734489718 ns	 finalTime = 39638735004309 ns
Total Time = 514591 ns

Test Case 3 getting data from Data table in summary table format

query: select d.value,d.timestamp,t.Name,m.metricName,m.metricUnit,n1.IP,n2.IP
from Data d, Test ts, tool t, metric m, node n1, node n2, Path p
where d.TestID=ts.ID and ts.ToolID=t.ID and ts.MetricID=m.ID and ts.PathID=p.ID
and p.MonHostID=n1.ID and p.MondNodeID=n2.ID
and d.timestamp between '1199145600' and '1199145629'
initial Time = 39638735245401 ns	 finalTime = 39638735734569 ns
Total Time = 489168 ns

[edit] List of expected queries

[edit] Node + Node Desc:

insert into NodeDesc(name,IP,port,lat,lon,hostname) Values 
('monhost1','xxx.xxx.xxx.xxx',80,'30.27','30.27','monhost1.com');


insert into Node(Name,IP,port,NodeDescID) Values
('monhost1','xxx.xxx.xxx.xxx',80,1);

[edit] Path

insert into Path(MonHostID,MondNodeID) Values (1,2);

[edit] Metric

insert into metric(metricName,minValue,maxValue,metricValueType,metricUnit) 
Values('RTT',0,5000,double,'ms');


insert into metric(metricName,minValue,maxValue,metricValueType,metricUnit) 
Values ('hops',1,30,int,);


insert into metric(metricName,minValue,maxValue,metricValueType,metricUnit) 
Values ('availability',0,1,int,);

[edit] Tool

insert into Tool(Name,Loc,arguments) Values('Ping','/bin/ping','-t');


insert into Tool(Name,Loc,arguments) Values('Trace Route','/bin/tracert',);

[edit] Test

insert into test(ToolID, MetricID, TestVar, Formula, RegularExpression, PathID,
InformationExtractorClass, isRegular) Values(1,1,'rtt',1,'./extractorClasses/rttExtrator',0);

[edit] Data

insert into data (TestID, timestamp, value) values(1,unixtimestamp,value1);
insert into data (TestID, timestamp, value) values(1,'1206389382','0.233');
insert into data (TestID, timestamp, value) values(1,'1206389382','0.343');
select d.value,d.timestamp,t.Name,m.metricName,m.metricUnit,n1.IP,n2.IP from
Data d, Test ts, tool t, metric m, node n1, node n2, Path p where
d.TestID=ts.ID and ts.ToolID=t.ID and ts.MetricID=m.ID and ts.PathID=p.ID and
p.MonHostID=n1.ID and p.MondNodeID=n2.ID;


[edit] Selection of Summary Data

1. select * from archiveServer.dailysummarytable2008 where ToolName='tool1';
2. select * from archiveServer.dailysummarytable2008 where MetricName='metric1';
3. select * from archiveServer.dailysummarytable2008 where timestamp 
   between '1206389380' and '1206389382';
4. select * from archiveServer.summarytablebyhour2008 where timestamp 
   between '1206389000' and '1206389382' and ToolName='tool1';
5. select * from monhost1.summarytablebyhour2008 where timestamp 
   between '1206389000' and '1206389382' and ToolName='tool1';
6. select * from monhost1.summarytablejantojun2008 where timestamp 
   between '1206389000' and '1206389382' and MetricName='metric1';


[edit] Deployment

[edit] MonHost

1. Create MonHost Database using queries given above. Monthly Summary Tables can be ignored.

2. create a folder named epm and extract the rar file.

3. copy folder "epm" from epmMonhost in /etc/

4. Fill the two configuration files with valid values

5. create a folder named logs

6. create a crontab using command crontab -e and enter:

01 * * * * java -jar [fully qualified path]/Tester/Tester.jar >> [fully qualified 
path]/logs/Tester.Report

52 * * * * java -jar [fully qualified path]/LocalBackup/LocalBackup.jar > [fully qualified 
path]/logs/BackupReport

54 * * * * java -jar [fully qualified 
path]/AggSumControlCenter/AggregatedSummarizationControlCenter.jar > [fully qualified 
path]/logs/AggReport

56 * * * * java -jar [fully qualified path]/BackupToArchiveServer_Client/BackupToASClient.jar 
> [fully qualified path]/logs/ASClientReport


[edit] Archive Server

1. Create Archive Server Database. Monthly Summary tables can be ignored.

2. create a folder named epmServer and extract the rar file.

3. copy folder "epm" from epmArchiveServer in /etc/

4. Fill the two configuration files with valid values

5. create a folder named logs

6. create a crontab using command crontab -e and enter:

58 * * * * java -jar [fully qualified 
path]/AggSumControlCenter/AggregatedSummarizationControlCenter.jar > [fully qualified 
path]/logs/AggSum.Server.Report
50 * * * * java -jar  
[fully qualified path]/BackupToASServerDist/BackupToASServer.jar>[fully qualified 
path]/logs/ASServerReport
Personal tools