Test of MySQL database server

Date: Jan. 30, 2001

Introduction

The CDF production farm group uses MySQL database server for the internal bookkeeping. It is important to test and understand the MySQL performance and capability.

Bookkeeping of the production farm in operation

Each file being processed inside the farm causes several entries to the database along the procedure, such as Start, Staging, Staged, Dispatched, Transferring file, Processing start, Executing and Done. Although the farm might be requested to process files in a dataset or files in a data stream , i.e. multiple-file processing in parallel, due to the network-bandwidth control these files will be queued first and processed while the capacity allows. Figure 1 shows the activity of logging information into the database of processing a batch of 289 files collected in the CDF RUN II commissioning run. The peak in the left side of the upper plot is due to 'Start' and 'Staging' mostly and partly 'Staged'. It takes a while to get file from tape to disk. The files are then dispatched to worker nodes one by one. There the reconstructing work is started to process the file. The information logging get sparser and sort of uniformly distributed over the time period. The bottom plot shows more details of the left side of the upper plot. In case of processing a data stream it is estimated that about 4 requests per second will be made to the MySQL server.

Tests done to understand MySQL performance and capability

Since there will be multiple files being processed at the any moment, the first concern is the number of connections that MySQL can handle simultaneously. The result of our test shows that MySQL handles the requests well up to 40 processes requesting for information logging simultaneously. Cases of more than 40 processes are not tested.

The CDF RUN II data taking will continue for years. The size of the database table increases when there are more and more files being processed. The limitation of table size is 2 GB or so. However there are ways to increase the size of the table effectively. The concern here is the change of logging rate. It turned out that with indexing turned on, there was no sign of rate dropping up to 500 K entries pre-existing. Beyond that was not tested.

A test table was created with the following fields.

Field  Type 
fname  char(40) 
ndname  char(40) 
version  int(11) 
fdatime  datime 

where the first field, 'fname', is the file name which is changed file by file automatically if the entries are made by the same script. The field, 'ndname' is not really used in the test. The field, 'version', is used to identify the tests done with various different conditions. The field, 'fdatime', is the time when the entry is made and is logged automatically.

Scripts written in PYTHON were used to fill the table and to call a simplified version of 'bookkeeper' used in the farm operation. In this simplified 'bookkeeper', there is no check of duplicating entries also less error handling in parameter input.

Number of simultaneous processes requesting for information logging

The test process fills the table with different file names 100 time for each call. Each time the test process does the connecting to the database server, the information logging and the disconnecting from the database server. Multiple test processes can be started in the background. The version numbers of all test conditions are listed in the following table. The entries created during the filling procedure are identified with negative version numbers, so that they can be removed independently. Of course, there are entries made by the test processes so the '100 K' doesn't exactly means hundred thousand pre-existing entries. However the difference is small enough to neglect in terms of MySQL performance. The cases for indexing turned on are identified by adding 100 to the version number of the correspondent non-indexing cases.

indexing  # of test processes  no entry  filling  100 K  filling  200 K  filling  500 K 
No  1001  -1  2001  -2  3001  -5  4001 
No  1002  2002  3002  4002 
No  1003  2003  3003  4003 
No  1004  2004  3004  4004 
No  1005  2005  3005  4005 
No  10  1010  2010  3010  4010 
No  20  1020  2020  3020  4020 
No  40  1040  2040  3040  4040 
Yes  1105  2105  3105  4105 
Yes  20  1120  2120  3120  4120 

The result of the tests without indexing are listed in the following table. Detail plot can be obtained by clicking on the number. The error of each number is statistical only. In the plot, the x axis is the time after the test starts and is in unit of second. The y axis is the number of entries that MySQL gets in every second.

# of test processes  no entry  100 K  200 K  500 K 
3.48 +- 0.48 2.17 +- 0.27 1.43 +- 0.16 0.78 +- 0.61
6.06 +- 0.69 3.77 +- 0.38 2.08 +- 0.18 0.95 +- 0.064
6.98 +- 0.66 4.05 +- 0.34 2.11 +- 0.15 0.93 +- 0.030
7.14 +- 0.59 4.08 +- 0.30 2.12 +- 0.10 0.95 +- 0.022
6.85 +- 0.50 3.94 +- 0.25 2.10 +- 0.071 0.95 +- 0.015
10  7.19 +- 0.38 4.05 +- 0.13 2.10 +- 0.028 0.94 +- 0.057
20  7.30 +- 0.27 3.97 +- 0.11 2.11 +- 0.012 0.94 +- 0.0025
40  7.58 +- 0.22 3.67 +- 0.062 2.07 +- 0.013 0.93 +- 0.0038

In general the rate of inserting goes down while the table gets larger and larger. Increasing the number of processes up to 40 doesn't seem to have any effect on the rate. However, in case of no existing entry and with 40 processes making request in parallel, the MySQL server starts to have two different rates. It some time gets information logged slowly and other time it gets much more records done. However, no record was lost. With more and more records existing in the table this behavior shows up with less and less simultaneous processes. Still no record was lost even in the condition when there were 500 K records existing and 40 simultaneous processes. The server would have nothing done in a period of 10 to 20 seconds or so and gets lot of queued jobs done in the next second.

Roughly the MySQL takes about 7, 4, 2, and 1 requests per second when there are zero, 100 thousand, 200 thousand, and 500 thousand existing entries.

Rate changing with respect to the table size, with or without indexing

Indexing helps a lot in the performance. In the test, the indexing is done on 'fname' and 'version'. No primary key is defined. For a field to be a primary key, it must not have any identical entry. Since the farm could reprocess the same file several times, it take additional effort to create primary key. The performance without or without primary key doesn't change too much so it is preferable that the farm uses indexing without primary key defined. Similar tests were repeated for 5 processes and 20 processes. The results are shown in the following table.

# of test processes  no entry  100 K  200 K  500 K 
6.78 +- 0.55 7.94 +- 0.60 8.20 +- 0.62 8.06 +- 0.61
20  8.40 +- 0.33 8.33 +- 0.33 8.40 +- 0.33 8.37 +- 0.33

With indexing turned on the rate of information logging remain at 8 or so per second up to 500 thousand existing entries in our test. A comparison in case of 5 simultaneous processes is shown in this plot.

Conclusion

Since we estimate that in the operation there will be about 4 requests per second and it has been demonstrated that MySQL can handle 8 requests per second with indexing turned on, so it is understood that MySQL is capable of handling the request from the farm operation.

It is also estimated that the CDF RUN II data taking might get hundreds of thousand files per year. The MySQL server can handle a table of 500 thousand entries without any degradation in performance. So MySQL is capable of supporting the farm performance for year. If we separate the files according to the run number, for example, that will make the production farm reliable in the long run in terms of MySQL capability. This requires some change in the bookkeeping and will be done in the near future.

Based on the result from the tests, the farm group believes that MySQL is capable of supporting the farm activity. We have also turned on the indexing on the farm tables in operation. The indexing is on the fields of 'fname', 'exe_version', and the 'comment'. These fields exist in all file tables and are used to identify every entry.

The CDF production farm group is looking forward to having a smooth production process for the CDF experiment!


Posted at Feb. 2, 2001 by Yen-Chu Chen, TEL: (630) 840-2124.