The CDF RUN II data taking will continue for years. The size of the database table increases when there are more and more files being processed. The limitation of table size is 2 GB or so. However there are ways to increase the size of the table effectively. The concern here is the change of logging rate. It turned out that with indexing turned on, there was no sign of rate dropping up to 500 K entries pre-existing. Beyond that was not tested.
A test table was created with the following fields.
| Field | Type |
|---|---|
| fname | char(40) |
| ndname | char(40) |
| version | int(11) |
| fdatime | datime |
where the first field, 'fname', is the file name which is changed file by file automatically if the entries are made by the same script. The field, 'ndname' is not really used in the test. The field, 'version', is used to identify the tests done with various different conditions. The field, 'fdatime', is the time when the entry is made and is logged automatically.
Scripts written in PYTHON were used to fill the table and to call a simplified version of 'bookkeeper' used in the farm operation. In this simplified 'bookkeeper', there is no check of duplicating entries also less error handling in parameter input.
| indexing | # of test processes | no entry | filling | 100 K | filling | 200 K | filling | 500 K |
|---|---|---|---|---|---|---|---|---|
| No | 1 | 1001 | -1 | 2001 | -2 | 3001 | -5 | 4001 |
| No | 2 | 1002 | 2002 | 3002 | 4002 | |||
| No | 3 | 1003 | 2003 | 3003 | 4003 | |||
| No | 4 | 1004 | 2004 | 3004 | 4004 | |||
| No | 5 | 1005 | 2005 | 3005 | 4005 | |||
| No | 10 | 1010 | 2010 | 3010 | 4010 | |||
| No | 20 | 1020 | 2020 | 3020 | 4020 | |||
| No | 40 | 1040 | 2040 | 3040 | 4040 | |||
| Yes | 5 | 1105 | 2105 | 3105 | 4105 | |||
| Yes | 20 | 1120 | 2120 | 3120 | 4120 |
The result of the tests without indexing are listed in the following table. Detail plot can be obtained by clicking on the number. The error of each number is statistical only. In the plot, the x axis is the time after the test starts and is in unit of second. The y axis is the number of entries that MySQL gets in every second.
In general the rate of inserting goes down while the table gets larger and larger. Increasing the number of processes up to 40 doesn't seem to have any effect on the rate. However, in case of no existing entry and with 40 processes making request in parallel, the MySQL server starts to have two different rates. It some time gets information logged slowly and other time it gets much more records done. However, no record was lost. With more and more records existing in the table this behavior shows up with less and less simultaneous processes. Still no record was lost even in the condition when there were 500 K records existing and 40 simultaneous processes. The server would have nothing done in a period of 10 to 20 seconds or so and gets lot of queued jobs done in the next second.
Roughly the MySQL takes about 7, 4, 2, and 1 requests per second when there are zero, 100 thousand, 200 thousand, and 500 thousand existing entries.
| # of test processes | no entry | 100 K | 200 K | 500 K |
|---|---|---|---|---|
| 5 | 6.78 +- 0.55 | 7.94 +- 0.60 | 8.20 +- 0.62 | 8.06 +- 0.61 |
| 20 | 8.40 +- 0.33 | 8.33 +- 0.33 | 8.40 +- 0.33 | 8.37 +- 0.33 |
With indexing turned on the rate of information logging remain at 8 or so per second up to 500 thousand existing entries in our test. A comparison in case of 5 simultaneous processes is shown in this plot.
It is also estimated that the CDF RUN II data taking might get hundreds of thousand files per year. The MySQL server can handle a table of 500 thousand entries without any degradation in performance. So MySQL is capable of supporting the farm performance for year. If we separate the files according to the run number, for example, that will make the production farm reliable in the long run in terms of MySQL capability. This requires some change in the bookkeeping and will be done in the near future.
Based on the result from the tests, the farm group believes that MySQL is capable of supporting the farm activity. We have also turned on the indexing on the farm tables in operation. The indexing is on the fields of 'fname', 'exe_version', and the 'comment'. These fields exist in all file tables and are used to identify every entry.
The CDF production farm group is looking forward to having a smooth
production process for the CDF experiment!