From wolbers@fnal.gov Mon Mar 8 18:40:37 1999 Date: Fri, 04 Dec 1998 07:57:33 -0600 From: Stephen Wolbers To: chenyc@fnal.gov Subject: Some thoughts Yen-Chu, I wrote this up two days ago and then could not send it to you because I got busy and I could not get my mail to work here at Brookhaven. Anyway, it is not complete but I just wanted to send you something to read. Steve What is needed to make the farm usable for Run II Yen-Chu, The CDF Run II farm requires hardware and software, as well as procedures and ideas of how it will be run. This note is a short description of the things which I can think of right now that we will need to purchase or specify or create during the next year or so. 1. Worker nodes 2. Scripting language choice 3. I/O nodes 4. Tape I/O 5. Farm control 6. Networking 7. Staging disk 8. Concatenation disk I will not necessarily address these in any proper order - that can be done later. But I would like to get some of these ideas out. Worker nodes Clearly there will be a need to understand what kind of worker nodes will be needed on the farm. The current default is PII duals running LINUX. That may be the most effective, but it is not proven to be true, and the market may change. Some investigation into quads should be done before the final choice is made. The networking is almost certain to be 100 Mbit ethernet. The memory requirement is not yet known, but it would be surprising if it was less than 64 Mbyte. The amount of local disk is also not known, but a minimum will be space needed for 2 files (input) for each process which is running and some space for output. On a dual there is likely to be 2 processes running. This leads to 4 input files and some number of output files. If an input file averages 1 Gbyte, the space required will be at least 4 Gbyte. 9 Gbytes should be sufficient under these assumptions. Scripting Language Choice It is worth thinking about scripting languages soon. The choices are cshell, bourne shell, Python, perl, or possibly a programming language (C or C++). There are good arguments for many of the choices. The group should investigate the pros and cons of the various languages. After the investigation the choice should be made by a discussion of the members of the group. As part of the discussion the scripting language being used by others in CDF and by the farms groups and other support groups in the Computing Division should be investigated. I/O nodes I/O nodes for the farm are tightly coupled to the tape I/O. There are a few possibilities which should be looked at: 1. Big SMP. 2. Two or more small SMP's. 3. Multiple small single or dual machines. Some of the considerations that will come into play are the amount of disk storage required for staging and concatenation. This space could be quite substantial and requires computer systems which are capable of attachment of large amounts of disk. A second consideration is CPU and memory required to move the data around, both on and off the network or on and off of disk. Finally there will be network connections required for this machine or machines and they must be sufficient to move the data. A small group of people from the CDF production group should form to study this issue and to report to the rest of the CDF production group, the CDF Data Handling group, and to the Run II farms as a whole. Tape I/O Tape I/O is extremely important to the production. Files must be taken from the mass storage system for processing on the farm at a fairly rapid rate. This must be done in an efficient manner so as not to "thrash" the robotics. The software interface should encourage this. Output files from the farms need to be written onto tapes in the mass storage system. This must occur in a manner that optimizes use of the robot and packs the information on the tapes in a way which makes physics analysis easier. This is another area which a small subgroup from the production group can potentially spend some effort on. That group could study the issue, come up with proposals, and have those proposals be seen by others so decisions can be made. Farm Control Overall farm control is envisioned to be via scripts which in turn get information from the cdf production database. The control needs to be tightly coupled to the I/O for the farm, to allow for efficient use of the tape and robot resouces. It also needs to "watch" the disk use of input and output of the farms to allow for action to be taken if disks are filling. This is an area which will require a plan, then a prototype system (presumably using the farms batch system FBS) and then tuning as the farm is used. Networking Networking to the farms needs to be understood. It looks likely that Gigabit ethernet to the I/O nodes and 100 Mbit ethernet to the worker nodes will be sufficient. The prototype farm may allow the production group to measure enough of the performance characteristics of the networks to decide this issue and allow for specification of the proper switches, hubs, etc. Staging Disk Staging disk is required somewhere in the system to buffer files that are being read from mass storage or are being saved for writing to mass storage. The size and location of the staging disk will have to be determined. This calculation requires some knowledge of the production model, the size of the input files, the number of streams (input and output), the rules for concatenating files, the packing rules for writing tapes, as well as a detailed model of the location of files as they move through the farms production. Concatenation Disk It is important to understand how much output disk space will be required for concatenation of files. This is not a trivial amount of disk storage and is different from the space required for staging. From chenyc@fnal.gov Mon Mar 8 18:51:00 1999 Date: Fri, 11 Dec 1998 14:54:12 -0600 (CST) From: Yenchu Chen To: Stephen Wolbers Subject: Re: Some thoughts Hi Steve, I am sorry for late response to your mail. There are just too many things load on me while I need to get preliminary result out from my analysis which is the most essential part for me to search for a faculty position. Back to the production farm ... > 1. Worker nodes As I know the best performance/price computers are PC and alpha. Yes alpha computers are very cheap now but we don't have enough experience with this new alpha machine. If we do want to consider machines other than PC, we need to get samples of them now and experiment with them. IO capability is my major concern in using PC linux. We will have so many files running on the network of these computers. When I think about it I worry always. In case of E871, when there are 20 processes running the CPU time used for each event is close to the wall clock time. But when there are more than 40 processes running asking for data from IO node they are slowed down. Now that we have close to 100 processes running together, the CPU time is only 2/3 of the wall clock time. We are wasting CPU power by asking them to have access to the same IO node simultaneously. Yes, in case of CDF we will move data files to local disk of worker node. But still there will be a few hundreds files transfering simultaneously. Say we have 400 worker nodes and there are 400 processes (It might be better to run 800 processes to use CPU power better but lets not worry this at the moment.) running, there will be 400 files being transfered at each moment while there are 400 files been analyzed by worker nodes! We should enough bandwidth for them but I haven't seen a solid prof yet. How does this related to PC itself? All the hard disks on IO node are NFS mounted to worker nodes. PC linux doesn't seem to support NFS very well. If we run only two processes on each dual PC, yes I agree that 9 GB disk is sufficient. We can double that if we decide to go to four processes on each dual PC. > 2. Scripting language choice Using C or C++ (I would love to) together with the shell scripts, one can do almost everything one wants to do. But it might take us more time to write and debug the code. Shell scripts provide 'convenience' and they have their own limitations. I am personally not paticular well in writing script. Ping is much better than I am. Anyway, as you said this issue need to be studied and dressed. > 3. I/O nodes and 4. Tape I/O The major task of I/O node(s) will be getting data files from mass storage. Thus I agree with you that it is tightly coupled with tape I/O. It is also related to the internal I/O capability of PC itself as you have mentioned. This is actually the part worries me most. When there are hundreds files running around what is going to happen? > 5. Farm control We had several discussions on this and Yeh Ping is very interested in writing the script to do this control/monitoring task. We need to ion out all the details though. > 6. Networking As you said we should have enough bandwidth but still I would like to see some result from a test of transfering multiple (hundreds) files simultaneously. > 7. Staging disk I would think that this staging area for input should be on the I/O node(s) that is what they are here for, isn't it? > 8. Concatenation disk The concatenation worries me! I haven't seen anyone succefully appending files on tape. Thus I tend to use staging area to accumulate data to be a full tape size. So in case we are using, say, 25 GB tape and we have 60 output physics streams plus the storage space when the previous data sets are going to tapes, we will need 3 TB right there! More streams will require more staging area. Of course we can alwasy put some streams together. Maybe we should limit the number of streams. The upper limit can be decided later. I feel that there are a lot of information floating around but I haven't put them together nicely to have a clear picture of the whole thing. I will try to put down everything in a single plot and distribute it to everyone. I will also try to generate a list of tasks and send it out to everyone hopefully before our next meeting at Monday evening. We can talk about it at the meeting. Best regards, Yen-Chu Chen chenyc@fnal.gov (630) 840-8871 (experiment) (886)-(2) 2789-9681 (Inst. of Phys., Academia Sinica) On Fri, 4 Dec 1998, Stephen Wolbers wrote: > Yen-Chu, > > I wrote this up two days ago and then could not send it to you > because I got busy and I could not get my mail to work here at > Brookhaven. Anyway, it is not complete but I just wanted to send you > something to read. > > Steve > > > What is needed to make the farm usable for Run II > > Yen-Chu, > > The CDF Run II farm requires hardware and software, as well as > procedures and ideas of how it will be run. This note is a short > description of the things which I can think of right now that we will > need to purchase or specify or create during the next year or so. > > > I will not necessarily address these in any proper order - that can be > done later. But I would like to get some of these ideas out. > > Worker nodes > > Clearly there will be a need to understand what kind of worker nodes > will be needed on the farm. The current default is PII duals running > LINUX. That may be the most effective, but it is not proven to be true, > and the market may change. Some investigation into quads should be done > before the final choice is made. The networking is almost certain to > be 100 Mbit ethernet. The memory requirement is not yet known, but it > would be surprising if it was less than 64 Mbyte. The amount of local > disk is also not known, but a minimum will be space needed for 2 files > (input) for each process which is running and some space for output. > On a dual there is likely to be 2 processes running. This leads to 4 > input files and some number of output files. If an input file averages > 1 Gbyte, the space required will be at least 4 Gbyte. 9 Gbytes should > be sufficient under these assumptions. > > Scripting Language Choice > > It is worth thinking about scripting languages soon. The choices are > cshell, bourne shell, Python, perl, or possibly a programming language > (C or C++). There are good arguments for many of the choices. The > group should investigate the pros and cons of the various languages. > After the investigation the choice should be made by a discussion of the > members of the group. As part of the discussion the scripting language > being used by others in CDF and by the farms groups and other support > groups in the Computing Division should be investigated. > > I/O nodes > > I/O nodes for the farm are tightly coupled to the tape I/O. There are > a few possibilities which should be looked at: > > 1. Big SMP. > 2. Two or more small SMP's. > 3. Multiple small single or dual machines. > > Some of the considerations that will come into play are the amount of > disk storage required for staging and concatenation. This space could > be quite substantial and requires computer systems which are capable of > attachment of large amounts of disk. A second consideration is CPU and > memory required to move the data around, both on and off the network or > on and off of disk. Finally there will be network connections required > for this machine or machines and they must be sufficient to move the > data. > > A small group of people from the CDF production group should form to > study this issue and to report to the rest of the CDF production group, > the CDF Data Handling group, and to the Run II farms as a whole. > > Tape I/O > > Tape I/O is extremely important to the production. Files must be taken > from the mass storage system for processing on the farm at a fairly > rapid rate. This must be done in an efficient manner so as not to > "thrash" the robotics. The software interface should encourage this. > Output files from the farms need to be written onto tapes in the mass > storage system. This must occur in a manner that optimizes use of the > robot and packs the information on the tapes in a way which makes > physics analysis easier. > > This is another area which a small subgroup from the production group > can potentially spend some effort on. That group could study the issue, > come up with proposals, and have those proposals be seen by others so > decisions can be made. > > Farm Control > > Overall farm control is envisioned to be via scripts which in turn get > information from the cdf production database. The control needs to be > tightly coupled to the I/O for the farm, to allow for efficient use of > the tape and robot resouces. It also needs to "watch" the disk use of > input and output of the farms to allow for action to be taken if disks > are filling. This is an area which will require a plan, then a > prototype system (presumably using the farms batch system FBS) and then > tuning as the farm is used. > > Networking > > Networking to the farms needs to be understood. It looks likely that > Gigabit ethernet to the I/O nodes and 100 Mbit ethernet to the worker > nodes will be sufficient. The prototype farm may allow the production > group to measure enough of the performance characteristics of the > networks to decide this issue and allow for specification of the proper > switches, hubs, etc. > > Staging Disk > > Staging disk is required somewhere in the system to buffer files that > are being read from mass storage or are being saved for writing to mass > storage. The size and location of the staging disk will have to be > determined. This calculation requires some knowledge of the production > model, the size of the input files, the number of streams (input and > output), the rules for concatenating files, the packing rules for > writing tapes, as well as a detailed model of the location of files as > they move through the farms production. > > Concatenation Disk > > It is important to understand how much output disk space will be > required for concatenation of files. This is not a trivial amount of > disk storage and is different from the space required for staging. >