Minutes of the Event Data Model Working Group Meeting 02 March 2000 Rob Kennedy, for the CDF Run II Event Data Model Working Group Attending: Rob Kennedy, Philippe Canal, Rick Snider, Liz Sexton-Kennedy, Marge Shapiro, Rob Harris, Pierre Savard, Prem Singh, Peter Tamburello, Kirsten Tollefson By Phone: Paolo Calafiura, Kevin McFarland By Video: none I) Keep/Drop Lists of Objects - Liz Sexton-Kennedy -------------------------------------------------- We want to define the interface and use-cases for dropping objects from an event upon reading and upon writing. Liz led a discussion in order to begin this task. In the past we could drop/keep banks with I/O lists according to their bank name or by a user-defined "set" of banks names. The banks read in were put into the input list which by default was the set of banks that would be output. While module developers could provide hints as to which banks to keep or drop, Ultimately, it was the user of an A_C job who decided what was output. This has a few problems if we naively apply the same approach to the objects in the new EDM. Some objects depend on others to store shared state information (CdfTrack and CdfTrackColl, for instance). This means it is possible for a user to drop the collection class, but keep the element instances, meaning that the elements may not be fully comprehensible. It was suggested we might considered a switch that the user can set which would select whether to keep/drop only this object or this and all dependent objects, which would be used by an objects Streamer() method. Liz proposed we extend the idea of specifying which bank names to drop to specifying pairs of (object_class_name, RCP_name), where the RCP_name is the parameter set name. It was mentioned that many algorithms would benefit from being able to specify dropping all but the "latest" (object_class_name, RCP_name) pairs. We need to carefully define what we mean by "latest" in this context since (object_class_name, RCP_name) pairs may not be unique. A module may put out more than one instance of the same pair which would confuse a naive algorithm to look for the largest object_id with the same pair value. Also desired would be boolean operations on sets. If we can implement an association of branches and I/O sets, then the mapping of I/O keep/drop sets is better aligned with ROOT concepts. We could then take advantage of the I/O list to avoid performing unnecessary post_read() calls, for instance, as well as using the ROOT I/O mechanism to skip over unrequested branches of events. If we did this, we would want to store the set_names in the event. These could go into a separate branch distinct from the event. Set names and object class names are also candidates for storage in the Data Catalog. Time scale: Partial implementation by MDC-2. Keep/drop by object class name, perhaps sometime next week? Explore sets and branches, with first being the input set. Rename the "description" string in StorableObject to "RCPName", though this may require complete rebuilds if archspec_rootcint.mk shortcomings are tripped. II) Accessing Event Data Objects - Rob Kennedy -------------------------------- Rob K. presented a brief outline of the Accessing Event Data Objects in the CDF Run II Event Data Model note he has been working on. This is intended to gather the basic information together to allow folks with and without C++ training to access data in SeqRoot format files at some level. A) Introduction to the new Event Data Model - much like CHEP 2000 paper and presentation, see EDM web page for URL B) Tools to access data files not requiring programming knowledge - convert Ybos2SeqRoot - hex dump and caveats - object lister - object pretty printer C) Programmatic access to event data - Basics of EventRecord - " " StorableObject - " " StorableBanks - Example of accessing data and creating a histogram which we will put into validation (note Young-Kee's web page) D) Creating new objects - StorableObjects, StreamableObjects in existing packages - StorableBanks - Links - Creating a new package containing StorableObjects E) Debugging code using the new EDM (with URLS to product support pages) - debuggers - purify - profiling Comments not included in outline: - We should stress that objects are not declared working until there is a test that they can be read back after having been written out. - Suggest that new users work in an existing package before they venture into creating new packages. - Rob K. would like to make this a collaborative effort as much as is reasonably possible. If you have a contribution you would like to make, or would like to edit/review a section or the entire note, please let him know. The first draft should be out next week, as soon as some re-organization to the EdmObjects and EdmModules packages is done (affects where example programs are located). III) Ideas to Streamline I/O-Intensive Processes - Rob Kennedy -------------------------------------------------------------- Rob K. reported on his profiling work which was intended to uncover why the MDC-1 splitting program was running so very much slower than his earlier ROOT I/O benchmarks would have indicated. He looked at ROOT v2.22.06 debug and optimized and v2.23.11 optimized on a Linux PC (200 MHz Pentium Pro) using converted YBOS files (all banks) as well as a file output by MDC-1 production (mix of banks and generalized StorableObjects). Preliminary profile results seem to indicate that the bulk of the additional CPU between the two sets of files is spent in postread() activities related to tracking detectors. Reading a banks-only data file (root v2.22.06 optimized) ----------------------------------- Each sample counts as 0.01 seconds. % self self total time seconds calls ms/call ms/call name 10.44 2.13 5002138 0.00 0.00 DatarepConversion::copy_swap_bytes_03_12(int *, const int *, unsigned int) 7.55 1.54 307208 0.01 0.03 StorableBank::Streamer( (TBuffer &)) 4.95 1.01 20216 0.05 0.05 DatarepConversion::copy_swap_bytes_01_23(int *, const int *, unsigned int) 4.80 0.98 1060932 0.00 0.01 StorableBank::convert_body_after_input( (void)) 4.17 0.85 193829 0.00 0.01 TRY_Bank_Type::string2vector_mixed_block( const(std::basic_string, std::allocator> const &, int &, int &, int &, int, int &, TRY_Vector &, bool)) 3.48 0.71 5012687 0.00 0.00 convert_mono_block_data(int *, const int *, unsigned int, unsigned int) 3.09 0.63 499380 0.00 0.01 convert_mixed_block_data(int *, const int *, const int *, unsigned int, unsigned int) 2.94 0.60 2124778 0.00 0.00 Id::Streamer( (TBuffer &)) 2.89 0.59 731938 0.00 0.00 TRY_Bank_Type::parse_typestring_atom( const(std::basic_string, std::allocator> const &, int, int, int &, int &)) 2.84 0.58 1060932 0.00 0.00 StorableObject::Streamer( (TBuffer &)) 2.70 0.55 1060932 0.00 0.00 void std::list::push_back(const T1 &) [with T1=StorableObject *, T2=std::allocator] 2.65 0.54 797700 0.00 0.01 convert_mixed_bank_data(int *, const int *, unsigned int, unsigned int) All other entries deleted from listing.... Total cumulative CPU time = 20.40 Total number of bytes read = 462929920 (YBOS equivalent data size) Reading an MDC-1 data file (root v2.22.06 optimized) ----------------------------------- Each sample counts as 0.01 seconds. % self self total time seconds calls ms/call ms/call name 7.61 5.44 58348144 0.00 0.00 EventRecord::ConstIterator::p_object( const(void)) 6.69 4.78 192715 0.02 0.08 EventRecord::ConstIterator::__ct( (EventRecord const *, Id const &)) 6.16 4.40 58389136 0.00 0.00 std::list>::const_iterator::operator++(std::list::const_iterator &(void)) 5.64 4.03 52780971 0.00 0.00 StorableBank::type_size( const(void)) 5.04 3.60 456 7.89 49.63 void StoredSiClusterData::readBanks(const T1 *, const T2 *, const SiStripInfoSet *) [with T1=QSIC_StorableBank, T2=QSIP_StorableBank] 4.31 3.08 3522188 0.00 0.00 SIXD_StorableBank::Data_Iter::nearest( (void)) 3.23 2.31 9872644 0.00 0.00 StorableBank::get_I4_element( const(int, int)) 2.74 1.96 2526980 0.00 0.00 __kai::rb_tree_node_base *__kai::rb_tree::__search(const T1 &, __kai::rb_tree_base::search_mode) const [with T1=SiDigiCode, T2=std::less, N3=(unsigned int)16] 2.43 1.74 1776507 0.00 0.00 bool SiStripInfoSet::accumulate(const T1 &, const T2 &) [with T1=SiDigiCode, T2=SiStrip] 2.38 1.70 1942047 0.00 0.00 DatarepConversion::copy_swap_bytes_03_12(int *, const int *, unsigned int) 2.10 1.50 575133 0.00 0.01 SiHitSet::insertValue( (SiDigiCode const &, SiHit *)) 1.97 1.41 7215556 0.00 0.00 StorableBank::get_I2_element( const(int, int)) 1.92 1.37 1537256 0.00 0.00 ISLD_StorableBank::Data_Iter::nearest( (void)) 1.69 1.21 456 2.65 29.00 void TRYRun2SiStripSet::readSIXDBank(const T1 &, const T2 *) [with T1=SIXD_StorableBank, T2=MSVX_StorableBank] 1.33 0.95 5061583 0.00 0.00 StorableBank::get_BY_element( const(int, int)) 1.30 0.93 752388 0.00 0.00 __kai::rb_tree_node_base *__kai::rb_tree::__modify(const T1 &, bool *, __kai::rb_tree_base::modify_mode) [with T1=SiDigiCode, T2=std::less, N3=(unsigned int)16] 1.23 0.88 1150284 0.00 0.00 SiStripInfoSet::ConstIterator SiStripInfoSet::findStrip(const T1 &, int) const [with T1=SiDigiCode, T2=SiStrip] 1.13 0.81 575133 0.00 0.00 void SiDigiCodeRefSet::append(const SiDigiCode &, T1 *) [with T1=SiCluster] 1.08 0.77 5151579 0.00 0.00 SiStripInfoSet::ConstIterator::operator!=(bool const(SiStripInfoSet::ConstIterator const &)) 1.08 0.77 575133 0.00 0.00 SiCluster::updateClusterParam( (void)) 1.01 0.72 250151 0.00 0.01 SiDigiCodeRefSet::insert(SiDigiCodeRefSet::DigiIterator (SiDigiCode const &)) 0.98 0.70 456 1.54 9.54 void TRYRun2SiStripSet::readISLDBank(const T1 &, const T2 *) [with T1=ISLD_StorableBank, T2=MISL_StorableBank] 0.88 0.63 571787 0.00 0.00 SiCluster::SiCluster(T1, T1) [with T1=SiStripInfoSet::ConstIterator] 0.83 0.59 457 1.29 2.12 SiDigiCodeValueSet::__dt( (void)) 0.77 0.55 1 550.00 550.25 CT_HitSet::__ct( (void)) All other entries deleted from listing.... Total cumulative CPU time = 71.48 Total number of bytes read = 180911338 (exact data file size) Rob K. suggested that, in addition to looking at the methods involved here to look for simple C++-level optimizations, we also look at three alternatives to streamline I/O-intensive programs like a data sample splitter. 1) Make postread() and prewrite() optional or called on "touching" an object. This would be more like the intent of activate() and deactivate() in that only if the object is touched. But we need to be very clear what we mean by touched, since many operations may request the StorableObject base class information of an object, but they do not need to activate the object. There may be problems with objects containing pointers into other objects, since those pointers may not be valid until an activate() is performed. We may use a smart pointer for this purpose, which would know to activate the object it points into as well the object it points at. 2) Split events into at least a "Header" branch and a "Data" branch. Only unpack the Header branch to get at triggers bits for splitting. Repack the header branch, but just make a copy of the packed Data branch to write the event to the output file. At the very least, we can skip the postread() call to the Data branch (which has implications to how we store objects coming from different branches in the event). 3) Related to (2) is that we want to try to request not only a write(TBuffer&) with one TBuffer per TBranch, but we would like to request a similar read operation. The idea is to minimize the work done by ROOT for branches where we just want to copy that branch piece of the event from one disk file to another. While the write(TBuffer&) seems relatively easy to implement in ROOT, we need to consider the read(TBuffer&) request carefully. Kevin pointed out that we need these optimizations by the end of March if the online, Level3, and farms groups are to have time to adapt to them and test them before MDC-2. This puts a high priority on understanding about how much we will gain from each approach, and how soon we can implement them. Philippe is looking into the write(TBuffer&) request, and now the read(TBuffer&) request. He may be able to develop a solution in March independently of the rest of the ROOT team. IV) Any Other Business ---------------------- Marge brought up the point of data browsing using ROOT as a potentially very useful and visible tool to help users adapt to ROOT-based files. It is knot yet know how usful data browsing at the ShowMembers() level would be, however. Rob K. pointed out that ShowMembers() for raw data banks would only print out the bank name, bank number, number of type words, number of data words, and an integer away containing the type and data words. No interpretation of the internal data structure would take place. Marge asked that someone investigate how much work it would take to implement using cint as an interpreter to execute methods in a class which do know how to interpret internal data structure. Since raw data and some other banks use templated data iterator classes, it is not clear how many bank classes are of interest would be interpretable by cint. Liz S-K and Philippe agreed try loading our current ROOT dictionary .so's and see what functionality we might be able to achieve with reasonable effort. .the end.