Comparison of performance of silicon clustering between single muon events and b-bbar events.

On average a single muon event took 0.014 seconds and a b-bbar event 0.135 seconds.
The table below shows how the time in the event step of the SiClusteringModule was spend 
for both types of events. Only the 5 most prominent functions and the entry for operator 
new is shown in the table. Notice that the percentage entry for new is not correct.
The entries for Incl.. CPU state how much time was spend in the function and all functions 
called from it in total. The Excl. CPU column shows how much time was spend in the 
function alone, excluding all of the functions it has called. The calls column indicates
how often during the program the function was called.
----------------------------------------------------------------------
           Incl.  Excl.   Calls 
 Attrib.    CPU    CPU            
    %     (secs) (secs)  
1000 single muons:
-----------------          
Immediate Parents:
   100.0% 14.019  0.010   1000    APPSequence::event(TRY_Abstract_Record*)
Selected function:
     0.2% 14.009  0.030   1000    SiClusteringModule::event(TRY_Abstract_Record*)
Immediate Children:
    61.4%  8.595  0.050   1000    TRYSiClusterData::writePersistent(void)
    16.7%  2.333  1.080   1000    TRYRun2SiStripSet::_readSIXDBank(void)
    12.3%  1.722  1.170   1000    TRYRun2SiStripSet::_readISLDBank(void)
     6.6%  0.924  0.020   1000    SiClusterFinder::findClusters(void)
     1.3%  0.184  0.000   1000    TRYSiClusterData::~TRYSiClusterData(void)
     0.0%  0.536  0.030 312683    operator new(unsigned int)
10 b-bbar:
----------
    46.3%  0.627  0.000     10    SiClusterFinder::findClusters(void)
    18.4%  0.249  0.020     10    TRYSiClusterData::writePersistent(void)
    18.3%  0.248  0.010     10    TRYRun2SiStripSet::_readSIXDBank(void)
    10.1%  0.137  0.000     10    TRYSiClusterData::~TRYSiClusterData(void)
     3.4%  0.047  0.010     10    TRYRun2SiStripSet::_readISLDBank(void)
     0.0%  0.383  0.050 182745    operator new(unsigned int)

For the single muons the by far dominant task is the persistent write of the cluster data. 
The persistent write is less than one third of the percentage in the b-bbar events, but is 
still the second largest time consumer. This indicates that the persistent write has a large
"once only" overhead. For b-bbar the write persistent and "read DBanks" functions are 
comparable in time consumption and make up approximately have of the total CPU time. In the 
muon events the total time spent on persistent I/O is approximately 90% of the time.

It is also apparent that the readISLDBank/ReadCVXDBank ratio is close to 1 in the muon case
indicating that these functions have similar overhead. However for the b-bbar events the 
ISLDBank seems to read much fast than the SVXD bank indicating that it contains either less
data or is more efficiently read.

The firsts level breakdown shows that in semi "realistic" events like b-bbar the cluster
finding is dominant and performance increase is most likely to be gained there.

Caveat: 
Note that these events do not have an underlying event nor do they have any noise or 
realistic hit generation simulations in them. It is therefore to be expected that 
"Real" events will have significantly more hits in them and increase the time spend in 
both the read DBank and findCluster functions.

In order to be able to compare the muon and b-bar events more directly with each other 
the following table shows the inclusive and exclusive CPU times spend in each function 
divided by the number of events in milli seconds for both types of events.

   single-mu   |     b-bbar   |     Function Name
 Incl.   Excl. |  Incl.  Excl.|
------------------------------------------------------------
 8.595   0.050 |  24.9   2.0  | TRYSiClusterData::writePersistent(void)
 2.333   1.080 |  24.8   1.0  | TRYRun2SiStripSet::_readSIXDBank(void)
 1.722   1.170 |   4.7   1.0  | TRYRun2SiStripSet::_readISLDBank(void)
 0.924   0.020 |  62.7   0.0  | SiClusterFinder::findClusters(void)
 0.184   0.000 |  13.7   0.0  | TRYSiClusterData::~TRYSiClusterData(void)
 0.536   0.030 |  38.3   5.0  | operator new(unsigned int)

Most noticeable in this table is the increase in the findClusters function. The 
findCluster function spends it time mainly in SiClusterFinder::_ClusterFindingLoop 
(70% for muons, 81% for b-bbar) and in SiClusterList::addDetector (21% mu, 16% b-bbar).

The SiClusterFinder::_ClusterFindingLoop is in both cases dominated by 
SiClusterFinder::_generateClusters(const SiDigiCode&) (78% mu, 72% b-bbar)
Inside the _generateClusters function the muon and b-bbar sample start to deviate as
can be seen in the table below.

--------------------------------------------------------------------
           Incl.  Excl.   Calls 
 Attrib.    CPU    CPU            
    %     (secs) (secs)            
1000 single muons:
-----------------          
Immediate Parents:
   100.0%  0.653  0.040   12470  SiClusterFinder::_clusterFindingLoop(const SiDigiCode&)
Selected function:
    11.7%  0.511  0.060   12474  SiClusterFinder::_generateClusters(const SiDigiCode&)
Immediate Children:
    27.8%  0.142  0.000    9590  SiLayerFinder::newHit(SiCluster*,const SvxIslWafer*)
    14.4%  0.074  0.000    2884  SiZLayerFinder::newHit(SiCluster*,const SvxIslWafer*)
    11.7%  0.536  0.030  312683  operator new(unsigned int)
     7.9%  0.041  0.000    9590  SiLayerFinder::newCluster(void)
     7.2%  0.037  0.000   12474  __insert_multi__Q2_5__kai72...(Abbreviated)
     5.9%  0.030  0.030    2884  SiZLayerFinder::findClusters(void)
     5.9%  0.030  0.010   12474  SiLayerFinder::reset(void)
     2.4%  0.012  0.000    2884  SiZLayerFinder::newCluster(void)
     2.0%  0.010  0.010    9590  SiLayerFinder::findClusters(void)
     1.2%  0.020  0.020   72778  SiStrip::SiStrip(const SiStrip&)
     1.1%  0.060  0.060  138255  __search__Q2_5__kai72...(Abbreviated)
     0.7%  0.010  0.010   37388  __upper_bound__Q2_5__kai72...(Abbreviated)
     0.2%  0.010  0.010  136151  SiDigiCode::SiDigiCode(const SiDigiCode&)
     0.2%  0.010  0.010  160768  SiDigiCode::~SiDigiCode(void)
     0.0%  0.010  0.010  112758  __kai::rb_tree_base::...(Abbreviated)
10 b-bbar:
----------
    22.9%  0.083  0.000    7372  SiLayerFinder::newHit(SiCluster*,const SvxIslWafer*)
    21.4%  0.383  0.050  182745  operator new(unsigned int)
    12.7%  0.046  0.000    7372  SiLayerFinder::newCluster(void)
     8.0%  0.029  0.000    9607  __insert_multi__Q2_5__kai72...(Abbreviated)
     6.9%  0.025  0.000    2235  SiZLayerFinder::newHit(SiCluster*,const SvxIslWafer*)
     6.5%  0.050  0.050   20395  __upper_bound__Q2_5__kai72...(Abbreviated)
     3.8%  0.014  0.000    2235  SiZLayerFinder::newCluster(void)
     3.3%  0.030  0.030   69014  SiStrip::SiStrip(const SiStrip&)
     2.8%  0.010  0.010    7351  SiLayerFinder::findClusters(void)
     2.3%  0.090  0.090  103218  __search__Q2_5__kai72...(Abbreviated)
     0.8%  0.030  0.030   80208  __kai::rb_tree_base::...(Abbreviated)
     0.4%  0.010  0.010   65559  SiDigiCode::~SiDigiCode(void)
     0.0%  0.000  0.000    2224  SiZLayerFinder::findClusters(void)
Note: The strange looking names ending in ....(abbreviated) are templated functions
      which are not correctly De-mangled by cvd. 

It is difficult to interpret the above table. It seems that, as expected from the number
of Z- and phi-Layers, the fraction of time spent in the making of ZLayer Clusters and Hits 
is always smaller than that of the Phi Layers. For the muons however the Z-Hits and 
Z-Clusters seem to take slightly more time than expected.
Footnote:
Hits are, as far as I understand the space points made from the clusters as is indicated 
by the fact that the newHit and newCluster functions are equally often called.  

  #ZHits  or ZClusters							#Z-Layers
 ---------------------- = 0.300 (for muons) 0.303 (for b-bbar) =approx. ----------   
 #PhiHits or PhiClusters						#Phi-Layers

 %time(ZHits and ZClusters)	   16.8		       10.7
 ------------------------------ = ------=0.47(muons)  ------=0.300(b-bbar)
 %time(PhiHits and PhiClusters)	   35.7		       35.6

However the time spend in operator new is taking a very large fraction of the total time.
This is in particular so for the b-bbar events. This may be a sign that the creation 
of individual objects for Clusters and Hits may be rather expensive and these objects 
could be merged, saving the creation of the Hits but possibly requiring a larger Object
like a cluster with additional space point information being used instead.



++++++++
Summary:
++++++++

In total the clustering of a bbar event takes 0.135 seconds which is 14% of the estimated 
time available for an event to pass the complete level3 trigger. This does seem to be 
a lot given the many other task the level3 trigger has to perform. 
46% of the time (for b-bbar events) is spend in the actual cluster finding and 
Hit making whereas 22% fall to the reading of the DBanks (SVXD=18.3 ISLD=3.4) and
18.4% to the persistent write of the results banks (clusters and hits).

During the cluster finding a significant fraction of the time is spent in operator new, 
creating Hit and Cluster Objects. Maybe a merge of these structures should be considered.
New as called in the cluster finding only takes 5% of the total time of the 
SiClusteringModule. In total new takes 28.3%, so nearly one third of the modules time.
The corresponding time spend in delete or automatic delete-like functions is not easy 
to determine but can be nearly as big. This may indicate that the module is spending 
over 50% of its time in memory management functions.

Regional unpacking of the DBanks may be a solution to this problem as it will limit the 
number of clusters and hits to be created. Even a rather complicated access mechanism 
may still prove profitable for this task.