HPC Benchmark Tests with FDMNES 28/03/2019 =============================== Rainer Wilcke, ESRF, Grenoble (wilcke@esrf.fr) This writeup contains the results of the HPC benchmarking tests that were done by the ESRF during the HNSci Cloud project in 2018. It is organized as follows: 1) FDMNES software used for the benchmark test 2) hardware of the three compute clusters used 3) benchmarking results 4) discussion of the results 1) The FDMNES Program ===================== FDMNES is a program for the calculation of the spectra of the absorption of X-rays in material. For this, it processes a user-defined range of energies, calculates the resulting spectra and writes them to output. Depending on the size of the models investigated, it can be very compute and memory intensive - up to 100 cores with 100 GB of memory per parallel process for up to a week is not uncommon. However, both its input and output data files are relatively small, in total typically less than 1 GB. This makes it an interesting test case for HPC calculations: a high demand on compute power and a low demand on data transfer capacity. FDMNES uses parallel processing for the calculations, with two levels of parallelisation. The principal parallelisation is over the range of energies. This can require rather large amounts of memory, and it is often not possible to run as many parallel processes on the nodes as there are cores available because there is not enough memory to accommodate them. It is then necessary to "undersubscribe" the nodes, i.e. to use less parallel processes than cores in order to give each process more memory. This parallelisation scales well with increasing number of parallel processes. The energy calculations involve the solution of a sparse matrix equation, which in turn can be parallelised. Each energy calculation then uses a user-defined number of parallel processes for the matrix calculation. This parallelisation goes into saturation at about 4 parallel matrix processes per energy. In principle, the most efficient way to use the program is to not parallelise the matrix calculation and to run the energy calculation on as many parallel processes as there are cores. However, if this is not possible because of memory requirements, it can be advantageous to parallelise the matrix calculation as well, because it uses only little additional memory. This way some of the cores that otherwise just would sit idle are used for the matrix calculations, resulting in a moderate but still noticeable additional speedup. The total amount of parallel processes in a FDMNES run is given by ntotal = (# processes energy calculation) * (# processes matrix calculation) 2) Hardware of the Compute Clusters =================================== During the HNSci Cloud project, the program was run on both the OTC and the RHEA (Exoscale) clusters, and also on the local compute cluster of the ESRF for comparison. Models of different size were computed, identified in the tables below by the increasing value of the "radius" parameter on top of each table. The principal characteristics of the computers used are listed below. Note that the OTC "e1.8xlarge" node type was only used in the test case "radius 7.0" to find out the effect of using less nodes but with a much larger memory per node. These results are easily identified by having 940 GB of memory per node. RHEA: Titan 128 GB 16 cores, 2.2 GHz OTC: m2.8xlarge.8 256 GB 28 cores, 2.6 GHz OTC: e1.8xlarge 940 GB 18 cores, 2.3 GHz ESRF: hpc3/hib3 252 GB 28 cores, 3.3 GHz The ESRF cluster has some nodes (labeled "hib3") that are connected by the fast Infiniband network, whereas some others (labeled "tcp") are not. The two types can be mixed in a calculation, in which case the network uses only the slower "tcp" connection type. This was used for some tests ("radius 7.0") to find out the influence of the network speed on the calculations. If "both" or just "ESRF" is stated, there was a mix of both node types used for the calculation. Just taking into account the clock rate of the processors, the speedup should be as follows: OTC e1.8xlarge / RHEA Titan 1.05 OTC m2.8xlarge.8 / RHEA Titan 1.18 ESRF / RHEA Titan 1.50 OTC m2.8xlarge.8 / e1.8xlarge 1.13 ESRF / e1.8xlarge 1.43 ESRF / m2.8xlarge.8 1.27 In principle, one would also have to take into account that some of the clusters enable hyperthreading on the nodes and others do not. However, in all of the cases investigated the nodes were heavily "undersubscribed" with considerably less than half of the "virtual CPUs" active in the calculations. Thus every process should in principle have run on a real core and hyperthreading effects therefore not have played a role. 3) Results of the Benchmark Tests ================================= All tests were run on homogeneous clusters, i.e. all nodes in a created cluster were of the same type. With the exception of OTC for the "7.0" case, the node type was always the same for a given cluster provider for all tests. The labels in the tables below mean: nodes number of nodes available on the created compute cluster cores total number of cores available on the created compute cluster cores / node number of cores available per node mem / node total memory available per node (GB) used cor / nd number of cores used per node for the calculation procs total number of parallel processes for the calculation procs energ number of parallel processes for the energy calculation procs matrx number of parallel processes for the matrix calculation used Mem / Proc amount of memory used per parallel process (GB) elapsed time s elapsed time (walltime) in seconds for the complete calculation radius: 4.4 cluster nodes cores cores mem/ used procs procs procs used elapsed total /node node cor/nd energ matrx Mem/Proc time s ESRF 4 112 28 252 8 32 8 4 12 24112 RHEA 8 128 16 128 4 32 4 8 12 43268 OTC 4 128 32 256 8 32 8 4 12 21178 radius: 5.5 cluster nodes cores cores mem/ used procs procs procs used elapsed total /node node cor/nd energ matrx Mem/Proc time s ESRF (tcp) 4 112 28 252 8 32 8 4 24 57252 RHEA 8 128 16 128 4 32 4 8 24 102071 OTC 4 128 32 256 8 32 8 4 24 49957 radius: 6.0 cluster nodes cores cores mem/ used procs procs procs used elapsed total /node node cor/nd energ matrx Mem/Proc time s ESRF (tcp) 4 112 28 252 8 32 8 4 32 84041 RHEA 8 128 16 128 4 32 4 8 32 158502 OTC 8 256 32 256 4 32 4 8 32 129869 radius: 7.0 cluster nodes cores cores mem/ used procs procs procs used elapsed total /node node cor/nd energ matrx Mem/Proc time s ESRF (ib) 6 168 28 252 4 24 4 6 50 258845 ESRF (tcp) 8 224 28 252 4 32 4 8 50 229053 ESRF (ib) 8 224 28 252 4 32 4 8 50 212616 ESRF (both) 8 224 28 252 2 16 16 1 120 226428 RHEA 8 128 16 128 2 16 2 8 50 543056 OTC 8 256 32 256 4 32 4 8 50 252793 OTC 4 128 32 940 8 32 8 4 50 271005 OTC 4 128 32 940 8 32 32 1 120 271592 4) Discussion of the Results ============================ As expected both the total elapsed time and the memory used per process increase with the size of the problem calculated (indicated by the increasing "radius" value in the tables). It can also be seen that the RHEA cluster is in the cases "radius 4.4" and "radius 5.5" considerably slower than the two other clusters, in spite of the total number of processes being identical. The cluster is also slower than one would expect if this was simply due to the lower clock speed, because it is about a factor of 2 slower than the two others, whereas the clock speed would account only for a factor 1.5 at most. In the "radius 6.0" case, this difference is much less pronounced and for the "RHEA / OTC" combination compatible with the expected speedup from the clock rates. However, in that case the ESRF elapsed time is considerably lower than for the two other clusters, which was not the case for ESRF versus OTC in the "radius 4.4" and radius 5.5" cases. The difference seems to be due to the fact that in the first two cases the calculations on the RHEA cluster used 8 parallel processes for the matrix calculation and only 4 for the energy loop, whereas for the calculations on the ESRF and OTC cluster it was 4 for the matrix and 8 for the energies. In the "radius 6.0" case both RHEA and OTC used the "8 for matrix, 4 for energies" combination, whereas the ESRF still used "4 for matrix, 8 for energies". This indicates that the parallelisation of the matrix calculation reaches saturation when going from 4 to 8 parallel processes, the 4 additional processes for the matrix essentially not contributing much speedup. The larger number of parallel energy processes, however, is not in saturation and makes the ESRF calculations much faster. If one assumes that the energy calculation scales indeed linearly with the number of processors and the matrix calculation goes completely into saturation above 4 parallel processes, then multiplying the ESRF elapsed time for the "6.0" case by 2 should correspond to a combination of "4 for matrix, 4 for energy" and be comparable to the "8 for matrix, 4 for energy" results of the two other clusters. This, however, would give a value of about 168000 seconds for the ESRF cluster, which seems too high. In the "4.4" and "5.5" cases, the OTC cluster is about 10% faster than the ESRF one, whereas comparing the measured "129869" OTC value for the "6.0" case with the hypothetical "16800" is a difference of about 25%. It thus seems that the matrix calculation does still contribute a bit to the speedup when using more 4 parallel processes. What remains unexplained is that for the "4.4" and the "5.5" cases the ESRF cluster is slower than the "OTC" cluster by about 10%, in spite of the fact that from the clock rates one would expect it to be 25% faster. As discussed above, the faster ESRF execution for the "6.0" case can be explained with the different parallelisation setups and is not a proof for the better hardware performance one would expect from the different clock rates. However, looking at the "7.0" case, it gets even more unclear. Now the ESRF cluster with the "8 matrix, 4 energy" configuration is faster than the same OTC configuration by about 20%, approximately what would be expected from the difference in clock rates. It should be noted that the "7.0" case is the only one where the ESRF cluster ran with 8 parallel processes for the matrix calculation. It can thus be speculated that the attribution of cores to processes and thus the speed of memory access from the processes is more favorable in certain configurations on the ESRF than on OTC, and less so in other configurations, but this cannot be proven from the data. It would be interesting to re-run all these test cases on all clusters with a "4 matrix, 8 energy" configuration to have identical conditions, but as the clusters are no longer available that is not possible. Several more conclusions can be drawn from the benchmark results, in particular from the "7.0" test case. The "4 energy, 8 matrix" configuration was run at the ESRF with both a cluster having all nodes with Infiniband network (labeled "ib") and one where none of the nodes had Infiniband ("tcp"). As expected, the Infiniband configuration is faster, but only by about 8%. This is understandable: FDMNES uses very little input data and also creates relatively small (compared to the compute effort) output data. There is thus not much need for interprocess communication during the calculations and the network speed contributes little to the performance. From the different configurations ran for case "7.0", one can also see the effect of the parallelisation of the matrix calculations. The first was a "4 energy, 6 matrix, all Infiniband" configuration at the ESRF with 8 nodes and 24 cores (processes) used; the third very similar, except that it had 8 processes for the matrix calculations and thus 32 cores (processes) used. The compute power thus increased by 33%, but the elapsed time decreased only by 18%, a clear indication that the parallelisation of the matrix calculation does not scale linearly with the number of processes. On the other hand, at least up to 4 parallel processes for the matrix calculation it scales as well as the energy parallelisation, as can be seen fom the last two cases with the OTC "940 GB memory" nodes. In both cases there are 4 nodes and 32 cores (processes) used, but in the one case the configuration is "8 energy, 4 matrix" and in the other it is "32 energy, 1 matrix". The elapsed times are essentially identical, indicating that the parallelisation in the matrix calculation is as efficient as the one in the energy calculation. It can also seen that parallelising only in energy is a quite efficient way to run FDMNES, because the ESRF configuration "16 energy, 1 matrix" with 16 used cores (processes) takes about the same elapsed time as the "4 energy, 8 matrix" configuration which has twice the number of used cores. In other words, 16 cores do the job as fast in the first combination as 32 in the second. However, because of the memory requirements for the energy calculation, this first case uses only 2 of the 28 cores available on each node. The number of nodes used for the calculation is identical, only there are more processors now sitting idle. Given that the matrix calculation seems to scale well up to 4 parallel processes, the best combination would probably be "8 energy, 4 matrix". This would use 4 cores on each node and certainly be faster than the "4 energy, 8 matrix" combination. Of course, it is less efficient in compute performance per used core, but the same number of nodes are used (and thus blocked) and the elapsed time would be shorter. Of course, the most efficient way to run FDMNES should be to have all cores active in the calculation, but very often (as in this case) that is not possible because of the memory requirements. Thus if one takes nodes with the same number of cores but a bigger memory available, more of the cores in those nodes can be used and this should improve the performance. That was tested with a node type from OTC that has 940 GB and 32 cores, thus about 30 GB per core as compared to the other OTC type (similar to the ESRF one) that has 8 GB per core (256 GB for 32 cores). Interestingly enough, that combination is not as efficient as one might think. First, both "940 GB type" runs are slower than the OTC run on the 256 GB nodes, with the same number of processes (32) running. Of course, the 256 GB nodes are about 15% faster in clock speed, but on the other hand the 256 GB run used the "4 energy, 8 matrix" combination, which is less efficient than the combinations used for the 940 GB runs. Taking into account the different clock speeds, all three combinations have the same performance, the bigger memory and thus resulting higher fraction of used cores per node does not seem to help. Further, if one compares the "32 energy, 1 matrix" combination for the OTC nodes with 940 GB to the ESRF one "16 energy, 1 matrix", again the nodes with bigger memory are slower in elapsed time. The ESRF nodes have a 1.43 times higher clock speed, but there are only half as many parallel energy processes running. The OTC 940 GB combination should thus be about 40% faster than the ESRF run, but that is definitely not the case. The reason for this remains unclear.