python api invocation of gretel fails on gpfs (or lustre) filesystems on large processor counts
Summary
On gpfs filesystems gretel tries to recombine local .slf files before they are entirely written
Environment
- Operating System: Linux on HPC
- TELEMAC version: V9P0
Steps to reproduce
Run an python API driven instance of Telemac on a high processor count (or an ensemble of simultaneous telemac instances) on a large processor count on an HPC with a large number of cores per node.
First spotted running ensemble data assimilation from the fishstick branch (but the python api are unchanged w.r.t. the main branch).
What is the current bug behaviour?
GPFS (and lustre) filesystems use a cache mechanism for queing I/O.
In practice, when the writes return, even when the file is closed, there is no garanty that the file is really flushed on disk, but it could be in the cache buffer.
Currenty, upon termination of a telemac run, the python api finalize method invokes gretel via the concatenation_step for collecting the results. It if happens that one of the files is still not entirely flushed to the disk, gretel will fail when detecting that some records are missing in some input files.
What is the expected correct behaviour?
The code should wait (for a reasonable max amount of time) that all the part local files have the correct number of records before invoking gretel.
Relevant logs and/or screenshots
ERROR IN res_Chinon_2024.slf00007-00004: RECORD: 9 IS NOT WITHIN [0,
8 ]
PLANTE: PROGRAM STOPPED AFTER AN ERROR
RETURNING EXIT CODE: 2...
Possible fixes
A possible workaround would be to add in the finalize a loop (with a short pause in between) on a check on the current number of records in the output local files, before invoking gretel, with a safeguard on the maximum number of iterations.
This could be implemented in the simplest way on the master proc, with a loop on the part files, just before calling the gretel wrapping api, or in the most effective way in parallel, each member checking the nrecords of its part file and checking the equality of and MPI.MIN and an MPI.MAX reductions.
Don't hesitate to get in touch if a discussion is needed.
References
GPFS is a “client-side cache” design. The cache is kept in a dedicated and pinned area of each application node’s memory called the pagepool and is typically around 50 Mbytes per node. This cache is managed with both read-ahead (prefetch) techniques and write-behind techniques. Consistency is maintained by the token manager server of the mmfsd daemon. There is one such copy of the mmfsd running within the entire SP parallel computer. The read-ahead algorithms are able to discover sequential access and constant-stride access. GPFS is multi-threaded. As soon as an application’s write buffer has been copied into the pagepool, the write is completed from an application thread’s point of view. GPFS schedules a worker thread to see the write through to completion by issuing calls to the VSD layer for communication to the I/O node. The amount of concurrency available for write-behind and read-ahead activities is determined by the system administrator when the file system is installed.