I/O Doctors

#1

What is iot?

iot is a Linux application program that inserts the I/O toolkit middleware into the environment of all child processes of iot. iot has several options that allow for customizable instrumentation.


  • Wild card name selection of programs and files affected by iot middleware
  • Each child of iot generates its own ilz stream
  • Merging of all generated ilz streams into a single stream
  • Collection of kernel instrumentation such as diskstats, netstats, meminfo, and cpustat
  • Optional delay of iot intervention in programs until MPI_Initialized() returns true.
  • Benefits


  • No root permission needed
  • No recompiling or relinking
  • Minimal changes to run scripts
  • Compatible with MPI
  • Low overhead
  • Examples

    iot command is a pre-command, such as time:

                  

    % iot -f dd.icf dd if=/dev/zero of=/dev/null count=1024 bs=1m

    iot arguments

                  

    ​-f <icf_file_name> ​-c <file> collect all ilz streams in a single file ​​-m <mpi flavor> valid flavors are mpich, mpt, ompi

    Processes downstream of the iot process are selectable for instrumentation, based on the pathname of the program being executed, by directives in an iot configuration file (icf):

                  

    PROGRAMS.include={"*.exe:a.out"}.exclude={"sh:csh"}

    icf example

                  

    ​ilz.name={PROGRAM+"."+PID} ​​PROGRAMS.include={"*.exe"}.exclude{"/bin/*"} ​​ diskstats.interval={500}.devices={"sd*"} ​​ meminfo.interval={500} ​​​ FILES.include={"/tmp/**"} ​​​​ LAYERS.use={trc,psx}

    #2

    What is Pulse?

    Pulse is a Java graphical application that imports ilz streams and allows the user to view and manipulate the imported data. Pulse has several features that facilitate the analysis of data:


  • Built in compiler/linker allowing customizable expressions, functions and macros for data manipulation and plotting
  • Support of numerous data types: int, long, float, double, string, and boolean
  • A built in library of plotting functions allowing access to Java's plotting capabilities
  • Tabular data with row counts limited only by the amount of stack memory available to the JRE ( Java Runtime Environment ).
  • NOTE: The largest ilz stream processed by Pulse consisted of 2 tables of 28.6M posix reads and writes, requiring only 2.3GB of memory by the JRE




    #3

    How are Instrumentation Streams used?

    File I/O instrumentation for each program is selectable on a per-file basis, based on each file's pathname, by directives in an icf file:

                  

    ​FILES.include={*.dat:/tmp/**}.exclude={/scratch/**} ​LAYERS.use={trc,psx}

    Resulting instrumentation is recorded in an ilz file.​

    ilz streams are:


  • Stored in a file for post-mortem analysis
  • Delivered via socket for run-time analysis
  • zlib compressed to minimize size
  • Multiple ilz streams can be concurrently merged into a single ilz stream
  • There are 3 levels of file I/O instrumentation

    Statistics, for each selected file, are summed into a one line summary for the process

    ​trc.detail=0 produces the following for every program

    # date time host user pid program bucket openCount closeCount write(count,bytes,wait) read count,bytes,wait)
    03/12/15 12:31:26 shemp bauerj 30168 cc1 program_to_psx 192 192 113 547 0.00 191 1219977 0.00
    
                

    Statistics, for each selected file, are output to the ilz file.

    ​trc.detail=1 produces the following for every file

    iosim.exe.21016.ilz trc program_to_cache /mnt/raid0/bauerj/iosim/20992/merged/kubota.SCR300 58224 7630618624 8.43
    iosim.exe.21016.ilz trc size(open,close,max)=0,0,460980224  fsType=EXT2 sector=8192
    iosim.exe.21016.ilz trc write         8818 1155792896     1.50
    iosim.exe.21016.ilz trc read         49399 6474825728     4.15
    iosim.exe.21016.ilz trc ftunc            5  948043776     2.79
                

    Blow-by-blow details of each I/O event, for each selected file, are logged in the ilz file.

    ​trc.detail=2 provides tables such as the following for each posix I/O event for each file, in addition to the per-file summary information from trc.detail=1

    POSIX

    For example, the details for a POSIX read are:

    Visualized in Pulse

    #4

    Global Probe Instrumentation

    meminfo

    Memory information for a program being instrumented can be enabled with a directive in an icf file:

    ​meminfo.interval={500}

    The above requests the following memory related details be logged to the ilz file at an interval of 500 milliseconds.

    Visualized in Pulse

    diskstats

    Disk statistics information instrumentation, as reported in /proc/diskstats, can be enabled with a directive in an icf file:

    ​diskstats.interval={100}.devices={sd*}

    The above requests the following diskstats related details be logged to the ilz file at an interval of 100 milliseconds. For each selected device, both a read table and a write table are generated.

    Visualized in Pulse

    Lustre OSC statistics

    Lustre client OSC usage statistics, as reported in /proc/fs/lustre/osc, can be enabled with a directive in an icf file:

    ​lfsstats.interval={10}.devices={snx*}

    Visualized in Pulse

    Lustre fiemap

    llfie is an iot utility program that generates an ilz file containing the mapping information for a data file stored on a Lustre filesystem. This can be used to visualize how the file is mapped into each OST. Notice in the following table that as the file's allocation position increases, the physical position on the OST oscillates between deep into the OST(~15TB) and back out to the start of the OST(~170GB).

    The following plot has "extent number" on the horizontal axis and "physical position" on the vertical axis. This 30,000 foot view of the file allocation shows how the start of the file is generally deep in the OST. The bulk of the file is at the start of the OST.

    #5

    IOT Optimization Layers

    cache layer

    The iot cache layer is a user memory cache. Key features are:


  • user selects which files are cached based on the file's pathname
  • all cached files share one global iot cache
  • configurable total cache size
  • configurable page size
  • asynchronous prefetching when sequential access is detected
  • can use either direct I/O or system buffered I/O
  • optional generation of instrumentation data to assist tuning
  • thread safe
  • The following image depicts the iot cache in operation, both as a prefetch mechanism and a data reuse mechanism. The grey shaded rectangles represent when a given part of the file is cache resident. The vertical line at the left of the rectangle is the representation of the asynchronous read loading the page. The other red lines are the user program reading data out of the iot cache. At a later point the user program starts reading the file backwards, which has a steeper slope( higher transfer rate), as the data is already cache resident.

    paio

    The iot paio layer uses pthreads to handle asynchronous I/O requests issued by the iot cache layer. Key features are:


  • low overhead
  • the number of aio handling threads is user specified
  • proxy layer

    The iot proxy layer is a data transport mechanism. It allows for the transporting of I/O calls, normally made on the local node, to an iot proxy server, ips, running on a different node. The ips then issues the I/O calls on its node. The ips on the server node may also use any of the availble iot layers. This means the ips can have its own iot cache and be used to aggregate I/O calls coming from multiple client processes using the iot proxy layer. Security for the connection to the ips is via ssh. The user process on the client node must be able to ssh, without password, into the server node( where the ips server is to be started or is already running ).


  • multiple ips servers can be running on a server node
  • multiple clients can connect to a single ips
  • a single client can connect to multiple ips ( on one or more server nodes). This means if there are multiple network connections between the client node and the server node there is the capability for proxy/ips to utilize the available bandwidth
  • ipsctl, an iot utility program, can be used to monitor and control the ips.
  • proxy layer is thread safe and can be downstream of the iot cache and paio layers to asynchronously keep data moving across the proxy/ips connection(s)