Go to the first, previous, next, last section, table of contents.


Run-time: Debugging, Performance, Long Jobs

Strategies for Greater Concurrency

Strategy 1: SEMI-INDEPENDENT TASKS:

    Define tasks so that most task outputs do not require any update.
    This is always the case for trivial parallelism (when tasks are
    independent of each other).  It is also often the case for many
    examples of search and enumeration problems.

Strategy 2: CACHE PARTIAL RESULTS:

    Inside DoTask() and UpdateSharedData(), save partial
    computations in global private variables.  Then, in the event of a
    REDO action, `TOP-C' guarantees to invoke
    DoTask() again on the
    original slave process or slave thread.  That slave may then use
    previously computed partial results in order to shorten the required
    computation.

Strategy 3: MERGE TASK OUTPUTS:

    Inside CheckTaskResult(), the master may merge two or more task
    outputs in an application independent way.  This may avoid the
    need for a REDO action, or it may reduce the number of required
    UPDATE actions.

Improving Performance

If your application runs too slowly due to excessive time for communication, consider running multiple slave processes on a single processor. This allows one process to continue computing while another is communicating or idle waiting for a new task to be generated by the master.

If communication overhead or idle time is still to high, consider if it is possible to increase the granularity of your task -- perhaps by amalgamating several consecutive tasks as a single larger task to be performed by a single process. You can do some of this automatically. For example, if the statement:

  TOPC_agglom_count=5;

is executed before TOPC_master_slave(), then `TOP-C' will transparently bundle five task inputs as a single network message, and similarly for the corresponding task outputs.

PERFORMANCE ISSUE FOR MPI:

If you have a more efficient version of `MPI' (perhaps a vendor version
tuned to your hardware), consider replacing LIBMPI in
`.../top-c/Makefile' by your vendor's `limbpi.a' or
`libmpi.so', and delete or modify the the LIBMPI target in the
`Makefile'.

PERFORMANCE ISSUE FOR SMP:

Finally under `SMP', there is an important performance issue
concerning the interaction of `TOP-C' with the operating system.
First, the vendor-supplied compiler, cc, is recommended over
gcc for `SMP', due to specialized vendor-specific
architectural issues.  Second, if a thread completes its work before
using its full scheduling quantum, the operating system may yield the
CPU of that thread to another thread -- potentially including a thread
belonging to a different process.  There are several ways to defend
against this.  One defense is to insure that the time for a single task
is significantly longer than one quantum.  Another defense is to ask the
operating system to give you at least as many "run slots" as you have
threads (slaves plus master).  Some operating systems use
pthread_setconcurrency() to allow an application to declare this
information, and `TOP-C' invokes pthread_setconcurrency()
where it is available.  However, other operating systems may have
alternative ways of tuning the scheduling of threads, and it is
worthwhile to read the relevant manuals of your operating system.

Tracing and Other Debugging Techniques

If the difficulty is that the application fails to start in the distributed memory model (using topcc -mpi), then read section See section Invoking a TOP-C Application in Distributed Memory, for some debugging techniques. The rest of this section assumes that the application starts up correctly.

First, compile and link your code using topcc -seq -g, and make sure that your application works correctly sequentially. Only after you have confidence in the correctness of the sequential code, should you begin to debug the parallel version.

If the application works correctly in sequential mode, one should debug in the context of a single slave. It is convenient to declare the remote slave to be localhost, in order to minimize network delays and so as not to disturb users of other machines. In this case, the code is "almost" sequential.

Next, one should test on two slaves, and finally all possible slaves.

If a bug appears as one moves to greater parallelism, one should trace messages between master and slaves (for any number of slaves). The following variables are provided to trace messages between master and slave.

Variable: int TOPC_trace
boolean - default is TRUE Should have value TRUE, FALSE, or NOSTATS (default is FALSE); If set to TRUE, will provide a trace of communication between master and slave. If FALSE, provides only summary statistics at end. If NOSTATS, no extra printing at all. The statistics refer only to time in TOPC_master_slave(). Typically, about 20 milliseconds in master_slave_stats(). (The value, NOSTATS, is experimental and not implemented here.)
void (*TOPC_trace_input)(void *input);
void (*TOPC_trace_output)(void *output);
void (*TOPC_trace_action)(TOPC_ACTION action);
         Global pointer to function (default is NULL).  User can
         set it to his or her own trace function to print out
         data-specific tracing information in addition to generic
         message tracing of TOPC_trace.  For example, if you pass
         integers, define TOPC_trace_input() as:

           { printf("%d",*(int *)input); }
/* NOT IMPLEMENTED IN THIS VERSION */
void master_slave_stats();
        Prints cumulative statistics from all invocations of
master_slave();

Note that tracing takes place entirely on the master. So, any print statements produced by a slave may be asynchronous with the trace printing and other printing on the master.

If you find the master hanging, waiting for a slave message, then the probable cause is that DoTask() is doing something bad (hanging, infinite loop, bus/segmentation error, etc.).

If you are really desperate, note that gdb (the GNU C debugger) includes an "attach" command (see section `Attach' in The GNU debugger), which allows you to attach and debug a separate running process. This lets you debug a running slave, if it is running o the same processor. For this strategy, you will want the slave to delay executing to give you time to execute gdb and attach on the remote host or remote thread. To force the slave to wait 15 seconds, type TOPC_SLAVE_WAIT=15;export (under the Bourne shell, `sh'), or setenv TOPC_SLAVE_WAIT 15 (under `csh') before executing your `TOP-C' application.

Other useful techniques that may improve performance of certain applications are:

  1. set up multiple slaves on each processor (if slave processors are sometimes idle)
  2. re-write the code to bundle a set of tasks as a single task (to improve the granularity of your parallelism)

Long Jobs and Courtesy to Others

It is easy for parallel jobs to demand excessive resources. Some simple UNIX system calls prevent this.

#include <unistd.h>
alarm(int SECS) - kill job after SECS.  Place in DoTask().
        The alarm resets to SECS after each task, so that it will
        kill slaves in infinite loop
        or hung connection.  For example, inserting alarm(3600) at the
        beginning of DoTask() and alarm(300) at the end of
        DoTask() allows an hour-long task and requires that the
        master generate a new task within 5 minutes of the last one
        completed by that slave.
#include <unistd.h>
#include <sys/resource.h>
setpriority(PRIO_PROCESS,getpid(),prio) - prio = 10 still
        gives you some CPU time.  prio = 19 means that any job of
        higher priority always runs before you.  Place in main()
#include <sys/resource.h>
struct rlimit rlp;
rlp.rlim_max = rlp.rlim_cur = SIZE;
setrlimit(RLIMIT_RSS, &rlp) - SIZE is RAM limit (bytes).  If
        your system has
        significant paging, the system will prefer to keep your process
        from growing beyond SIZE bytes of resident RAM.  Even if you set
        nice to priority 20, this is still important.  Otherwise you may
        cause someone to page out much of his or her job in your favor during
        one of your infrequent quantum slices of CPU time.  Place in
        main().


Go to the first, previous, next, last section, table of contents.