Strategy 1: SEMI-INDEPENDENT TASKS:
Define tasks so that most task outputs do not require any update.
This is always the case for trivial parallelism (when tasks are
independent of each other). It is also often the case for many
examples of search and enumeration problems.
Strategy 2: CACHE PARTIAL RESULTS:
Inside DoTask() and UpdateSharedData(), save partial
computations in global private variables. Then, in the event of a
REDO action, `TOP-C' guarantees to invoke
DoTask() again on the
original slave process or slave thread. That slave may then use
previously computed partial results in order to shorten the required
computation.
Strategy 3: MERGE TASK OUTPUTS:
Inside CheckTaskResult(), the master may merge two or more task
outputs in an application independent way. This may avoid the
need for a REDO action, or it may reduce the number of required
UPDATE actions.
If your application runs too slowly due to excessive time for communication, consider running multiple slave processes on a single processor. This allows one process to continue computing while another is communicating or idle waiting for a new task to be generated by the master.
If communication overhead or idle time is still to high, consider if it is possible to increase the granularity of your task -- perhaps by amalgamating several consecutive tasks as a single larger task to be performed by a single process. You can do some of this automatically. For example, if the statement:
TOPC_agglom_count=5;
is executed before TOPC_master_slave(), then `TOP-C' will
transparently
bundle five task inputs as a single network message, and similarly
for the corresponding task outputs.
PERFORMANCE ISSUE FOR MPI:
If you have a more efficient version of `MPI' (perhaps a vendor version tuned to your hardware), consider replacingLIBMPIin `.../top-c/Makefile' by your vendor's `limbpi.a' or `libmpi.so', and delete or modify the theLIBMPItarget in the `Makefile'.
PERFORMANCE ISSUE FOR SMP:
Finally under `SMP', there is an important performance issue concerning the interaction of `TOP-C' with the operating system. First, the vendor-supplied compiler,cc, is recommended overgccfor `SMP', due to specialized vendor-specific architectural issues. Second, if a thread completes its work before using its full scheduling quantum, the operating system may yield the CPU of that thread to another thread -- potentially including a thread belonging to a different process. There are several ways to defend against this. One defense is to insure that the time for a single task is significantly longer than one quantum. Another defense is to ask the operating system to give you at least as many "run slots" as you have threads (slaves plus master). Some operating systems usepthread_setconcurrency()to allow an application to declare this information, and `TOP-C' invokespthread_setconcurrency()where it is available. However, other operating systems may have alternative ways of tuning the scheduling of threads, and it is worthwhile to read the relevant manuals of your operating system.
If the difficulty is that the application fails to start in the distributed
memory model (using topcc -mpi), then read section
See section Invoking a TOP-C Application in Distributed Memory, for some
debugging techniques. The rest of this section assumes that the
application starts up correctly.
First, compile and link your code using topcc -seq -g, and make
sure that your application works correctly sequentially. Only after you
have confidence in the correctness of the sequential code, should you
begin to debug the parallel version.
If the application works correctly in sequential mode, one should debug
in the context of a single slave. It is convenient to declare the
remote slave to be localhost, in order to minimize network delays
and so as not to disturb users of
other machines. In this case, the code is "almost"
sequential.
Next, one should test on two slaves, and finally all possible slaves.
If a bug appears as one moves to greater parallelism, one should trace messages between master and slaves (for any number of slaves). The following variables are provided to trace messages between master and slave.
TRUE, FALSE, or NOSTATS
(default is FALSE);
If set to TRUE, will provide a trace of communication between
master and slave. If FALSE, provides only summary statistics
at end. If NOSTATS, no extra printing at all.
The statistics refer only to time in TOPC_master_slave().
Typically, about 20 milliseconds in master_slave_stats().
(The value, NOSTATS, is experimental and not implemented here.)
void (*TOPC_trace_input)(void *input);
void (*TOPC_trace_output)(void *output);
void (*TOPC_trace_action)(TOPC_ACTION action);
Global pointer to function (default is NULL). User can
set it to his or her own trace function to print out
data-specific tracing information in addition to generic
message tracing of TOPC_trace. For example, if you pass
integers, define TOPC_trace_input() as:
{ printf("%d",*(int *)input); }
/* NOT IMPLEMENTED IN THIS VERSION */
void master_slave_stats();
Prints cumulative statistics from all invocations of
master_slave();
Note that tracing takes place entirely on the master. So, any print statements produced by a slave may be asynchronous with the trace printing and other printing on the master.
If you find the master hanging, waiting for a slave message, then the
probable cause is that DoTask() is doing something bad (hanging,
infinite loop, bus/segmentation error, etc.).
If you are really desperate, note that gdb (the GNU C debugger) includes
an "attach" command (see section `Attach' in The GNU debugger),
which allows you to attach and debug a separate running process. This
lets you debug a running slave, if it is running o the same processor.
For this strategy, you will want the slave to delay executing to give
you time to execute gdb and attach on the remote host or remote thread.
To force the slave to wait 15 seconds, type
TOPC_SLAVE_WAIT=15;export (under the Bourne shell, `sh'), or
setenv TOPC_SLAVE_WAIT 15 (under `csh') before executing
your `TOP-C' application.
Other useful techniques that may improve performance of certain applications are:
It is easy for parallel jobs to demand excessive resources. Some simple UNIX system calls prevent this.
#include <unistd.h> alarm(int SECS) - kill job after SECS. Place inDoTask(). The alarm resets to SECS after each task, so that it will kill slaves in infinite loop or hung connection. For example, inserting alarm(3600) at the beginning ofDoTask()and alarm(300) at the end ofDoTask()allows an hour-long task and requires that the master generate a new task within 5 minutes of the last one completed by that slave. #include <unistd.h> #include <sys/resource.h> setpriority(PRIO_PROCESS,getpid(),prio) - prio = 10 still gives you some CPU time. prio = 19 means that any job of higher priority always runs before you. Place inmain()#include <sys/resource.h> struct rlimit rlp; rlp.rlim_max = rlp.rlim_cur = SIZE; setrlimit(RLIMIT_RSS, &rlp) - SIZE is RAM limit (bytes). If your system has significant paging, the system will prefer to keep your process from growing beyond SIZE bytes of resident RAM. Even if you set nice to priority 20, this is still important. Otherwise you may cause someone to page out much of his or her job in your favor during one of your infrequent quantum slices of CPU time. Place inmain().
Go to the first, previous, next, last section, table of contents.