Co-Array Fortran, formally called F--, is a small set of extensions to Fortran 95 for Single Program Multiple Data, SPMD, parallel processing.
Co-Array Fortran is a simple syntactic extension to Fortran 95 that converts it into a robust, efficient parallel language. It looks and feels like Fortran and requires Fortran programmers to learn only a few new rules. The few new rules are related to two fundamental issues that any parallel programming model must resolve, work distribution and data distribution.
First, consider work distribution. A single program is replicated a fixed number of times, each replication having its own set of data objects. Each replication of the program is called an image. Each image executes asynchronously and the normal rules of Fortran apply, so the execution path may differ from image to image. The programmer determines the actual path for the image with the help of a unique image index, by using normal Fortran control constructs and by explicit synchronizations. For code between synchronizations, the compiler is free to use all its normal optimation techniques, as if only one image is present.
Second, consider data distribution. The co-array extension to the language allows the programmer to express data distribution by specifying the relationship among memory images in a syntax very much like normal Fortran array syntax. One new object, the co-array, is added to the language. For example,
REAL, DIMENSION(N)[*] :: X,Y X(:) = Y(:)[Q]declares that each image has two real arrays of size N. If Q has the same value on each image, the effect of the assignment statement is that each image copies the array Y from image Q and makes a local copy in array X.
Array indices in parentheses follow the normal Fortran rules within one memory image. Array indices in square brackets provide an equally convenient notation for accessing objects across images and follow similar rules. Bounds in square brackets in co-array declarations follow the rules of assumed-size arrays since co-arrays are always spread over all the images. The programmer uses co-array syntax only where it is needed. A reference to a co-array with no square brackets attached to it is a reference to the object in the local memory of the executing image. Since most references to data objects in a parallel code should be to the local part, co-array syntax should appear only in isolated parts of the code. If not, the syntax acts as a visual flag to the programmer that too much communication among images may be taking place. It also acts as a flag to the compiler to generate code that avoids latency whenever possible.
Fortran 90 array syntax, extended to co-arrays, provides a very powerful and concise way of expressing remote memory operations. Here are some simple examples:
X = Y[PE] ! get from Y[PE] Y[PE] = X ! put into Y[PE] Y[:] = X ! broadcast X Y[LIST] = X ! broadcast X over subset of PE's in array LIST Z(:) = Y[:] ! collect all Y S = MINVAL(Y[:]) ! min (reduce) all Y B(1:M)[1:N] = S ! S scalar, promoted to array of shape (1:M,1:N)
Input/output has been a problem with previous SPMD programming models, such as MPI, because standard Fortran I/O assumes dedicated single-process access to an open file and this constraint is often violated when it is assumed that I/O from each image is completely independent. Co-Array Fortran includes only minor extensions to Fortran 95 I/O, but all the inconsistencies of earlier programming models have been avoided and there is explicit support for parallel I/O. In addition I/O is compatible with both process-based and thread-based implementations.
The only other additions to Fortran 95 are several intrinsics. For example: the integer function NUM_IMAGES() returns the number of images, the integer function THIS_IMAGE() returns this image's index between 1 and NUM_IMAGES(), and the subroutine SYNC_ALL() is a global barrier which requires all operations before the call on all images to be completed before any image advances beyond the call. In practice it is often sufficient, and faster, to only wait for the relevant images to arrive. SYNC_ALL(WAIT=LIST) provides this functionality. There is also SYNC_TEAM(TEAM=TEAM) and SYNC_TEAM(TEAM=TEAM,WAIT=LIST) for cases where only a subset, TEAM, of all images are involved in the synchronization. The intrinsics START_CRITICAL and END_CRITICAL provide a basic critical region capability. It is also possible to write your own synchronization routines, using the basic intrinsic SYNC_MEMORY. This routine forces the local image to both complete any outstanding co-array writes into ``global'' memory and refresh from global memory any local copies of co-array data it might be holding (in registers for example). A call to SYNC_MEMORY is rarely required in Co-Array Fortran, because there is an implicit call to this routine before and after virtually all procedure calls including Co-Array's built in image synchronization intrinsics. This allows the programmer to assume that image synchronization implies co-array synchronization.
Image and co-array synchronization is at the heart of the typical Co-Array Fortran program. For example, here is how to exchange an array with your north and south neighbors:
COMMON/XCTILB4/ B(N,4)[*] SAVE /XCTILB4/ C CALL SYNC_ALL( WAIT=(/IMG_S,IMG_N/) ) B(:,3) = B(:,1)[IMG_S] B(:,4) = B(:,2)[IMG_N] CALL SYNC_ALL( WAIT=(/IMG_S,IMG_N/) )The first SYNC_ALL waits until the remote B(:,1:2) is ready to be copied, and the second waits until it is safe to overwrite the local B(:,1:2). Only nearest neighbors are involved in the sync. It is always safe to replace SYNC_ALL(WAIT=LIST) calls with global SYNC_ALL() calls, but this will often be significantly slower. In some cases, either the preceeding or succeeding synchronization can be avoided. Communication load balancing can sometimes be important, but the majority of remote co-array access optimization consists of minimizing the frequency of synchronization and having synchronization cover the minimum number of images. If the program is likely to run on machines without global memory hardware, then array syntax (rather than DO loops) should always be used to express remote memory operations and copying co-array's into local temporary buffers well before they are required might be appropriate (although the compiler may do this for you).
In data parallel programs, each image is either performing the same operation or is idle. For example here is a data parallel fixed order cumulative sum:
REAL SUM[*] CALL SYNC_ALL( WAIT=1 ) DO IMG= 2,NUM_IMAGES() IF (IMG==THIS_IMAGE()) THEN SUM = SUM + SUM[IMG-1] ENDIF CALL SYNC_ALL( WAIT=IMG ) ENDDOHaving each SYNC_ALL wait on just the active image improves performance, but there are still NUM_IMAGES() global sync's. In this case a better alternative is probably to minimize synchronization by avoiding the data parallel overhead entirely:
REAL SUM[*] ME = THIS_IMAGE() IF (ME.GT.1) THEN CALL SYNC_TEAM( TEAM=(/ME-1,ME/) ) SUM = SUM + SUM[ME-1] ENDIF IF (ME.LT.NUM_IMAGES()) THEN CALL SYNC_TEAM( TEAM=(/ME,ME+1/) ) ENDIFNow each image is involved in at most two sync's, and only with the images just before and just after it in image order. Note that the first SYNC_TEAM call on one image is matched by the second SYNC_TEAM call on the previous image. This illustrates the power of the Co-Array Fortran synchronization intrinsics. They can improve the performance of data parallel algorithms, or provide implicit program execution control as an alternative to the data parallel approach.
Several non-trivial Co-Array Fortran programs are included as examples with the caf2omp translator, and with the Cray T3E intrinsics.