1 0 0 STREAMLINING YOUR PIPELINES + STREAMLINING YOUR PIPELINES 0 Melinda Varian 0 Office of Computing and Information Technology Princeton University 87 Prospect Avenue Princeton, NJ 08544 USA --.-- Email: melinda@princeton.edu Web: http://pucc.princeton.edu/~melinda/ Telephone: (609) 258-6016 0 SHARE 92 Session 9241 February 1999 0 0 INTRODUCTION + INTRODUCTION + ____________ 0 At Princeton, by 1993 more and more of our applications were becoming extensive refineries of pipelines (for example, Rick Troth's CMS Gopher client and server and our own BITFTP server). The CPU utilization of heavily used applications is always a concern, so we wanted to try to optimize some of these applications. However, we found that we had no tools that would help us tune applications that made extensive use of CMS Pipelines. The excellent traditional tools for analyzing the CPU + _____________ used by REXX programs, such as TRACEXEC (by Kent Fiala, of the SAS Institute) and HISTOREX (by Brick Verser, of Kansas State University), showed us only that most of the CPU used by these applications was due to PIPE commands, which we knew already. 0 So, in 1993, my management instructed me to write a tool for analyzing the performance of piped applications. The result was a tool called Rita (which was named for Lovely Rita, Meter Maid, since it is a program + _______________________ for reading meters). That name was suggested by John Hartmann, the author of CMS Pipelines, who also contributed much good advice and other + _____________ assistance to this project. 0 John suggested that I start by turning on the 8K bit in the CMS + ___ Pipelines "msglevel" setting. That would cause Pipes to generate an + _________ _____ accounting message each time a pipeline stage terminated: 0 pipmod msglevel 8207 Ready; pipe (name test) literal hello | console hello Spent .017 milliseconds in stage. ... Issued from stage 1 of pipeline 1 name "test". 1 0 Page 2 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 ... Running "literal hello". Spent .065 milliseconds in stage. ... Issued from stage 2 of pipeline 1 name "test". ... Running "console". Spent .047 milliseconds in gnl ohd. Spent .429 milliseconds in scanner. Spent .244 milliseconds in dispatch. Spent .451 milliseconds in wait. Spent 2.491 milliseconds in acnt ohd. Ready; 0 (You can find out about the various "msglevel" settings by issuing the command pipe ahelp pipmod.) 0 My first implementation was simply a REXX filter that could be used in a pipeline that invoked the subject pipeline with a runpipe stage. runpipe captured the Pipes accounting messages, and my REXX filter + _____ processed them using the then brand-new IBM 407 emulation in the spec stage. 0 'PIPMOD MSGLEVEL 8207' 'PIPE var pipe | runpipe | dobit8k | ...' 'PIPMOD MSGLEVEL 15' 0 The result was a display of the CPU utilization of each stage of each pipeline in the complex. This example shows the output for one subroutine pipeline, which had been invoked nine times from a loop in a REXX filter in a higher level pipeline: 0 23.9 ms in stage 1 of pipeline 1 "test27a": test27b 0.1 ms in stage 2 of pipeline 1 "test27a": literal abc 0.1 ms in stage 3 of pipeline 1 "test27a": totarget nlocate 1 0.1 ms in stage 4 of pipeline 1 "test27a": literal xyz 0.1 ms in stage 5 of pipeline 1 "test27a": hole 0 24.3 ms in pipeline "test27a" (9 invocations) <===== 0 The millisecond values are the CPU used by each stage exclusive of any CPU used by subroutine pipelines invoked by the stage. 0 This particular example taught me a good lesson about Pipes performance. + _____ I foolishly asserted that the Pipes accounting numbers must be wrong, + _____ because it couldn't possibly be so expensive to invoke that REXX filter nine times. The Piper suggested that I try EXECLOADing the filter. This was the result: 0 11.1 ms in stage 1 of pipeline 1 "test27a": test27b 0.1 ms in stage 2 of pipeline 1 "test27a": literal abc 0.1 ms in stage 3 of pipeline 1 "test27a": totarget nlocate 1 0.1 ms in stage 4 of pipeline 1 "test27a": literal xyz 0.1 ms in stage 5 of pipeline 1 "test27a": hole 0 11.5 ms in pipeline "test27a" (9 invocations) <===== 1 0 Streamlining Your Pipelines Page 3 ------------------------------------------------------------------------ 0 MORAL: If you find yourself issuing a CALLPIPE command in a loop, + MORAL: first ask yourself why the loop can't be absorbed into the pipeline, and if it really can't, then remind yourself to EXECLOAD any REXX filters (or other EXECs) used by the subroutine pipeline. 0 Although this prototype was useful, I soon found that I wanted a display of the CPU use for each stage both inclusive of and exclusive of the CPU used by its subroutines, such as TRACEXEC provides for EXECs. The Pipes + _____ accounting records don't supply enough information to determine which stage invoked a subroutine, so John suggested that I also process the "event records" that runpipe creates when it is invoked with the EVENTS option. (The EVENTS option was added to Pipes in CMS 10 for use by + _____ PIPEDEMO. You can find out about the event records by issuing the command pipe ahelp pipevent.) So, I threw out my prototype and started over again to write what came to be called Rita. While I was writing Rita, John made several enhancements to the event records to make it easier for me to navigate the calling sequences in the subject pipeline. That was very helpful, especially since Gopher, for example, proved to be recursive. John also added a MSGLEVEL option to runpipe to provide finer control over the msglevel setting. (Both of these enhancements were included in CMS 11.) 0 The new tool consisted of two pieces, RITA EXEC, which contained a pipeline that invoked the subject pipeline, and RITA REXX, a filter that analyzed both the accounting messages and the event records as they were created. To use Rita, one replaced a PIPE command with EXEC RITA. Thus, the subject pipeline was the argument to RITA EXEC, which ran it with runpipe. 0 As soon as I began using Rita to measure our BITFTP server, however, I ran into the problem that stem and var stages in subroutine pipelines that referenced REXX variables in the main EXEC were now searching for those variables in RITA EXEC, because using RITA EXEC had inserted an additional EXECCOMM level below the main EXEC, between the level that had contained the PIPE command and the level of any REXX filters in subroutine pipelines. John quickly solved that problem by extracting my pipeline from RITA EXEC, encapsulating it in an Assembler program, and building RITA MODULE. So, then, to use Rita, one simply replaced a PIPE command with a RITA command. The result was a display similar to this one for each pipeline in the complex: 0 36 ( 36) ms in stage 1 of segment 1 "SendMail": var fromline 22 ( 22) ms in stage 2 of segment 1 "SendMail": locate 81 1343 ( 972) ms in stage 3 of segment 1 "SendMail": 822cont 14 ( 14) ms in stage 4 of segment 1 "SendMail": faninany 107 ( 107) ms in stage 5 of segment 1 "SendMail": stem fromline. 0 1521 (1151) ms in pipeline "SendMail" (452 invocations) <===== 0 In this example, sendmail is a subroutine pipeline consisting of five stages, one of which, 822cont, is a REXX filter containing another subroutine pipeline. During this measurement run, the sendmail 1 0 Page 4 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 subroutine pipeline was issued 452 times and used 1521 milliseconds of CPU. Of that, 1343 milliseconds were used by its 822cont stage, and 972 milliseconds of that were used by something other than the subroutine pipeline in that stage. 0 A version of Rita that will work under CMS 10 and later has been included on the "Samples and Examples" disk (usually MAINT's 193) since VM/ESA 2.1.0. An even more current version of Rita can be found on the CMS Pipelines Web page (http://pucc.princeton.edu/~pipeline/) and at our + _____________ anonymous FTP site (cd to anonymous.376 on pucc.princeton.edu). 0 CASE STUDY--A UUENCODE STAGE + CASE STUDY--A UUENCODE STAGE + ____________________________ 0 Our BITFTP service machine runs at three sites, ten copies in all, so I have been eager to reduce its CPU utilization. BITFTP is a moderately large refinery: 0 o 1 PIPE o 143 CALLPIPEs o 2 ADDPIPEs o 800+ stages o 8000 lines of REXX 0 One of its real hotspots (accounting for twenty-five to thirty percent of its CPU use on some days) was its uuencode filter. However, after being hacked at off and on for a year, uuencode accounted for less than five percent of BITFTP's CPU use. In getting it to that point, I learned a few things about tuning a Pipes application, so I think it + _____ might make an interesting case study. 0 UUencoding is a poorly-defined process for encoding files so that they can be transmitted safely in email via hostile gateways. Basically, the + ___ process is: 0 o Expand each 6 bits to 8 (by inserting high-order 0s); o Form into 60-byte records each preceded by a byte specifying the length of the data in the record before expansion (i.e., 45); + ____ o Add ASCII blank (x'20') to each byte to make it printable; o Do something to prevent loss of trailing blanks; and o Garnish the file with a header record and two trailer records. 0 Here are the input and output when uuencoding a one-record file, UUTEST TEXT: 1 0 Streamlining Your Pipelines Page 5 ------------------------------------------------------------------------ 0 +----------------------------------------------------------------------+ | | | abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ | | | |----------------------------------------------------------------------| + + + | | | begin 0600 uutest.text | | M86)C9&5F9VAI:FML;6YO<'%R7I!0D-$149'2$E*2TQ-3D]045)3M | | '5%565UA96@ M | | M | | end | | | +----------------------------------------------------------------------+ 0 Because I was working on VM, the encoded output would be EBCDIC, and it would be translated to ASCII when the mail went onto the net. Adding a blank (either EBCDIC or ASCII) to my EBCDIC characters would not be the same operation as adding an ASCII blank to an ASCII character, even after the resulting character had been translated to ASCII by the mail gateway. So, to add an ASCII blank to each of my EBCDIC bytes, I used the approach that John Fisher, of Rensselaer Polytechnic Institute, uses in his excellent ARCUTIL program. After the 6-to-8 expansion, each character is translated to the EBCDIC character that will assume its value plus x'20' after being translated to ASCII when going through the mail gateway. To prevent loss of trailing blanks, I use the scheme of suffixing a "guard character" to each line. 1 0 Page 6 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 UUencode Phase 1 + UUencode Phase 1 + ________________ 0 This was the first implementation of my uuencode filter: 0 +----------------------------------------------------------------------+ | | | 'CALLPIPE (endchar ? name uuencode)', | | '*: |', /* Connect input to pipeline. */ | | 'xlate 1-* e2a |', /* Translate to ASCII. */ | | 'spec 1-* 1 x0A next |', /* Append line-feed to each. */ | | 'fblock 45 |', /* Reblock to 45-byte chunks. */ | | 'd: drop last 1 |', /* Last record may be short. */ | | 'vchar 6 8 |', /* Recode, 45 bytes to 60. */ | | 'change //M/ |', /* Data length (45 bytes). */ | | 'spec 1-* 1 /M/ 62 |', /* Append "guard" character. */ | | 'f: faninany |', /* Pick up last data record. */ | | 'xlate 1-*' addspace '|', /* Make data printable EBCDIC. */ | | 'preface var beginline |', /* Prepend canonical header. */ | | 'append literal M|', /* Indicate end of data. */ | | 'append literal end|', /* Indicate end of file. */ | | '*:', /* Connect output to pipeline. */ | | '?', | | 'd: |', /* Process last data record: */ | | 'c: count bytes |', /* Calculate its size. */ | | 'copy |', /* (To unblock "count".) */ | | 'spec 1-* 1 x0000 next |', /* Compensate for "vchar". */ | | 'chop 45 |', /* Truncate compensation. */ | | 'vchar 6 8 |', /* Recode, each 3 bytes to 4. */ | | 's: spec', /* Complete the formatting: */ | | 'select 1', /* get data length; */ | | '1-* d2c 1.1 right', /* make it one byte binary; */ | | 'select 0', /* get encoded data; */ | | '1-* 2', /* put it next; and */ | | '/M/ next |', /* append guard character. */ | | 'f:', /* Send back to mainline. */ | | '?', | | 'c: |', /* Count of bytes in last rec. */ | | 's:' /* Feed it to "spec". */ | | | +----------------------------------------------------------------------+ 0 UUencoding is used to transmit a file by enclosing it in mail. In the production version of this program, the first two stages are used only if the file is a text file, rather than binary. In that case, the recipient will expect the file to be in ASCII, not EBCDIC, once it has been decoded, so it must be translated before encoding, and its record structure must be preserved by appending a linend to each record before encoding. (Typically, the only reason to encode a text file is to preserve the record structure when the records are longer than 80 bytes.) 1 0 Streamlining Your Pipelines Page 7 ------------------------------------------------------------------------ 0 The encoding proper starts with the fblock 45 stage, which flows the records together and reblocks them into 45-byte chunks. The last record may be short, so drop last 1 is used to divert it to the second pipeline segment for special handling. 0 All records except for the last one are then fed into the vchar 6 8 stage, which expands each 6 bits to 8, thus expanding each 45-byte record to 60 bytes. (vchar treats each of its input records as a bit stream in which the size of an input "character" is defined by its first argument and the size of an output character is defined by its second argument. It steps through the input record taking an input character (6 bits in this case) and expanding or contracting it to an output character (8 bits in this case), padding with high-order zeros or truncating on the left, as required. The output record is also treated as a bit stream.) 0 The change stage prepends the character "M" to each record, because the records will be translated to ASCII when they go through a mail gateway, thus changing this "M" to x'2D' (decimal 45), which is the number of bytes in the record before expansion. The spec stage appends another "M", to guard against the loss of any trailing blanks. faninany gathers the records from both of its input streams. xlate translates each of the encoded characters to the character that will be x'20' greater when translated to ASCII. (That is, x'00' is translated to an EBCDIC blank, which will become x'20', an ASCII blank, when translated to ASCII; x'01' is translated to the EBCDIC character that will become x'21' when translated to ASCII; and so forth.) The canonical uuencode header record and trailer records are inserted into the pipeline, and the records flow out through the ending connector to the calling pipeline. 0 The last record, which may be short, is handled in the second pipeline segment. count bytes writes the length of that record to its secondary output. The copy stage puts a quantum delay into the pipeline to prevent a stall. spec appends two bytes of zeros to the record to compensate for a quirk of vchar, which discards any bits left over when it gets to the end of a record and hasn't enough bits left to make another full output character. chop 45 pares the record down to 45 bytes in case appending those two bytes of zeros made it longer than 45. vchar 6 8 expands each 6 bits of data to 8, thus transforming each 3 bytes into 4. The spec pulls in the count of data bytes from its secondary input, converts that to a one-byte binary number, then follows that by the expanded data and a guard character, after which this last record flows into the faninany in the first pipeline segment. 0 That worked, but it was slow: 0 846.1 ms in stage 2 of segment 1: xlate 1-* e2a 1520.5 ms in stage 3 of segment 1: spec 1-* 1 x0A next 937.5 ms in stage 4 of segment 1: fblock 45 1245.1 ms in stage 5 of segment 1: drop last 1 5165.0 ms in stage 6 of segment 1: vchar 6 8 ! 1355.7 ms in stage 7 of segment 1: change //M/ ! 2227.2 ms in stage 8 of segment 1: spec 1-* 1 /M/ 62 1 0 Page 8 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 838.0 ms in stage 9 of segment 1: faninany 1217.5 ms in stage 10 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.0 ms in stage 11 of segment 1: preface var beginline 0.9 ms in stage 12 of segment 1: append literal M 0.9 ms in stage 13 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 5.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2: chop 45 1.1 ms in stage 6 of segment 2: vchar 6 8 8.4 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 15373.6 ms in pipeline "uuencode" (20 invocations) <===== 0 UUencode Phase 2 + UUencode Phase 2 + ________________ 0 The first change to make was obvious. The function of the change stage that prefixed an "M" to each full line was merged into the spec stage that followed it: 0 847.5 ms in stage 2 of segment 1: xlate 1-* e2a : 1502.2 ms in stage 3 of segment 1: spec 1-* 1 x0A next : 930.4 ms in stage 4 of segment 1: fblock 45 1222.0 ms in stage 5 of segment 1: drop last 1 5155.3 ms in stage 6 of segment 1: vchar 6 8 ! 2680.3 ms in stage 7 of segment 1: spec /M/ 1 1-* 2 /M/ 62 836.7 ms in stage 8 of segment 1: faninany 1231.4 ms in stage 9 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.0 ms in stage 10 of segment 1: preface var beginline 0.9 ms in stage 11 of segment 1: append literal M 0.9 ms in stage 12 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 5.1 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2: chop 45 1.1 ms in stage 6 of segment 2: vchar 6 8 8.4 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 14426.7 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: spec is relatively expensive, but if you need to use it, then + MORAL: you probably want to make it do as much as it can. 0 UUencode Phase 3 + UUencode Phase 3 + ________________ 0 The next improvement also merged the function of two stages. block has an option C, which specifies that an EBCDIC linend is to be appended to each input record before reblocking. By using that option, I was able to delete the spec stage that had been appending an ASCII linend to each input record: 1 0 Streamlining Your Pipelines Page 9 ------------------------------------------------------------------------ 0 : 1197.6 ms in stage 2 of segment 1: block 45 c 1123.1 ms in stage 3 of segment 1: xlate 1-* e2a 1133.4 ms in stage 4 of segment 1: drop last 1 5102.6 ms in stage 5 of segment 1: vchar 6 8 2723.9 ms in stage 6 of segment 1: spec /M/ 1 1-* 2 /M/ 62 . 814.2 ms in stage 7 of segment 1: faninany 1136.6 ms in stage 8 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.0 ms in stage 9 of segment 1: preface var beginline 0.9 ms in stage 10 of segment 1: append literal M 0.9 ms in stage 11 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 5.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2: chop 45 1.0 ms in stage 6 of segment 2: vchar 6 8 8.4 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 13252.3 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: Again, merging the function of two stages into one paid off. + MORAL: Note, however, that the xlate e2a stage has become significantly more expensive, as the result of the reordering of the first few stages. Most of the records are originally longer than 45 bytes, so there are more records after reblocking, and processing more records takes more time. Because I hadn't yet written Rita when I made this change, I didn't notice this effect at the time. I was simply going by the change in total CPU utilization. 0 UUencode Phase 4 + UUencode Phase 4 + ________________ 0 The next change was to replace faninany with fanin. There was no need for faninany here. I tend to use it, rather than fanin, out of habit, but all records from the primary stream will be received before the single record from the secondary stream, so fanin will do exactly the right thing in this case and will be less expensive: 0 1187.0 ms in stage 2 of segment 1: block 45 c 1115.3 ms in stage 3 of segment 1: xlate 1-* e2a ! 1138.8 ms in stage 4 of segment 1: drop last 1 5116.5 ms in stage 5 of segment 1: vchar 6 8 2706.7 ms in stage 6 of segment 1: spec /M/ 1 1-* 2 /M/ 62 . 598.6 ms in stage 7 of segment 1: fanin 1117.8 ms in stage 8 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.1 ms in stage 9 of segment 1: preface var beginline 0.9 ms in stage 10 of segment 1: append literal M 0.9 ms in stage 11 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 5.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2: chop 45 1.1 ms in stage 6 of segment 2: vchar 6 8 8.4 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right 1 0 Page 10 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 select 0 1-* 2 /M/ next 13001.8 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: Avoid unnecessary use of stages that use SELECT INPUT ANY. + MORAL: Watching for input on multiple streams is inherently more expensive than is simply reading all records from one stream before switching to another. 0 UUencode Phase 5 + UUencode Phase 5 + ________________ 0 It was becoming obvious that the drop last 1 was relatively expensive. As there is a calculation of the number of output records earlier in the program, it was trivial to replace drop last 1 with take n-1: + _ 0 1193.5 ms in stage 2 of segment 1: block 45 c 1122.2 ms in stage 3 of segment 1: xlate 1-* e2a ! 661.8 ms in stage 4 of segment 1: take 2999 : 5124.7 ms in stage 5 of segment 1: vchar 6 8 2777.4 ms in stage 6 of segment 1: spec /M/ 1 1-* 2 /M/ 62 600.0 ms in stage 7 of segment 1: fanin 1140.9 ms in stage 8 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.1 ms in stage 9 of segment 1: preface var beginline 0.9 ms in stage 10 of segment 1: append literal M 0.9 ms in stage 11 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.5 ms in stage 3 of segment 2: copy 5.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2: chop 45 1.1 ms in stage 6 of segment 2: vchar 6 8 8.4 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 12641.5 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: take last and drop last must maintain a revolving buffer of the + MORAL: specified number of records. This may be an unnecessary expense in your application. 0 UUencode Phase 6 + UUencode Phase 6 + ________________ 0 As you can see, vchar was becoming more and more conspicuously the main problem. Fortunately, then, the Piper came to lunch. At lunch, he asked all of the other guests around the table for their Pipes + _____ wishlists. One of the guests asked for vchar to be speeded up. He soon got his wish: 0 . 1157.9 ms in stage 2 of segment 1: block 45 c . 1214.8 ms in stage 3 of segment 1: xlate 1-* e2a 728.8 ms in stage 4 of segment 1: take 2999 : 1342.9 ms in stage 5 of segment 1: vchar 6 8 2718.2 ms in stage 6 of segment 1: spec /M/ 1 1-* 2 /M/ 62 595.9 ms in stage 7 of segment 1: fanin 1 0 Streamlining Your Pipelines Page 11 ------------------------------------------------------------------------ 0 1130.7 ms in stage 8 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.1 ms in stage 9 of segment 1: preface var beginline 0.9 ms in stage 10 of segment 1: append literal M 0.9 ms in stage 11 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.5 ms in stage 3 of segment 2: copy 5.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.9 ms in stage 5 of segment 2: chop 45 0.7 ms in stage 6 of segment 2: vchar 6 8 8.3 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 8909.7 ms in pipeline "uuencode" (20 invocations) <===== 0 This improvement was achieved by special-casing idioms that truncate records at byte boundaries. (It was included in CMS 11.) 0 MORAL: Feed the Piper. + MORAL: 0 UUencode Phase 7 + UUencode Phase 7 + ________________ 0 By then, I had written Rita, so I did some measuring and finally noticed the problem I had introduced when I switched the order of the xlate and block stages. Since block can readily be made to append ASCII linends, this was easily remedied: 0 . 861.8 ms in stage 2 of segment 1: xlate 1-* e2a . 1110.6 ms in stage 3 of segment 1: block 45 linend 0A 726.5 ms in stage 4 of segment 1: take 2999 1333.1 ms in stage 5 of segment 1: vchar 6 8 ! 2730.0 ms in stage 6 of segment 1: spec /M/ 1 1-* 2 /M/ 62 595.2 ms in stage 7 of segment 1: fanin 1135.1 ms in stage 8 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.1 ms in stage 9 of segment 1: preface var beginline 0.9 ms in stage 10 of segment 1: append literal M 0.9 ms in stage 11 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 5.3 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.9 ms in stage 5 of segment 2: chop 45 0.8 ms in stage 6 of segment 2: vchar 6 8 8.4 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 8513.1 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: All else being equal, it is cheaper to process a smaller number + MORAL: of larger records. 1 0 Page 12 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 UUencode Phase 8 + UUencode Phase 8 + ________________ 0 The spec that prepended the record length ("M") and appended the guard character ("M") was the obvious problem to tackle next. My first attempt was to replace it with two very simple stages, a change and a pad: 0 855.3 ms in stage 2 of segment 1: xlate 1-* e2a 1146.1 ms in stage 3 of segment 1: block 45 linend 0A 717.1 ms in stage 4 of segment 1: take 2999 1341.2 ms in stage 5 of segment 1: vchar 6 8 ! 1376.2 ms in stage 6 of segment 1: change //M/ ! 998.2 ms in stage 7 of segment 1: pad 62 M 595.6 ms in stage 8 of segment 1: fanin 1151.8 ms in stage 9 of segment 1: xlate 1-* 00 40 01 5A 02 7F... 2.0 ms in stage 10 of segment 1: preface var beginline 0.9 ms in stage 11 of segment 1: append literal M 1.0 ms in stage 12 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 6.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2: chop 45 0.7 ms in stage 6 of segment 2: vchar 6 8 8.5 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 8203.1 ms in pipeline "uuencode" (20 invocations) <===== 0 I was somewhat surprised that using pad to append a single character to records of a known length would be so inexpensive. 0 MORAL: It is sometimes cheaper to use two simple stages rather than + MORAL: one complex stage. 0 UUencode Phase 9 + UUencode Phase 9 + ________________ 0 I needed a greater CPU reduction, however, so instead of using the change and pad, I replaced the spec stage with guard, a trivial Assembler filter whose only function was to copy 60-byte records from its input to its output, prefixing an "M" and suffixing an "M" to each: 1 0 Streamlining Your Pipelines Page 13 ------------------------------------------------------------------------ 0 +----------------------------------------------------------------------+ | | | ADDGUARD TITLE 'Pipeline stage to prefix length and suffix guard' | | ADDGUARD MODBEG FREETYPE=CMS Begin module. | | PIPEPVR (R5) Base the transfer vector. | | GUARD PROC , Define procedure. | | OBUFFER DC C'M',CL60' ',C'M' Output buffer. | | GUARDED EQU OBUFFER,*-OBUFFER Output record definition. | | PBEGIN , Begin procedure. | | REPEAT , Loop while there is input. | | PIPLOCX , Examine next record (PEEKTO). | | MVC OBUFFER+1(60),0(R1) Copy record to output buffer. | | PIPOUTX GUARDED Emit the record (OUTPUT). | | PIPINPUT (,0) Consume input (READTO). | | UNTIL NOTZERO Loop until no more input. | | PEXIT RC=(R15) Exit. | | PROCEND , End procedure. | | MODEND , End module. | | | +----------------------------------------------------------------------+ 0 Although guard was a very special-purpose routine with no parameter processing and no error checking and was not at all something one could be proud of, it had the advantage that it solved the problem with sixteen lines of code and about five minutes of work: 0 888.6 ms in stage 2 of segment 1 xlate 1-* e2a 1120.1 ms in stage 3 of segment 1 block 45 linend 0A 674.4 ms in stage 4 of segment 1 take 2999 1268.4 ms in stage 5 of segment 1 vchar 6 8 ! 650.4 ms in stage 6 of segment 1 guard 635.4 ms in stage 7 of segment 1 fanin : 1167.6 ms in stage 8 of segment 1 xlate 1-* 00 40 01 5A 02 7F... 2.1 ms in stage 9 of segment 1 preface var beginline 0.9 ms in stage 10 of segment 1 append literal M 0.9 ms in stage 11 of segment 1 append literal end 1.1 ms in stage 2 of segment 2 count bytes 0.6 ms in stage 3 of segment 2 copy 6.1 ms in stage 4 of segment 2 spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2 chop 45 0.7 ms in stage 6 of segment 2 vchar 6 8 8.5 ms in stage 7 of segment 2 spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 6426.6 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: It may be quite easy to move the worst part of the problem into + MORAL: Assembler. 1 0 Page 14 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 UUencode Phase 10 + UUencode Phase 10 + _________________ 0 It was obvious then to add a TR instruction and John Fisher's translate table to this Assembler filter to perform the function of the second xlate stage: 0 +----------------------------------------------------------------------+ | | | UUFORM TITLE 'Pipeline stage to complete uuencode formatting' | | UUFORM MODBEG FREETYPE=CMS Begin module. | | PIPEPVR (R5) Base the transfer vector. | | UUFORMAT PROC , Define procedure. | | OBUFFER DC C'M',CL60' ',C'M' Output buffer. | | GUARDED EQU OBUFFER,*-OBUFFER Output record definition. | | PBEGIN , Begin procedure. | | REPEAT , Loop while there is input. | | PIPLOCX , Examine next record (PEEKTO). | | MVC OBUFFER+1(60),0(R1) Copy record to output buffer. | : | TR OBUFFER+1(60),UUETABLE Add ASCII blank to each byte. | | PIPOUTX GUARDED Emit the record (OUTPUT). | | PIPINPUT (,0) Consume input (READTO). | | UNTIL NOTZERO Loop until no more input. | | PEXIT RC=(R15) Exit. | : | ******************************************************************** | : | * The following translate table is designed to add the ASCII space * | : | * character, while allowing for subsequent translation from EBCDIC * | : | * to ASCII. Thus, x'00' is translated to x'40', which will become * | : | * x'20' when the uuencoded file passes through an EBCDIC-to-ASCII * | : | * gateway. Similarly, x'01' is translated to x'5A', which will * | : | * become x'21' when it passes through such a gateway. I am * | : | * indebted to John Fisher, of RPI, for this technique. * | : | ******************************************************************** | : | * 0 1 2 3 4 5 6 7 8 9 A B C D E F | : | UUETABLE DC X'405A7F7B5B6C507D4D5D5C4E6B604B61' 0 | : | DC X'F0F1F2F3F4F5F6F7F8F97A5E4C7E6E6F' 1 | : | DC X'7CC1C2C3C4C5C6C7C8C9D1D2D3D4D5D6' 2 | : | DC X'D7D8D9E2E3E4E5E6E7E8E9ADE0BD5F6D' 3 | | PROCEND , End procedure. | | MODEND , End module. | | | +----------------------------------------------------------------------+ 0 That made another significant reduction in the CPU use: 0 835.4 ms in stage 2 of segment 1: xlate 1-* e2a 1137.1 ms in stage 3 of segment 1: block 45 linend 0A 714.6 ms in stage 4 of segment 1: take 2999 . 1286.3 ms in stage 5 of segment 1: vchar 6 8 : 849.9 ms in stage 6 of segment 1: uuformat 591.5 ms in stage 7 of segment 1: fanin 2.1 ms in stage 8 of segment 1: preface var beginline 0.9 ms in stage 9 of segment 1: append literal M 0.9 ms in stage 10 of segment 1: append literal end 1 0 Streamlining Your Pipelines Page 15 ------------------------------------------------------------------------ 0 1.0 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 6.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.9 ms in stage 5 of segment 2: chop 45 0.7 ms in stage 6 of segment 2: vchar 6 8 8.6 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 1.6 ms in stage 8 of segment 2: xlate 1-* 00 40 01 5A 02 7F... 0 5438.4 ms in pipeline "uuencode" (20 invocations) <===== 0 UUencode Phase 11 + UUencode Phase 11 + _________________ 0 Doing that made me see how simple it would be also to do the 6-to-8 expansion in the Assembler routine for this special case of having an input record always an integral multiple of 3 bytes in length and always exactly equivalent to an output record: 0 +----------------------------------------------------------------------+ | | | ENCODE TITLE 'Pipeline stage to uuencode 45-byte records' | | ENCODE MODBEG FREETYPE=CMS Begin module. | | PIPEPVR (R5) Base the transfer vector. | | ENCODEUU PROC , Define procedure. | | OBUFFER DC C'M',CL60' ',C'M' Output buffer. | | GUARDED EQU OBUFFER,*-OBUFFER Output record definition. | | PBEGIN , Begin procedure. | | REPEAT , Loop while there is input. | | PIPLOCX , Examine next record (PEEKTO). | . | LA R3,OBUFFER+1 Beginning of output buffer. | . | LA R4,45/3 Number of 3-byte chunks to do. | . | REPEAT , | . | ICM R6,B'0111',0(R1) Get three bytes to encode. | . | LA R1,3(,R1) Update input pointer. | . | STCM R6,B'0001',3(R3) Fourth output byte. | . | SLL R6,2 | . | STCM R6,B'0010',2(R3) Third output byte. | . | SLL R6,2 | . | STCM R6,B'0100',1(R3) Second output byte. | . | SLL R6,2 | . | STCM R6,B'1000',0(R3) First output byte. | . | UNTIL BCT=R4,INCR=(R3,4) Loop thru 45 bytes of input. | . | NC OBUFFER+1(60),=60X'3F' Clear high-order bits. | | TR OBUFFER+1(60),UUETABLE Add ASCII blank to each byte. | | PIPOUTX GUARDED Emit the record (OUTPUT). | | PIPINPUT (,0) Consume input (READTO). | | UNTIL NOTZERO Loop until no more input. | | PEXIT RC=(R15) Exit. | | . . . | | | +----------------------------------------------------------------------+ 1 0 Page 16 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 Thus, the vchar stage in the main portion of the pipeline could also be removed, again reducing CPU utilization: 0 860.4 ms in stage 2 of segment 1: xlate 1-* e2a 1100.0 ms in stage 3 of segment 1: block 45 linend 0A ! 639.6 ms in stage 4 of segment 1: take 2999 . 1176.4 ms in stage 5 of segment 1: encodeuu 608.5 ms in stage 6 of segment 1: fanin 2.0 ms in stage 7 of segment 1: preface var beginline 0.9 ms in stage 8 of segment 1: append literal M 0.9 ms in stage 9 of segment 1: append literal end 1.0 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 6.1 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.8 ms in stage 5 of segment 2: chop 45 0.7 ms in stage 6 of segment 2: vchar 6 8 8.5 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right select 0 1-* 2 /M/ next 1.6 ms in stage 8 of segment 2: xlate 1-* 00 40 01 5A 02 7F... 0 4408.1 ms in pipeline "uuencode" (20 invocations) <===== 0 UUencode Phase 12 + UUencode Phase 12 + _________________ 0 But I was feeling insecure about the blind assumption of 45-byte input records. Worrying about that made me see that with one more small change, the Assembler stage could assume the function of the take stage and be made more robust at the same time: 1 0 Streamlining Your Pipelines Page 17 ------------------------------------------------------------------------ 0 +----------------------------------------------------------------------+ | | | ENCODE TITLE 'Pipeline stage to uuencode 45-byte records' | | ENCODE MODBEG FREETYPE=CMS Begin module. | | PIPEPVR (R5) Base the transfer vector. | | PIPUUE PROC , Define procedure. | | OBUFFER DC C'M',CL60' ',C'M' Output buffer. | | GUARDED EQU OBUFFER,*-OBUFFER Output record definition. | | PBEGIN , Begin procedure. | | REPEAT , Loop while there is input. | | PIPLOCX , Examine next record (PEEKTO). | ! | C R0,=F'45' Right chunk size? | ! | IF NOTEQUAL No, we can't handle it. | ! | PIPSEL OUTPUT,1,EXIT=NOTZERO Select secondary output. | ! | PIPSHORT , SHORT input to secondary output. | ! | GOTOEXIT , All done; exit. | ! | FI , Right chunk size; uuencode it. | | LA R3,OBUFFER+1 Beginning of output buffer. | | LA R4,45/3 Number of 3-byte chunks to do. | | REPEAT , | | ICM R6,B'0111',0(R1) Get three bytes to encode. | | . . . | | | +----------------------------------------------------------------------+ 0 With the Assembler filter shorting its input to its secondary output when it encountered a record not exactly 45 bytes long, there was no need for the take stage: 0 844.5 ms in stage 2 of segment 1: xlate 1-* e2a 1109.2 ms in stage 3 of segment 1: block 45 linend 0A ! 1156.1 ms in stage 4 of segment 1: pipuue 597.8 ms in stage 5 of segment 1: fanin 2.1 ms in stage 6 of segment 1: preface var beginline 0.9 ms in stage 7 of segment 1: append literal M 1.0 ms in stage 8 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy : 6.3 ms in stage 4 of segment 2: spec 1-* 1 x0000 next : 0.8 ms in stage 5 of segment 2: chop 45 : 0.8 ms in stage 6 of segment 2: vchar 6 8 : 8.7 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right : select 0 1-* 2 /M/ next 1.6 ms in stage 8 of segment 2: xlate 1-* 00 40 01 5A 02 7F... 0 3731.7 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: Assembler filters are really very easy to write, and there can + MORAL: be a considerable payoff.(1) 0 -------------------- 0 (1) Writing filters in Assembler was finally supported officially in 1 0 Page 18 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 UUencode Phase 13 + UUencode Phase 13 + _________________ 0 I was quite happy with my Assembler filter, despite the inelegance of its handling only the common case, and I was ready to leave uuencode alone for a while, but then I got a note from a user complaining that BITFTP's uuencoder was sometimes padding records excessively. He quoted the nearest thing there is to a standard for uuencode, a BSD man page, which said that the encoded portion of each record must be padded to a multiple of 4 bytes. My filter was sometimes padding the last, short record of a file further than that. So, I added a straightforward REXX filter to force the data portion of the last record to be an integral multiple of 4 bytes in length: 0 +----------------------------------------------------------------------+ | | | /* POSTUUE REXX: Position uuencode guard character properly */ | | | | Signal On Error | | | | 'PEEKTO record' /* Examine next record. */ | | Parse Var record datalen +1 /* Data length in binary. */ | | reclen = (((c2d(datalen) + 2) % 3) * 4) + 1 /* UUencoded + 1. */ | | 'OUTPUT' Left(record,reclen) || 'M' /* Append guard character. */ | | 'READTO' /* Consume processed record. */ | | | | Error: Exit RC*(RC<>12) /* RC = 0 if end-of-file. */ | | | +----------------------------------------------------------------------+ 0 It was fortunate that this filter was executed only once per file, because even with it compiled and EXECLOADed, it was relatively expensive: 0 846.3 ms in stage 2 of segment 1: xlate 1-* e2a 1090.3 ms in stage 3 of segment 1: block 45 linend 0A 1135.7 ms in stage 4 of segment 1: pipuue 590.7 ms in stage 5 of segment 1: fanin 2.0 ms in stage 6 of segment 1: preface var beginline 0.9 ms in stage 7 of segment 1: append literal M 0.9 ms in stage 8 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy : 6.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next : 0.8 ms in stage 5 of segment 2: chop 45 : 0.8 ms in stage 6 of segment 2: vchar 6 8 : 6.7 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right 0 -------------------- 0 CMS 12, and a dumbed-down version of the interface was documented in the CMS 12 manuals. For an introduction to using the complete interface, see John P. Hartmann, Writing Assembler Filters for Use + _________________________________ with CMS Pipelines (available from the Pipelines Web page). + __________________ _________ 1 0 Streamlining Your Pipelines Page 19 ------------------------------------------------------------------------ 0 : select 0 1-* 2 : 26.9 ms in stage 8 of segment 2: postuue 1.7 ms in stage 9 of segment 2: xlate 1-* 00 40 01 5A 02 7F... 0 3711.5 ms in pipeline "uuencode" (20 invocations) <===== 0 UUencode Phase 14 + UUencode Phase 14 + _________________ 0 Pipes had recently acquired the ability to run a REXX filter read from + _____ the secondary input of a rexx stage, so I tried implementing the same REXX code that way as well: 0 +----------------------------------------------------------------------+ | | | postuue = "/* */;'PEEKTO record';", | | "datalen=((c2d(Left(record,1))+2)%3)*4;", | | "'OUTPUT' Left(record,datalen+1)'M';'READTO'" | | | | 'CALLPIPE (endchar ? name uuencode)', | | . . . | | 'r: rexx *.1: |', /* Append guard character. */ | | . . . | | '?', | | 'var postuue |', /* Small REXX stage to run. */ | | 'r:' /* Feed it to "rexx". */ | | | +----------------------------------------------------------------------+ 0 Not surprisingly, that was measurably slower: 0 853.6 ms in stage 2 of segment 1: xlate 1-* e2a 1097.2 ms in stage 3 of segment 1: block 45 linend 0A 1138.4 ms in stage 4 of segment 1: pipuue 592.2 ms in stage 5 of segment 1: fanin 2.1 ms in stage 6 of segment 1: preface var beginline 0.9 ms in stage 7 of segment 1: append literal M 0.9 ms in stage 8 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.5 ms in stage 3 of segment 2: copy : 6.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next : 0.8 ms in stage 5 of segment 2: chop 45 : 0.7 ms in stage 6 of segment 2: vchar 6 8 : 6.4 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right : select 0 1-* 2 : 29.7 ms in stage 8 of segment 2: rexx *.1: 1.7 ms in stage 9 of segment 2: xlate 1-* 00 40 01 5A 02 7F... : 1.5 ms in stage 1 of segment 4: var postuue 0 3734.1 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: The compiler usually wins (but not always). + MORAL: 1 0 Page 20 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 UUencode Phase 15 + UUencode Phase 15 + _________________ 0 I knew there had to be a way to do it with built-ins, so I tried again, this time replacing the spec that appended trailing zeros and the chop 45 with a 407 emulation spec that built a CALLPIPE command with a pad stage that would pad the input record to an integral multiple of 3 bytes, thus assuring that the output of vchar would be an integral multiple of 4 bytes: 0 +----------------------------------------------------------------------+ | | | /* The following pipeline segment issues a CALLPIPE of the form: | | CALLPIPE (escape " name UUPAD) literal €€""""""""| pad 12 00 |*: | | */ | | uuepad =, /* Length of encoded data must */ | | 'escape |', /* be even multiple of 4. */ | | 'p: spec', /* So, build CALLPIPE to make */ | | '/CALLPIPE (escape " name UUPAD)', | | 'literal/ 1 1-* nw /||', /* input even multiple of 3. */ | | 'pad/ next', | | 'select 1', /* Get count of data bytes. */ | | 'a: 1-* ignore', | | 'compute #0=((a+2)%3)*3',/* Round up to multiple of 3.*/ | | 'counter 0 picture z9 nw /00 ||', | | '*:/ nw |', | | 'pipcmd |' /* Issue the CALLPIPE command. */ | | | +----------------------------------------------------------------------+ 0 That was faster than the REXX filter: 0 856.4 ms in stage 2 of segment 1: xlate 1-* e2a 1094.1 ms in stage 3 of segment 1: block 45 linend 0A 1137.6 ms in stage 4 of segment 1: pipuue 591.4 ms in stage 5 of segment 1: fanin 2.0 ms in stage 6 of segment 1: preface var beginline 0.9 ms in stage 7 of segment 1: append literal M 0.9 ms in stage 8 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy : 0.6 ms in stage 4 of segment 2: escape : 21.1 ms in stage 5 of segment 2: spec /CALLPIPE (escape " name : UUPAD) literal/ 1 1-* nw /|... : 1.3 ms in stage 6 of segment 2: pipcmd : 0.7 ms in stage 7 of segment 2: vchar 6 8 : 8.3 ms in stage 8 of segment 2: spec select 1 1-* d2c 1.1 right : select 0 1-* 2 /M/ next 1.6 ms in stage 9 of segment 2: xlate 1-* 00 40 01 5A 02 7F... : 0.9 ms in stage 2 of segment 3: fanout : 0.7 ms in stage 3 of segment 3: copy 0 3720.3 ms in pipeline "uuencode" (20 invocations) <===== 1 0 Streamlining Your Pipelines Page 21 ------------------------------------------------------------------------ 0 MORAL: Even the most contorted solutions with built-ins are usually + MORAL: faster than REXX filters. 0 UUencode Phase 16 + UUencode Phase 16 + _________________ 0 Then, the Piper came back from holiday and reminded me that vchar 32 32 will truncate to an integral multiple of 4 bytes (that is, it will copy 32 bits from the input record to a 32-bit "character" in the output record, continuing until the end of the input record and discarding any leftover bits if there are not enough for a full output character): 0 845.7 ms in stage 2 of segment 1: xlate 1-* e2a 1091.5 ms in stage 3 of segment 1: block 45 linend 0A 1136.1 ms in stage 4 of segment 1: pipuue 590.3 ms in stage 5 of segment 1: fanin 2.0 ms in stage 6 of segment 1: preface var beginline 0.9 ms in stage 7 of segment 1: append literal M 0.9 ms in stage 8 of segment 1: append literal end 1.1 ms in stage 2 of segment 2: count bytes 0.6 ms in stage 3 of segment 2: copy 6.2 ms in stage 4 of segment 2: spec 1-* 1 x0000 next 0.7 ms in stage 5 of segment 2: vchar 6 8 : 0.8 ms in stage 6 of segment 2: vchar 32 32 : 8.6 ms in stage 7 of segment 2: spec select 1 1-* d2c 1.1 right : select 0 1-* 2 /M/ next 1.6 ms in stage 8 of segment 2: xlate 1-* 00 40 01 5A 02 7F... 0 3687.0 ms in pipeline "uuencode" (20 invocations) <===== 0 MORAL: vchar can be used to do almost anything. + MORAL: 0 UUencode Phase 17 + UUencode Phase 17 + _________________ 0 I sent the Piper my Assembler filter and asked him to critique my use of PL/j. When he returned my filter, I was not surprised to find that he had both improved the PL/j and rewritten the inner loop to use an entirely different algorithm. I won't show that, however, because Rita said that his filter was slower than mine. 0 His response to that challenge was to come up with a faster algorithm and to handle the short record along with the others, resulting in a full-function and very fast uuencode filter: 0 826.0 ms in stage 2 of segment 1: xlate 1-* e2a 1081.2 ms in stage 3 of segment 1: block 45 linend 0A 1284.9 ms in stage 4 of segment 1: pipuue 2.2 ms in stage 5 of segment 1: preface var beginline 0.9 ms in stage 6 of segment 1: append literal M 0.9 ms in stage 7 of segment 1: append literal end 0 3196.0 ms in pipeline "uuencode" (20 invocations) <===== 1 0 Page 22 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 MORAL: Challenge the Piper. + MORAL: 0 UUencode Postscript + UUencode Postscript + ___________________ 0 John's uuencode filter never made its way into the product, so a few months later I received a request from a friend for a REXX uuencode. I started to give him my most current straight-REXX implementation, but realized that it was not really general-purpose enough, because of the take n-1 stage requiring that the output record count be known in + _ advance. That led me to see whether I could find something that performed better than drop last 1 but did not require knowing the record count. After thinking about it a bit, I realized that what I really needed after the block 45 was not to special-case the last record, but + ____ rather to special-case any record shorter than 45 bytes. In those cases where the last record produced by block 45 is exactly 45 bytes long, there is no need for special processing at all. Thus, drop last 1 can be replaced by locate 45, which will divert any record shorter than 45 bytes to its secondary output for special processing. 0 My measurements using a June, 1995, version of CMS Pipelines showed that + _____________ the CPU use of locate 45 fell about halfway between that of the drop and that of the take. Thinking I might use this as another state in the UUencode case study, I repeated the measurements with a June, 1993, version of CMS Pipelines to match the measurements I have just been + _____________ showing you. What I found was that locate 45 used 22% less CPU in 1995 than it had two years earlier: 0 June '93 June '95 Change 0 drop last 1 1117 1079 -3% locate 45 1054 819 -22% take 2999 614 602 -2% 0 The improvement appears to be due to changes made when support for "inputRanges" was added to locate. This sort of thing happens in Pipes + _____ all the time. John frequently gets ideas for improving the performance of individual stages or of Pipes as a whole and quietly implements them. + _____ For example, in 1996 he shaved twelve instructions off of one of the heavily-used paths in the Pipes dispatcher. My measurements show that + _____ his change reduced the time spent in the dispatcher by 3-7%. This change was included in CMS 12, as were several optimizations of the performance of individual stages. 0 MORAL: Pipes always gets faster, so always use the newest version you + MORAL: _____ can get. The most current version of Pipes runs nicely all the way back + _____ to CMS 5, so consider porting it backwards if you are unable to move quickly to the most current level of CMS. 1 0 Streamlining Your Pipelines Page 23 ------------------------------------------------------------------------ 0 CASE STUDY--A NETDATA RECEIVE STAGE + CASE STUDY--A NETDATA RECEIVE STAGE + ___________________________________ 0 One of my next projects was a service machine that needed to receive several hundred thousand NETDATA-format files per day. My preliminary work on this server made it clear that whether I used the CMS RECEIVE EXEC or a pipeline to get these files from the virtual reader, the time to receive the files would be the limiting factor for my server. Since I needed the data in a pipeline in order to process them, the tradeoff was between using RECEIVE EXEC to receive the file to disk and then loading it into the pipeline and erasing it from disk or using a pipeline to read the file straight from the virtual reader. I found the latter to require 3% more CPU but only half as much elapsed time, due to its doing far fewer I/Os. 0 So, I decided to see if I could speed up the standard receive pipeline filter. Given that that filter was the work of the Master Plumber, I knew that my only hope to make it run faster would be to use new function that had been added to Pipes during the years since he wrote + _____ his version of receive. 0 Receive Phase 1 + Receive Phase 1 + _______________ 0 One of his first cuts at a receive filter had been this: 0 +----------------------------------------------------------------------+ | | | 'CALLPIPE (name Receive)', /* Read and deblock NETDATA: */ | | 'reader file' Arg(1) '|', /* Read specified file. */ | | 'find' "41"x'|', /* Just cards, please. */ | | 'spec 2-* 1 |', /* Remove opcode. */ | | 'pad 80 |', /* In case blanks stripped. */ | | 'deblock netdata |', /* Unravel DMSDDL damage. */ | | 'find {|', /* Just real data. */ | | 'spec 2-* 1 |', /* Delete control character. */ | | '*:' /* Here are the real records. */ | | | +----------------------------------------------------------------------+ 0 The reader stage moves the spool file specified in its argument from the virtual reader into the pipeline. The find stage selects records from that raw spool file based on the CCW opcode in the first byte. Selecting for records that start with x'41' has the effect of discarding the tag records from the spool file, which have no-op opcodes (x'03'). The spec stage then discards that opcode byte by copying the input to the output starting with column 2. The pad 80 makes legal card images by restoring trailing blanks that have been lost along the way. Then deblock netdata reblocks its input to write out complete NETDATA record segments, both control records and data records, each prefixed by a flag byte. The data records, which are marked with a flag of x'C0' ("{"), are selected by the second find stage. The second spec stage removes that flag byte, at which point the file has been restored to the format it had before being transmitted. 1 0 Page 24 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 That worked properly and was reasonably fast: 0 31 ms in stage 1 of segment 1: reader file 0961 55 ms in stage 2 of segment 1: find   147 ms in stage 3 of segment 1: spec 2-* 1 56 ms in stage 4 of segment 1: pad 80 114 ms in stage 5 of segment 1: deblock netdata 68 ms in stage 6 of segment 1: find { 177 ms in stage 7 of segment 1: spec 2-* 1 648 ms in pipeline "Receive0" (1 invocation) <===== 0 532 ms in dispatcher. 0 However, John soon saw how to remove the pad 80 stage by subsuming its function into the spec, and this version became the standard way of receiving a NETDATA file into a pipe: 0 30 ms in stage 1 of segment 1: reader file 0961 55 ms in stage 2 of segment 1: find   149 ms in stage 3 of segment 1: spec 2-* 1.80 110 ms in stage 4 of segment 1: deblock netdata 68 ms in stage 5 of segment 1: find { 173 ms in stage 6 of segment 1: spec 2-* 1 585 ms in pipeline "Receive1" (1 invocation) <===== 0 462 ms in dispatcher. 0 MORAL: We have already seen the benefits of letting spec do as much as + MORAL: it can. This particular trick of using it to pad records is a good one to remember. 0 Receive Phase 2 + Receive Phase 2 + _______________ 0 I suspected that I could make this filter faster by replacing the two spec stages with change stages, but of course that required me to put back the pad: 0 +----------------------------------------------------------------------+ | | | 'CALLPIPE (name Receive2)', /* Read and deblock NETDATA: */ | | 'reader file' Arg(1) '|', /* Read specified file. */ | | 'find' "41"x'|', /* Just cards, please. */ | | 'change 1.1 x41 // |', /* Remove opcode. */ | | 'pad 80 |', /* In case blanks stripped. */ | | 'deblock netdata|', /* Unravel DMSDDL damage. */ | | 'find {|', /* Just real data. */ | | 'change 1.1 /{// |', /* Delete control character. */ | | '*:' /* Here are the real records. */ | | | +----------------------------------------------------------------------+ 1 0 Streamlining Your Pipelines Page 25 ------------------------------------------------------------------------ 0 Rita showed that the combination of change and pad was slightly slower than the first spec, but that the second change was faster than the spec it replaced: 0 30 ms in stage 1 of segment 1: reader file 0961 55 ms in stage 2 of segment 1: find   109 ms in stage 3 of segment 1: change 1.1 x41 // 57 ms in stage 4 of segment 1: pad 80 110 ms in stage 5 of segment 1: deblock netdata 68 ms in stage 6 of segment 1: find { 130 ms in stage 7 of segment 1: change 1.1 /{// 559 ms in pipeline "Receive2" (1 invocation) <===== 0 532 ms in dispatcher. 0 Despite the fact that the time charged to the individual stages was less than before, the EXEC as a whole actually used more CPU time. The difference is explained by another number that Rita reports, the time spent in the Pipelines dispatcher. As a rough estimate, the dispatcher + _________ time is proportional to the number of times a record is handed to a stage. Thus, when the Piper removed pad from his first implementation, dispatcher time dropped from 532 ms to 462, and when I put pad back, dispatcher time when back up to 532, which was enough to swamp the savings from using change rather than spec. 0 MORAL: Don't overlook the time spent in the dispatcher and other + MORAL: Pipelines overhead. + _________ 0 CAVEAT: On versions of CMS Pipelines before CMS 12 (i.e., before + CAVEAT: ______________ ____ internal level 1.0109001A) or when using a version of Rita that does not support the MASK option of runpipe, the dispatcher time reported by Rita will be so dominated by the time to dispatch Rita itself as to be useless for this sort of analysis. (Using CMS 12 and the CMS 12 version of Rita, dispatching Rita amounts to less than 5% of the total dispatcher time.) If you need to examine the dispatcher time using an earlier version of Pipes, simply issue the command PIPMOD MSGLEVEL 8207 + _____ and run your pipeline without Rita. 0 Receive Phase 3 + Receive Phase 3 + _______________ 0 To get the improvement I was looking for, I obviously needed to handle the records fewer times. For CMS 10, John made a very powerful enhancement to change that allows it to have a secondary output stream. When that stream is defined, change writes the unchanged records to it. + _________ The stream need not actually be connected; if it is defined but not connected, the unchanged records are discarded. Using that capability, I was able to replace each of the find | spec pairs in the original implementation with a change stage: 1 0 Page 26 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 +----------------------------------------------------------------------+ | | | 'CALLPIPE (end ? name Receive3)',/* Read and deblock NETDATA: */ | | 'reader file' Arg(1) '|', /* Read specified file. */ | | 'c: change 1.1 x41 // |', /* Select data; remove opcode. */ | | 'pad 80 |', /* Pad for DEBLOCK. */ | | 'deblock netdata |', /* Unravel DMSDDL damage. */ | | 'd: change 1.1 /{// |', /* Select data; remove ctl char.*/ | | '*:', /* Here are the real records. */ | | '?', | | 'c:', /* Discard tags, etc. */ | | '?', | | 'd:' /* Discard NETDATA control recs.*/ | | | +----------------------------------------------------------------------+ 0 The first change stage removes the x'41' opcode in column one and writes the changed records to its primary output. It also writes the unchanged records, the ones that did not have x'41' in column one, to its secondary output. That output is not connected to anything, so those records are discarded. Similarly, the second change selects the records that have a x'C0' in the first column and removes that byte, while discarding the records that don't begin with x'C0'. Although this implementation is a tad obscure, it is a definite win: 0 29 ms in stage 1 of segment 1: reader file 0961 108 ms in stage 2 of segment 1: change 1.1 x41 // 56 ms in stage 3 of segment 1: pad 80 109 ms in stage 4 of segment 1: deblock netdata 132 ms in stage 5 of segment 1: change 1.1 /{// 434 ms in pipeline "Receive3" (1 invocation) <===== 0 373 ms in dispatcher. 0 Both the time charged to individual stages and the dispatcher time are reduced significantly; overall, this solution uses half the CPU of the standard filter. 0 MORAL: Pipes always gets more powerful; check every new release for + MORAL: _____ new built-in programs that may help you cool down your hot spots. (Issue the command pipe ahelp menu to find out what really is in the new release, because it is not all documented in the standard help files.) 0 Receive Phase 4 + Receive Phase 4 + _______________ 0 Somebody asked for a receive filter on the PIPELINE CFORUM, so I proudly appended my shiny new one, but the universal opinion was that it was just too arcane for general use. Steve Hayes, of IBM UK, proposed another solution that is quite nifty: 1 0 Streamlining Your Pipelines Page 27 ------------------------------------------------------------------------ 0 +----------------------------------------------------------------------+ | | | 'CALLPIPE (name Receive4)', /* Read and deblock NETDATA: */ | | 'reader file' Arg(1) '|', /* Read specified file. */ | | 'find' "41"x'|', /* Just cards, please. */ | | 'not chop 1 |', /* Remove opcode. */ | | 'pad 80 |', /* Pad for DEBLOCK. */ | | 'deblock netdata |', /* Unravel DMSDDL damage. */ | | 'find {|', /* Just real data. */ | | 'not chop 1 |', /* Delete control character. */ | | '*:' /* Here are the real records. */ | | | +----------------------------------------------------------------------+ 0 He used not chop 1, rather than spec, to strip the first byte from the records. (chop 1, by itself, copies the first byte to its primary output stream and the remainder of the record to its secondary output stream (or discards it, if the secondary output is not connected). not chop 1 reverses that action; that is, it copies the first byte to its secondary output or discards it if the secondary output is not connected, as in this case, and writes the remainder of the record to its primary output.) Using not chop was a very good idea, but the time in the dispatcher was surprisingly high: 0 CPU utilization of pipeline "Receive4": 30 ms in stage 1 of segment 1: reader file 0961 55 ms in stage 2 of segment 1: find   91 ms in stage 3 of segment 1: not chop 1 57 ms in stage 4 of segment 1: pad 80 114 ms in stage 5 of segment 1: deblock netdata 69 ms in stage 6 of segment 1: find { 107 ms in stage 7 of segment 1: not chop 1 523 ms in pipeline "Receive4" (1 invocation) <===== 0 CPU utilization of pipeline "not": 91 ms in stage 2 of segment 1: chop 1 91 ms in pipeline "not" (1 invocation) <===== 0 CPU utilization of pipeline "not": 107 ms in stage 2 of segment 1: chop 1 107 ms in pipeline "not" (1 invocation) <===== 0 718 ms in dispatcher. 0 Receive Phase 5 + Receive Phase 5 + _______________ 0 When I asserted (without doing any measurements) that the dispatcher overhead must be due to the complexity of not (which sets up a subroutine pipeline to run its target stage), Glenn Knickerbocker, of IBM, countered with the same logic implemented without not: 1 0 Page 28 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 +----------------------------------------------------------------------+ | | | 'CALLPIPE (end ? name Receive5)', /* Read and deblock NETDATA: */ | | 'reader file' Arg(1) '|', /* Read specified file. */ | | 'find' "41"x'|', /* Just cards, please. */ | | 'c: chop 1', /* Remove opcode. */ | | '?', | | 'c: |', /* Here after opcode removed. */ | | 'pad 80 |', /* Pad for DEBLOCK. */ | | 'deblock netdata |', /* Unravel DMSDDL damage. */ | | 'find {|', /* Just real data. */ | | 'd: chop 1', /* Delete control character. */ | | '?', | | 'd: |', /* Here after ctl char removed.*/ | | '*:' /* Here are the real records. */ | | | +----------------------------------------------------------------------+ 0 As you can see, not was not the problem: 0 30 ms in stage 1 of segment 1: reader file 0961 55 ms in stage 2 of segment 1: find   86 ms in stage 3 of segment 1: chop 1 57 ms in stage 2 of segment 2: pad 80 113 ms in stage 3 of segment 2: deblock netdata 69 ms in stage 4 of segment 2: find { 106 ms in stage 5 of segment 2: chop 1 516 ms in pipeline "Receive5" (1 invocation) <===== 0 715 ms in dispatcher. 0 The time in the dispatcher was essentially unchanged. 0 Receive Phase 6 + Receive Phase 6 + _______________ 0 I took another stab at replacing the spec stages, this time using strip leading with the count of the maximum number of bytes to strip set to 1: 0 +----------------------------------------------------------------------+ | | | 'CALLPIPE (name Receive6)', /* Read and deblock NETDATA: */ | | 'reader file' Arg(1) '|', /* Read specified file. */ | | 'find' "41"x'|', /* Just cards, please. */ | | 'strip leading x41 1 |', /* Remove opcode. */ | | 'pad 80 |', /* In case blanks stripped. */ | | 'deblock netdata |', /* Unravel DMSDDL damage. */ | | 'find {|', /* Just real data. */ | | 'strip leading { 1 |', /* Delete control character. */ | | '*:' /* Here are the real records. */ | | | +----------------------------------------------------------------------+ 1 0 Streamlining Your Pipelines Page 29 ------------------------------------------------------------------------ 0 That was the best non-arcane solution so far: 0 30 ms in stage 1 of segment 1: reader file 0961 55 ms in stage 2 of segment 1: find   54 ms in stage 3 of segment 1: strip leading x41 1 57 ms in stage 4 of segment 1: pad 80 111 ms in stage 5 of segment 1: deblock netdata 69 ms in stage 6 of segment 1: find { 66 ms in stage 7 of segment 1: strip leading { 1 441 ms in pipeline "Receive6" (1 invocation) <===== 0 527 ms in dispatcher. 0 MORAL: There is more than one way to skin a cat. Become familiar with + MORAL: the entire repertoire of CMS Pipelines built-in programs to give + _____________ yourself more ways of approaching problems. 0 Receive Phase 7 + Receive Phase 7 + _______________ 0 Then the Piper piped up with a small fix to chop that made it more efficient when it has only a secondary output stream connected, and that made the not chop and strip leading cases exactly equivalent: 0 CPU utilization of pipeline "Receive4": 30 ms in stage 1 of segment 1: reader file 0961 55 ms in stage 2 of segment 1: find   54 ms in stage 3 of segment 1: not chop 1 56 ms in stage 4 of segment 1: pad 80 112 ms in stage 5 of segment 1: deblock netdata 68 ms in stage 6 of segment 1: find { 67 ms in stage 7 of segment 1: not chop 1 443 ms in pipeline "Receive4" (1 invocation) <===== 0 CPU utilization of pipeline "not": 54 ms in stage 2 of segment 1: chop 1 54 ms in pipeline "not" (1 invocation) <===== 0 CPU utilization of pipeline "not": 67 ms in stage 2 of segment 1: chop 1 67 ms in pipeline "not" (1 invocation) <===== 0 536 ms in dispatcher. 0 MORAL: It really does always get faster. + MORAL: ______ 1 0 Page 30 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 CASE STUDY--TUNING RITA + CASE STUDY--TUNING RITA + _______________________ 0 Rita is a pipeline, of course, and one that has needed quite a lot of tuning. The process of streamlining Rita did a good job of illustrating one of the most important concepts in tuning a pipeline, so I will use Rita itself as another case study. 0 Basically, the pipeline in Rita looks like this: 0 'PIPE var pipeline | runpipe events | rita | console' 0 The rita filter is 900+ lines of REXX that accumulates the numbers from the event and accounting records and formats them for display. So, optimizing RITA REXX was no different from optimizing any other REXX program. Tools like HISTOREX give one all the information that is needed to find the high-use areas and pound them down. 0 Unfortunately, after I had optimized the REXX code to the best of my ability, I was faced with the fact that running a pipeline under Rita still used at least an order of magnitude more CPU than did running it standalone. The basic problem was that every time the subject pipeline sighed, or even thought about sighing, runpipe events wrote an event record, and that record had to be processed by my REXX filter. I actually needed to see only a few of the many different event record types, so the obvious approach was to discard the unneeded record types before they got to my REXX filter. After a good deal of experimenting, I concluded that this was the most efficient way of doing that: 0 +----------------------------------------------------------------------+ | | | 'PIPE (endchar ?)', | | 'var pipeline |', /* Load pipeline into pipe. */ | | 'runpipe events |', /* Run the subject pipeline. */ | | 'l: lookup 1', /* Discard unneeded events. */ | | '?', | | 'strliteral x010203050708090a0c0d0e11 |', /* Unused types. */ | | 'deblock 1 |', /* Get one type per record. */ | | 'l: |', /* Master file for "lookup". */ | | . . . /* More filtering. | | 'rita |', /* Analyze event records. */ | | 'console' /* Display results. */ | | | +----------------------------------------------------------------------+ 0 The event records have a one-byte record type field in the first column. The strliteral stage here lists the record types that are not of interest. deblock 1 breaks that list up into one-byte records that are fed into the secondary input of the lookup, to form its "master file". Each time runpipe produces an event record, that record is read into the primary input of the lookup. Since the primary output of lookup is not connected, records that have a type matching any of those in the master file are discarded immediately. Records of other types are written to 1 0 Streamlining Your Pipelines Page 31 ------------------------------------------------------------------------ 0 the secondary output of lookup, and are put through some sub-type filtering that isn't shown here before being fed into the rita stage. My measurements showed that using lookup to discard the unwanted records required only 18% as much CPU as doing it with a cascade of nfind filters. 0 Using this technique got Rita's overhead down far enough that I could begin sharing it with other people, but it was still terribly expensive to use. Then, however, John came to my rescue again and added the MASK option to runpipe. MASK allows one to specify the event record types that are to be suppressed. Using the MASK option greatly reduced the overhead of running a pipeline under Rita: 0 Bare EXEC: 12.4 secs Under Rita without MASK option: 56.2 secs Under Rita with MASK option: 16.3 secs 0 The versions of Pipes and Rita shipped in CMS 12 and later support the + _____ MASK option. 0 MORAL: If you know you will be discarding records, do it as early in + MORAL: your pipeline as possible. 0 ABSORBING LOOPS INTO THE PIPELINE + ABSORBING LOOPS INTO THE PIPELINE + _________________________________ 0 Whenever you see a PIPE or a CALLPIPE being issued in a loop, you have an opportunity to do some streamlining. Absorbing such a loop into the pipeline generally produces a dramatic reduction in CPU utilization. As an example of this process, I will discuss a very useful pipeline construct, the "sipping subroutine pipeline".(2) 0 Some of the Pipes selection filters divide their input stream into two + _____ pieces; that is, they select all the records up to some point and then discard the remainder. These are called "partitioning selection filters" (because they partition the input stream). Examples include take, tolabel, totarget, and whilelabel. When a partitioning selection filter is used in a CALLPIPE, it can select some number of records from its input stream and then terminate when it reaches the point where it would start discarding the remaining records. The subroutine pipeline created by the CALLPIPE then terminates, and the records that would have been discarded are left in the pipeline, ready to be processed by other commands in the REXX filter that issued that CALLPIPE. Thus, the subroutine pipeline created by the CALLPIPE simply "sips" the records it needs from the beginning of the input stream and leaves the rest for further processing. 0 -------------------- 0 (2) Ben Chi, of the University of Albany, coined this delightful term. 1 0 Page 32 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 Word Spill Using a Sipping Subroutine Pipeline + Word Spill Using a Sipping Subroutine Pipeline + ______________________________________________ 0 SPILL1 REXX is an example of a REXX filter that performs a "word spill" operation by invoking a simple sipping subroutine pipeline in a loop. spill1 issues CALLPIPE commands in a loop to process one input record at a time. It uses PEEKTO and STREAMSTATE commands to decide when it should exit from its Do Forever loop. The PEEKTO gets RC 12 when there is no more input; the STREAMSTATE OUTPUT gets RC 12 when the output stream is no longer connected. Thus, in either case, the loop is terminated. 0 +----------------------------------------------------------------------+ | | | /* SPILL1 REXX: Perform a word spill at the specified length. */ | | | | Parse Arg length . /* Desired maximum lrecl. */ | | | | Do Forever /* Do until end-of-file. */ | | | | 'PEEKTO' /* Test for end-of-file. */ | | If RC /== 0 /* Exit if no more input. */ | | Then Exit RC*(RC<>12) | | | | 'CALLPIPE (end ? name Spill1)', /* Split up single record: */ | | '*: |', /* Input from pipeline. */ | | 'take 1 |', /* Sip one record to process. */ | | 'split |', /* Split it into words. */ | | 'join * / /' length '|', /* Rejoin to desired length. */ | | '*:' /* Output records to pipeline. */ | | | | 'STREAMSTATE OUTPUT 0' /* Test for end-of-file. */ | | If RC == 12 Then Exit 0 /* Exit if output severed. */ | | | | End | | | +----------------------------------------------------------------------+ 0 The subroutine pipeline itself is very simple. What it does is sip one record, split that up into individual words, and then rejoin those words into one or more records that are no longer than the length specified as an argument. The idea is to split the records at a word boundary, rather than in the middle of a word. 0 The reason for having coded this subroutine pipeline to process only one input record at a time was to prevent it from putting words from different input records into the same output record. That is, the join * would join all the words from all the records in the input stream, if it were not prevented from doing that by the take 1. 0 This technique has great power and simplicity, but it is not always the cheapest way to do things. Rita produced these numbers when this REXX filter was used to process 5,000 records: 1 0 Streamlining Your Pipelines Page 33 ------------------------------------------------------------------------ 0 CPU utilization of pipeline specification "DoSpill1": 11 ( 11) ms in stage 1 of segment 1: < all notebook 16 ( 16) ms in stage 2 of segment 1: take 5000 932 (606) ms in stage 3 of segment 1: spill1 60 7 ( 7) ms in stage 4 of segment 1: hole 966 (640) ms total in "DoSpill1" (1 invocation) <===== 0 CPU utilization of pipeline specification "Spill1": 37 ( 37) ms in stage 2 of segment 1: take 1 125 (125) ms in stage 3 of segment 1: split 165 (165) ms in stage 4 of segment 1: join * / / 60 326 (326) ms total in "Spill1" (5000 invocations) <===== 0 966 ms attributed to stages; 6 virtual I/Os. 0 1124 ms in scanner. 191 ms in commands. 733 ms in dispatcher. 0 The spill1 stage used 932 ms of CPU, 606 of which were not attributed to running its subroutine pipeline; that is, there was a lot happening in the REXX filter in addition to the running of the subroutine pipeline. Note also the fairly large Pipelines overhead numbers shown at the + _________ bottom of the Rita display, particularly the time spent in the pipeline scanner, which had to scan that CALLPIPE command 5,000 times. 0 Word Spill With Continuous Flow + Word Spill With Continuous Flow + _______________________________ 0 The reason for doing the 5,000 CALLPIPE commands was to avoid mixing together words from more than one of the input records. The CALLPIPE loop was a simple, straightforward solution to that problem. Any solution that subsumes that loop into a single pipeline is likely to be more complex. However, if this is a high-use filter, you may well judge the increase in complexity to be offset by the reduction in cost. 0 Here is a way to perform the same function with continuous flow through a single CALLPIPE: 1 0 Page 34 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 +----------------------------------------------------------------------+ | | | /* SPILL2 REXX: Perform a word spill at the specified length. */ | | | | Parse Arg length . /* Desired maximum lrecl. */ | | | | 'CALLPIPE (end ? name Spill2)', /* Split up multiple records: */ | | '*: |', /* Input from pipeline. */ | | 'o: fanout |', /* Divert second copy. */ | | 'spec number 1.10 |', /* Convert first to key. */ | | 'j: juxtapose |', /* Prefix keys to words. */ | | 'join keylength 10 / /' length+10 '|', /* Join as desired. */ | | | 'not chop 10 |', /* Remove the record key. */ | | '*:', /* Output to pipeline. */ | | '?', | | 'o: |', /* Second copy to here. */ | | 'split |', /* Split into words. */ | | 'j:' /* Send to be prefixed. */ | | | +----------------------------------------------------------------------+ 0 As each record enters this subroutine pipeline through the input connector, it passes to a fanout stage that makes a second copy and sends that to the second pipeline segment, where it is split up into one-word records. The first copy of each input record goes from the fanout to the spec stage, which converts it into a record containing only a 10-digit sequence number. juxtapose then reads each of the one-word records from its secondary input and prefixes it with the corresponding sequence number record from its primary input, producing an output stream containing records that each contain a single word prefixed by the sequence number of the input record from which that word was derived. join keylength joins the records that begin with the same sequence number. It inserts a blank between each of the records it joins, and it joins only as many records as it can without exceeding the specified output length. An output record contains the key (the sequence number) only once, at the beginning, no matter how many input records were joined to form that output record. Thus, each of the original input records may be converted into one or more output records, | just as in the earlier implementation. not chop is used to strip off | the sequence number, and the records are sent through the output connector to the calling pipeline. 0 You will note that the PEEKTO and the STREAMSTATE are not required here, because there is no longer a loop to terminate. When either the input stream or the output stream goes to end of file, the CALLPIPE command will terminate, resulting in spill2 as a whole terminating. 0 No loop is required because the trick of attaching the same sequence number to each piece derived from a given input record allows us to maintain the record boundaries while pulling the records apart and putting them back together. Thus, all the processing can be done in a single subroutine pipeline. 1 0 Streamlining Your Pipelines Page 35 ------------------------------------------------------------------------ 0 Rita shows interesting results for this implementation: 0 CPU utilization of pipeline specification "DoSpill2": 7 ( 7) ms in stage 1 of segment 1: < all notebook 14 ( 14) ms in stage 2 of segment 1: take 5000 538 ( 1) ms in stage 3 of segment 1: spill2 60 6 ( 6) ms in stage 4 of segment 1: hole 565 ( 28) ms total in "DoSpill2" (1 invocation) <===== 0 CPU utilization of pipeline specification "Spill2": 27 ( 27) ms in stage 2 of segment 1: fanout 65 ( 65) ms in stage 3 of segment 1: spec number 1.10 178 (178) ms in stage 4 of segment 1: juxtapose 163 (163) ms in stage 5 of segment 1: join keylength 10 / / 70 19 ( 0) ms in stage 6 of segment 1: not chop 10 85 ( 85) ms in stage 2 of segment 2: split 537 (518) ms total in "Spill2" (1 invocation) <===== 0 CPU utilization of pipeline specification "not": 19 ( 19) ms in stage 2 of segment 1: chop 10 19 ( 19) ms total in "not" (1 invocation) <===== 0 565 ms attributed to stages; 6 virtual I/Os. 0 1 ms in scanner. 0 ms in commands. 490 ms in dispatcher. 0 Running the spill2 subroutine pipeline once to process 5,000 records is actually more costly than running the spill1 subroutine 5,000 times, because spill2 does much more work on each record it processes. However, both REXX and Pipes overhead drop dramatically in the spill2 + _____ case, making the total cost of spill2 only one-fourth that of spill1. (With no loop in spill2, REXX had so little to do that it used only 1 ms. Without the loop, the pipeline scanner had only one CALLPIPE to scan, and the pipeline command processor did not have 5,000 PEEKTOs and STREAMSTATEs to process.) 0 Of course, the real solution is to use the spill built-in program added in CMS 12. spill requires only 18 ms to do a much nicer job of word-spilling these 5,000 records. 0 Other Techniques for Achieving Continuous Flow + Other Techniques for Achieving Continuous Flow + ______________________________________________ 0 In general, one writes a sipping subroutine in order to operate on a single record or a single group of records without getting them mixed up with other records. In cases where performance considerations require a different approach, there are several useful techniques, in addition to the one we've just seen of numbering the records. Two techniques you should keep in mind are: 1 0 Page 36 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 1. Inserting a marker at the end of each group of input records to use later to recover the original record boundaries: 0 +---------------------------------------------------------------------+ | | | . . . | | 'spec 1-* 1 x00 next |', /* Delimit record for later. */ | | 'deblock 1 |', /* One record per character. */ | | 'unique first |', /* Don't want runs of the same.*/ | | 'deblock linend 00 terminate |', /* Restore record boundaries. */ | | . . . | | | +---------------------------------------------------------------------+ 0 | This fragment deletes all but the first occurrence of runs of the same | character. A character known not to occur in the data is inserted at | the end of each original record. The records are split into one-byte | records. Consecutive duplicates are discarded. And then deblock | restores the original record boundaries, deleting the marker | characters in the process. 0 2. Joining the records for each group together into a single record before processing them: 0 +---------------------------------------------------------------------+ | | | . . . | | 'join * / / |', /* Join all NAMES records. */ | | 'split before string /:NICK./ |', /* Split into NAMES entries. */ | | 'locate /:LIST./ |', /* Select those with lists. */ | | . . . | | | +---------------------------------------------------------------------+ 0 | This fragment process records from a NAMES file discarding those that | do not contain :LIST tags. All the records from the file are joined | into a single record with a blank inserted between the original | records. That single record is split before the :NICK tags, thus | producing one record per NAMES file entry, and entries not containing | :LIST tags are discarded. 0 OPTIMIZING UNABSORBED LOOPS + OPTIMIZING UNABSORBED LOOPS + ___________________________ 0 | It may often be the case that for reasons of simplicity you will prefer | to issue a subroutine pipeline repeatedly in a loop rather than | absorbing that loop into a single pipeline. In those cases, there are a | few simple techniques that can be used to reduce REXX and Pipelines + _________ | overhead. 1 0 Streamlining Your Pipelines Page 37 ------------------------------------------------------------------------ 0 | Spill1 Phase 2: Reducing REXX Overhead + Spill1 Phase 2: Reducing REXX Overhead + ______________________________________ 0 | When looking at the SPILL1 REXX example, we noticed that REXX was doing | a lot of work, 606 ms worth. Any of the usual methods for reducing REXX | overhead are likely to help. The most obvious is to compile SPILL1 | REXX: 0 | CPU utilization of pipeline specification "Dospill1C": | 13 ( 13) ms in stage 1 of segment 1: < all notebook | 17 ( 17) ms in stage 2 of segment 1: take 5000 | 606 (274) ms in stage 3 of segment 1: rexx (spill1 crexx) 60 | 7 ( 7) ms in stage 4 of segment 1: hole | 643 (312) ms total in "DoSpill1C" (1 invocation) <===== 0 | CPU utilization of pipeline specification "Spill1": | 38 ( 38) ms in stage 2 of segment 1: take 1 | 126 (126) ms in stage 3 of segment 1: split | 168 (168) ms in stage 4 of segment 1: join * / / 60 | 332 (332) ms total in "Spill1" (5000 invocations) <===== 0 | 643 ms attributed to stages; 6 virtual I/Os. 0 | 1151 ms in scanner. | 196 ms in commands. | 745 ms in dispatcher. 0 | Although that reduces the time in REXX from 606 ms to 274 ms, it has no | impact on the Pipes overhead, since 5,000 CALLPIPE commands are still + _____ | issued and scanned, 5,000 PEEKTO commands are still issued, and 5,000 | STREAMSTATE OUTPUT commands are still issued. 0 | Spill1 Phase 3: Reducing Pipelines Overhead with EOFREPORT + Spill1 Phase 3: Reducing Pipelines Overhead with EOFREPORT + __________________________________________________________ 0 | The Pipelines overhead of SPILL1 REXX can be reduced by reducing the + _________ | number of Pipelines subcommands issued. CMS 14 introduced the EOFREPORT + _________ | subcommand, the purpose of which is to facilitate writing REXX filters | that propagate end-of-file "quickly". That is, EOFREPORT allows a | filter to notice end-of-file on its output stream while it is waiting to | read from its input stream and to notice end-of-file on its input stream | while it is waiting to write to its output stream. An additional | benefit of EOFREPORT is that it may provide a performance improvement by | allowing one to issue fewer Pipelines subcommands. In the case of + _________ | spill1, for example, we can insert an EOFREPORT ALL command before the | loop and remove the STREAMSTATE OUTPUT command from within the loop: 1 0 Page 38 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 | +----------------------------------------------------------------------+ | | | | | /* SPILL1X REXX: Perform a word spill at the specified length. */ | | | | | | Parse Arg length . /* Desired maximum lrecl. */ | | | | | | 'EOFREPORT ALL' /* Watch for severed streams. */ | | | | | | Do Forever /* Do until end-of-file. */ | | | | | | 'PEEKTO' /* Test for end-of-file. */ | | | If RC /== 0 /* Exit if no more input. */ | | | Then Exit RC*(RC<>12 & RC<>8) | | | | | | 'CALLPIPE (end ? name Spill1X)', /* Split up single record: */ | | | '*: |', /* Input from pipeline. */ | | | 'take 1 |', /* Sip one record to process. */ | | | 'split |', /* Split it into words. */ | | | 'join * / /' length '|', /* Rejoin to desired length. */ | | | '*:' /* Output records to pipeline. */ | | | | | | End | | | | | +----------------------------------------------------------------------+ 0 | When the output stream is severed as the result of the take 5000 in the | main pipeline, the PEEKTO will terminate with return code 8, giving us | the same results as before but requiring 4,999 fewer Pipelines + _________ | subcommands to be issued. 0 | That reduces the time in REXX from 274 ms to 186 and the | command-processing overhead from 196 ms to 115: 0 | CPU utilization of pipeline specification "DoSpill1X": | 14 ( 14) ms in stage 1 of segment 1: < all notebook | 16 ( 16) ms in stage 2 of segment 1: take 5000 | 516 (186) ms in stage 3 of segment 1: rexx (spill1x crexx) 60 | 7 ( 7) ms in stage 4 of segment 1: hole | 553 (223) ms total in "DoSpill1X" (1 invocation) <===== 0 | CPU utilization of pipeline specification "Spill1X": | 37 ( 37) ms in stage 2 of segment 1: take 1 | 126 (126) ms in stage 3 of segment 1: split | 167 (167) ms in stage 4 of segment 1: join * / / 60 | 330 (330) ms total in "Spill1X" (5000 invocations) <===== 0 | 553 ms attributed to stages; 7 virtual I/Os. 0 | 1127 ms in scanner. | 115 ms in commands. | 708 ms in dispatcher. 1 0 Streamlining Your Pipelines Page 39 ------------------------------------------------------------------------ 0 Moving Stages Out of the Loop + Moving Stages Out of the Loop + _____________________________ 0 If you are unable to absorb your loop into a single CALLPIPE command, you may still find it possible to move some of the stages out of the loop. Assume that you have a loop that issues a CALLPIPE something like this: 0 +----------------------------------------------------------------------+ | | | | 'EOFREPORT ALL' /* Propagate eof quickly. */ | | | | Do Forever /* Do until end-of-file. */ | | | | 'PEEKTO' /* Test for end-of-file. */ | | If RC /== 0 /* Exit if no more input. */ | | | Then Exit RC*(RC<>12 & RC<>8) | | | | 'CALLPIPE (name Glenn1)', | | '*: |', /* Input from pipeline. */ | | 'take 5 |', /* Sip a set of records. */ | | 'xlate |', /* Upper-case them. */ | | 'pick w1 >> w2 |', /* Select the ones we need. */ | | 'join * |', /* Join into one record. */ | | 'duplicate |', /* Make a second copy. */ | | '*:' /* Output to pipeline. */ | | | | End | | | +----------------------------------------------------------------------+ 0 CPU utilization of pipeline specification "Glenn1": 41 ms in stage 2 of segment 1: take 5 100 ms in stage 3 of segment 1: xlate 82 ms in stage 4 of segment 1: pick w1 >> w2 28 ms in stage 5 of segment 1: join * 12 ms in stage 6 of segment 1: duplicate 263 ms total in "Glenn1" (2358 invocations) <===== 0 670 ms in scanner. 56 ms in commands. 520 ms in dispatcher. 1 0 Page 40 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 You will note that there is no real need for the function performed by the xlate and duplicate stages to be done within the loop. All input records could be upper-cased and all output records could be duplicated. And this could be done by prefixing an ADDPIPE to the stage's input stream and suffixing an ADDPIPE to the stage's output stream: 0 +----------------------------------------------------------------------+ | | | | 'EOFREPORT ALL' /* Propagate eof quickly. */ | | | | 'ADDPIPE *.input: | xlate | *.input:' /* Upper-case all input. */ | | 'ADDPIPE *.output: | duplicate | *.output:' /* Duplicate output. */ | | | | Do Forever /* Do until end-of-file. */ | | | | 'PEEKTO' /* Test for end-of-file. */ | | If RC /== 0 /* Exit if no more input. */ | | | Then Exit RC*(RC<>12 & RC<>8) | | | | 'CALLPIPE (name Glenn2)', | | '*: |', /* Input from pipeline. */ | | 'take 5 |', /* Sip a set of records. */ | | 'pick w1 >> w2 |', /* Select the ones we need. */ | | 'join * |', /* Join into one record. */ | | '*:' /* Output to pipeline. */ | | | | End | | | +----------------------------------------------------------------------+ 0 Even though only two stages are moved out of the loop in this case, total CPU utilization drops 25%, because the scanner and the dispatcher both have less work to do: 0 CPU utilization of pipeline specification "NoName001": 88 ms in stage 2 of segment 1: xlate 88 ms total in "NoName001" (1 invocation) <===== 0 CPU utilization of pipeline specification "NoName002": 8 ms in stage 2 of segment 1: duplicate 8 ms total in "NoName002" (1 invocation) <===== 0 CPU utilization of pipeline specification "Glenn2": 41 ms in stage 2 of segment 1: take 5 83 ms in stage 3 of segment 1: pick w1 >> w2 27 ms in stage 4 of segment 1: join * 151 ms total in "Glenn2" (2358 invocations) <===== 0 529 ms in scanner. 54 ms in commands. 435 ms in dispatcher. 1 0 Streamlining Your Pipelines Page 41 ------------------------------------------------------------------------ 0 | Subroutine Pipelines Invoked by Subroutine Pipelines in Loops + Subroutine Pipelines Invoked by Subroutine Pipelines in Loops + _____________________________________________________________ 0 | If you have a subroutine pipeline invoked repeatedly within a loop, it | is a good idea to check whether it uses any REXX filters and to take | some precautions if it does. 0 | Of course, the very first precaution is to make sure you are EXECLOADing | those REXX filters. I had a sipping subroutine that invoked Joachim | Becela's SPACE REXX in two places. space is the equivalent of the REXX | Space() function; it contains this subroutine pipeline: 0 | +----------------------------------------------------------------------+ | | | | | 'CALLPIPE (name Space)', | | | '*: |', /* Input from pipeline. */ | | | 'tokenize / / / / |', /* Split and delimit. */ | | | 'joincont not leading / / / / |', /* Rejoin Space'd. */ | | | 'strip leading |', /* Remove delimiting blanks. */ | | | 'locate 1 |', /* Discard last delimiter. */ | | | '*:' /* Output to pipeline. */ | | | | | +----------------------------------------------------------------------+ 0 | When my sipping subroutine was invoked 530 times, the times recorded for | its two invocations of space were 538 ms and 341 ms. After I added an | EXECLOAD SPACE REXX command to my EXEC, however, the time charged to the | first invocation was reduced to 313 ms (as it had been being charged for | the time spent loading the REXX file). 0 | It is often the case that a REXX filter, such as this one, has no REXX | logic but is just a segment of pipeline that has been packaged | separately to hide complexity. If that is the case and if the REXX + __ | filter is being invoked from a sipping subroutine pipeline in a | production application, you may want to consider pulling the pipeline | segment up into the sipping subroutine to avoid the cost of repeatedly | launching a REXX filter. When my sipping subroutine was invoked 551 | times, the two invocations of space used about 700 ms (with SPACE REXX | EXECLOADed). When I moved the four stages from the pipeline in SPACE | REXX up into my sipping subroutine, the two occurrences of those four | stages used only 100 ms altogether. 0 | Near the end of a Rita report, you will see counts of the number of | different pipeline specifications issued and of the total number of | pipeline specifications issued. These were the numbers reported when I | was invoking SPACE REXX as a separate filter: 0 | 15 pipeline specifications used (126 stages). | 4969 pipeline specifications issued. 0 | It is good to take note of the changes in these numbers when you are | changing a complex pipeline. This is what I saw when I moved the stages | from SPACE REXX up into the calling pipeline: 1 0 Page 42 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 | 14 pipeline specifications used (128 stages). | 3867 pipeline specifications issued. 0 | As expected, 1102 fewer pipeline specifications were issued, which makes | it not surprising that the time spent in the scanner and the dispatcher | was reduced by ten percent. 0 | Avoiding ALL in Loops + Avoiding ALL in Loops + _____________________ 0 | The all filter is a wonderful convenience in "quick-and-dirty" pipes, | but for performance reasons you should avoid using it in loops in | production applications. Originally, all simply generated the same | sequence of locate, nlocate, and faninany stages you would have written | yourself, so the only additional cost involved in using all was its time | to generate that sequence (which seems to be about 13 ms on our 9672 for | the all stage shown below). This was insignificant for one-time uses, | but could mount up if the all were part of a subroutine pipeline that | was being issued a large number of times in a loop. More recently, | there have been significant enhancements to locate that have not been | propagated to all, so you may now be able to generate a more efficient | search sequence yourself than you would get using all. 0 | For example, this stage in a subroutine pipeline that was invoked 619 | times used 8,349 ms: 0 | 'all /(h/ ! /home/ ! /(H/ ! /Home/ ! /HOME/ |' 0 | Because of recent enhancements to locate, that all stage could be | replaced by this sequence: 0 | +----------------------------------------------------------------------+ | | | | | . . . | | | 'h: locate anycase /(h/ |', /* Select "(h" or "(H". */ | | | 'H: faninany |', | | | . . . | | | '?', | | | 'h: |', | | | 'locate anycase /home/ |', /* Select home/Home/HOME/etc. */ | | | 'H:' | | | | | +----------------------------------------------------------------------+ 0 | These three stages used only 37 ms, a reduction of 99.6%. 1 0 Streamlining Your Pipelines Page 43 ------------------------------------------------------------------------ 0 CACHING THE RESULTS OF EXPENSIVE OPERATIONS + CACHING THE RESULTS OF EXPENSIVE OPERATIONS + ___________________________________________ 0 | One of the most powerful techniques for streamlining your pipelines is | to cache the results of expensive operations. This is usually easy to | do using the lookup stage, which has been greatly enhanced in recent | releases of CMS. For example, if you needed to do a DNS lookup on the | hostnames in records passing through a pipeline and there were likely to | be duplicates, you might do well to cache and reuse the output of the | hostbyname stage rather than doing the same DNS lookups repeatedly. | This is a simple filter to do DNS lookups and cache the responses (its | input is a stream of hostnames; its output records have the IP address | appended): 0 | +----------------------------------------------------------------------+ | | | | | 'CALLPIPE (listerr endchar ? name CachedDNSLookup)', | | | '*: |', /* Hostnames to look for. */ | | | 'l: lookup 1-* word 1 master |',/* Already know about it? */ | | | 'y: faninany |', /* Bring in new cases. */ | | | '*:', /* Names & IP addrs to caller. */ | | | '?', | | | 'l: |', /* Unmatched hostnames here. */ | | | 'o: fanout |', /* Divert copy for new master. */ | | | 'h: hostbyname |', /* Can we resolve it? */ | | | 'i: faninany |', /* Bring in ones we couldn't. */ | | | 'copy |', /* (Unblock FANOUT.) */ | | | 's: spec 1-* 1 select 1 1-* nw |', /* Build response/master. */ | | | 't: fanout |', /* Divert copy for response. */ | | | 'copy |', /* (Unblock LOOKUP.) */ | | | 'l:', /* Into LOOKUP's reference. */ | | | '?', | | | 'o: |', /* Unmatched hostnames here. */ | | | 's:', /* Into SPEC for new master. */ | | | '?', | | | 'h: |', /* HOSTBYNAME errors here. */ | | | 'i:', /* Return error to caller. */ | | | '?', | | | 't: |', /* Copy of HOSTBYNAME result. */ | | | 'y:' /* Out to the caller. */ | | | | | +----------------------------------------------------------------------+ 0 | lookup's reference (its "master file") serves as the cache here. The | master records contain two words, the hostname and the corresponding IP | address. Each time a hostname record is read from the input, lookup | tries to match it with a record in its reference. When a match is | found, the record from the reference (the "master record") is written to | lookup's primary output and is sent from there out to the caller. 0 | The reference starts out empty, as lookup's secondary input stream is | unconnected. So, at least the first hostname will not be matched. When | a hostname is not matched, the detail record is written to the secondary 1 0 Page 44 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 | output of lookup, and passes from there to hostbyname, which determines | the IP address for that host. hostbyname's answer is fed into the | tertiary input of lookup (the stream for masters to be added to the | reference) and is also sent out to the caller. The next time that | hostname is encountered on the primary input, lookup will find a match | in its reference and will write the result directly to the caller. 0 | The originally-empty lookup reference accumulates a new record each time | another hostname is resolved, and no hostname need be resolved more than | once, no matter how often it occurs in the input stream. Depending on | the proportion of duplicates, using this technique can result in | dramatic reductions in both CPU and elapsed time. 1 0 Streamlining Your Pipelines Page 45 ------------------------------------------------------------------------ 0 | If you are not yet comfortable building feedback loops for lookup | yourself, Rob van der Heij has provided a production-strength cache | stage called PIPCACHE REXX that is ready to be used right out of the | box. The pipcache stage requires a somewhat unusual topology: 0 | 'PIPE . . . | p: pipcache | expensive-operation | p: | . . .' 0 | +----------------------------------------------------------------------+ | | | | | +----------+ +-----------+ | | | . . . --------->|0 0|-->| expensive |--+ | | | | | +-----------+ | | | | | pipcache | | | | | | | | | | | +-->|1 1|------------------)----> . . . | | | | +----------+ | | | | | | | | | +---------------------------------+ | | | | | +----------------------------------------------------------------------+ 0 | pipcache assumes that the stages between its primary output and its | secondary input perform the operation whose results are to be cached. | When pipcache reads a record from its primary input, it searches for a | match in its cache. If a match is found, pipcache writes the result | value from its cache to its secondary output. If no match is found, + _________ | pipcache copies the input record to its primary output, where it is put + _______ | through the expensive operation. The result of that operation is fed | into the secondary input of pipcache, which caches the result and copies | it to its secondary output. Thus, whether or not there was a cache hit, | the result is written to the secondary output of pipcache. In the | process, the result is cached if it was not already in the cache. 0 | As an example of a case in which one would want to use pipcache, let's | look at this fragment that processes the output of a CP QUERY RDR | command with the ISODATE option: 0 | +----------------------------------------------------------------------+ | | | | | . . . | | | 'spec w8 1 1-* nw |', /* Prepend copy of isodate. */ | | | 'dateconvert w1 isodate rexxb |', /* Convert to Rexx Date('B').*/ | | | . . . | | | | | +----------------------------------------------------------------------+ 0 | Word 8 of the response (the date in isodate format) is copied to the | beginning of the record, so that the dateconvert stage can be used to | convert it to the REXX base date while leaving the original record | intact. 1 0 Page 46 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 | With 35,000 spool files, using dateconvert turned out to be extremely | expensive: 0 | 865 ms in stage 7 of segment 1: spec w8 1 1-* nw | 58606 ms in stage 8 of segment 1: dateconvert w1 isodate rexxb 0 | 1662 ms in dispatcher. | 63006 ms total. 0 | Since there were known to be many repeats of the dates, the obvious | solution was to cache the converted dates and execute dateconvert only | once for each unique date: 0 | +----------------------------------------------------------------------+ | | | | | . . . | | | 'd: spec w8 1 write', /* Just date to primary, and */ | | | 'outstream 1 1-* 1 |', /* whole record to secondary.*/ | | | 'p: pipcache |', /* Cache for DATECONVERT. */ | | | 'dateconvert isodate rexxb |', /* To Rexx Date('B'). */ | | | 'p: |', | | | 'copy |', /* (Unblock first SPEC.) */ | | | 's: spec 1-* 1 select 1 1-* nw |', /* Prefix Date('B'). */ | | | . . . | | | '?', | | | 'd: |', /* Original records to here. */ | | | 'copy |', /* (Unblock first SPEC.) */ | | | 's:' /* Feed into second SPEC. */ | | | | | +----------------------------------------------------------------------+ 0 | Because pipcache treats its entire input record as the key when looking | up values in its cache, we must give it records containing only the date | to be converted. So, the spec stage is modified to write only the date | to its primary output and to write the original record to its secondary | output. Once the isodate record has been converted to the base date, | another spec is used to reunite the two records, giving the same output | as before. 0 | As pipcache reads each isodate record from its primary input, it looks | for a match in its cache. If a match is found, the cached base date is | written to the secondary output. If no match is found, the isodate | record is copied to the primary output, where dateconvert reads it and | converts it to the base date. pipcache then reads the base date from | its secondary input, stores it in its cache, and copies it to its | secondary output. In either case, the second spec stage reads the base | date from its primary input and prefixes it to the original QUERY RDR | output record, which it reads from its secondary input. 0 | Although this more complex solution requires twice as much dispatcher | time, the total cost of the pipeline is reduced by 86%: 0 | 807 ms in stage 7 of segment 1: spec w8 1 write outstream 1 1-* 1 1 0 Streamlining Your Pipelines Page 47 ------------------------------------------------------------------------ 0 | 1300 ms in stage 8 of segment 1: pipcache | 400 ms in stage 9 of segment 1: dateconvert isodate rexxb | 171 ms in stage 11 of segment 1: copy | 798 ms in stage 12 of segment 1: spec 1-* 1 select 1 1-* nw | . . . | 160 ms in stage 2 of segment 3: copy 0 | 3480 ms in dispatcher. | 8787 ms total. 0 | Exactly the same scheme can be used anywhere your pipelines perform an | expensive or slow-running operation that doesn't really need to be | repeated when its input is the same. PIPCACHE REXX is shown on the next | page. 1 0 Page 48 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 /*-------------------------------------------------------------------*/ /* PIPCACHE REXX Rob vd Heij */ /* */ /* Records containing values to be converted by the subsequent */ /* stage are read from the primary input stream. Records with the */ /* results of the conversion are written to the secondary output. */ /* The stage that does the conversion gets its input from the */ /* primary output of this stage and writes its output to the */ /* secondary input of the stage. The conversion is bypassed (the */ /* record from the primary input is not written to the primary */ /* output) if an earlier record had the same value, in which case */ /* the result can be found in the cache maintained by this stage */ /* and can be written directly to the secondary output, saving the */ /* cost of doing the conversion again. */ /* The master records in the first LOOKUP stage below are input */ /* values that have not been "converted" by being fed through the */ /* (expensive) stage that follows this one. Each of these master */ /* records is preceded by a 10-character sequence number key. */ /* The master records in the second LOOKUP stage use that same */ /* sequence number key to denote the corresponding converted value. */ /* If you are running a version of CMS Pipelines that does not */ /* support the "propagateeof" global option, change the CALLPIPE */ /* to ADDPIPE. */ 0 'CALLPIPE (listerr endchar \ name PipCache propagateeof)', '\ *.input.0: ', /* Values to be converted. */ '| l1: lookup 1-* 11-* master ', /* Value already converted? */ '| l2: lookup 1.10 master', /* Yes, use key to find result.*/ '| not chop 10', /* Remove seq nr from result. */ '| f2: faninany', /* Either matched or converted.*/ '| *.output.1:', /* Output all results. */ '\ l1:', /* Here if value not in table. */ '| spec number 1 1-* n', /* Assign sequence number key. */ '| f1: fanout stop anyeof', /* Divert copy to be converted.*/ '| copy', /* (Unblock 1st LOOKUP/FANOUT.)*/ '| l1:', /* New master to first LOOKUP. */ '\ l2:', /* (Placeholder for 2ndry.) */ '\ f1:', /* Key + value to be converted.*/ '| nr: chop 10', /* Keep numeric key here. */ '| copy', /* (Unblock CHOP.) */ '| s: spec 1.10 1 select 1 1-* n',/* Append result of conversion.*/ '| l2:', /* New master to second LOOKUP.*/ '\ nr:', /* Value to be converted. */ '| *.output.0:', /* To primary output so other */ , /* .. stage can convert it. */ '\ *.input.1:', /* We read result from here. */ '| f3: fanout', /* One copy of the result to */ '| f2:', /* .. go to secondary output. */ '\ f3:', /* Another copy to feed into */ '| s:' /* .. 2nd LOOKUP's reference. */ 0 error: return rc * ( rc <> 12 ) 1 0 Streamlining Your Pipelines Page 49 ------------------------------------------------------------------------ 0 OTHER TIPS FOR STREAMLINING YOUR PIPELINES + OTHER TIPS FOR STREAMLINING YOUR PIPELINES + __________________________________________ 0 1. Become familiar with the notes in the author's help files, which often mention performance ramifications. For example, pipe ahelp faninany displays this advice regarding the performance of the STRICT option: 0 3. Depending on the number of input streams, STRICT may add significant overhead. Use it only when you can prove from the topology that it really is needed. 0 2. Be sure to note differences in the virtual I/O count when comparing alternate implementations. Rita displays the I/O count near the end of its detail output: 0 | 480 ms attributed to stages; 11 virtual I/Os. 0 | 3. Note differences in virtual memory utilization when comparing | alternate implementations. Recent versions of Rita display the | highwatermark memory utilization for each stage in each pipeline: 0 | 2 ( 2) ms (65K) in stage 1 of segment 1: < all notebook l | 3 ( 3) ms (<1K) in stage 2 of segment 1: take 1000 | 141 ( 88) ms ( 3K) in stage 3 of segment 1: spill1x 60 0 | 4. Always keep an eye open for ways to reduce the number of records | flowing through your pipelines. One way to do that is to discard | unneeded records as early in your processing as feasible, but another | is to combine records as early as possible. For example, a sipping + _______ | subroutine that was invoked 537 times included these stages: 0 | +---------------------------------------------------------------------+ | | | | | . . . | | | 'H: fanin |', /* Bring in the mail body. */ | | | 'join * x0D25 |', /* Make single record w/CRLFs. */ | | | . . . | | | '?', | | | . . . | | | 'buffer |', /* Hold the mail body here. */ | | | 'H:' /* Mail body following headers. */ | | | | | +---------------------------------------------------------------------+ 0 | The buffer held the body of a mail file until fanin had read the mail | headers from its primary input. The streams were then merged and the | records were delimited by carriage return and linefeed and joined into | a single record. The times for those three stages were 14, 99, and 69 | ms, respectively. When the buffer was replaced with another | join * x0D25, which both buffered the records and combined them into a | single record, the times for the three stages became 14, 40, and 80 | ms, a reduction of 26%. 1 0 Page 50 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 5. Use xlate rather than change for single-character changes and look for ways to do more than one thing with a single xlate stage. For example, this: 0 +---------------------------------------------------------------------+ | | | . . . | | 'xlate upper |', /* Make sure it's uppercase. */ | | 'change //WHALE DISPLAY / |', /* Make it a WHALE command. */ | | 'change /./ / |', /* Tokenize. */ | | . . . | | | +---------------------------------------------------------------------+ 0 can be replaced by this: 0 +---------------------------------------------------------------------+ | | | . . . | | 'xlate upper . 40 |', /* Upper-case and tokenize. */ | | 'change //WHALE DISPLAY / |', /* Make it a WHALE command. */ | | . . . | | | +---------------------------------------------------------------------+ 0 6. Learn to use the arithmetic capability of spec (also known as the "407 emulation"). Using the 407 emulation to perform arithmetic operations seems to cost about half what it costs to do them in interpreted REXX, but more than it costs to do them in compiled REXX. (You can learn about using the 407 emulation by issuing the command pipe ahelp spectut, which will display a tutorial, or pipe ahelp specref, which will display reference material.) 0 7. Keep lookup in mind whenever you need to do a series of operations on the same field of your records. For example, lookup can often replace a cascade of locate or change stages. 0 8. Become familiar with vchar, which can be used to do many functions involving acting on every n-th byte. For example, to delete every + _ third byte, starting with the first one, use vchar 24 16. (In this case, vchar takes 24 bits from its input stream and places those in a 16-bit output "character", discarding the eight high-order bits, thus deleting every third byte.) The examples displayed by pipe ahelp vchar will give you some idea of the power of this stage. 0 9. If you need to read a minidisk file into two separate pipelines in the same EXEC, don't load the file into a REXX stem; read the file twice! The cost of using EXECCOMM to build and break down a large stem is almost always much greater than the cost of reading the file twice. 1 0 Streamlining Your Pipelines Page 51 ------------------------------------------------------------------------ 0 And if a file must be read multiple times, consider EXECLOADing it and using < with no filemode to read the in-storage copy. 0 Or, if you want to get fancy, you may be able to store the file in | memory using the instore stage. This REXX filter (which was written | before dam became available) reads its input stream and passes all of the records to its output, if any of them contains the string specified in its argument: 0 +---------------------------------------------------------------------+ | | | /* Copy all input to output or none (if no matches found). */ | | | | 'ADDPIPE', /* Prefixed ADDPIPE. */ | | '*.input: |', /* Connect to our input stream.*/ | | 'instore |', /* Store all records. */ | | '*.input:' /* Write pointer to our input. */ | | | | 'PEEKTO pointer' /* Read but don't consume. */ | | | | 'CALLPIPE', /* Examine the passed file: */ | | 'var pointer |', /* Pick up copy of pointer. */ | | 'outstore |', /* Expand file into pipeline. */ | | 'locate /'Arg(1)'/ |', /* Any hits? */ | | 'take 1 |', /* Make answer Boolean. */ | | 'count lines |', /* Make answer numeric. */ | | 'var found' /* Store answer. */ | | | | If found Then /* If got any hits, ... */ | | 'CALLPIPE', /* ... output entire file. */ | | 'var pointer |', /* Load pointer again. */ | | 'outstore |', /* Expand file again. */ | | '*.output:' /* Write to output this time. */ | | | | 'READTO' /* Now consume the pointer. */ | | | | Exit | | | +---------------------------------------------------------------------+ 0 instore reads all the records from its input stream and stores them in memory. Once it has read all of its input, it writes a single output record that describes the in-storage file. This file token ("pointer") record produced by instore can be used repeatedly to load the file into a pipeline with an outstore stage, but once that record has been consumed, the in-storage copy of the file no longer exists. In this example, the ADDPIPE prefixed to the stage's input stream stores its input in memory and writes the file token record to the stage's input stream. The stage is careful not to consume that file token record until after the file has been read twice. 1 0 Page 52 Streamlining Your Pipelines ------------------------------------------------------------------------ 0 10. When searching for a string at the beginning of records, use find or strfind rather than locate. find uses about 20% less CPU than locate 1.n uses. 0 11. Consider whether operations in a spec stage can be combined. These two spec stages produce the same output, but the second one uses 17% less CPU: 0 '. . . | spec 1.10 1 /overlay/ 11 18-* 18 | . . .' 0 '. . . | spec 1-* 1 /overlay/ 11 | . . .' 0 12. Become familiar with the possibilities for using tokenize to parse your records. For example, the tokenize stage below comes from a pipeline that removes strings enclosed in square brackets from a file containing text. It performs the same operations as the cascade of stages in the next line, but the cascade uses 4.5 times as much CPU as the tokenize: 0 '. . . | tokenise /<>/ / / | . . .' 0 '...| spec 1-* 1 write / / 1 | split before anyof / <>/ | split after anyof / <>/ | ... 0 TIPS FOR USING RITA + TIPS FOR USING RITA + ___________________ 0 1. Rita produces the best results when all CALLPIPEs and ADDPIPEs have the NAME option specified and when any specifications that have the same name are identical to one another. 0 2. If a user-written pipeline stage goes into a wait state outside of the control of Pipes, the wait time will be counted as CPU time. The + _____ most common case of this (one that can be very misleading) is a REXX stage that uses the REXX Say instruction. The time during which the user's terminal is in the "MORE" condition will be reported by Rita as CPU time used by the REXX stage. 0 3. The numbers displayed by Rita do not include the CPU utilization of other pipeline sets started with PIPE commands or runpipe stages from within the pipeline set Rita is measuring. 0 4. To use Rita with a pipeline that contains a ldrtbls stage, NUCXLOAD RITA MODULE so that invoking Rita doesn't alter the loader tables. 0 5. Rita is much less costly to use if RITA REXX has been compiled. 0 6. There is a bug in the version of Rita shipped with CMS 12 that causes it to loop when the subject pipeline uses an input console stage. If you need to measure such a pipeline, get a newer version of Rita from one of the archives. 1 0 Streamlining Your Pipelines Page 53 ------------------------------------------------------------------------ 0 CONCLUSION + CONCLUSION + __________ 0 Tuning a heavily-piped application is no different from tuning any other sort of application (or system). First you find out where the hot spots are and then you find ways to cool them down.