Source: wiretap.area.com/Gopher/Library/Techdoc/Cpu/coproc.txt
Author: Norbert Juffa
Written: Jan-1993
Reformatted in HTML: Jan-2017 by DougX.net
back to Doug's hardware page #1991
Application Time w/o 387 Time w/387 Speedup Art&Letters 87.0 sec 34.8 sec 150% Quattro Pro 8.0 sec 4.0 sec 100% Wingz 17.9 sec 9.1 sec 97% Mathematica 420.2 sec 337.0 sec 25% The following table is an excerpt from [70]: Application Time w/o 387 Time w/387 Speedup Corel Draw 471.0 sec 416.0 sec 13% Freedom Of Press 163.0 sec 77.0 sec 112% Lotus 1-2-3 257.0 sec 43.0 sec 597% The following table is an excerpt from [25]: Application Time w/o 387 Time w/387 Speedup Design CAD, Test1 98.1 sec 50.0 sec 96% Design CAD, Test2 75.3 sec 35.0 sec 115% Excel, Test 1 9.2 sec 6.8 sec 35% Excel, Test 1 12.6 sec 9.3 sec 35%Note that coprocessor performance also depends on the motherboard, or more specifically, the chipset used on the motherboard. In [34] and [35] identically configured motherboards using different 386 chipsets were tested. Among other tests a coprocessor benchmark was run which is based on a fractal computation and its execution time recorded. The following tables showing coprocessor performance to vary with the chipset have been copied from these articles in abridged form:
Cyrix Cyrix chip set 387+ chip set 83D87 Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0% Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5% ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0% Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHz 27.38 sec 91.6%
Relative execution times of coprocessor vs. software emulators for selected coprocessor instructions Intel 387DX TP 6.0 Emulator EM87 Emulator FADD ST, ST(0) 1 26 104 FDIV [DWord] 1 22 136 FXAM 1 10 73 FYL2X 1 33 102 FPATAN 1 36 110 F2XM1 1 38 110 The following table is an excerpt from [44]: Intel 80287 Intel E80287 Emulator FADD ST, ST(0) 1 42 FDIV [DWord] 1 266 FXAM 1 139 FYL2X 1 99 FPATAN 1 153 F2XM1 1 41 The following has been adapted from [43] and merged with my own data: Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086) FADD ST, ST(0) 1 20 94 FDIV [DWord] 1 22 82 FPTAN 1 18 144 F2XM1 1 6 171 FSQRT 1 44 544
[43] This was the first coprocessor that Intel made available for the 80x86 family. It was introduced in 1980 and therefore does not have full compatibility with the IEEE-754 standard for floating-point arithmetic, (which was finally released in 1985). It complements the 8088 and 8086 CPUs and can also be interfaced to the 80188 and 80186 processors.
The 8087 is implemented using NMOS. It comes in a 40-pin CERDIP (ceramic dual inline package). It is available in 5 MHz, 8 MHz (8087-2), and 10 MHz (8087-1) versions. Power consumption is rated at max. 2400 mW [42].
A neat trick to enhance the processing power of the 8087 for computations that use only the basic arithmetic operations (+,-,*,/) and do not require high precision is to set the precision control to single- precision. This gives one a performance increase of up to 20%. For details about programming the precision control, see program PCtrl in appendix A.
With the help of an additional chip, the 8087 can in theory be interfaced to an 80186 CPU [36]. The 80186 was used in some PCs (e.g. from Philips, Siemens) in the 1982/1983 time frame, but with IBM's introduction of the 80286-based AT in 1984, it soon lost all significance for the PC market.
The 80187 is a rather new coprocessor designed to support the 80C186 embedded controller (a CMOS version of the 80186 CPU; see above). It was introduced in 1989 and implements the complete 80387 instruction set. It is available in a 40 pin CERDIP (ceramic dual inline package) and a 44 pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation. Power consumption is rated at max. 675 mW for the 12.5 MHz version and max. 780 mW for the 16 MHz version [37].
Intel 80287[44] This is the original Intel coprocessor for the 80286, introduced in 1983. It uses the same internal execution unit as the 8087 and therefore has the same speed (actually, it is sometimes slower due to additional overhead in CPU-coprocessor communication). As with the 8087, it does not provide full compatibility with the IEEE-754 floating point standard released in 1985.
The 80287 was manufactured in NMOS technology, and is packaged in a 40- pin CERDIP (ceramic dual inline package). There are 6 MHz, 8 MHz, and 10 MHz versions. Power consumption can be estimated to be the same as that for the 8087, which is 2400 mW max.
The 80287 has been replaced in the Intel 80x87 family with its faster successor, the CMOS-based Intel 287XL, which was introduced in 1990 (see below). There may still be a few of the old 80287 chips on the market, however.
Intel 80287XLThis chip is Intel's second-generation 287, first introduced in 1990. Since it is based on the 80387 coprocessor core, it features full IEEE 754 compatibility and faster instruction execution. Intel claims about 50% faster operation than the 80287 for typical benchmark tests such as Whetstone [45]. Comparison with benchmark results for the AMD 80C287, which is identical to the Intel 80287, support this claim [1]: The Intel 287XL performed 66% faster than the AMD 80C287 on a fractal benchmark and 66% faster on the Whetstone benchmark in these tests. Whetstone results from [46] show the Intel 287XL at 12.5 MHz to perform 552 kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91% performance increase. A benchmark using the MathPak program showed the Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0 sec.) [26]. Since the 287XL has all the additional instructions and enhancements of a 387, most software automatically identifies it as an 80387-compatible coprocessor and therefore can make use of extra 387- only features, such as the FSIN and FCOS instructions.
The 287XL is manufactured in CMOS and therefore uses much less power than the older NMOS-based 80287. At 12.5 MHz, the power consumption is rated at max. 675 mW, about 1/4 of the 80287 power consumption. The 287XL is available in either a 40-pin CERDIP (ceramic dual inline package) or a 44 pin PLCC (plastic leaded chip carrier). (This latter version is called the 287XLT and intended mainly for laptop use.) The 287XL is rated for speeds of up to 12.5 MHz.
AMD 80C287This chip, manufactured by Advanced Micro Devices (AMD), is an exact clone of the old Intel 80287, and was first brought to market by AMD in 1989. It contains the original microcode of the 80287 and is therefore 100% compatible with it. However, as the name indicates, the 80C287 is manufactured in CMOS and therefore uses less power than an equivalent Intel 80287. At 12.5 MHz, its power consumption is rated at max. 625 mW or slightly less than that of the Intel 80287XL [27]. There is also another version called AMD 80EC287 that uses an 'intelligent' power save feature to reduce the power consumption below 80C287 levels. Tests at 10.7 MHz show typical power consumption for the 80EC287 to be at 30 mW, compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and 1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally suited for low power laptop systems.
The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. (I have only seen it being offered in 10 MHz and 12 MHz versions, however.) At about US$ 50, it is currently the cheapest coprocessor available. Note that it provides less performance than the newer Intel 287XL (see above). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs (dual inline package) and as 44 pin PLCC (plastic leaded chip carrier).
Due to recent legal battles with Intel over the right to use the 287 microcode, which AMD lost, AMD may have to discontinue this product (disclaimer: I am not a legal expert).
Cyrix 82S87This 80287-compatible chip was developed from the Cyrix 83D87, (Cyrix's 80387 'clone') and has been available since 1991. It complies completely with the IEEE-754 standard for floating-point arithmetic and features nearly total compatibility with Intel's coprocessors, including implementation of the full Intel 80387 instruction set. It implements the transcendental functions with the same degree of accuracy and the superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the fastest [1] and most accurate 287 compatible coprocessor available. Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5 MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87 chips manufactured after 1991 use the internals of the Cyrix 387+, which succeeds the original 83D87 [73].
The 82S87 is a fully static CMOS design with very low power requirements that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the 82S87 to consume about the same amount of power as the AMD 80C287 (see above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded chip carrier) compatible with the pinout of the Intel 287XLT and ideally suited for laptop use.
IIT 2C87This chip was the first 80287 clone available, introduced to the market in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87 implements the full 80387 instruction set [38]. Tests I ran on the 3C87 seem to indicate that it is not fully compatible with the IEEE-754 standard for floating-point arithmetic (see below for details), so it can be assumed that the 2C87 also fails these test (as it presumably uses the same core as the 3C87).
The IIT 2C87 provides extra functions not available on any other 287 chip [38]. It has 24 user-accessible floating-point registers organized into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2) allow switching from one bank to another. (Transfers between registers in different banks are not supported, however, so this feature by itself is of limited usefulness. Also, there seems to be only one status register (containing the stack top pointer), so it has to be manually loaded and stored when switching between banks with a different number of registers in use [40]). The register bank's main purpose is to aid the fourth additional instruction the 2C87 has (F4X4), which does a full multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D- graphics applications [39]. The built-in matrix multiply speeds this operation up by a factor of 6 to 8 when compared to a programmed solution according to the manufacturer [38]. Tests show the speed-up to be indeed in this range [40]. For the 3C87, I measured the execution time of F4X4 to be about 280 clock cycles; the execution time on the 2C87 should be somewhat larger - I estimate it to be around 310 clock cycles due to the higher CPU-NDP communication overhead in instruction execution in 286/287 systems (~45-50 clock cycles) compared with 386/387 systems (~16-20 clock cycles). As desirable as the F4X4 instruction may seem, however, there are very few applications that make use of it when an IIT coprocessor is detected at run time (among them Schroff Development's Silver Screen and Evolution Computing's Fast-CAD 3-D [25]).
The 2C87 is available for speeds of up to 20 MHz. It is implemented in an advanced CMOS process and has therefore a low power consumption of typically about 500 mW [38].
Intel 80387This chip was the first generation of coprocessors designed specifically for the Intel 80386 CPU. It was introduced in 1986, about one year after the 80386 was brought to market. Early 386 system were therefore equipped with both a 80287 and a 80387 socket. The 80386 does work with an 80287, but the numerical performance is hardly adequate for such a system.
The 80387 has itself since been superseded by the Intel 387DX introduced by a quiet change in 1989 (see below). You might find it when acquiring an older 386 machine, though. The old 80387 is about 20% slower than the newer 387DX.
The 80387 is packaged in a 68-pin ceramic PGA, and was manufactured using Intel's older 1.5 micron CHMOS III technology, giving it moderate power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW typical), at 20 MHz max. 1550 mW (950 mW typical), and at 25 MHz max. 1950 mW (1250 mW typical) [60].
Intel 387DXThe 387DX is the second-generation Intel 387; it was quietly introduced to replace the original 80387 in 1989. This version is done in a more advanced CMOS process which enables the coprocessor to run at a maximum frequency of 33 MHz (the 80387 was limited to a maximum frequency of 25 MHz). The 387DX is also about 20% faster than the 80387 on the average for the same clock frequency. For a 386/387 system operating at 29 MHz the Whetstone benchmark (compiled with the highly optimizing Metaware High-C V1.6) runs at 2377 kWhetstones/sec for the 80387 and at 2693 kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation programmed in assembly language, the 387DX performance was 28% higher than the performance of the 80387. The transcendental functions have also sped up from the 80387 to the 387DX. In the Savage benchmark (again, compiled with Metaware High-C V1.6 and running on a 29 MHz system), the 80387 evaluated 77600 function calls/second, while the 387DX evaluated 97800 function calls/second, a 26% increase [7]. Some instructions have been sped up a lot more than the average 20%. For example, the performance of the FBSTP instruction has increased by a factor of 3.64.
The Intel 387DX (and its predecessor 80387) are the only 387 coprocessors that support asynchronous operation of CPU and coprocessor. The 387 consists of a bus interface unit and a numerical execution unit. The bus interface unit always runs at the speed of the CPU clock (CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc, the numerical execution unit runs at the same speed as the bus interface unit. If CKM is tied to ground, the numerical execution unit runs at the speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10. For example, for a 20 MHz 386, the Intel 387DX could be clocked from 12.5 MHz to 28 MHz via the NUMCLK2 input. (On the Cyrix 83D87, Cyrix 387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These coprocessors are therefore not capable of asynchronous operation and always run at the speed of the CPU.)
The Intel 387DX is manufactured using Intel's advanced low power CHMOS IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW typical), at 25 MHz max. 1050 mW (625 mW typical), and at 33 MHz max. 1250 mW (750 mW typical) [59].
Intel 387SXThis is the coprocessor paired with the Intel 386SX CPU. The 386SX is an Intel 80386 with a 16-bit, rather than 32-bit, data path. This reduces (somewhat) the costs to build a 386SX system as compared to a full 32- bit design required by a 386DX. (The 386SX's main *marketing* purpose was to replace the 80286 CPU, which was being sold more cheaply by other manufacturers [such as AMD], and which Intel subsequently stopped producing.) Due to the 16-bit data path, the 386SX is slower than the 386DX and offers about the same speed as an 80286 at the same clock frequency for 16-bit applications. But as the 386SX is a complete 80386 internally, it offers also the possibility to run 32-bit applications and supports the virtual 8086 mode (used for example by Windows' 386 enhanced mode).
The 387SX has all the features of the Intel 80387, including the ability of asynchronous operation of CPU and coprocessor (see Intel 387DX information, above). Due to the 16 bit data path between the CPU and the coprocessor, the 387SX is a bit slower than a 80387 operating at the same frequency. In addition, the 387SX is based on the core of the original 80387, which executes instructions slower than the second generation 387DX.
The 387SX comes in a 68-pin PLCC (plastic leaded chip carrier) package and is available in 16 MHz and 20 MHz versions. (Coprocessors for faster 386SX systems based on the Am386SX CPU are available from IIT, Cyrix, and ULSI.) Power consumption for the 387SX at 16 MHz is max. 1250 mW (740 mW typical); for the 20 MHz version it is max. 1500 mW (1000 mW typical) [62].
Intel 387SLThis coprocessor is designed for use in systems that contain an Intel 386SL as the CPU. The 386SL is directly derived from the 386SX. It is a static CHMOS IV design with very low power requirements that is intended to be used in notebook and laptop computers. It features an integrated cache controller, a programmable memory controller, and hardware support for expanded memory according to the LIM EMS 4.0 standard. The 387SL, introduced in early 1992, has been designed to accompany the 386SL in machines with low power consumption and substitute the 387SX for this purpose. It features advanced power saving mechanisms. It is based on the 387DX core, rather than on the older and slower 80387 core (which is used by the 387SX).
IIT 3C87This IIT chip was introduced in 1989, about the same time as the Cyrix 83D87. Both coprocessors are faster than Intel's 387DX coprocessor. The IIT 3C87 also provides extra functions not available on any other 387 chip [38]. It has 24 user-accessible floating-point registers organized into three register banks. Three additional instructions (FSBP0, FSBP1, FSBP2) allow switching from one bank to another. (Transfers between registers in different banks are not supported, however, so this feature by itself is of limited usefulness. Also, there seems to be only one status register [containing the stack top pointer], so it has to be manually loaded and stored when switching between banks with a different number of registers in use [40]). The register bank's main purpose is to aid the fourth additional instruction the 3C87 has (F4X4), which does a full multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D-graphics applications [39]. The built-in matrix multiply speeds this operation up by a factor of 6 to 8 when compared to a programmed solution according to the manufacturer [38]. Tests show the speed-up to be indeed in this range [40]. I measured the F4X4 to execute in about 280 clock cycles, during which time it executes 16 multiplications and 12 additions. The built-in matrix multiply speeds up the matrix-by- vector multiply by a factor of 3 compared with a programmed solution according to IIT [39]. The results for my own TRNSFORM benchmark support this claim (see results below), showing a performance increase by a factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly as fast as on an Intel 486 at the same clock frequency. As desirable as the F4X4 instruction may seem, however, there are very few applications that make use of it when an IIT coprocessor is detected at run time (among them Schroff Development's Silver Screen and Evolution Computing's Fast-CAD 3-D [25]).
These IIT-specific instructions also work correctly when using a Chips & Technologies 38600DX or a Cyrix 486DLC CPU, which are both marketed as faster replacements for the Intel 386DX CPU.
Tests I ran with the IEEETEST program show that the 3C87 is not fully compatible with the IEEE-754 standard for floating-point arithmetic, although the manufacturer claims otherwise. It is indeed possible that the reported errors are due to personal interpretations of the standard by the program's author that have been incorporated into IEEETEST and that the standard also supports the different interpretation chosen by IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST have become somewhat of an industry standard [66] and Intel's 387, 486, and RapidCAD chips pass the test without a single failure, so the fact that the IIT 3C87 fails some of the tests indicates that it is not fully compatible with the Intel 387 coprocessor. My tests also show that the IIT 3C87 does not support denormals for the double extended format. It is not entirely clear whether the IEEE standard mandates support for extended precision denormals, as the IEEE-754 document explicitly only mentions single and double-precision denormals. Missing support for denormals is not a critical issue for most applications, but there are some programs for which support of denormals is at the very least quite helpful [41]. In any case, failure of the 3C87 to support extended precision denormal numbers does represent an incompatibility with the Intel 387 and 486 chips.
The 3C87 is implemented in an advanced CMOS process and has low power requirements, typically about 600 mW. Like the 387 'clones' from Cyrix and ULSI, the 3C87 does not support asynchronous operation of the CPU and the coprocessor, but always runs at the full speed of the CPU. It is available in 16, 20, 25, 33, and 40 MHz versions.
IIT 3C87SXThis is the version of the IIT 3C87 that is intended for use with Intel's 386SX or AMD's Am386SX CPU, and is functionally equivalent to the IIT3C87. Due to the 16-bit data path between the CPU and the coprocessor in a 386SX- based system, coprocessor instructions will execute somewhat more slowly than on the 3C87. At present, the IIT 3C87SX is the only 387SX coprocessor that is offered at speeds of 16, 20, 25, and 33 MHz. (I have read that Cyrix has also announced an 83S87- 33, but haven't seen it being offered yet.) The 3C87SX is packaged in a 68-pin PLCC.
Cyrix FasMath 83D87This chip was introduced in 1989, only shortly after the coprocessors from IIT. It has been found to be the fastest 387-compatible coprocessor in several benchmark comparisons [1,7,68,69]. It also came out as the fastest coprocessor in my own tests (see benchmark results below). Although the Cyrix 83D87 provides up to 50% more performance than the Intel 387DX in benchmarks comparisons, the speed advantage over other 387-compatible coprocessors in real applications is usually much smaller, because coprocessor instructions represent only a small part of the total application code. For example, in a test using the program 3D- Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1].
Besides being the fastest 387 coprocessor, the 83D87 also offers the most accurate transcendental functions results of all coprocessors tested (see test results below). The new "387+" version of the 83D87, available since November 1991, even surpasses the level of accuracy of the original 83D87 design. Note that the name 387+ is used in European distribution only. In other parts of the world, the new chip still goes by the name 83D87.
Unlike Intel's coprocessors, which use the CORDIC [18,19] algorithm to compute the transcendental functions, Cyrix uses polynomial and rational approximations to the functions. In the past the CORDIC method has been popular since it requires only shifts and adds, which made it relatively easy to implement a reasonably fast algorithm. Recently, the cost for the implementation of fast floating-point hardware multipliers has dropped significantly (due to the availability of VLSI), making the use of polynomial and rational approximations superior to CORDIC for the generation of transcendental functions [61]. The Cyrix 83D87 uses a fast array multiplier, making its transcendental functions faster than those of any other 387 compatible coprocessor. It also uses 75 bit for the mantissa in intermediate calculations (as opposed to 68 bits on other coprocessors), making its transcendental functions more accurate than those of any other coprocessor or FPU (see results below).
The 83D87 (and its successor, the 387+) are the 387 'clones' with the highest degree of compatibility to the Intel 387DX. A few minor software and hardware incompatibilities have been documented by Cyrix [12]. The software differences are caused by some bugs present in the 387DX that Cyrix fixed in the 83D87. Unlike the Intel 387DX, the 83D87 (and all other 387-compatible chips as well) does not support asynchronous operation of CPU and coprocessor. There were also problems in the past with the CPU-coprocessor communications, causing the 83D87 to occasionally hang on some machines. The reason behind this was that Cyrix shaved off a wait state in the communication protocol, which caused a communications breakdown between the CPU and the 83D87 for some systems running at 25 MHz or faster. (One notable example of this behavior was the Intel 302 board.) Also there were problems with boards based on early revisions of the OPTI chipset. These problem are only rarely encountered with the current generation of 386 motherboards, and it is possible that it has been entirely eliminated in the 387+, the successor to the 83D87.
To reduce power consumption the 83D87 features advanced power saving features. Those portions of the coprocessor that are not needed are automatically shut down. If no coprocessor instructions are being executed, *all* parts except the bus interface unit are shut down [12]. Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, while typical power consumption at this clock frequency is 500 mW [15].
Cyrix EMC87This coprocessor is basically a special version of the Cyrix 83D87, introduced in 1990. In addition to the normal 387 operating mode, in which coprocessor-CPU communication is handled through reserved IO ports, it also offers a memory-mapped mode of operation similar to the operation principle of the Weitek Abacus. Like the Weitek chip, the EMC87 occupies a block of memory starting at physical address C0000000h (the Abacus occupies a memory block of 64 KB, while the EMC87 uses only 4 KB [77]). It can therefore only be accessed in the protected or virtual modes of the 386 CPU. DOS programs can access the EMC87 with the help of DOS extenders or memory managers like EMM386 which run in protected/virtual mode themselves. To implement the memory-mapped interface, the usual 80x87 architecture has been slightly expanded with three additional registers and eleven additional instructions that can only be used if the memory-mapped mode is enabled.
Using this special mode of the EMC87 provides a significant speed advantage. The traditional 387 CPU-coprocessor interface via IO ports has an overhead of about 14-20 clock cycles. Since the Cyrix 83D87 executes some operations like addition and multiplication in much less time, its performance is actually limited by the CPU-coprocessor interface. Since the memory-mapped mode has much less overhead, it allows all coprocessor instructions to be executed at full speed with no penalty.
Originally, Cyrix claimed support for the fast memory-mapped mode of the EMC87 from a number of software vendors (including Borland and Microsoft). However, there are only very few applications that make use of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP FORTRAN-386 compiler, Metaware's High-C compiler version 1.6 and newer, and Intusofts's Spice [63,73]. Part of the problem in supporting the memory-mapped mode is that the application must reserve one of the general purpose registers of the CPU to use memory-mapped mode instructions that access memory.
(Note that the EMC87 is *not* compatible with Weitek's Abacus coprocessor. They both use the same CPU interface technique [memory mapping], but while the EMC87 uses the standard 387 instruction set, the Weitek Abacus coprocessors use a different instruction set entirely its own.)
Since the EMC87 provides also the standard 386/387 CPU interface via IO ports, it can be used just like any other 387-compatible coprocessor and delivers the same performance as the Cyrix 83D87 in this mode. The EMC87 even allows mixed use of memory-mapped and traditional instructions in the same code. Cyrix has also implemented some additional instructions in the EMC87 that are also available in the 387-compatible mode: FRICHOP, FRINT2, and FRINEAR. These instructions enable rounding to integer without setting the rounding mode by manipulating the coprocessor control word, and are intended to make life easier for compiler writers.
In a test, the EMC87 at 33 MHz ran the single-precision Whetstone benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In another test, the EMC87 ran a fractal computation at twice the speed of the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third test found the EMC87's overall performance to be 20% higher than the performance of the Cyrix 83D87 [65].
The Cyrix FasMath EMC87 has also been marketed as Cyrix AutoMATH; the two chips are identical. Unlike the Cyrix 83D87, which fits into the 68- pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that not all boards have such a socket (a notable exception being IBM's PS/2s, for example). The EMC87 is available 25 and 33 MHz versions. Maximum power consumption at 33 MHz is 2000 mW.
Cyrix appears currently to be phasing out the EMC87.
Cyrix FasMath 387+This chip is the second-generation successor to the Cyrix 83D87. (The name "387+" is only used for European distribution; in other parts of the world, it goes by the original 83D87 designation.) According to a source within Cyrix [73], the 387+ was designed to make a smaller (and thus cheaper to manufacture) coprocessor chip that could also be pushed to higher frequencies than the original chip: the 387+ is available in versions of up to 40 MHz, whereas the original 83D87 could go no faster than 33 MHz.
The Cyrix 387+ is ideally suited to be used with Cyrix's 486DLC CPU, which is a 486SX compatible replacement chips for the Intel 386DX. Indeed Cyrix sells upgrade kits consisting of a 486DLC CPU and a Cyrix 387+.
In my tests, I found the Cyrix 387+ to be about five to 10 percent *slower* than the Cyrix 83D87. However, some instructions like the square root (FSQRT) now run at only half the speed at which they ran in the 83D87, and most transcendental functions show about a 40% drop in performance compared to their 83D87 averages (see performance results, below). However, I did find the transcendental functions on the 387+ to be a bit *more* accurate than those implemented in the 83D87. The new design uses a slower hardware multiplier that needs six clock cycles to multiply the floating-point mantissa of an internal precision number, while the multiplier in the 83D87 takes only 4 clocks to accomplish the same task. Since the transcendental functions in Cyrix math coprocessors are generated by polynomial and rational approximations, this slows them down significantly.
The divide/square root logic has also been changed from the 83D87 design. The original design used an algorithm that could generate both the quotient and square root, so the execution times for these instructions were nearly identical. The algorithm chosen for the division in the 387+ doesn't allow the square root to be taken so easily, so it takes nearly twice as long.
In the 387+, the available argument range for the FYL2XP1 instruction has been extended, from the usual range -1+sqrt(2)/2..sqrt(2)/2 that is found on all 80x87 coprocessors, to include all floating-point numbers. Also, four additional instructions have been implemented: FRICHOP (opcode DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC), and FTSTP (opcode D9 E6).
Cyrix FasMath 83S87The 83S87 is the SX version of the Cyrix 83D87. Just as the 83D87 is the fastest 387-compatible coprocessor, the Cyrix 83S87 is the fastest of the 387SX compatible coprocessors [1], as well as providing the most accurate transcendental functions. 83S87 chips manufactured after 1991 use the internals of the Cyrix 387+, the successor to the original 83D87 [73] (above). The Cyrix 83S87 is ideally suited to be used with the Cyrix Cx486SLC CPU, a 486SX compatible CPU which is a replacement chip for the Intel 386SX CPU.
The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20, and 25 MHz versions. Due to the advanced power saving features of the Cyrix coprocessor, the typical power consumption of the 20 MHz version is only about 350 mW [67].
ULSI Math*Co 83C87The ULSI 83C87 is an 80387-compatible coprocessor first introduced in early 1991, well after the IIT 3C87 and Cyrix 83D87 appeared. Like other 387 clones, it is somewhat faster than the Intel 387DX, particularly in its basic arithmetic functions. The transcendental functions, however, show only a slight speed improvement over the Intel 387DX (see benchmark results below).
In my tests, the ULSI had the most inaccurate transcendental functions of all tested coprocessors. However, the maximum relative error is still within the limits set by Intel, so this is probably not an important issue for all but a very few applications. The ULSI 83C87 shows some minor flaws in the tests for IEEE 754 compatibility, but this, too, is probably unimportant under typical operating conditions. ULSI claims that the program IEEETEST, which was used to test for IEEE compatibility, contains many personal interpretations of the IEEE standard by the program's author and states that there is no ANSI- certified IEEE-754 compliance test. While this may be true, it is also a fact that the IEEE test vectors used in IEEETEST are a de facto industry standard, and that Intel's 387, 486, and RapidCAD chips pass it without a single failure, as do the coprocessors from Cyrix. Since the ULSI Math*Co 83C87 fails some of the tests, it is certainly less than 100% compatible with Intel's chips, although this will likely make little or no difference in typical operating conditions. (It is interesting to note that an ULSI 83S87 manufactured in 92/17 showed fewer errors in the IEEETEST test run [74] than the ULSI 83C87, manufactured in 91/48, I used in my original test. This indicates that ULSI might have applied some quick fixes to newer revisions of their math coprocessors.)
The ULSI 83C87 fails to be compatible with the IEEE-754 in that is does not implement the "precision control" feature. While all the internal operations of 80x87 coprocessors are usually performed with the maximum precision available (double-extended precision with 64 mantissa bits), the 80x87 architecture also offer the possibility to force lower precision to be used for the basic arithmetic functions (add, subtract, multiply, divide, and square root). This feature is required by IEEE-754 for all coprocessors that can not store results *directly* to a single or double-precision location. Since 80x87 coprocessors lack this storage capability, they all implement precision control to provide correctly rounded single- and double-precision results according to the floating- point standard - except the ULSI chips. For programs that make use of precision control (e.g., Interactive UNIX), correct implementation of the feature may be essential for correct arithmetic results.
Like other non-Intel 387 compatibles, the 83C87 does not support asynchronous operation of the CPU and the coprocessor. This means that the 83C87 always runs at the full speed of the CPU. It is available in 20, 25, 33, and 40 MHz versions. The ULSI is produced in low power CMOS; power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625 mW), and at 40 MHz it is max. 1500 mW (750 mW typical) [58]. The 83C87 is packaged in a 68-pin ceramic PGA.
ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc., will replace the coprocessor up to three times free of charge should it ever fail to function properly.
ULSI Math*Co 83S87This chip is the SX version of the ULSI 83C87, for use in systems with an Intel 387SX or an AMD Am387SX CPU. It is functionally equivalent to the 83C87. To aid low-power laptop designs, the ULSI 83S87 features an advanced power saving design with a sleep mode and a standby mode with only minimal power requirements. Power consumption under normal operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25 MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC.
C&T SuperMATH 38700DXProduced by Chips&Technologies, this is the latest entry into the 387- compatible marketplace. Originally announced in October, 1991, it has apparently not been available to end-users before the third quarter of 1992, at least here in Germany. My tests show that its compatibility with Intel products is very good, even for the more arcane features of the 387DX and comparable to the coprocessors from Cyrix. Like these chips, it passes the IEEETEST program without a single failure. It passes, of course, all tests in Chips&Technologies' own compatibility test program, SMDIAG. However, some of the tests (the transcendental functions) in this program are selected in such a way that the C&T 38700 passes while the Cyrix 83D87 or Intel RapidCAD fail, so they are not very useful. (There is also a 'bug' in the test for FSCALE that hides a true bug in the C&T 38700.) My tests show the accuracy of the transcendental functions on the C&T 38700DX varies. Overall, accuracy of the transcendentals is slightly better than on the Intel 387DX.
In my own speed tests [see below] and those reported in [1], the C&T 38700DX showed performance at about 90-100% the level of the Cyrix 83D87, which is the 387 clone with the highest performance. For floating-point-intensive benchmarks, the C&T 38700DX provides up to 50% more computational performance than the Intel 387DX. However, as with all other 387 compatible coprocessors, the speed advantage over the Intel 387DX is far less significant in real applications.
The SuperMATH 38700DX is implemented in 1.2 micron CMOS with on-chip power management, which makes for low power consumption. The 38700DX is packaged in a 68-pin ceramic PGA (pin grid array and available in speeds of 16, 20, 25, 33, and 40 MHz.
C&T 38700SXThis chip is the SX version of the 38700DX and compatible with the Intel 387SX. It provides performance comparable to a Cyrix 83S87 [1], the 387SX clone with the highest performance. Compatibility with the Intel 387SX is very good and on par with the high degree of the compatibility found in the Cyrix 83S87.
The 38700SX has low power consumption. It is packaged in a 68-pin PLCC (plastic leaded chip carrier) and available in speeds of 16, 20, and 25 MHz.
Intel RapidCADThe RapidCAD is not a coprocessor, strictly seen, although it is marketed as one. Rather, it is a full replacement for a 80386 CPU: basically, an Intel 486DX CPU chip without the internal cache and with a standard 386 pinout. RapidCAD is delivered as a set of two chips. RapidCAD-1 goes into the 386 socket and contains the CPU and FPU. RapidCAD-2 goes into the coprocessor (387) socket and contains a simple PAL whose only purpose is to generate the FERR signal normally generated by a coprocessor (This is needed by the motherboard circuitry to provide 287 compatible coprocessor exception handling in 386/387 systems.) The RapidCAD instruction set is compatible with the 386, so it doesn't have any newer, 486-specific instructions like BSWAP. However, since the RapidCAD CPU core is very similar to 80486 CPU core, most of the register-to-register instructions execute in the same number of clock cycles as on the 486.
RapidCAD's use of the standard 386 bus interface causes instructions that access memory to execute at about the same speed as on the 386. The integer performance on the RapidCAD is definitely limited by the low memory bandwidth provided by this interface (2 clock cycles per bus cycle) and the lack of an internal cache. CPU instructions often execute faster than they can be fetched from memory, even with a big and fast external cache. Therefore, the integer performance of the RapidCAD exceeds that of a 386 by *at most* 35%. This value was derived by running some programs that use mostly register-to-register operations and few memory accesses, and is supported by the SPEC ratings that Intel reports for the 386-33 and the RapidCAD-33: while the 386-33 has a SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase. (Note that these tests used the old [1989] SPEC benchmarks suite.)
While CPU and integer instructions often execute in one clock cycle on the RapidCAD, floating-point operations always take more than seven clock cycles. They are therefore rarely slowed down by the low-bandwidth 386 bus interface; My tests show a 70%-100% performance increase for floating-point intensive benchmarks over a 386-based system using the Intel 387DX math coprocessor. This is consistent with the SPECfp rating reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while the RapidCAD is rated at 6.1 SPECfp at the same frequency, an 85% increase. This means that a system that uses the RapidCAD is faster than *any* 386/387 combination, regardless of the type of 387 used, whether an Intel 387DX or a faster 387 clone. The diagnostic disk for the RapidCAD also gives some application performance data for the RapidCAD compared to the Intel 387DX:
Application Time w/ 387DX Time w/ RapidCAD Speedup AutoCAD 11 52 sec 32 sec 63% AutoShade/Renderman 180 sec 108 sec 67% Mathematica(Windows ) 139 sec 103 sec 35% SPSS/PC+ 4.01 17 sec 14 sec 21%
RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed through different channels than the other Intel math coprocessors, and I have therefore been unable to obtain a data sheet for it. [78] gives the typical power consumption of the 33 MHz RapidCAD as 3500 mW, which is the same as for the 33 MHz 486DX. The RapidCAD-1 chip gets quite hot when operating. Therefore, I recommend extra cooling for it (see the paragraph below on the 486 for details). The RapidCAD-1 is packaged in a 132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a 68-pin PGA like a 80387 coprocessor.
Intel 486DXThe Intel 486DX is, of course, not solely a coprocessor. This chip, first introduced by Intel in 1989, functionally combines the CPU (a heavily-pipelined implementation of the 386 architecture) with an enhanced 387 (the chip's floating-point unit, FPU) and 8 KB of unified on-chip code/data cache. (This description is necessarily simplified; for a detailed hardware description, see [52].) The 486DX offers about two to three times the integer performance of a 386 at the same clock frequency, while floating-point performance is about three to four times as high as the Intel 387DX at the same clock rate [29]. Since the FPU is on the same chip as the CPU, the considerable communication overhead between CPU and coprocessor in a 386/387 system is omitted, letting FPU instructions run at the full speed permitted by the implementation. The FPU also takes advantage of the on-chip cache and the highly pipelined execution unit. The concurrent execution of CPU and coprocessor instructions typical for 80x86/80x87 systems is still in existence on the 486, but some FPU instructions like FSIN have nearly no concurrency with CPU instructions, indicating that they make heavy use of both, CPU and FPU resources [53, 1].
Besides its higher performance, the 486 FPU provides more accurate transcendental functions than the 387DX coprocessor, according to my tests (see below). To achieve better interrupt latency, FPU instructions with a long execution times have been made abortable if an interrupt occurs during their execution.
Due to the considerable amount of heat produced by these chips, and taking into consideration the slow air flow provided by the fan in garden-variety PC tower cases, I recommend an extra fan directly above the CPU for safer operation. If you measure the surface temperature of an 486DX after some time of operation in a normal tower case without extra cooling, you may well come up with something like 80-90 degrees Celsius (that is 175-195 degrees Fahrenheit for those not familiar with metric units) [54,55]. You don't need the well known (and expensive) IceCap[tm] to effectively cool your CPU; a simple fan mounted directly above the CPU can bring the temperature of the chip down to about 50-60 degrees Celsius (120-140 degrees Fahrenheit), depending on the room temperature and the temperature within the PC case (which depends on the total power dissipation of all the components and the cooling provided by the fan in the system's power supply). According to a simple rule known as Arrhenius' Law, lowering the temperature by 10 degrees Celsius slows down chemical reactions by a factor of two, so lowering the temperature of your CPU by 30 degrees should prolong the life of the device by a factor of eight, due to the slower ageing process. If you are reluctant to add a fan to your system because of the additional noise, settle for a low-noise fan like those available from the German manufacturer Pabst (this is not meant to be an advertisement; I am just the happy owner of such a fan, and have no other connections to the firm).
The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is available in 25 MHz and 33 MHz versions. Since the end of 1991, a 50 MHz version has also been available, manufactured by a CHMOS V process (the 25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500 mW for the 33 MHz version (3500 mW typical), and 5000 mW (3875 mW typical) for the 50 MHz chip.
Intel 486DX2The 486DX2 represents the latest generation of Intel CPUs. The "DX2" suffix (instead of simply DX) is meant to be an indicator that these are clock-doubled versions of the basic CPU. A normal 486DX operates at the frequency provided by the incoming clock signal. A 486DX2 instead generates a new clock signal from the incoming clock by means of a PLL (phase locked loop). In the DX2, this clock signal has twice the frequency of the incoming clock, hence the name clock-doubler. All internal parts of the 486DX2 (cache, CPU core, and FPU) run at this higher frequency; only the bus interface runs at the normal (undoubled) speed. Using this technique, an Intel 486DX2-50 can run on an unmodified motherboard designed for 25 MHz operation. Since motherboards which run at 50 MHz are much harder to design and build than those for 25 MHz, this makes a 486DX2-50 system cheaper than an 'equivalent' 486DX-50 system.
For all operations that don't access off-chip resources (e.g., register operations), a 486DX2-50 provides exactly the same performance as a 486DX-50, and twice the performance of a 486DX-25. However, since the main memory in a 486DX2-50 systems still operates at 25 MHz, all instructions involving memory accesses are potentially slower than in a 486DX-50 system, whose memory also (presumably) runs at 50 MHz. The internal cache of the 486 helps this problem a bit, but overall performance of a 486DX2-50 is still lower than that of a 486DX-50. Intel's documentation [32] shows this drop to be quite small, although it is highly dependent upon the particular application.
The truly wonderful thing about the 486DX2 is that it allows easy upgrading of 25 and 33 MHz 486 systems, since the 486DX2 is completely pin-compatible with the 486DX: you need just take out the 486DX and plug in the new 486DX2. Note that power consumption of the 486DX2-50 equals that of the 486DX-50 (4000 mW typical, 4750 mW max.), and that the 486DX2-66 exceeds this by about 25% (4875 mW typical, 6000 mW max.). These chips get *really* hot in a standard PC case with no extra cooling, even if they come with an attached heat sink by default. (See the discussion above for more detailed information on this problem and possible solutions).
Intel 487SXThe 487SX is the math coprocessor intended for use in 486SX systems. The 486SX is basically a 486DX without the floating-point unit (FPU) [48, 50]. (Originally Intel sold 486DXs with a defective FPU as 486SXs but it has now completely removed the FPU part from the 486SX mask for mass production.) The introduction of the 486SX in 1991 has been viewed by many as a marketing 'trick' by Intel to take market share from the 386 based systems once AMD became successful with their Am386. (AMD has taken as much as 40% of the 386 market due to some superior features such as higher clock frequency, lower power consumption, fully static design, and availability of a 3V version). A 486SX at 20 MHz delivers a bit less integer performance than a 40 MHz Am386.
To add floating-point capabilities to a 486SX based system, it would seem to be easiest to swap the 486SX for a 486DX, which includes the FPU on-chip. However, Intel has prevented this easy solution by giving the 486SX a slightly different pin out [48, 51]. Since only three pins are assigned differently, clever board manufacturers have come out with boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU socket and by doing so provide a clean upgrade path. A set of three jumpers ensures correct signal assignment to the changed pins for either CPU type. To upgrade 486SX systems without this feature, you are forced to buy a 487SX and install it in the "Performance Upgrade Socket" (present in most systems).
Once the 487SX was available, it was quickly found out that it is just a normal 486DX with a slightly different pinout [49]. Technically speaking, the solution Intel chose was the only practical way to provide a 486SX system with the high level of floating-point performance the 486DX offers. The CPU and FPU must be on the same chip; otherwise, the FPU cannot make use of the CPU's internal cache and there would be considerable overhead in CPU-FPU communication (similar to a 386/387 system), nullifying most of the arithmetic speedups over the 387. That the 486SX, 487SX, and 486DX are *not* pin-compatible seems to be purely for marketing reasons.
To upgrade a 486SX based system, Intel also offers the OverDrive chip, which is just the same as a 487SX with internal clock doubling. It also goes into the motherboard's "Performance Upgrade Socket". The OverDrive roughly doubles the performance of a 486SX/487SX based system. (For a explanation of clock doubling, see the description of the Intel 486DX2 above.)
Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX system, so the 486SX could be removed once the 487SX is installed. Since the shut down is logical, not electrical, the 486SX still uses power if used with the 487SX, although it is inoperational. As with the 486SX, the 487SX is currently available in 20 MHz and 25 MHz versions. At 20 MHz, the 487SX has a power consumption of max. 4000 mW (3250 mW typical). It is available in a 169 pin ceramic PGA (pin grid array).
Weitek 1167This math coprocessor was the predecessor of the Weitek Abacus 3167. It was actually a small printed circuit board with three chips mounted on it. In contrast to the Weitek 3167, the 1167 did not have a square root instruction; instead, the square root function was computed by means of a subroutine in the Weitek transcendental function library. However, the 1167 did have a mode in which it supported denormal numbers. (The Weitek 3167 and 4167 only implement the 'fast' mode, in which denormals are not supported.) Overall performance of the 1167 is slightly less than that of the Weitek 3167.
Weitek 3167The 3167 was introduced by Weitek in 1989 and provided the fastest floating-point performance possible on a 386 based system at that time. The 3167 is not a real coprocessor, strictly speaking, but rather a memory-mapped peripheral device. The architecture of the 3167 was optimized for speed wherever possible. Besides using the faster memory mapped interface to the CPU (the 80x87 uses IO-ports), it does not support many of the features of the 80x87 coprocessors, allowing all of the chip's resources to be concentrated on the fast execution of the basic arithmetic operations. (For a more detailed description of the Weitek 3167, see the first chapter of this document.)
In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167 the Whetstone benchmark performed at 7574 kWhetstones/sec compared with the 3743 kWhetstones/s for the Intel 387DX. (Note, however, that these are single-precision results and that the Weitek 3167's performance would drop to about half the stated rate for double-precision, while the value for the Intel 387DX would change very little.) In any case, before the advent of the Intel RapidCAD, the Weitek 3167 usually outperformed all 387-compatible coprocessors, even for double-precision operations [63,65,69]. For typical applications, the advantage of the Weitek 3167 over the 387 clones is much smaller. In a benchmark test using AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel 387DX's performance compared with 106% for the Cyrix FasMath 83D87 and 118% for the Intel RapidCAD.
The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an EMC socket (provided in most 386-based systems). It does *not* fit into the normal 68-pin PGA socket intended for a 387 coprocessor.
To get the best of both worlds, one might want to use a Weitek 3167 and a 387 compatible coprocessor in the same system. These coprocessors can coexist in the same system without problems; however, most 386-based systems contain only one coprocessor socket, usually of the EMC (extended math coprocessor) type. Thus, you can install either a 387 coprocessor or a Weitek 3167, but not both at the same time. There *are* small daughter boards available that plug into the EMC socket and provide two sockets, an EMC and a standard coprocessor socket.
At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At 33 MHz, max. power consumption is 2250 mW.
Weitek 4167The 4167 is a memory-mapped coprocessor that has the same architecture as the 3167; it is designed to provide 486-based systems with the highest floating-point performance available. It executes coprocessor instructions at three to four times the speed of the Weitek 3167. Although it is up to 80% faster than the Intel 486 in some benchmarks [1,69], the performance advantage for real application is probably more like 10%. The introduction of the 486DX2 processors has more or less obliterated the need for a Weitek 4167, since the DX2 CPUs provide the same performance as the Weitek, as well as the additional features the 80x87 architecture has that the Weitek does not.
The Weitek 4167 is packaged in a 142-pin PGA package that is only slightly smaller than the 486's package. At 25 MHz, it has a max. power consumption of 2500 mW [32].
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec Intel 386DX WITH: EM87 emulator 0.0070 0.0040 0.0050 0.0050 26 418 ## Franke387 emu. 0.0307 0.0246 0.0194 0.0179 137 3335 $$ TP/MS-FORT emu 0.0263 0.0227 0.0167 0.0158 133 3160 %% Q387 emulator 0.0920 0.0664 0.0305 0.0304 251 4796 (( Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860 ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431 IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020 IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 ?? C&T 38700 0.9455 0.6907 0.3338 0.2700 2376 62565 Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890 Cyrix EMC87 1.0400 0.6628 0.3352 0.2808 2540 71685 // Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464 Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192 40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec Intel 386DX WITH: EM87 emulator 0.0084 0.0080 0.0060 0.0060 31 502 ## Franke387 emu. 0.0369 0.0295 0.0233 0.0215 164 4002 $$ TP/MS-FORT emu 0.0316 0.0273 0.0200 0.0190 160 3794 %% Q387 emulator 0.1103 0.0798 0.0365 0.0364 301 5758 (( Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677 ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926 IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766 IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 ?? C&T 38700 1.0722 0.7908 0.4007 0.3222 2837 74906 Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322 Cyrix EMC87 1.2381 0.7963 0.4025 0.3324 3061 86083 // Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957 Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec Cyrix 486DLC (cache off) WITH: EM87 emulator 0.0089 0.0082 0.0062 0.0063 31 472 ## Franke387 emu. 0.0402 0.0324 0.0258 0.0240 184 4807 $$ TP/MS-FORT emu 0.0346 0.0288 0.0206 0.0212 173 4401 %% Q387 emulator 0.1214 0.0810 0.0368 0.0382 320 6020 (( Intel 387DX 0.8455 0.6552 0.3659 0.3033 2249 48780 ULSI 83C87 1.1818 0.7543 0.3752 0.3026 2381 53476 IIT 3C87 0.9541 0.6609 0.3653 0.3036 2476 55814 IIT 3C87,4X4 0.9541 1.4988 0.3653 0.3036 2476 55814 ?? C&T 38700 1.1183 0.7644 0.3796 0.3087 2703 73350 Cyrix 387+ 1.1305 0.7445 0.3727 0.3060 2731 81967 Cyrix EMC87 1.2236 0.7593 0.3823 0.3144 2908 88889 // Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464 Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192 40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec Cyrix 486DLC (cache off) WITH: EM87 emulator 0.0107 0.0098 0.0075 0.0075 37 567 ## Franke387 emu. 0.0488 0.0392 0.0311 0.0288 223 5808 $$ TP/MS-FORT emu 0.0416 0.0345 0.0246 0.0253 208 5284 %% Q387 emulator 0.1463 0.0973 0.0442 0.0458 384 7237 (( Intel 387DX 1.0196 0.7880 0.4375 0.3644 2712 58479 ULSI 83C87 1.4247 0.9064 0.4506 0.3630 2868 64171 IIT 3C87 1.1556 0.7963 0.4399 0.3611 2988 66964 IIT 3C87,4X4 1.1556 1.7916 0.4399 0.3611 2988 66964 ?? C&T 38700 1.3333 0.9210 0.4548 0.3708 3254 88106 Cyrix 387+ 1.3507 0.8958 0.4477 0.3754 3297 98361 Cyrix EMC87 1.4648 0.9136 0.4548 0.3773 3505 106572 // Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957 Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec Cyrix 486DLC (cache on) WITH: EM87 emulator 0.0099 0.0089 0.0068 0.0069 35 550 ## Franke387 emu. 0.0462 0.0362 0.0288 0.0265 205 5445 $$ TP/MS-FORT emu 0.0410 0.0330 0.0234 0.0241 198 5339 %% Q387 emulator 0.1344 0.0902 0.0389 0.0403 339 6241 (( Intel 387DX 0.8525 0.6552 0.3941 0.3279 2332 49834 ULSI 83C87 1.2093 0.7543 0.4068 0.3270 2478 57197 IIT 3C87 0.9720 0.6609 0.3959 0.3295 2579 57252 IIT 3C87,4X4 0.9720 1.5087 0.3959 0.3295 2579 57252 ?? C&T 38700 1.1305 0.7644 0.4126 0.3343 2839 75949 Cyrix 387+ 1.1429 0.7445 0.4023 0.3310 2866 85349 Cyrix EMC87 1.2381 0.7593 0.4150 0.3412 3051 93897 // Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464 Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192 40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec Cyrix 486DLC (cache on) WITH: EM87 emulator 0.0118 0.0107 0.0082 0.0082 42 659 ## Franke387 emu. 0.0565 0.0438 0.0350 0.0313 248 6585 $$ TP/MS-FORT emu 0.0491 0.0395 0.0279 0.0296 238 6408 %% Q387 emulator 0.1610 0.1084 0.0470 0.0484 407 7509 (( Intel 387DX 1.0297 0.7880 0.4748 0.3937 2801 59821 ULSI 83C87 1.4445 0.9028 0.4891 0.3926 2976 65789 IIT 3C87 1.1686 0.7963 0.4734 0.3916 3096 68729 IIT 3C87,4X4 1.1686 1.8057 0.4734 0.3916 3096 68729 ?? C&T 38700 1.3685 0.9173 0.4958 0.4012 3401 91185 Cyrix 387+ 1.3867 0.8958 0.4887 0.3962 3448 102564 Cyrix EMC87 1.4857 0.9100 0.4959 0.4091 3676 112360 // Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957 Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec C&T 38600DX WITH: Intel 387DX 0.7376 0.5620 0.3337 0.2636 2066 45489 ULSI 83C87 0.5226 0.4690 0.3236 0.2654 2087 43228 IIT 3C87 0.7879 0.5762 0.3397 0.2674 2263 51195 IIT 3C87,4X4 0.7879 0.6181 0.3397 0.2674 2263 51195 $$ C&T 38700 0.5977 0.5572 0.3463 0.2681 2338 63966 Cyrix 387+ 0.5896 0.5508 0.3438 0.2673 2375 66741 Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464 Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192 For comparison: PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934 i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203 i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++ i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 && i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !! i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **
System A: Motherboard with Forex chip set, 128 KB CPU Cache, 8 MB RAM
Hardware configuration for test of 486 FPU (extra fan for 40 MHz operation):System B: Motherboard with SIS chip set, 256 KB CPU Cache, 8 MB RAM
## -- EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that loads as a TSR. It uses INT 7 traps emitted by 80286, 80386, or 486SX systems with no coprocessor upon encountering coprocessor instructions to catch coprocessor instructions and emulate them. Whetstone and Savage benchmarks for this test were compiled with the original TP 6.0 library, as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my own library if a 387 is detected. Obviously EM87 identifies itself as a 387, but it has no support for 387-specific instructions.
$$ -- Franke387 is a commercial 387 emulator that is also available in a shareware version. For this test, shareware version V2.4 was used. Franke387 unlike many other emulators supports all 387 instructions. It is loaded as a device driver and uses INT 7 to trap coprocessor instructions.
(( -- Q387 is an emulator that is distributed as a shareware program by Quickware of Austin, Texas. As the name implies, this emulator uses 386 specific code and supports the full 387 instruction set. The program is about 330 kByte in size and loads completely into extended memory, using absolutely no DOS memory. It is loaded as a TSR and requires an EMM (expanded memory manager) to be present. The emulation uses the INT 7 mechanism. The version of Q387 used was 3.0a.
%% -- These benchmarks were run using the built-in coprocessor emulators of the TP 6.0 (for Savage, LLL, Whetstone, TRNSFORM, PEAKFLOP) and the MS FORTRAN 5.0 (for Linpack) run-time libraries by forcing the libraries into not using a coprocessor by using the environment settings NO87=NC and 87=N.
$$ -- The 3C87 specific F4X4 instruction was used in the vector transformation benchmark.
// -- The EMC87 was used in the 387-compatible mode only. The faster memory- mapped mode was *not* used. Times should therefore be identical to the Cyrix 83D87.
++ -- Older motherboard with no chip set (discrete logic), no CPU cache, 16 MB RAM
&& -- System A, CPU cache disabled via extended set-up, turbo-switch set to half speed (that is, 20 MHz)
!! -- 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM due to the fast CPU used here, performance figures are somewhat higher than can be expected for a 80286/287 combination, except for the PEAKFLOP benchmark, which is basically coprocessor limited.
** -- 8086/8087 system with 640 KB RAM
Single Prec. Double Prec. Double Prec. 3167 4167 3167 4167 387 486 Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6 Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300Note that for the Intel coprocessors, running programs in single vs. double- precision doesn't provide much of an performance advantage since all internal calculations are always done in extended precision. Using Weitek coprocessors, however, performance nearly doubles in single-precision mode. For double-precision calculations using only basic arithmetic, the Weitek Abacus can at most provide performance at twice the level of the respective Intel coprocessor (387/486) at the same clock speed.
Comparison of floating-point performance [30,32] single-precision Weitek 4167-33 Intel 486-33 Intel 486DX2-66 Linpack MFLOPS 5.0 1.8 3.5 Whetstones kWhet/sec 22700 12700 25500 double-precision Weitek 4167-33 Intel 486-33 Intel 486DX2-66 LINPACK MFLOPS 3.5 1.6 3.1 kWhetstones/sec 14000 12300 24700
Intel Intel Cyrix Cyrix C&T ULSI IIT Intel Intel i486 RapidCAD 83D87 387+ 38700 83C87 3C87 387DX 80387 FLD1 4 3 14 14 14 18 24 23 26 FLDZ 4 3 14 14 14 18 24 23 31 FLDPI 7 8 14 15 14 18 24 38 45 FLDLG2 7 8 14 14 14 18 24 33 45 FLDL2T 7 8 14 14 14 19 24 38 45 FLDL2E 7 8 14 14 14 19 24 38 45 FLDLN2 7 8 14 14 14 19 24 38 45 FLD ST(0) 4 4 14 14 14 14 24 20 21 FST ST(1) 3 4 14 14 14 14 19 18 22 FSTP ST(0) 4 4 14 14 14 15 19 19 22 FSTP ST(1) 4 4 15 15 14 15 19 20 22 FLD ST(1) 4 4 14 14 14 14 24 18 21 FXCH ST(1) 4 4 14 20 14 19 24 24 27 FILD [Word] 12 16 33 37 32 42 38 47 62 FILD [DWord] 8 11 26 26 21 32 28 35 45 FILD [QWord] 9 15 30 30 25 36 32 34 54 FLD [DWord] 3 5 26 26 21 23 28 20 25 FLD [QWord] 3 7 30 30 25 27 32 24 35 FLD [TByte] 5 11 46 46 46 46 47 46 57 FBLD [TByte] 83 90 66 86 106 146 197 71 278 FIST [Word] 31 31 37 40 37 42 51 69 90 FIST [DWord] 29 30 35 40 35 40 49 66 84 FST [DWord] 7 7 35 37 32 40 33 37 40 FST [QWord] 8 9 43 43 39 47 40 45 51 FISTP [Word] 32 32 42 40 37 43 46 70 90 FISTP [DWord] 31 31 40 40 35 41 50 67 87 FISTP [QWord] 29 29 44 44 42 48 56 73 92 FSTP [DWord] 8 8 38 36 32 41 35 38 43 FSTP [QWord] 9 9 46 43 39 48 42 46 49 FSTP [TByte] 8 8 50 45 49 50 48 53 58 FBSTP [TByte] 170 172 98 98 114 129 218 144 533 FINIT 17 31 15 16 15 15 16 16 25 FCLEX 7 20 15 16 16 16 16 16 25 FCHS 7 8 14 15 14 14 19 30 33 FABS 5 5 14 15 14 14 19 30 33 FXAM 12 13 14 15 14 14 19 39 43 FTST 5 5 19 25 14 24 24 34 38 FSTENV 67 82 125 125 124 132 124 159 165 FLDENV 44 59 106 106 112 120 106 119 129 FSAVE 181 169 355 355 374 361 376 469 511 FRSTOR 130 203 358 358 385 372 371 420 456 FSTSW [mem] 4 5 14 14 14 14 14 14 17 FSTSW AX 3 4 12 12 11 11 11 11 14 FSTCW [mem] 4 5 14 14 13 13 13 14 18 FLDCW [mem] 4 11 26 26 31 32 27 32 36 FADD ST,ST(0) 8 9 19 20 19 19 24 24 32 FADD ST,ST(1) 9 9 19 20 19 18 24 20 32 FADD ST(1),ST 10 10 19 20 19 18 24 24 37 FADDP ST(1),ST 11 11 19 19 19 16 24 25 37 FADD [DWord] 9 10 25 28 22 23 23 21 34 FADD [QWord] 9 10 32 32 26 27 27 25 38 FIADD [Word] 20 21 34 34 33 40 40 52 80 FIADD [DWord] 20 21 27 28 27 30 30 37 61 FSUB ST(1),ST 10 10 19 20 19 19 24 24 38 FSUBR ST(1),ST 9 10 19 22 19 19 24 27 38 FSUBRP ST(1),ST 10 10 19 19 22 20 24 25 38 FSUB [DWord] 11 12 27 28 27 23 29 27 32 FSUB [QWord] 11 12 32 32 31 27 33 26 44 FISUB [Word] 21 21 34 34 34 40 40 52 80 FISUB [DWord] 21 22 27 28 27 29 30 40 60 FMUL ST,ST(1) 16 17 19 25 24 24 29 38 57 FMUL ST(1),ST 16 17 19 24 24 24 29 40 62 FMULP ST(1),ST 17 17 19 24 24 25 29 40 58 FIMUL [Word] 22 23 40 40 37 46 46 52 80 FIMUL [DWord] 22 23 27 28 27 36 35 45 68 FMUL [DWord] 11 12 27 28 27 28 29 25 45 FMUL [QWord] 14 15 32 32 31 32 33 37 61 FDIV ST,ST(0) 73 74 26 40 59 54 54 89 100 FDIV ST,ST(1) 73 74 36 45 59 54 54 77 100 FDIV ST(1),ST 73 74 36 45 59 55 54 78 102 FDIVR ST(1),ST 73 74 36 45 59 54 54 77 102 FDIVRP ST(1),ST 73 74 36 44 59 55 54 76 106 FIDIV [Word] 84 85 52 58 75 76 76 105 141 FIDIV [DWord] 84 85 45 46 65 65 65 101 123 FDIV [DWord] 73 74 45 46 63 56 59 77 101 FDIV [QWord] 73 74 50 50 67 60 63 78 103 FSQRT (0.0) 25 25 19 19 14 19 24 29 37 FSQRT (1.0) 83 84 36 74 54 89 59 109 132 FSQRT (L2T) 86 87 36 74 54 89 59 104 137 FXTRACT (L2T) 17 17 19 19 19 28 79 53 72 FSCALE (PI,5) 30 30 36 24 24 49 79 59 82 FRNDINT (PI) 31 31 19 29 24 34 29 49 82 FPREM (99,PI) 58 59 54 99 44 54 49 79 96 FPREM1(99,PI) 90 91 54 99 44 59 54 104 121 FCOM 5 6 15 20 19 25 19 29 32 FCOMP 6 6 15 19 19 25 19 30 33 FCOMPP 7 7 15 19 19 25 19 31 40 FICOM [Word] 16 17 34 34 33 46 34 58 76 FICOM [DWord] 16 16 21 28 21 35 23 45 57 FCOM [DWord] 5 6 21 28 22 23 23 27 34 FCOM [QWord] 5 8 27 32 25 27 27 31 39 FSIN (0.0) 24 24 14 99 14 19 24 39 43 FSIN (1.0) 310 313 114 164 144 494 219 509 596 FSIN (PI) 88 89 118 189 64 64 214 134 152 FSIN (LG2) 292 295 72 89 139 454 184 449 531 FSIN (L2T) 299 302 123 179 164 469 214 454 536 FCOS (0.0) 24 24 19 159 14 19 24 34 42 FCOS (1.0) 302 305 84 104 139 489 214 459 547 FCOS (PI) 88 89 154 254 64 64 224 199 232 FCOS (LG2) 300 303 108 149 139 454 194 504 583 FCOS (L2T) 307 310 159 239 164 469 224 509 601 FSINCOS (0.0) 25 25 14 19 19 18 34 38 55 FSINCOS (1.0) 353 356 124 174 254 493 419 538 636 FSINCOS (PI) 105 106 162 263 79 68 424 228 277 FSINCOS (LG2) 340 343 119 159 249 458 359 533 627 FSINCOS (L2T) 347 350 168 248 274 473 424 538 646 FPTAN (0.0) 25 25 14 19 19 18 29 38 46 FPTAN (1.0) 266 269 119 149 184 538 309 323 396 FPTAN (PI) 145 146 134 228 104 108 304 168 211 FPTAN (LG2) 244 246 94 129 179 498 274 298 363 FPTAN (L2T) 247 249 139 219 204 513 304 298 365 FPATAN (0.0) 38 39 19 24 19 20 29 95 93 FPATAN (1.0) 294 298 124 159 29 375 604 360 433 FPATAN (PI) 304 308 139 188 279 360 424 375 472 FPATAN (LG2) 290 293 128 154 269 365 379 375 448 FPATAN (L2T) 304 308 144 189 274 359 424 375 468 F2XM1 (0.0) 25 25 14 14 14 19 24 34 37 F2XM1 (LN2) 209 211 89 119 169 394 284 299 348 F2XM1 (LG2) 204 206 78 104 159 379 284 294 337 FYL2X (1.0) 60 61 36 39 24 75 94 115 127 FYL2X (PI) 294 297 108 163 249 450 359 395 504 FYL2X (LG2) 311 314 108 159 249 460 339 410 518 FYL2X (L2T) 293 296 108 164 249 439 359 390 501 FYL2XP1 (LG2) 334 337 99 169 234 460 284 435 538 80386 + 80386 + 80386 + 80386 + Intel Intel Q387 Franke387 TP 6.0 EM87 8087 80287 Emulator Emulator Emulator Emulator FLD1 26 55 51 481 422 1626 FLDZ 21 53 39 480 416 1646 FLDPI 26 55 51 486 443 1626 FLDLG2 26 56 51 486 423 1626 FLDL2T 26 55 51 486 440 1626 FLDL2E 26 53 52 486 423 1626 FLDLN2 26 55 52 486 441 1626 FLD ST(0) 31 55 57 493 362 1851 FST ST(1) 26 54 61 489 355 1931 FSTP ST(0) 26 54 46 507 358 2115 FSTP ST(1) 21 55 66 507 356 2116 FLD ST(1) 26 55 54 493 362 1852 FXCH ST(1) 21 57 80 497 486 2187 FILD [Word] 58 90 122 667 712 2259 FILD [DWord] 64 74 121 608 812 2164 FILD [QWord] 74 93 179 652 707 2971 FLD [DWord] 49 44 106 633 473 2077 FLD [QWord] 54 57 118 641 524 2336 FLD [TByte] 59 45 102 607 492 2063 FBLD [TByte] 309 310 736 2019 1512 17827 FIST [Word] 79 72 143 854 766 2418 FIST [DWord] 84 80 136 865 518 2325 FST [DWord] 89 85 124 686 441 2200 FST [QWord] 99 92 135 703 516 2481 FISTP [Word] 79 80 154 864 794 2620 FISTP [DWord] 79 81 144 879 541 2523 FISTP [QWord] 88 75 184 904 916 3226 FSTP [DWord] 89 75 133 713 467 2400 FSTP [QWord] 93 72 142 732 538 2678 FSTP [TByte] 49 21 111 685 467 2124 FBSTP [TByte] 528 472 1124 3305 1555 27013 FINIT 11 10 1079 742 641 1369 FCLEX 11 10 48 440 323 912 FCHS 21 54 45 460 354 1744 FABS 21 54 43 456 349 1738 FXAM 21 54 72 481 380 1551 FTST 51 75 70 585 386 2721 FSTENV 54 57 827 928 519 2104 FLDENV 48 50 780 1125 450 1631 FSAVE 214 244 3929 1949 976 2749 FRSTOR 209 227 2901 2182 657 2225 FSTSW [mem] 28 10 87 516 401 1189 FSTSW AX N/A 55 57 451 N/A N/A FSTCW [mem] 28 10 74 506 359 1167 FLDCW [mem] 19 47 91 524 437 1584 FADD ST,ST(0) 86 128 136 643 706 2805 FADD ST,ST(1) 85 116 146 707 808 3093 FADD ST(1),ST 92 131 157 664 812 3146 FADDP ST(1),ST 92 129 164 704 799 3143 FADD [DWord] 105 122 221 874 969 3139 FADD [QWord] 115 122 232 888 1021 3396 FIADD [Word] 115 122 238 940 1211 3330 FIADD [DWord] 125 122 239 882 1297 3215 FSUB ST(1),ST 88 130 171 738 817 3156 FSUBR ST(1),ST 96 132 181 740 868 3004 FSUBRP ST(1),ST 99 132 193 733 805 3301 FSUB [DWord] 119 122 230 918 1018 3127 FSUB [QWord] 129 123 242 932 1070 3632 FISUB [Word] 115 123 268 977 1081 3802 FISUB [DWord] 125 125 289 940 980 4161 FMUL ST,ST(1) 145 151 297 810 1368 3924 FMUL ST(1),ST 145 151 296 817 1377 3962 FMULP ST(1),ST 148 168 304 840 1365 4164 FIMUL [Word] 132 151 384 1039 1517 4039 FIMUL [DWord] 141 151 383 980 1643 3976 FMUL [DWord] 125 123 345 948 1480 3445 FMUL [QWord] 175 192 387 991 1602 4416 FDIV ST,ST(0) 201 207 274 726 1536 9789 FDIV ST,ST(1) 203 218 299 808 1658 10332 FDIV ST(1),ST 207 214 299 825 1655 10342 FDIVR ST(1),ST 201 206 302 819 1806 10213 FDIVRP ST(1),ST 201 205 309 845 1803 10409 FIDIV [Word] 237 227 390 980 1779 11225 FIDIV [DWord] 246 227 411 944 1680 11572 FDIV [DWord] 229 226 352 893 1722 10577 FDIV [QWord] 236 227 391 993 1777 10829 FSQRT (0.0) 21 57 60 512 382 1755 FSQRT (1.0) 186 206 294 1106 2504 37836 FSQRT (L2T) 186 207 295 1398 2467 37925 FXTRACT (L2T) 51 56 155 726 571 3326 FSCALE (PI,5) 41 56 95 817 443 3194 FRNDINT (PI) 51 58 136 808 800 7092 FPREM (99,PI) 81 131 322 1696 941 4098 FPREM1(99,PI) N/A N/A 384 1625 N/A N/A FCOM 56 75 155 582 483 2799 FCOMP 61 92 160 616 485 2983 FCOMPP 61 90 149 661 476 3198 FICOM [Word] 79 77 231 808 861 3654 FICOM [DWord] 89 77 231 750 964 3684 FCOM [DWord] 74 75 214 741 625 3643 FCOM [QWord] 74 76 205 754 667 3771 FSIN (0.0) N/A N/A 137 639 N/A N/A FSIN (1.0) N/A N/A 997 4640 N/A N/A FSIN (PI) N/A N/A 322 2488 N/A N/A FSIN (LG2) N/A N/A 978 3911 N/A N/A FSIN (L2T) N/A N/A 1005 3767 N/A N/A FCOS (0.0) N/A N/A 182 740 N/A N/A FCOS (1.0) N/A N/A 988 4777 N/A N/A FCOS (PI) N/A N/A 337 2557 N/A N/A FCOS (LG2) N/A N/A 976 4176 N/A N/A FCOS (L2T) N/A N/A 1001 3905 N/A N/A FSINCOS (0.0) N/A N/A 225 714 N/A N/A FSINCOS (1.0) N/A N/A 1841 6049 N/A N/A FSINCOS (PI) N/A N/A 1167 4091 N/A N/A FSINCOS (LG2) N/A N/A 1525 5640 N/A N/A FSINCOS (L2T) N/A N/A 1552 5405 N/A N/A FPTAN (0.0) 41 58 90 752 8381 2324 FPTAN (1.0) 581 582 1182 6366 10817 29824 FPTAN (PI) 606 587 292 4388 12410 2300 FPTAN (LG2) 516 513 883 5939 12502 26770 FPTAN (L2T) 576 586 954 5723 12483 2301 FPATAN (0.0) 41 55 123 616 1208 10578 FPATAN (1.0) 736 736 171 1426 13446 34208 FPATAN (PI) 206 207 11115 2835 13305 46903 FPATAN (LG2) 756 736 11077 2490 13319 41312 FPATAN (L2T) 206 204 11117 2922 13364 50149 F2XM1 (0.0) 16 56 102 563 723 1722 F2XM1 (LN2) 631 624 905 4178 11070 33823 F2XM1 (LG2) 611 585 890 4798 11116 32163 FYL2X (1.0) 56 57 136 961 1214 4327 FYL2X (PI) 946 961 1008 8987 12858 40148 FYL2X (LG2) 1081 1038 1035 8933 12748 46821 FYL2X (L2T) 926 886 1089 8982 12712 38986 FYL2XP1 (LG2) 1026 1037 1154 10485 11867 44708Clock-cycle timings for floating-point operations on Weitek coprocessors
Single-precision Double-precision 3167 4167 3167 4167 ABS 3 2 3 2 NEG 6 2 6 2 ADD 6 2 6 2 SUB 6 2 6 2 SUBR 6 2 6 2 MUL 6 2 10 3 DIVR 38 17 66 31 SQRT 60 17 118 31 SIN 146 ~50 292 ~100 COS 140 ~50 285 ~100 TAN 188 ~60 340 ~110 EXP 179 ~60 401 ~130 LOG 171 ~60 365 ~120 F->ASCII 1000 N/A 1700 N/A // ASCII->F 1100 N/A 1800 N/A // // rough average of the timings given for different numeric formats by Weitek. Note that these conversions routines do much more work than the FBLD and FBSTP instructions provided by the 80x87 coprocessors. FBLD and FBSTP are useful for conversion routines but quite a bit of additional code is need for this purpose.
JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his activities as a member of the floating-point working group that defined the IEEE 754-1985 Standard for Binary Floating-Point Arithmetic. Appendix C of his thesis presents FPTEST, a Pascal program written by J Thomas and JT Coonen. IEEETEST is a port of FPTEST and runs on PCs whose math coprocessor accepts 80387-compatible floating-point instructions.
IEEETEST reads test vectors from the file TESTVECS and compares the answer returned by the math coprocessor with the answer listed in the test vector. If these answers differ an 'F' is displayed, otherwise a '.'is displayed. Answers can differ due to two types of failures: numeric failures or flag failures. Numeric failures occur when the computed answer has the wrong value. Flag failures occur when the status (invalid operation, divide by zero, underflow, overflow, inexact) is incorrectly identified.
TESTVECS is the concatenation of unmodified versions of all the test vectors distributed by UC Berkeley. The test data base is copyrighted by UC Berkeley (1985) and is being distributed with their permission. FPTEST and the test data base can be obtained by asking for 'IEEE-754 Test Vector' from UC Berkeley, Electrical Engineering and Computer Science, Industrial Liaison Program, 479 Corey Hall, Berkeley, CA, 94720 (415)643-6687.
The initial version of this test data base for the proposed IEEE 754 binary floating-point standard (draft 8.0) was developed for Zilog, Inc. and was donated to the floating-point working group for dissemination. Errors in or additions to the distributed data base should be reported to the agency of distribution, with copies to Zilog, Inc., 1315 Dell Avenue, Campbell, CA, 95008.
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4320 0 | 0 0 0 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4311 0 | 0 0 0 | 0 0 0 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3978 0 | 0 0 0 | 0 0 0 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2832 0 | 0 0 0 | 0 0 0 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 948 0 | 0 0 0 | 0 0 0 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31235 0 |
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4312 8 | 0 0 0 | 0 0 8 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4250 61 | 0 0 0 | 28 28 5 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3936 42 | 0 0 0 | 19 19 4 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2828 4 | 0 0 0 | 0 0 4 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 930 18 | 0 0 0 | 6 6 6 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31102 133 |
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4320 0 | 0 0 0 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4296 15 | 0 0 0 | 5 5 5 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3966 12 | 0 0 0 | 4 4 4 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2828 4 | 0 0 0 | 0 0 4 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 930 18 | 0 0 0 | 6 6 6 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31102 45 |
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 200 16 | 0 0 16 | 0 0 0 Addition + | 3336 192 | 0 0 128 | 0 0 96 Comparison C | 4224 96 | 0 0 96 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4159 152 | 0 0 124 | 0 0 116 Fraction Part F | 600 24 | 0 0 24 | 0 0 24 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3702 276 | 0 0 248 | 0 0 100 Negation - | 200 16 | 0 0 16 | 0 0 0 Next After N | 2248 584 | 0 0 584 | 0 0 168 Round to Integer I | 542 16 | 0 0 4 | 0 0 16 Scalb S | 874 74 | 5 5 44 | 8 8 20 Square Root V | 688 56 | 0 0 56 | 0 0 56 Subtraction - | 3336 192 | 0 0 128 | 0 0 96 Remainder % | 2844 140 | 0 0 140 | 0 0 116 Totals | 29401 1834 |
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 3612 708 | 136 136 136 | 228 228 228 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 516 2316 | 168 168 332 | 764 764 764 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 Remainder % | 1490 1494 | 432 432 288 | 342 342 230 Totals | 23412 7823 |
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 348 2484 | 768 768 768 | 504 504 526 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 Remainder % | ######## not run since machine hangs #######
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 152 64 | 0 0 8 | 24 24 8 Addition + | 1587 1941 | 178 178 722 | 508 508 616 Comparison C | 3696 624 | 208 208 208 | 4 4 108 Copy Sign @ | 1200 288 | 0 0 0 | 144 144 0 Division / | ######## not run since machine hangs ####### Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 908 52 | 0 0 16 | 16 16 4 Multiplication * | ######## not run since machine hangs ####### Negation - | 152 64 | 0 0 8 | 24 24 8 Next After N | 1404 1420 | 404 404 596 | 80 80 172 Round to Integer I | 514 44 | 4 4 20 | 8 8 16 Scalb S | ######## not run since machine hangs ####### Square Root V | 569 175 | 14 31 54 | 28 48 72 Subtraction - | 1827 1701 | 98 98 642 | 452 452 576 Remainder % | ######## not run since machine hangs #######
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 104 112 | 42 38 16 | 24 24 0 Addition + | 911 2617 | 746 637 637 | 672 672 380 Comparison C | 3180 1140 | 380 380 380 | 108 108 108 Copy Sign @ | 696 792 | 320 280 0 | 288 288 0 Division / | 900 3411 | 673 574 814 | 977 977 821 Fraction Part F | 348 276 | 154 82 40 | 24 24 24 Logb L | 656 304 | 136 100 36 | 24 24 12 Multiplication * | 1023 2955 | 759 663 857 | 670 670 442 Negation - | 86 130 | 44 38 32 | 24 24 0 Next After N | 464 2368 | 780 780 796 | 344 344 320 Round to Integer I | 273 285 | 95 74 52 | 72 72 68 Scalb S | 254 694 | 217 192 137 | 176 168 136 Square Root V | 128 616 | 192 180 147 | 196 196 188 Subtraction - | 911 2617 | 746 637 637 | 672 672 372 Remainder % | 558 2426 | 903 859 664 | 508 508 220 Totals | 10492 20743 |
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 168 48 | 16 16 16 | 16 8 0 Addition + | 1877 1651 | 294 290 336 | 496 456 416 Comparison C | ## not run - program aborts with div-by-0 ## Copy Sign @ | 1392 96 | 48 48 0 | 48 0 0 Division / | ## not run - program aborts with div-by-0 ## Fraction Part F | 588 36 | 12 0 24 | 0 0 0 Logb L | 888 72 | 24 24 24 | 12 12 12 Multiplication * | 2148 1830 | 332 310 528 | 520 360 352 Negation - | 160 48 | 16 16 16 | 16 8 0 Next After N | ## not run - program aborts with div-by-0 ## Round to Integer I | 318 240 | 0 0 4 | 80 80 80 Scalb S | 564 384 | 108 100 76 | 112 88 56 Square Root V | 180 564 | 143 157 169 | 72 72 128 Subtraction - | 1877 1651 | 294 290 336 | 496 456 416 Remainder % | 1072 1912 | 652 672 524 | 336 288 216
Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934
Precision Control SINGLE 1.23456789012337585E+0000 DOUBLE 1.23456789012337585E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934
Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals not supported
Precision Control SINGLE 1.23456789012351396E+0000 DOUBLE 1.23456789012351396E+0000 EXTENDED 1.23456789012351396E+0000 Rounding Control NEAREST -1.23457766383395931E+0100 DOWN -1.23457766383395931E+0100 UP -1.23457766383395931E+0100 CHOP -1.23457766383395931E+0100 Denormal support SINGLE denormals not supported DOUBLE denormals not supported EXTENDED denormals not supported
Precision Control SINGLE 1.23456789012337614E+0000 DOUBLE 1.23456789012337614E+0000 EXTENDED 1.23456789012337614E+0000 Rounding Control NEAREST -1.23427621117212139E+0100 DOWN -1.23427621117212139E+0100 UP -1.23427621117212139E+0100 CHOP -1.23427621117212139E+0100 Denormal support SINGLE denormals not supported DOUBLE denormals not supported EXTENDED denormals not supported
%wrong is the percentage of results that differ from the 'exact' result (infinitely precise result rounded to 64 bits) ULP_hi is the number of results where the returned result was greater than the 'exact' (correctly rounded) result by one ULP (the numeric weight of the last mantissa bit, 2**-63 to 2**-64 depending of the size of the number). ULPs_hi is the number of results where the returned result was greater than the 'exact' result by two or more ULPs. ULP_lo is the number of results where the returned result was smaller than the 'exact' (correctly rounded) result by one ULP (the numeric weight of the last mantissa bit, 2**-63 to 2**-64 depending of the size of the number). ULPs_lo is the number of results where the returned result was smaller than the 'exact' result by two or more ULPs. max ULP err is the maximum deviation of a returned result from the 'exact' answer expressed in ULPs.Test results for accuracy of transcendental functions for double extended precision as returned by the program TRANCK. 100,000 trials per function:
Franke387 V2.4 emulator max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 39.042 25301 708 13029 4 2 COS 0,pi/4 75.714 49827 25887 0 0 3 TAN 0,pi/4 76.976 14230 10029 24323 28394 9 ATAN 0,1 55.826 26028 1529 24044 4225 4 2XM1 0,0.5 96.717 0 0 47910 48807 5 YL2XP1 0,sqrt(2)-1 93.007 578 9 27416 65004 8 YL2X 0.1,10 62.252 16817 4712 37082 3641 2953 Microsoft's coprocessor emulator (part of MS-C and MS-Fortran libraries) max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 N/A N/A N/A N/A N/A N/A COS 0,pi/4 N/A N/A N/A N/A N/A N/A TAN 0,pi/4 40.828 27764 1520 11445 99 2 ATAN 0,1 32.307 18893 485 12530 299 2 2XM1 0,0.5 52.163 8585 189 37745 5644 3 YL2XP1 0,sqrt(2)-1 88.801 4714 916 14239 68932 11 YL2X 0.1,10 36.598 13813 3272 13866 5647 11 INTEL 8087, 80287 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 N/A N/A N/A N/A N/A N/A COS 0,pi/4 N/A N/A N/A N/A N/A N/A TAN 0,pi/4 37.001 18756 524 17405 316 2 ATAN 0,1 9.666 6065 0 3601 0 1 2XM1 0,0.5 19.920 0 0 19920 0 1 YL2XP1 0,sqrt(2)-1 7.780 868 0 6912 0 1 YL2X 0.1,10 1.287 723 0 564 0 1 INTEL 80387 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 28.872 2467 0 26392 13 2 COS 0,pi/4 27.213 27169 35 9 0 2 TAN 0,pi/4 10.532 441 0 10091 0 1 ATAN 0,1 7.088 2386 0 4691 1 2 2XM1 0,0.5 32.024 0 0 32024 0 1 YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1 YL2X 0.1,10 13.020 6508 0 6512 0 1 INTEL 387DX max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 28.873 2467 0 26393 13 2 COS 0,pi/4 27.121 27090 22 9 0 2 TAN 0,pi/4 10.711 457 0 10254 0 1 ATAN 0,1 7.088 2386 0 4691 1 2 2XM1 0,0.5 32.024 0 0 32024 0 1 YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1 YL2X 0.1,10 13.020 6508 0 6512 0 1 ULSI 83C87 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 35.530 4989 6 30238 297 2 COS 0,pi/4 43.989 11193 675 31393 728 2 TAN 0,pi/4 48.539 18880 1015 26349 2295 3 ATAN 0,1 20.858 62 0 20796 0 1 2XM1 0,0.5 21.257 4 0 21253 0 1 YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2 YL2X 0.1,10 13.603 9816 0 3787 0 1 IIT 3C87 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 18.650 11171 0 7479 0 1 COS 0,pi/4 7.700 3024 0 4676 0 1 TAN 0,pi/4 20.973 9681 0 11291 1 2 ATAN 0,1 19.280 13186 0 6094 0 1 2XM1 0,0.5 25.660 17570 0 8090 0 1 YL2XP1 0,sqrt(2)-1 45.830 23503 1896 19654 777 3 YL2X 0.1,10 10.888 5638 357 4845 48 3 C&T 38700DX max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.821 1272 0 549 0 1 COS 0,pi/4 23.358 12458 0 10901 0 1 TAN 0,pi/4 17.178 10725 0 6453 0 1 ATAN 0,1 9.359 7082 0 2277 0 1 2XM1 0,0.5 15.188 3039 0 12149 0 1 YL2XP1 0,sqrt(2)-1 19.497 12109 0 7388 0 1 YL2X 0.1,10 46.868 261 0 46607 0 1 CYRIX 83D87 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.554 1015 0 539 0 1 COS 0,pi/4 0.925 143 0 782 0 1 TAN 0,pi/4 4.147 881 0 3266 0 1 ATAN 0,1 0.656 229 0 427 0 1 2XM1 0,0.5 2.628 1433 0 1194 0 1 YL2XP1 0,sqrt(2)-1 3.242 825 0 2417 0 1 YL2X 0.1,10 0.931 256 0 675 0 1 CYRIX 387+ max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.486 864 0 622 0 1 COS 0,pi/4 2.072 12 0 2060 0 1 TAN 0,pi/4 0.602 63 0 539 0 1 ATAN 0,1 0.384 12 0 372 0 1 2XM1 0,0.5 1.985 27 0 1958 0 1 YL2XP1 0,sqrt(2)-1 3.662 1705 0 1957 0 1 YL2X 0.1,10 0.764 367 0 397 0 1 INTEL RapidCAD, Intel 486 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 16.991 1517 0 15474 0 1 COS 0,pi/4 9.003 7603 0 1400 0 1 TAN 0,pi/4 10.532 441 0 10091 0 1 ATAN 0,1 7.078 2386 0 4691 1 2 2XM1 0,0.5 32.025 0 0 32025 0 1 YL2XP1 0,sqrt(2)-1 21.800 533 0 21267 0 1 YL2X 0.1,10 3.894 1879 0 2015 0 1Discussion of the transcendental function tests
Results of running the SMDIAG program on 387-compatible coprocessors (p = passed, f = failed) Intel Intel Intel Cyrix Cyrix IIT ULSI C&T Test RapidCAD 387DX 80387 387+ 83D87 3C87 83C87 38700 1 (fstore) f p p p f f f p ##,%% 2 (fiall) p p p p p p f p 3 (faddsub) p p p p p p p p 4 (faddsub_nr) p p p p f f f p %% 5 (faddsub_cp) p p p p f f f p %% 6 (faddsub_dn) p p p p f f f p %% 7 (faddsub_up) p p p p f f f p %%,&& 8 (fmul) p p p p p f f p 9 (fdivn) p p p p p p p p 10 (fdiv) p p p p p p f p 11 (fxch) p p p p p p p p 12 (fyl2x) p p p f f f f p ++ 13 (fyl2xp1) f p p f f f f p ++ 14 (fsqrt) p p p p p p p p 15 (fsincos) f p p f f f f p ++ 16 (fptan) p p p f p f f p ++ 17 (fpatan) p p p f f f f p ++ 18 (f2xm1) p p p f f f f p ++ 19 (fscale) f f p f f f f p ** 20 (fcom1) p p p p p f f p 21 (fprem) p p p p p p p p 22 (misc1) p p p p p f f p 23 (misc3) p p p p p p p p 24 (misc4) p p p p f f p p %% failed modules: 4 1 0 7 12 16 17 0 ## the failure of the Intel RapidCAD is caused by the fact that it stores the value of BCD INDEFINITE differently from the Intel 387DX. It uses FFFFC000000000000000, while the 387DX uses FFFF8000000000000000. However, both encodings are valid according to Intel's documentation, which defines the BCD INDEFINITE as FFFFUUUUUUUUUUUUUUUU, where U is undefined. So failure of the RapidCAD to deliver the same answer as the 387DX is not an "error", just a very slight incompatibility. ** the FSCALE errors reported for the Intel 387DX, Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and ULSI 83C87 are due to a single 'wrong' result each returned by one of the FSCALE computations. SMDIAG expects the result returned by the first generation Intel 80387 (and, of course, the C&T 38700DX). However, this result is wrong according to Intel's documentation and the behavior was corrected in the second generation Intel 387DX. Therefore, the Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and ULSI 83C87 return the correct result compatible with the Intel 387DX. %% Failures reported for the Cyrix 83D87 are due to the fact that it converts pseudodenormals contained in its registers to normalized numbers upon storing them to memory with the FSTP TBYTE PTR instruction. Intel's processors store pseudodenormals without 'normalizing' them. This is an incompatibility, but not an error, because both encodings will evaluate to the same value should they be reused in a calculation. && Two of the failures reported for the Cyrix 83D87 are actual errors where the Cyrix 83D87 fails to deliver the correct result. 1) control word = 0A7F (closure=proj., round=up, precision=53bit) ST(0) = 0001 ABCEF9876542101 ST(1) = 0001 800000000345FFF instruction: FSUBRP ST(1), ST result should be: 0000 2BCEF987650EC800, status word = 3A30 83D87 returns: 0000 3BCEF987650EC000, status word = 3830 2) control word = 0A7F (closure=proj., round=up, precision=53bit) ST(0) = 0001 ABCEF9876542101 ST(1) = 0001 800000000000000 instruction: FSUB ST, ST(1) result should be: 0000 2BCEF98765432800, status word = 3A30 83D87 returns: 0000 3BCEF98765432000, status word = 3830 ++ The failures for the test of transcendental functions are caused by the tested coprocessor returning results that differ from the ones returned by the Intel 387DX. On the Cyrix 83D87, Cyrix 387+, and Intel RapidCAD, this is simply due to the improved accuracy these coprocessors provide over the Intel 387DX. The failures of the IIT 3C87 and ULSI 83C87 are mainly due to the lesser accuracy in the transcendental functions of these coprocessors, but for the IIT 3C87 an additional source of failures is its inability to handle extended-precision denormals.
[1] Schnurer, G.: Zahlenknacker im Vormarsch. c't 1992, Heft 4, Seiten 170- 186 [2] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark. Computer Journal, Vol. 19, No. 1, 1976, pp. 43-49 [3] Wichmann, B.A.: Validation code for the Whetstone benchmark. NPL Report DITC 107/88, National Physics Laboratory, UK, March 1988 [4] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years. In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990 [5] Dongarra, J.J.: The Linpack Benchmark: An Explanation. In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990 [6] Dongarra, J.J.: Performance of Various Computers Using Standard Linear Equations Software. Report CS-89-85, Computer Science Department, University of Tennessee, March 11, 1992 [7] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test. Design & Elektronik 1990, Heft 13, Seiten 105-110 [8] Ungerer, B.: Sockelfolger. c't 1990, Heft 4, Seiten 162-163 [9] Coonen, J.T.: Contributions to a Proposed Standard for Binary Floating- Point Arithmetic Ph.D. thesis, University of California, Berkeley, 1984 [10] IEEE: IEEE Standard for Binary Floating-Point Arithmetic. SIGPLAN Notices, Vol. 22, No. 2, 1985, pp. 9-25 [11] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754- 1985. New York, NY: Institute of Electrical and Electronics Engineers 1985 [12] FasMath 83D87 Compatibility Report. Cyrix Corporation, Nov. 1989 Order No. B2004 [13] FasMath 83D87 Accuracy Report. Cyrix Corporation, July 1990 Order No. B2002 [14] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990 Order No. B2004 [15] FasMath 83D87 User's Manual. Cyrix Corporation, June 1990 Order No. L2001-003 [16] Brent, R.P.: A FORTRAN multiple-precision arithmetic package. ACM Transactions on Mathematical Software, Vol. 4, No. 1, March 1978, pp. 57-70 [17] 387DX User's Manual, Programmer's Reference. Intel Corporation, 1989 Order No. 231917-002 [18] Volder, J.E.: The CORDIC Trigonometric Computing Technique. IRE Transactions on Electronic Computers, Vol. EC-8, No. 5, September 1959, pp. 330-334 [19] Walther, J.S.: A unified algorithm for elementary functions. AFIPS Conference Proceedings, Vol. 38, SJCC 1971, pp. 379-385 [20] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E mit Vektoreinrichtung. Arbeitsbericht RRZK-8803, Regionales Rechenzentrum an der Universit"at zu Kln, Februar 1988 [21] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory, USA, December 1986 [22] Nave, R.: Implementation of Transcendental Functions on a Numerics Processor. Microprocessing and Microprogramming, Vol. 11, No. 3-4, March-April 1983, pp. 221-225 [23] Yuen, A.K.: Intel's Floating-Point Processors. Electro/88 Conference Record, Boston, MA, USA, 10-12 May 1988, pp. 48/5-1 - 48/5-7 [24] Stiller, A.; Ungerer, B.: Ausgerechnet. c't 1990, Heft 1, Seiten 90-92 [25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Professionell, Juni 1991, Seiten 214-237 [26] Intel 80286 Hardware Reference Manual. Intel Corporation, 1987 Order No.210760-002 [27] AMD 80C287 80-bit CMOS Numeric Processor. Advanced Micro Devices, June 1989 Order No. 11671B/0 [28] Intel RapidCAD(tm) Engineering CoProcessor Performance Brief. Intel Corporation, 1992 [29] i486(tm) Microprocessor Performance Report. Intel Corporation, April 1990 Order No. 240734-001 [30] Intel486(tm) DX2 Microprocessor Performance Brief. Intel Corporation, March 1992 Order No. 241254-001 [31] Abacus 3167 Floating-Point Coprocessor Data Book. Weitek Corporation, July 1990 DOC No. 9030 [32] WTL 4167 Floating-Point Coprocessor Data Book. Weitek Corporation, July 1989 DOC No. 8943 [33] Abacus Software Designer's Guide. Weitek Corporation, September 1989 DOC No. 8967 [34] Stiller, A.: Cache & Carry. c't 1992, Heft 6, Seiten 118-130 [35] Stiller, A.: Cache & Carry, Teil 2. c't 1992, Heft 7, Seiten 28-34 [36] Palmer, J.F.; Morse, S.P.: Die mathematischen Grundlagen der Numerik- Prozessoren 8087/80287. Mnchen: tewi 1985 [37] 80C187 80-bit Math Coprocessor Data Sheet. Intel Corporation, September 1989 Order No. 270640-003 [38] IIT-2C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990 [39] Engineering note 4x4 matrix multiply transformation. IIT, 1989 [40] Tscheuschner, E.: 4 mal 4 auf einen Streich. c't 1990, Heft 3, Seiten 266-276 [41] Goldberg, D.: Computer Arithmetic. In: Hennessy, J.L.; Patterson, D.A.: Computer Architecture A Quantitative Approach. San Mateo, CA: Morgan Kaufmann 1990 [42] 8087 Math Coprocessor Data Sheet. Intel Corporation, October 1989, Order No. 205835-007 [43] 8086/8088 User's Manual, Programmer's and Hardware Reference. Intel Corporation, 1989 Order No. 240487-001 [44] 80286 and 80287 Programmer's Reference Manual. Intel Corporation, 1987 Order No. 210498-005 [45] 80287XL/XLT CHMOS III Math Coprocessor Data Sheet. Intel Corporation, May 1990 Order No. 290376-001 [46] Cyrix FasMath(tm) 82S87 Coprocessor Data Sheet. Cyrix Coporation, 1991 Document 94018-00 Rev. 1.0 [47] IIT-3C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990 [48] 486(tm)SX(tm) Microprocessor/ 487(tm)SX(tm) Math CoProcessor Data Sheet. Intel Corporation, April 1991. Order No. 240950-001 [49] Schnurer, G.: Die gro"se Verlade. c't 1991, Heft 7, Seiten 55-57 [50] Schnurer, G.: Eine 4 f"ur alle. c't 1991, Heft 6, Seite 25 [51] Intel486(tm)DX Microprocessor Data Book. Intel Corporation, June 1991 Order No. 240440-004 [52] i486(tm) Microprocessor Hardware Reference Manual. Intel Corporation, 1990 Order No. 240552-001 [53] i486(tm) Microprocessor Programmer's Reference Manual. Intel Corporation, 1990 Order No. 240486-001 [54] Ungerer, B.: Kalte H"ute. c't 1992, Heft 8, Seiten 140-144 [55] Ungerer, B.: Hei"se Sache. c't 1991, Heft 4, Seiten 104-108 [56] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Profesionell, Juni 1991, Seiten 214-237 [57] Niederkr"uger, W.: Lebendige Vergangenheit. c't 1990, Heft 12, Seiten 114-116 [58] ULSI Math*Co Advanced Math Coprocessor Technical Specification. ULSI System, 5/92, Rev. E [59] 387(tm)DX Math CoProcessor Data Sheet. Intel Corporation, September 1990. Order No. 240448-003 [60] 387(tm) Numerics Coprocessor Extension Data Sheet. Intel Corporation, February 1989. Order No. 231920-005 [61] Koren, I.; Zinaty, O.: Evaluating Elementary Functions in a Numerical Coprocessor Based on Rational Approximations. IEEE Transactions on Computers, Vol. C-39, No. 8, August 1990, pp. 1030-1037 [62] 387(tm) SX Math CoProcessor Data Sheet. Intel Corporation, November 1989 Order No. 240225-005 [63] Frenkel, G.: Coprocessors Speed Numeric Operations. PC-Week, August 27, 1990 [64] Schnurer, G.; Stiller, A.: Auto-Matt. c't 1991, Heft 10, Seiten 94-96 [65] Grehan, R.: FPU Face-Off. Byte, November 1990, pp. 194-200 [66] Tang, P.T.P.: Testing Computer Arithmetic by Elementary Number Theory. Preprint MCS-P84-0889, Mathematics and Computer Science Division, Argonne National Laboratory, August 1989 [67] Ferguson, W.E.: Selecting math coprocessors. IEEE Spectrum, July 1991, pp. 38-41 [68] Schnabel, J.: Viermal 387. Computer Pers"onlich 1991, Heft 22, Seiten 153-156 [69] Hofmann, J.: Starke Rechenknechte. mc 1990, Heft 7, Seiten 64-67 [70] Woerrlein, H.; Hinnenberg, R.: Die Lust an der Power. Computer Live 1991, Heft 10, Seiten 138-149 [71] email from Peter Forsberg (peterf@vnet.ibm.com), email from Alan Brown (abrown@Reston.ICL.COM) [72] email from Eric Johnson (johnsone%camax01@uunet.UU.NET), email from Jerry Whelan (guru@stasi.bradley.edu), email from Arto Viitanen (av@cs.uta.fi), email from Richard Krehbiel (richk@grebyn.com) [73] email from Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM) [74] correspondence with Bengt Ask (f89ba@efd.lth.se) [75] email from Thomas Hoberg (tmh@prosun.first.gmd.de) [76] Microsoft Macro Assembler Programmer's Guide Version 6.0, Microsoft Corporation, 1991. Document No. LN06556-0291 [77] FasMath EMC87 User's Manual, Rev. 2. Cyrix Corporation, February 1991 Order No. 90018-00 [78] Persson, C.: Die 32-Bit-Parade c't 1992, Heft 9, Seiten 150-156 [79] email from Duncan Murdoch (dmurdoch@mast.QueensU.CA)
Intel Corporation 3065 Bowers Avenue Santa Clara, CA 95051 USA IIT Integrated Information Technology, Inc. 2540 Mission College Blvd. Santa Clara, CA 95054 USA ULSI Systems, Inc. 58 Daggett Drive San Jose, CA 95134 USA Chips & Technologies, Inc. 3050 Zanker Road San Jose, CA 95134 USA Weitek Corporation 1060 East Arques Avenue Sunnyvale, CA 94086 USA AMD Advanced Microdevices, Inc. 901 Thompson Place P.O.B. 3453 Sunnyvale, CA 94088-3453 USA Cyrix Corporation P.O.B. 850118 Richardson, TX 75085 USA
{$N+,E+} PROGRAM PCtrl; VAR B,c: EXTENDED; Precision, L: WORD; PROCEDURE SetPrecisionControl (Precision: WORD); (* This procedure sets the internal precision of the NDP. Available *) (* precision values: 0 - 24 bits (SINGLE) *) (* 1 - n.a. (mapped to single) *) (* 2 - 53 bits (DOUBLE) *) (* 3 - 64 bits (EXTENDED) *) VAR CtrlWord: WORD; BEGIN {SetPrecisionCtrl} IF Precision = 1 THEN Precision := 0; Precision := Precision SHL 8; { make mask for PC field in ctrl word} ASM FSTCW [CtrlWord] { store NDP control word } MOV AX, [CtrlWord] { load control word into CPU } AND AX, 0FCFFh { mask out precision control field } OR AX, [Precision] { set desired precision in PC field } MOV [CtrlWord], AX { store new control word } FLDCW [CtrlWord] { set new precision control in NDP } END; END; {SetPrecisionCtrl} BEGIN {main} FOR Precision := 1 TO 3 DO BEGIN B := 1.2345678901234567890; SetPrecisionControl (Precision); FOR L := 1 TO 20 DO BEGIN B := Sqrt (B); END; FOR L := 1 TO 20 DO BEGIN B := B*B; END; SetPrecisionControl (3); { full precision for printout } WriteLn (Precision, B:28); END; END. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$N+,E+} PROGRAM RCtrl; VAR B,c: EXTENDED; RoundingMode, L: WORD; PROCEDURE SetRoundingMode (RCMode: WORD); (* This procedure selects one of four available rounding modes *) (* 0 - Round to nearest (default) *) (* 1 - Round down (towards negative infinity) *) (* 2 - Round up (towards positive infinity) *) (* 3 - Chop (truncate, round towards zero) *) VAR CtrlWord: WORD; BEGIN RCMode := RCMode SHL 10; { make mask for RC field in control word} ASM FSTCW [CtrlWord] { store NDP control word } MOV AX, [CtrlWord] { load control word into CPU } AND AX, 0F3FFh { mask out rounding control field } OR AX, [RCMode] { set desired precision in RC field } MOV [CtrlWord], AX { store new control word } FLDCW [CtrlWord] { set new rounding control in NDP } END; END; BEGIN FOR RoundingMode := 0 TO 3 DO BEGIN B := 1.2345678901234567890e100; SetRoundingMode (RoundingMode); FOR L := 1 TO 51 DO BEGIN B := Sqrt (B); END; FOR L := 1 TO 51 DO BEGIN B := -B*B; END; SetRoundingMode (0); { round to nearest for printout } WriteLn (RoundingMode, B:28); END; END. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$N+,E+} PROGRAM DenormTs; VAR E: EXTENDED; D: DOUBLE; S: SINGLE; BEGIN WriteLn ('Testing support and printing of denormals'); WriteLn; Write ('Coprocessor is: '); CASE Test8087 OF 0: WriteLn ('Emulator'); 1: WriteLn ('8087 or compatible'); 2: WriteLn ('80287 or compatible'); 3: WriteLn ('80387 or compatible'); END; WriteLn; S := 1.18e-38; S := S * 3.90625e-3; IF S = 0 THEN WriteLn ('SINGLE denormals not supported') ELSE BEGIN WriteLn ('SINGLE denormals supported'); WriteLn ('SINGLE denormal prints as: ', S); WriteLn ('Denormal should be printed as 4.60943...E-0041'); END; WriteLn; D := 2.24e-308; D := D * 3.90625e-3; IF D = 0 THEN WriteLn ('DOUBLE denormals not supported') ELSE BEGIN WriteLn ('DOUBLE denormals supported'); WriteLn ('DOUBLE denormal prints as: ', D); WriteLn ('Denormal should be printed as 8.75...E-0311'); END; WriteLn; E := 3.37e-4932; E := E * 3.90625e-3; IF E = 0 THEN WriteLn ('EXTENDED denormals not supported') ELSE BEGIN WriteLn ('EXTENDED denormals supported'); WriteLn ('EXTENDED denormal prints as: ', E); WriteLn ('Denormal should be printed as 1.3164...E-4934'); END; END.
; FILE: APFELM4.ASM ; assemble with MASM /e APFELM4 or TASM /e APFELM4 CODE SEGMENT BYTE PUBLIC 'CODE' ASSUME CS: CODE PAGE ,120 PUBLIC APPLE87; APPLE87 PROC NEAR PUSH BP ; save caller's base pointer MOV BP, SP ; make new frame pointer PUSH DS ; save caller's data segment PUSH SI ; save register PUSH DI ; variables LDS BX, [BP+04] ; pointer to parameter record FINIT ; init 80x87 FSP->R0 FILD WORD PTR [BX+02] ; maxrad FSP->R7 FLD QWORD PTR [BX+08] ; qmax FSP->R6 FSUB QWORD PTR [BX+16] ; qmax-qmin FSP->R6 DEC WORD PTR [BX+04] ; ymax-1 FIDIV WORD PTR [BX+04] ; (qmax-qmin)/(ymax-1)FSP->R6 FSTP QWORD PTR [BX+16] ; save delta_q FSP->R7 FLD QWORD PTR [BX+24] ; pmax FSP->R6 FSUB QWORD PTR [BX+32] ; pmax-pmin FSP->R6 DEC WORD PTR [BX+06] ; xmax-1 FIDIV WORD PTR [BX+06] ; delta_p FSP->R6 MOV AX, [BX] ; save maxiter,[BX] needed for MOV [BX+2], AX ; 80x87 status now XOR BP, BP ; y=0 FLD QWORD PTR [BX+08] ; qmax FSP->R5 CMP WORD PTR [BX+40], 0 ; fast mode on 8087 desired ? JE yloop ; no, normal mode FSTCW [BX] ; save NDP control word AND WORD PTR [BX], 0FCFFh; set PCTRL = single-precision FLDCW [BX] ; get back NDP control word yloop: XOR DI, DI ; x=0 FLD QWORD PTR [BX+32] ; pmin FSP->R4 xloop: FLDZ ; j**2= 0 FSP->R3 FLDZ ; 2ij = 0 FSP->R2 FLDZ ; i**2= 0 FSP->R1 MOV CX, [BX+2] ; maxiter MOV DL, 41h ; mask for C0 and C3 cond.bits iteration: FSUB ST, ST(2) ; i**2-j**2 FSP->R1 FADD ST, ST(3) ; i**2-j**2+p = i FSP->R1 FLD ST(0) ; duplicate i FSP->R0 FMUL ST(1), ST ; i**2 FSP->R0 FADD ST, ST(0) ; 2i FSP->R0 FXCH ST(2) ; 2*i*j FSP->R0 FADD ST, ST(5) ; 2*i*j+q = j FSP->R0 FMUL ST(2), ST ; 2*i*j FSP->R0 FMUL ST, ST(0) ; j**2 FSP->R0 FST ST(3) ; save j**2 FSP->R0 FADD ST, ST(1) ; i**2+j**2 FSP->R0 FCOMP ST(7) ; i**2+j**2 > maxrad? FSP->R1 FSTSW [BX] ; save 80x87 cond.codeFSP->R1 TEST BYTE PTR [BX+1], DL ; test carry and zero flags LOOPNZ iteration ; until maxiter if not diverg. MOV DX, CX ; number of loops executed NEG CX ; carry set if CX <> 0 ADC DX, 0 ; adjust DX if no. of loops<>0 ; plot point here (DI = X, BP = y, DX has the color) FSTP ST(0) ; pop i**2 FSP->R2 FSTP ST(0) ; pop 2ij FSP->R3 FSTP ST(0) ; pop j**2 FSP->R4 FADD ST,ST(2) ; p=p+delta_p FSP->R4 INC DI ; x:=x+1 CMP DI, [BX+6] ; x > xmax ? JBE xloop ; no, continue on same line FSTP ST(0) ; pop p FSP->R5 FSUB QWORD PTR [BX+16] ; q=q-delta_q FSP->R5 INC BP ; y:=y+1 CMP BP, [BX+4] ; y > ymax ? JBE yloop ; no, picture not done yet groesser: POP DI ; restore POP SI ; register variables POP DS ; restore caller's data segm. POP BP ; save caller's base pointer RET 4 ; pop parameters and return APPLE87 ENDP CODE ENDS END ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ UNIT Time; INTERFACE FUNCTION Clock: LONGINT; { same as VMS; time in milliseconds } IMPLEMENTATION FUNCTION Clock: LONGINT; ASSEMBLER; ASM PUSH DS { save caller's data segment } XOR DX, DX { initialize data segment to } MOV DS, DX { access ticker counter } MOV BX, 46Ch { offset of ticker counter in segm.} MOV DX, 43h { timer chip control port } MOV AL, 4 { freeze timer 0 } PUSHF { save caller's int flag setting } STI { allow update of ticker counter } LES DI, DS:[BX] { read BIOS ticker counter } OUT DX, AL { latch timer 0 } LDS SI, DS:[BX] { read BIOS ticker counter } IN AL, 40h { read latched timer 0 lo-byte } MOV AH, AL { save lo-byte } IN AL, 40h { read latched timer 0 hi-byte } POPF { restore caller's int flag } XCHG AL, AH { correct order of hi and lo } MOV CX, ES { ticker counter 1 in CX:DI:AX } CMP DI, SI { ticker counter updated ? } JE @no_update { no } OR AX, AX { update before timer freeze ? } JNS @no_update { no } MOV DI, SI { use second } MOV CX, DS { ticker counter } @no_update:NOT AX { counter counts down } MOV BX, 36EDh { load multiplier } MUL BX { W1 * M } MOV SI, DX { save W1 * M (hi) } MOV AX, BX { get M } MUL DI { W2 * M } XCHG BX, AX { AX = M, BX = W2 * M (lo) } MOV DI, DX { DI = W2 * M (hi) } ADD BX, SI { accumulate } ADC DI, 0 { result } XOR SI, SI { load zero } MUL CX { W3 * M } ADD AX, DI { accumulate } ADC DX, SI { result in DX:AX:BX } MOV DH, DL { move result } MOV DL, AH { from DL:AX:BX } MOV AH, AL { to } MOV AL, BH { DX:AX:BH } MOV DI, DX { save result } MOV CX, AX { in DI:CX } MOV AX, 25110 { calculate correction } MUL DX { factor } SUB CX, DX { subtract correction } SBB DI, SI { factor } XCHG AX, CX { result back } MOV DX, DI { to DX:AX } POP DS { restore caller's data segment } END; BEGIN Port [$43] := $34; { need rate generator, not square wave} Port [$40] := 0; { generator as prog. by some BIOSes } Port [$40] := 0; { for timer 0 } END. { Time } ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$A+,B-,R-,I-,V-,N+,E+} PROGRAM PeakFlop; USES Time; TYPE ParamRec = RECORD MaxIter, MaxRad, YMax, XMax: WORD; Qmax, Qmin, Pmax, Pmin: DOUBLE; FastMod: WORD; PlotFkt: POINTER; FLOPS:LONGINT; END; VAR Param: ParamRec; Start: LONGINT; {$L APFELM4.OBJ} PROCEDURE Apple87 (VAR Param: ParamRec); EXTERNAL; BEGIN WITH Param DO BEGIN MaxIter:= 50; MaxRad := 30; YMax := 30; XMax := 30; Pmin :=-2.1; Pmax := 1.1; Qmin :=-1.2; Qmax := 1.2; FastMod:= Word (FALSE); PlotFkt:= NIL; Flops := 0; END; Start := Clock; Apple87 (Param); { executes 104002 FLOP } Start := Clock - Start; { elapsed time in milliseconds } WriteLn ('Peak-MFLOPS: ', 104.002 / Start); END. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ; FILE: M4X4.ASM ; ; assemble with TASM /e M4X4 or MASM /e M4X4 CODE SEGMENT BYTE PUBLIC 'CODE' ASSUME CS:CODE PUBLIC MUL_4x4 PUBLIC IIT_MUL_4x4 FSBP0 EQU DB 0DBh, 0E8h ; declare special IIT FSBP1 EQU DB 0DBh, 0EBh ; instructions FSBP2 EQU DB 0DBh, 0EAh F4X4 EQU DB 0DBh, 0F1h ;--------------------------------------------------------------------- ; ; MUL_4x4 multiplicates a four-by-four matrix by an array of four ; dimensional vectors. This operation is needed for 3D transformations ; in graphics data processing. There are arrays for each component of ; a vector. Thus there is an ; array containing all the x components, ; another containing all the y components and so on. Each component is ; an 8 byte IEEE floating-point number. Two indices into the array of ; vectors are given. The first is the index of the vector that will be ; processed first, the second is the index of the vector processed ; last. ; ;--------------------------------------------------------------------- MUL_4x4 PROC NEAR AddrX EQU DWORD PTR [BP+24] ; address of X component array AddrY EQU DWORD PTR [BP+20] ; address of Y component array AddrZ EQU DWORD PTR [BP+16] ; address of Z component array AddrW EQU DWORD PTR [BP+12] ; address of W component array AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transform. mat. F EQU WORD PTR [BP+6] ; first vector to process K EQU WORD PTR [BP+4] ; last vector to process RetAddr EQU WORD PTR [BP+2] ; return address saved by call SavdBP EQU WORD PTR [BP+0] ; saved frame pointer SavdDS EQU WORD PTR [BP-2] ; caller's data segment PUSH BP ; save TURBO-Pascal frame ptr MOV BP, SP ; new frame pointer PUSH DS ; save TURBO-Pascal data segmnt MOV CX, K ; final index SUB CX, F ; final index - start index JNC $ok ; must not JMP $nothing ; be negative $ok: INC CX ; number of elements MOV SI, F ; init offset into arrays SHL SI, 1 ; each SHL SI, 1 ; element SHL SI, 1 ; has 8 bytes LDS DI, AddrT ; addr. of transformation mat. FLD QWORD PTR [DI] ; load a[0,0] = R7 FLD QWORD PTR [DI+8] ; load a[0,1] = R6 $mat_mul: LES BX, AddrX ; addr. of x component array FLD QWORD PTR ES:[BX+SI] ; load x[a] = R5 LES BX, AddrY ; addr. of y component array FLD QWORD PTR ES:[BX+SI] ; load y[a] = R4 LES BX, AddrZ ; addr. of z component array FLD QWORD PTR ES:[BX+SI] ; load z[a] = R3 LES BX, AddrW ; addr. of w component array FLD QWORD PTR ES:[BX+SI] ; load w[a] = R2 FLD ST(5) ; load a[0,0] = R1 FMUL ST, ST(4) ; a[0,0] * x[a] = R1 FLD ST(5) ; load a[0,1] = R0 FMUL ST, ST(4) ; a[0,1] * y[a] = R0 FADDP ST(1), ST ; a[0,0]*x[a]+a[0,1]*y[a]=R1 FLD QWORD PTR [DI+16] ; load a[0,2] = R0 FMUL ST, ST(3) ; a[0,2] * z[a] = R0 FADDP ST(1), ST ; a[0,0]*x[a]...a[0,2]*z[a]=R1 FLD QWORD PTR [DI+24] ; load a[0,3] = R0 FMUL ST, ST(2) ; a[0,3] * w[a] = R0 FADDP ST(1), ST ; a[0,0]*x[a]...a[0,3]*w[a]=R1 LES BX, AddrX ; get address of x vector FSTP QWORD PTR ES:[BX+SI] ; write new x[a] FLD QWORD PTR [DI+32] ; load a[1,0] = R1 FMUL ST, ST(4) ; a[1,0] * x[a] = R1 FLD QWORD PTR [DI+40] ; load a[1,1] = R0 FMUL ST, ST(4) ; a[1,1] * y[a] = R0 FADDP ST(1), ST ; a[1,0]*x[a]+a[1,1]*y[a]=R1 FLD QWORD PTR [DI+48] ; load a[1,2] = R0 FMUL ST, ST(3) ; a[1,2] * z[a] = R0 FADDP ST(1), ST ; a[1,0]*x[a]...a[1,2]*z[a]=R1 FLD QWORD PTR [DI+56] ; load a[1,3] = R0 FMUL ST, ST(2) ; a[1,3] * w[a] = R0 FADDP ST(1), ST ; a[1,0]*x[a]...a[1,3]*w[a]=R1 LES BX, AddrY ; get address of y vector FSTP QWORD PTR ES:[BX+SI] ; write new y[a] FLD QWORD PTR [DI+64] ; load a[2,0] = R1 FMUL ST, ST(4) ; a[2,0] * x[a] = R1 FLD QWORD PTR [DI+72] ; load a[2,1] = R0 FMUL ST, ST(4) ; a[2,1] * y[a] = R0 FADDP ST(1), ST ; a[2,0]*x[a]+a[2,1]*y[a]=R1 FLD QWORD PTR [DI+80] ; load a[2,2] = R0 FMUL ST, ST(3) ; a[2,2] * z[a] = R0 FADDP ST(1), ST ; a[2,0]*x[a]...a[2,2]*z[a]=R1 FLD QWORD PTR [DI+88] ; load a[2,3] = R0 FMUL ST, ST(2) ; a[2,3] * w[a] = R0 FADDP ST(1), ST ; a[2,0]*x[a]...a[2,3]*w[a]=R1 LES BX, AddrZ ; get address of z vector FSTP QWORD PTR ES:[BX+SI] ; write new z[a] FLD QWORD PTR [DI+96] ; load a[3,0] = R1 FMULP ST(4), ST ; a[3,0] * x[a] = R5 FLD QWORD PTR [DI+104] ; load a[3,1] = R1 FMULP ST(3), ST ; a[3,1] * y[a] = R4 FLD QWORD PTR [DI+112] ; load a[3,2] = R1 FMULP ST(2), ST ; a[3,2] * z[a] = R3 FLD QWORD PTR [DI+120] ; load a[3,3] = R1 FMULP ST(1), ST ; a[3,3] * w[a] = R2 FADDP ST(1), ST ; a[3,3]*w[a]+a[3,2]*z[a]=R3 FADDP ST(1), ST ; a[3,3]*w[a]...a[3,1]*y[a]=R4 FADDP ST(1), ST ; a[3,3]*w[a]...a[3,0]*x[a]=R5 LES BX, AddrW ; get address of w vector FSTP QWORD PTR ES:[BX+SI] ; write new w[a] ADD SI, 8 ; new offset into arrays DEC CX ; decrement element counter JZ $done ; no elements left, done JMP $mat_mul ; transform next vector $done: FSTP ST(0) ; clear FSTP ST(0) ; FPU stack $nothing: POP DS ; restore TP data segment POP BP ; restore TP frame pointer RET 24 ; pop parameters and return MUL_4X4 ENDP ;--------------------------------------------------------------------- ; ; IIT_MUL_4x4 multiplicates a four-by-four matrix by an array of four ; dimensional vectors. This operation is needed for 3D transformations ; in graphics data processing. There are arrays for each component of ; a vector. Thus there is an array containing all the x components, ; another containing all the y components and so on. Each component is ; an 8 byte IEEE floating-point number. Two indices into the array of ; vectors are given. The first is the index of the vector that will be ; processed first, the second is the index of the vector processed ; last. This subroutine uses the special instructions only available ; on IIT coprocessors to provide fast matrix multiply capabilities. ; So make sure to use it only on IIT coprocessors. ; ;--------------------------------------------------------------------- IIT_MUL_4x4 PROC NEAR AddrX EQU DWORD PTR [BP+24] ; address of X component array AddrY EQU DWORD PTR [BP+20] ; address of Y component array AddrZ EQU DWORD PTR [BP+16] ; address of Z component array AddrW EQU DWORD PTR [BP+12] ; address of W component array AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transf. matrix F EQU WORD PTR [BP+6] ; first vector to process K EQU WORD PTR [BP+4] ; last vector to process RetAddr EQU WORD PTR [BP+2] ; return address saved by call SavdBP EQU WORD PTR [BP+0] ; saved frame pointer SavdDS EQU WORD PTR [BP-2] ; caller's data segment Ctrl87 EQU WORD PTR [BP-4] ; caller's 80x87 control word PUSH BP ; save TURBO-Pascal frame ptr MOV BP, SP ; new frame pointer PUSH DS ; save TURBO-Pascal data seg. SUB SP, 2 ; make local variabe FSTCW [Ctrl87] ; save 80x87 ctrl word LES SI, AddrT ; ptr to transformation matrix FINIT ; initialize coprocessor FSBP2 ; set register bank 2 FLD QWORD PTR ES:[SI] ; load a[0,0] FLD QWORD PTR ES:[SI+32] ; load a[1,0] FLD QWORD PTR ES:[SI+64] ; load a[2,0] FLD QWORD PTR ES:[SI+96] ; load a[3,0] FLD QWORD PTR ES:[SI+8] ; load a[0,1] FLD QWORD PTR ES:[SI+40] ; load a[1,1] FLD QWORD PTR ES:[SI+72] ; load a[2,1] FLD QWORD PTR ES:[SI+104] ; load a[3,1] FINIT ; initialize coprocessor FSBP1 ; set register bank 1 FLD QWORD PTR ES:[SI+16] ; load a[0,2] FLD QWORD PTR ES:[SI+48] ; load a[1,2] FLD QWORD PTR ES:[SI+80] ; load a[2,2] FLD QWORD PTR ES:[SI+112] ; load a[3,2] FLD QWORD PTR ES:[SI+24] ; load a[0,3] FLD QWORD PTR ES:[SI+56] ; load a[1,3] FLD QWORD PTR ES:[SI+88] ; load a[2,3] FLD QWORD PTR ES:[SI+120] ; load a[3,3] ; transformation matrix loaded MOV AX, F ; index of first vector MOV DX, K ; index of last vector MOV BX, AX ; index 1st vector to process MOV CL, 3 ; component has 8 (2**3) bytes SHL BX, CL ; compute offset into arrays FINIT ; initialize coprocessor FSBP0 ; set register bank 0 $mat_loop:LES SI, AddrW ; addr. of W component array FLD QWORD PTR ES:[SI+BX] ; W component current vector LES SI, AddrZ ; addr. of Z component array FLD QWORD PTR ES:[SI+BX] ; Z component current vector LES SI, AddrY ; addr. of Y component array FLD QWORD PTR ES:[SI+BX] ; Y component current vector LES SI, AddrX ; addr. of X component array FLD QWORD PTR ES:[SI+BX] ; X component current vector F4X4 ; mul 4x4 matrix by 4x1 vector INC AX ; next vector MOV DI, AX ; next vector SHL DI, CL ; offset of vector into arrays FSTP QWORD PTR ES:[SI+BX] ; store X comp. of curr. vect. LES SI, AddrY ; address of Y component array FSTP QWORD PTR ES:[SI+BX] ; store Y comp. of curr. vect. LES SI, AddrZ ; address of Z component array FSTP QWORD PTR ES:[SI+BX] ; store Z comp. of curr. vect. LES SI, AddrW ; address of W component array FSTP QWORD PTR ES:[SI+BX] ; store W comp. of curr. vect. MOV BX, DI ; ofs nxt vect. in comp. arrays CMP AX, DX ; nxt vector past upper bound? JLE $mat_loop ; no, transform next vector FLDCW [Ctrl87] ; restore orig 80x87 ctrl word ADD SP, 2 ; get rid of local variable POP DS ; restore TP data segment POP BP ; restore TP frame pointer RET 24 ; pop parameters and return IIT_MUL_4x4 ENDP CODE ENDS END ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$N+,E+} PROGRAM Trnsform; USES Time; CONST VectorLen = 8190; TYPE Vector = ARRAY [0..VectorLen] OF DOUBLE; VectorPtr = ^Vector; Mat4 = ARRAY [1..4, 1..4] OF DOUBLE; VAR X: VectorPtr; Y: VectorPtr; Z: VectorPtr; W: VectorPtr; T: Mat4; K: INTEGER; L: INTEGER; First: INTEGER; Last: INTEGER; Start: LONGINT; Elapsed:LONGINT; PROCEDURE MUL_4X4 (X, Y, Z, W: VectorPtr; VAR T: Mat4; First, Last: INTEGER); EXTERNAL; PROCEDURE IIT_MUL_4X4 (X, Y, Z, W: VectorPtr; VAR T: Mat4; First, Last: INTEGER); EXTERNAL; {$L M4X4.OBJ} BEGIN WriteLn ('Test8087 = ', Test8087); New (X); New (Y); New (Z); New (W); FOR L := 1 TO VectorLen DO BEGIN X^ [L] := Random; Y^ [L] := Random; Z^ [L] := Random; W^ [L] := Random; END; X^ [0] := 1; Y^ [0] := 1; Z^ [0] := 1; W^ [0] := 1; FOR K := 1 TO 4 DO BEGIN FOR L := 1 TO 4 DO BEGIN T [K, L] := (K-1)*4 + L; END; END; First := 0; Last := 8190; Start := Clock; MUL_4X4 (X, Y, Z, W, T, First, Last); { IIT_MUL_4X4 (X, Y, Z, W, T, First, Last); } Elapsed := Clock - Start; WriteLn ('Number of vectors: ', Last-First+1); WriteLn ('Time: ', Elapsed, ' ms'); WriteLn ('Equivalent to ', (28.0*(Last-First+1)/1e6)/ (Elapsed*1e-3):0:4, ' MFLOPS'); WriteLn; WriteLn ('Last vector:'); WriteLn; WriteLn (X^[Last]); WriteLn (Y^[Last]); WriteLn (Z^[Last]); WriteLn (W^[Last]); END
"This document has been created to provide the net.community with some detailed information about mathematical coprocessors for the Intel 80x86 CPU family. It may also help to answer some of the FAQs (frequently asked questions) about this topic. The primary focus of this document is on 80387- compatible chips, but there is also some information on the other chips in the 80x87 family and the Weitek family of coprocessors. Care was taken to make the information included as accurate as possible. If you think you have discovered erroneous information in this text, or think that a certain detail needs to be clarified, or want to suggest additions, feel free to contact me at:
S_JUFFA@IRAVCL.IRA.UKA.DE or at my SnailMail address: Norbert Juffa Wielandtstr. 14 7500 Karlsruhe 1 Germany
This is the fifth version of this document (dated 01-13-93) and I'd like
to thank those who have helped improving it by commenting on the previous
versions:
Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM), Peter Forsberg
(peter@vnet.ibm.com), Richard Krehbiel (richk@grevyn.com), Arto
Viitanen (av@cs.uta.fi), Jerry Whelan (guru@stasi.bradley.edu),
Eric Johnson (johnson%camax01@uunet.UU.NET), Warren Ferguson
(ferguson@seas.smu.edu), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg
(tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John
Levine (johnl@iecc.cambridge.ma.us), David Hough (dgh@validgh.com),
Duncan Murdoch (dmurdoch@mast.QueensU.CA), Benjamin Eitan
(benny.iil.intel.com)
A very special thanks goes to David Ruggiero (osiris@halcyon.halcyon.com),
who did a great job editing and formatting this article. Thanks David!"