The FPREM and FPREM1 instructions are slow on all processors. You may replace it by the following algorithm: Multiply by the reciprocal divisor, get the fractional part by subtracting the truncated value, then multiply by the divisor. (see chapter 27.5 on how to truncate)
Some documents say that these instructions may give incomplete reductions and that it is therefore necessary to repeat the FPREM or FPREM1 instruction until the reduction is complete. I have tested this on several processors beginning with the old 8087 and I have found no situation where a repetition of the FPREM or FPREM1 was needed.