4. Debugging and verifying
Debugging assembly code can be quite hard and frustrating, as you probably already have discovered. I would recommend that you start with writing the piece of code you want to optimize as a subroutine in a high level language. Next, write a test program that will test your subroutine thoroughly. Make sure the test program goes into all branches and boundary cases.
When your high level language subroutine works with your test program then you are ready to translate the code to assembly language.
Now you can start to optimize. Each time you have made a modification you should run it on the test program to see if it works correctly. Number all your versions and save them so that you can go back and test them again in case you discover an error that the test program didn't catch (such as writing to a wrong address).
Test the speed of the most critical part of your program with the method described in chapter 30 or with a test program. If the code is significantly slower than expected, then the most probable causes are: cache misses (chapter 7), misaligned operands (chapter 6), first time penalty (chapter 8), branch mispredictions (chapter 22), instruction fetch problems (chapter 15), register read stalls (16), or long dependency chains (chapter 20).
Highly optimized code tends to be very difficult to read and understand for others, and even for yourself when you get back to it after some time. In order to make it possible to maintain the code it is important that you organize it into small logical units (procedures or macros) with a well-defined interface and appropriate comments. The more complicated the code is to read, the more important is a good documentation.