=====================================================================
7. Debugging
=====================================================================
Debugging is a generic term for the art of finding problems in code, or validating that code is working correctly.
Programmers debug their code while developing applications - they might do this by inserting additional "debug code" to log or display internal actions and data so that what the process is doing can be verified.
Another way to locate bugs is to use a debugger.
Despite the implciation of its name, a debugger does not actually do any kind of bug
removal - that's still your job - but what it can be used for is to "wrap" a process so that at any point you can pause it and browse the data, stacks, threads and handles the process might have at a given point in time.
You can also use breakpoints to mark a section of code and let the process run normally until it hits this point, and then pauses it and passes control to the debugger - useful if you know which function is last called when an error occurs but it is difficult or a lengthy process to reproduce.
When a process in Windows tries to do something it should not (most often access memory it does not own or free memory twice), a
"first chance" exception is raised.
This is where the OS checks to see if there is any debugger attached to the process - if there is then it is given control and details of the exception.
If there is no debugger attached, or the debugger tells the OS to let the exception pass, then control is passed back to the process to see how it behaves.
If the application is well written, it will have an exception handling routine and can either gracefully deal with it if it is not critical (possibly make an entry in a log ready to send back to the author) or close down in a more friendly manner than crashing.
A lot of applications are not capable of handling exceptions they raise (complexity of code, size of program, time to develop, etc.) and so the exception remains "unhandled".
Windows then checks to see (again) if there is a debugger attached to the process and pass it the
"second chance" exception.
If there is no debugger present then the default "post-mortem" debugger kicks in - Dr Watson.
Dr Watson adds an entry to its log and, if the settings are configured so, creates a dump of the process memory space.
The settings for Dr Watson can be seen and changed by running drwtsn32.exe
The information you can see is:
Log file path
Crash dump path and filename
Sound effect to play
Number of instructions to put in the log from the crashed process
Number of crashes to record
Crash dump type (full, mini, "NT compatible")
There are also a number of options to toggle (including whether application crashes create dump files at all) and a summary of the most recent log entries.
So user mode applications can be "live debugged" or have a crash dump analysed to find the root cause - but what about kernel mode exceptions?
We know that exceptions in kernel mode cause bugchecks to protect the integrity of data, but the kernel consists of many processes, threads, handles, objects, memory pools, stacks and they can corrupt each other - so how to "debug" the entire kernel?
If we run a debugger as a user mode process and try to freeze the kernel, we have just frozen the entire OS - so the debugger would freeze too...
Recall I mentioned the "/DEBUG" BOOT.INI switch earlier - with this option enabled (along with a DEBUGPORT to use), we can attach an external debugger running on another machine to the system and debug Windows itself.
Now you can manually break into a running system, freeze it and analyse all the kernel mode data structures, currently running process and threads, memory limits and use - and the system to be debugged does not even have any screen updates and the mouse does not even move.
If you want, at any time you can tell the debugger to release and let the debugged system continue on its way - other than a clock change it won't notice any difference.
If you have a debugger attached when kernel mode execption occurs, the debugger is notified and given the opportunity to see the system in its broken state without (or before) creating a memory dump file.
This is a live kernel debug, and we have the same post-mortem debug option by loading the memory.dmp file created when Windows bugchecks into the debugger.
However, if the dump options have been disabled then there is nothing to analyse.
If the dump option is set to "mini dump" then a very tiny amount of data is stored.
If a "kernel mode dump" is selected, then the code and data stored in physical memory for the kernel is dumped to a file - this is the most common dump file that actually contains useful data, but the size of the dump cannot be known beforehand as the amount of physical memory used by kernels will vary.
For a kernel dump file to be created a swap file is required, and there are
minimum page file sizes dependent on the amount of physical memory installed:
<128MiB -> 50Mib swap file
128MiB 4GiB -> 200MiB swap file
4GiB-8GiB -> 400MiB swap file
8GiB+ -> 800MiB swap file
The only way to
guarantee a kernel dump can be stored would be to set the page file size at 2GiB+1MiB, as on a 32-bit system the kernel cannot be larger than 2GiB (and there is a little overhead for the dump file header).
There must also be at enough free disk space equal to the size of the dump created on the system volume and the volume where memory.dmp is specified.
If a "complete memory dump" is selected, then all physical memory is dumped to a file, and the swap file and free disk space on the system volume must be at least as big as physical memory plus 1MiB each to guarantee a working dump.
The reason for the swap file being required for kernel or complete memory dumps is that this is where the memory is dumped to initially, and the reason for the free disk space requirement is that this dump is then copied to memory.dmp before the swap file is cleared and the system restarted.
A complete dump is not often necessary as bugchecks cannot be caused by user mode processes.
A bugcheck can also be instigated manually by setting the registry value
CrashOnCtrlScroll to 1 and using a key combination on a (non-USB) locally-connected keyboard.
Why would you want to bluescreen your machine on purpose?
In the event where you have a memory leak in kernel mode, most likely.
Memory leaks do not generate bugchecks - a poorly written kernel mode process might just consume and consume and consume until there is nothing left for other Windows processes and no new processes can spawn, deadlocking the system but not actually causing an exception.
In this situation (where you most likely have event IDs 2019 and 2020 logged), you could manually crash the system to produce a memory dump, and then analyse it to see which process consumed all the memory.
User mode application crashes are the responsibility of the vendor (remember that due to the "extension" model of Explorer and Internet Explorer, 3rd party plugins can be the cause of crashes in core components).
Kernel mode exceptions are invariably the result of a problem with a 3rd party driver (device or filter).
In either case, the first check would be to ensure the latest version of the software is installed (the likely faulting component being identified by analysing the dump).
If the software is up to date, or the problem persists, then remove the software entirely and see if it still occurs (not always practical in the case of storage or display drivers).
If all the software is up to date and no individual piece of software can be identified as the cause, then a live debug may be required.