An introductory tour to the GNU Compiler Collection and the GUN Project Debugger.
Overview
The term GCC originally means the GNU C Compiler when it was first released by Richard Stallman in 1987.
But now it has evolved into a whole GNU Compiler Collection,
which supports various programming languages, hardware architectures, and operating systems.
GCC includes the compiler frontend that supports C, C++, Objective-C, Fortran, Go, and many other languages,
and the standard libraries for these languages.
It has a complex middle-end that optimizes the generated AST into a register-transfer language for a target architecture,
and supports different backend machine code generators.
GCC is a 100% free software from the GNU Operating System project,
and is the cornerstone stone of the magnificent world of free software.
In this post we will go over a brief tour of using the C compiler gcc in GCC
(we use the lower cased gcc and gdb to indicate the executables installed on your machine),
and understanding how the GCC compilers work.
We will also introduce the GNU Project Debugger (aka GDB), and explain briefly how a debugger works.
Install gcc and gdb
On Ubuntu, gcc and gdb executables are included in the build-essential package,
so to install them, just run sudo apt install build-essential.
The compiler gcc
Prepare a simple C program
Let's have a two source files simple C program to use in our examples.
File: hello.c
#include <stdio.h>
void print_hello() {
printf("Hello \n");
printf("World\n");
}
File: include/hello.h
void print_hello();
File: main.c
#include <stdio.h>
#include <stdlib.h>
#include <hello.h>
int main(int argc, char *argv[]) {
print_hello();
int dividend, divisor;
float result;
dividend = atoi(argv[1]);
divisor = atoi(argv[2]);
result = dividend / divisor;
printf("%d/%d = %f\n", dividend, divisor, result);
return 0;
}
The basic usage of gcc
To compile and run our program, just pass the file names to the gcc executable:
gcc -o main main.c hello.c -I./include
./main 4 2
It prints:
Hello
World
4/2 = 2.000000
Internal steps of gcc
Internally, gcc goes over several steps to transform your C source code
into the binary program main that the machine can load.
It is a multistage process involving several tools in the GCC,
and the command gcc you typed works like a wrapper of these tools.
Overall, the compile process of the above simple C program includes preprocessing, compiling, assembling, and linking.
Preprocessing
The first step is preprocessing.
In this step, gcc uses the C preprocessor (cpp) to expand the marcos and copy the header files to your source code.
We can use gcc to break down the steps, and the -E argument let gcc to only do preprocessing.
gcc -E hello.c > hello.i
# or
cpp hello.c > hello.i
The gcc command is the equivalent of cpp command, as internally gcc just uses cpp to do the work.
It is a convention to use the .i and .ii extensions for C or C++ preprocessed source code files,
but they are really only expanded C/C++ source code.
Compiling
gcc compiles the C code into assembly code for the target machine architecture.
gcc -S hello.c
# or
gcc -S hello.i
This command will generate a file hello.s, you can see the assembly code in this file.
Assembling
Assembly code are just human readable translation of the machine code instructions.
The assembler program as is used to assemble it into binary object file.
gcc -c hello.c
# or
gcc -c hello.s
# or
as -o hello.o hello.s
This will generate file hello.o, called an object file.
In this example, our hello.c does not contain a main function, thus cannot run directly.
But even if you compile a standalone C program that contains main into an object file, it is still not ready to run.
We can get some hints from the file type of hello.o:
file hello.o
It returns the below information on my machine.
hello.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
We can see that it is an 64-bit ELF file, least-significant byte first (LSB), and relocatable.
In the .o object files, memory addresses of functions and variables are remain undefined.
We can use the readelf utility to check the symbol tables of the object file.
(Also checkout the objdump utility to examine an object file in details.)
readelf --symbols hello.o | grep print_hello
It will print:
10: 0000000000000000 35 FUNC GLOBAL DEFAULT 1 print_hello
You can see the Value column of function print_hello and other symbols are all zeros.
It is the linker's job to merge the object files together, relocate the functions and variables, and resolve the addresses.
Linking
Let's first have our main.c compiled and assembled too:
gcc -c main.c -I./include
Then link the two object files into an executable:
gcc -o main hello.o main.o
There is also a standalone ld program that gcc uses to link the objects together,
but using ld directly requires a bunch of extra parameters, you can check the manual for details.
Now if you check the symbol table of the main executable:
readelf --symbols main | grep print_hello
And the result.
67: 0000000000001207 35 FUNC GLOBAL DEFAULT 16 print_hello
The symbol print_hello has an address.
Other frequently used gcc options
-D name,-D name=value: Define a macro name.-I: Extra search path for headers.-On: Optimization leveln.-Wall: Enable all warnings.-L: Extra search path for libraries.-l: Link with a library.
Related environment variables:
C_INCLUDE_PATH,CPLUS_INCLUDE_PATH: Search path for C/C++ headers.LIBRARY_PATH: Search path for static libraries. The linker (included in GCC) uses it to look for libraries.LD_LIBRARY_PATH: Search path for dynamic libraries. Not related to GCC itself, used by the loader (ld-linux.so) to find dynamic libraries for your program when it runs.
Related topics
The compiler process
With GCC it is relative hard to examine the detailed compiler processes without diving into the source code.
GCC now uses very complex manually crafted lexer and parser for performance.
For beginners, this GCC Tiny project is a good start to learn about the compiler frontend.
It is also recommended to use the LLVM project to learn the internals of modern compiler.
LLVM has well designed compiler frontend API and provides excellent compiler backend implementations,
so that you can create a programming language easy and fast.
In this post we only use the LLVM based C compiler clang to demonstrate the intermediate result of lexer and parser:
clang -fsyntax-only -Xclang -dump-tokens hello.c
clang -fsyntax-only -Xclang -ast-dump hello.c
Dynamically linking and loader
Modern programs are mostly dynamically linked, thus when you view the ELF file you generated, not all symbols are resolved statically.
When the program needs to resolve dynamic libraries (libxxx.so), it relies on a loader,
on Linux, it is ld-linux.so, (/lib64/ld-linux-x86-64.so.2 on my machine).
This article explains well how an ELF file is loaded by the operating system,
and how the dynamic loader ld-linux.so works.
The debugger gdb
A debugger runs the target program under controlled conditions. It can usually let you set breakpoints, step into/over machine level instructions or high level language statements, examine the variable values, and even modify the instructions or data.
Compile the executable with debugging information
The debugger needs extra information, including the source code mapping, to let you debug on high level languages instead of machine level instructions. To compile the executable and produce debugging information:
gcc -g3 -O0 -o main main.c hello.c -I./include
The option -g3 adds the debugging information of marcos in your code.
If you do not have or do not care about debugging marcos, using -g is just fine.
The option -O0 is used to disable any optimization that might cause surprises while debugging.
A tour of gdb commands
Let's debug the executable main:
gdb ./main
A list of frequently used GDB commands:
list: List the source code.run: Run the program.run arg1 arg2 ...to pass arguments.
break: Set a breakpoint.break main.c:nto set a breakpoint at linenofmain.c.break function_nameto set a breakpoint at the functionfunction_name.
condition: Set a breakpoint condition.condition n variable=mto let breakpointnto be hit only ifvariableism.
print: Print value.print variableto print the value ofvariable.print expressionto print the value of an expression.print $rax/$eaxto print the value in registerraxoreax.
step: Step into the next line.stepi: Step into the next instruction.finish: Step out of the current function.next: Step over the next line.reverse-step/next/finish: Step/next/finish backwards.continue: Continue to run the program.backtrace: Print the stack frame.info: Print information about the program.info breakpoints: Print the breakpoints.info registers: Print all registers.
show: Print information about the debugger.show args: Print the arguments passed to the program.
GDB also supports a text base UI, called TUI mode:
Ctrl-x a: Switch on/off the TUI mode.Ctrl-x 1: Use a layout with only one window (source or assembly).Ctrl-x 2: Use a layout with at least two windows (two from source, assembly, register).layout next/prev/name: Switch to the next/previous layout, or the layoutname.Ctrl-x o: Switch focus between windows.Ctrl-x s: Toggle the single key mode.
Use gdb in VS Code
Added a section to setup VS Code debugging experience in the post Linux kernel overview.
How a debugger works
Internally, the debugger uses the system call ptrace() to
'control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers'.
Briefly speaking, the debugger:
- Call
fork()to create a child process; - In the child process, call
ptrace()with argumentPTRACE_TRACEME. This lets the OS to stop the child process whenever it receives a signal (exceptSIGKILL), and notify the parent process viawait(). - Then the child process uses the
exec()system call to load the tracee program. ThePTRACE_TRACEMEargument causes the child process to send to itself aSIGTRAPwhenever anexec()is called, thus the child process stops here, and the parent gets notified. - In the parent process, it can wait on the
wait()call in a dead loop, and take actions whenever the child stops. For example it canptrace()the child process withPTRACE_SINGLESTEPto let the it run one instruction then stop, or usePTRACE_CONTto let the child process continue to run. - The parent process can also use
ptrace()to peek and poke the child process. For example it can usePTRACE_POKETEXTto replace the first byte of any instruction with theINT 3instruction, and let the child program to stop on that instruction. It can also recover the instruction back, then usePTRACE_SETREGSto step back theeiporripby one, and then restart the child process. This is basically how a breakpoint is implemented.
To understand more details and play with ptrace() in an example,
I recommend to read the articles How debugger works by Eli Bendersky.
References
- Tutorials by Professor S. Weiss from the City University of New York.
- Official manual Debugging with GDB.
- LLVM for Grad Students by Adrian Sampson.
- How debugger works by Eli Bendersky.
- Playing with ptrace by Pradeep Padala.
- How debugger works by Alexander Sandler.