Computing knowledge base
Shells
Shells are programs that interpret commands. They act as your interface to the system by allowing you to run other programs. When you type on your computer’s command line, you are using a shell in interactive mode. You can also write shell scripts to be batch processed.
There are many different shell command languages and shells that understand them. Most operating systems have multiple options, and you can choose which ones to use for scripting and your interactive shell.
sh
is the POSIX-specified shell command language.
Nearly every operating system has a shell located at /bin/sh
that supports it.
Modern shells that interpret languages with more features and better syntax than sh
often have compatibility modes to interpret sh
scripts.
bash
is a command language and corresponding shell implementation.
It is derived from sh
with a number of extensions that make it nicer to use.
It is also extremely widespread, but less so than sh
.
FreeBSD, for example, does not ship with a bash
shell by default.
You can find a description of the language and shell in the Bash manual.
zsh
is a command language and shell, also derived from sh
, that is more modern and friendly than bash
.
It is present on fewer systems than sh
or bash
, but it is gaining popularity.
It has the best interactive mode of the three.
You can find a description of the language and shell in the Zsh manual.
Choosing a shell
I recommend using zsh
for your interactive shell, where concerns about cross-platform support don’t apply.
When you need to write a script, you should choose the language based on where the script needs to work.
If the script is non-trivial and only needs to work on a small set of machines that you control, I recommend using a real programming language. They are much nicer to use, and all major languages have libraries for doing things that shells normally do, like executing subprocesses and setting up pipelines. These libraries also often support invoking a shell directly. You can simply install an interpreter for your language of choice on all the machines that need to run your script, and you’re off to the races.
If the script needs to work absolutely everywhere, then use sh
.
Otherwise, bash
’s improved syntax is likely worth the reduced compatibility.
In the following notes on shell scripting, I assume the use of bash
.
For correct information about a particular shell or shell command language, read the appropriate manual.
Environment and shell variables
Like all processes, shells operate with a set of environment variables that are key-value pairs. These environment variables are passed to child processes that the shell creates. Shells also have a set of key-value pairs, known as shell variables, that are not passed to child processes. The set of shell variables is a superset of the set of environment variables. You can control environment variables, shell variables, and how they are passed to child processes using the shell command language.
To set a shell variable that later commands and built-in shell functions will see but child processes will not, you can use the syntax varname=value
.
To set an environment variable that will also be seen by child processes, you can use the export
builtin: export varname=value
.
You can also define variables that are passed to a child process’ environment but not set as shell or environment variables in the shell’s process by defining them and executing a child process on the same line: varname=value program
.
Here is an example script that shows all of these variable-setting methods in action:
# creates a shell variable
$ MYSHELLVAR=hello
# creates an environment variable
$ export MYENVVAR=goodbye
# prints "hello" because echo is a builtin
# and can see the shell variable
$ echo ${MYSHELLVAR}
# prints nothing because the shell variable
# is not exported to the bash child process
$ bash -c 'echo ${MYSHELLVAR}'
# prints "goodbye" because echo is a builtin
# and can see the environment variable
$ echo ${MYENVVAR}
# prints "goodbye" because the environment
# variable is exported to the bash child process
$ bash -c 'echo ${MYENVVAR}'
# prints "ciao" because MYCHILDVAR is passed
# to the child process' environment
$ MYCHILDVAR="ciao" bash -c 'echo ${MYCHILDVAR}'
# prints nothing because MYCHILDVAR was only set
# for the previous command's child process
$ echo ${MYCHILDVAR}
You can list all shell variables with the set
builtin by typing set
at the command line with no arguments.
You can list all environment variables passed to a child process with the standalone env
program by typing env
at the command line with no arguments.
Shell variables are used to store information private to the shell, including options that configure shell behavior and user-defined variables. Environment variables are used by all kinds of programs. Here are some canonical ones that are useful to know:
SHELL
: path to the binary of the currently running shellSHLVL
: nesting level of the current shellPATH
: ordered colon-separated list of directories in which to search for commands by namePWD
: the current working directoryHOME
: the current user’s home directoryEDITOR
: the user’s preferred command for editing filesPAGER
: the user’s preferred command for viewing files
Commands and paths
The first part of a shell command is typically the name of a program to run. When processing the command, the shell creates a child processes and uses it to execute the specified program with specified arguments, input file, and output files.
If the name of the program is given as an absolute or relative path, then the shell executes the program at the specified path.
If program is given as a name alone, then the shell searches directories in the PATH
environment variable, in order, for an executable file with a name that matches the provided one.
It executes the first one it finds.
The which
program, when given a program name as an argument, searches directories in the PATH
variable and prints the absolute path of the first program it finds with a name that matches the provided one.
In other words, it tells you which program the shell would execute for a command started with the given program name.
The env
program allows you to modify environment variables before executing a given program.
Like the shell, and which
, it searches the PATH
variable to determine which program to execute.
It is often used in shebangs at the top of executable script files as /usr/bin/env <interpreter>
so that the script will be executed by the appropriate interpreter without its author having to know the interpreter’s exact path.
Systems aren’t required by POSIX to have env
located at /usr/bin/env
, but it’s the most portable solution for script shebangs in almost all scenarios.
If you have a sh
script that you really want to run everywhere, then using /bin/sh
might be better.
Input and output
All processes have access to three special files: standard input, standard output, and standard error. By default, when the shell executes a program, it sets up the standard input file to receive keyboard input from the terminal emulator hosting the shell, and it sets up the standard output and error files so that their output is displayed in the terminal emulator.
You can use redirections to change where the input and output for these special files comes from and goes.
You can also link multiple processes together using pipes, which hook up the standard output of each program in a pipeline to the standard input of the following one.
You can create more sophisticated inter-process communication hookups using the mkfifo
command.
You can use braces to group a list of commands together so that their combined output can be redirected as a single unit:
# outfile contains hello and world
{ echo hello; echo world; } > /path/to/outfile
Tips and tricks
- Use the ShellCheck static analysis tool or website to lint scripts.
bash -n
will read commands but not execute them. It is useful for finding syntax errors.bash -x
prints commands and arguments before they are executed. Defining thePS4
variable in the following way allows you to profile bash scripts by timing all executed commands:PS4='$(date "+%T.%N ($LINENO) ")' bash -x <scriptname>
- You probably want the following as the first line in your scripts:
set -o errexit -o pipefail -o noclobber -o nounset
- Use the
help
builtin likeman
for other shell builtins. - Use the
read
builtin to read text from files into variables. - Putting text in single quotes preserves its literal representation, using double quotes allows for expansions.
- Put quotes around variables and space-delimited text that should be treated as a single argument or entity. It’s good practice to put all variables and strings in quotes, even if they don’t contain or expand to contain spaces.
- Use braces for all variable expansions.
- Check the list of conditional expressions for the
test
builtin. - Prepending
\
to a command disables alias resolution. - The
BASH_SOURCE
variable contains the path of the currently executing script. - Use the
ulimit
builtin to check and set resource limits for processes started by the shell.
Shell utilities
cd
- change the working directory
- to go back to the previous working directory:
cd -
- to go back to the previous working directory:
pushd
- change the working directory and push the current directory onto the directory stack
popd
- push a directory off the directory stack and change the working directory to it
echo
- output arguments simply
printf
- output arguments with more control
chsh
- change a user’s interactive shell
which
- locate a program in the user’s path
env
- print environment variables, or modify environment and execute a program
xargs
- read whitespace-delimited strings from standard input and execute a program with those strings as arguments
tee
- copy standard input to standard output and a list of files
Terminal emulators
Terminfo and Termcap are libraries and corresponding databases that let programs use terminals and terminal emulators in a device-independent way.
They look up the capabilities of the terminal they are running on, as described by the TERM
environment variable, in the databases, and allow programs to alter their behavior accordingly.
You can add or modify entries in the databases to control how programs behave with your terminal.
ANSI escape sequences are standardized in-band byte sequences that control terminal behavior.
UNIX tools
In the philosophy of UNIX-like operating systems, programs are meant to be simple tools that do a small set of things well. As the user, you can configure and combine them to achieve your goals. In this section I discuss some important kinds of software in detail and maintain categorized lists of other useful programs.
The system manual is your first port of call for figuring out how software works.
You can search the manual for information about a particular tool, system call, or library function by typing man <name>
at the command line.
Type man man
for more information about using the manual.
Some of the below tools are specific to particular operating systems. Check your system’s manual to see whether you have them. You can also use the UNIX Rosetta Stone to translate between OS-level tools across different systems.
Many of these tools aren’t included in default OS installations.
You can install them from the command line using your system’s package manager.
It’s apt
on Ubuntu, brew
on macOS, and pkg
on FreeBSD.
File basics
On Linux distributions, many software tools for basic file manipulation are from the GNU core utilities project. On flavors of BSD they are maintained and distributed as part of the base system. This means that these tools can have slightly different behaviors across systems.
You may occasionally want to move or remove filenames containing characters that are difficult to type.
In these cases, the easiest way to proceed is by opening the directory containing the filename in an interactive editor like vim
or emacs
.
For directories that don’t have many entries, you can also use rm -i -- *
to be prompted whether to delete each filename in the directory.
You can alternatively use ls -li
to determine the filename’s inode number, and then use find . -inum <inode number>
with the -delete
flag to remove the filename or the -exec
flag to otherwise interact with it.
However, note that the find
command will apply to all filenames in the current directory hierarchy with the specified inode number, so it may have unintended consequences if other filenames share the inode number you are interested in.
touch
- update access and modified times; create empty file if it doesn’t exist
mkdir
- create a directory
ls
- list files in a directory
rm
- remove a file or directory
mv
- move a file or directory
cp
- copy a file or directory
cat
- write file contents to standard output
chmod
- change file permissions
chown
- change file owner
chflags
- change file flags
ln
- create hard links and symbolic links
- to create a symbolic link:
ln -s </path/to/symlink/target> </path/to/new/symlink>
- to create a symbolic link:
rsync
- copy files to local or remote destination
File searching, viewing, and editing
lf
- terminal file manager
grep
- find lines in a file with contents that match a regular expression
- to find matching lines:
grep <regular expression> </file/to/search>
-c
prints the number of matching lines-C[=num]
prints lines of leading and trailing context-e
is useful to specify multiple regular expressions-E
enables extended regular expressions (special meanings for characters like?
and+
)-n
each match is preceded by its line number in the file-r
recursively search subdirectories of a directory-v
select lines that don’t match the given expressions
- to find matching lines:
rg
(ripgrep)- faster and more powerful version of
grep
find
- search for files in a file hierarchy
- to search for files with a
sh
extension:find </path/to/directory> -name "*.sh"
-name
search by name-type
search by file type-mtime
search by modification time
- to search for files with a
fd
- faster and simpler version of
find
fzf
- fast generic fuzzy finder, good integrations with
vim
and shell history tail
- display the last part of a file
- to wait for and display additional data as it is appended to a file:
tail -f </path/to/file>
- to wait for and display additional data as it is appended to a file:
lsof
- list open files
vim
- text editor
nvim
(neovim)- more modern, mostly backwards-compatible version of
vim
xxd
- create a hex dump of a binary file, or create a binary file from a hex dump
hexedit
- view and edit binary files in hexadecimal and ASCII
open
- open a file in the corresponding default application on macOS
xdg-open
- open a file in the corresponding default application on Linux or FreeBSD
File processing
tar
- create and manipulate archive files
- to create a tar archive with gzip compression:
tar -czvf </path/to/output.tar.gz> </path/to/input/files>
- to extract a tar archive with gzip compression, so that the archive contents will be placed into an existing output directory:
tar -xzvf </path/to/input.tar.gz> -C </path/to/output/directory>
- to create a tar archive with gzip compression:
ffmpeg
- video and audio converter
magick
(imagemagick)- convert and edit images
pandoc
- universal document converter
sed
- stream edit files
awk
- pattern-directed file processing
cut
- print selected portions of each line of a file
paste
- merge corresponding lines of input files
uniq
- report or filter out repeated lines in a file
sort
- sort files by lines
System administration
passwd
- change a user’s password
groups
- list user groups
useradd
- add a user on Linux
usermod
- modify a user on Linux
groupadd
- add a group on Linux
pw
- manage users and groups on FreeBSD
shutdown
- cleanly halt, shutdown, or reboot a machine
exit
- exit an interactive shell
mail
- send email from a machine
- to send a simple message
mail -s <subject> <someone@example.com>
- to send a simple message
cron
- execute commands on a schedule
service
- control daemons on Linux and FreeBSD
launchctl
- control daemons on macOS
Processes
ps
- list processes
kill
- send signals to processes
top
- interactive display about processes
htop
- improved
top
strace
- trace system calls on Linux
ktrace
- trace system calls on macOS and FreeBSD
dtruss
- trace system calls on macOS and FreeBSD
vmstat
- kernel statistics about processes, virtual memory, traps, and CPU usage on FreeBSD
Networking
curl
- transfer data to or from a server; more flexible than
wget
- to download a single file:
curl -o </path/to/output.file> <http://domain.com/remote.file>
- to download a single file:
wget
- download files from a network
- to download a single file:
wget -O </path/to/output.file> <http://domain.com/remote.file>
- to download a single file:
telnet
- open TCP connections
ping
- send ICMP packets to check whether a host is online
traceroute
- print the route taken by packets to a host
dig
- look up DNS information
ncat
(nmap’s version of netcat)- scriptable TCP and UDP toolbox
netstat
- show network-related data structures
sockstat
- information about open sockets on FreeBSD
tcpdump
- capture and print packet contents
ifconfig
- list and configure network interfaces on FreeBSD
route
- manipulate network routing tables on FreeBSD
ip
- manage network interfaces and routing tables on Linux
nftables
- configure firewall rules on Linux
ufw
- simple front end to
nftables
pfctl
(packet filter)- configure firewall rules on BSDs
netmap
- framework that bypasses the kernel to enable fast packet I/O
Disks
df
- show free space for mounted filesystems
du
- show disk usage for directories
- to show the size of a particular directory, where
-h
means human-readable size and-d
is the depth of subdirectory sizes to display:du -h -d 0 </path/to/directory>
- to show the size of a particular directory, where
mount
- mount filesystems or list mounted filesystems
- to mount a filesystem:
mount </path/to/device> </path/to/mount/point>
- to mount a filesystem:
umount
- unmount filesystems
gpart
- partition disks on FreeBSD
parted
- partition disks on Linux
newfs
- create UFS filesystems on FreeBSD
zpool
- create ZFS filesystems on FreeBSD
mkfs
- create filesystems on Linux
makefs
- create a file system image from files on FreeBSD
mkimg
- combine file system images into a partitioned disk image on FreeBSD
hdiutil
- work with disk images on macOS
dd
- copy files
- to write a disk image to a storage device:
dd if=</path/to/disk.img> of=</path/to/device> bs=8M status=progress
- to write a disk image to a storage device:
iostat
- statistics about disk use on FreeBSD
fuse
- kernel interface that allows userspace programs to export a virtual filesystem
Peripherals
devinfo
- information about peripheral devices on FreeBSD
lspci
- list PCI devices on FreeBSD
pciconf
- configure PCI devices on FreeBSD
acpidump
- analyze ACPI tables on FreeBSD
lsblk
- list block devices on Linux
udev
- dynamic peripheral management and naming on Linux
devd
- dynamic peripheral management and naming on FreeBSD
picocom
- terminal emulator for communicating over serial connections
- to open a terminal session using a serial device:
picocom -b <baud rate> </path/to/serial/device>
- to open a terminal session using a serial device:
bpf
(eBPF, extended Berkeley Packet Filter)- write arbitrary programs that run on a virtual machine within the kernel
Security analysis
afl-fuzz
(American Fuzzy Lop plus plus)- general-purpose fuzzer
syzkaller
- kernel fuzzer
nmap
- network scanner
wireshark
- network packet analyzer
squid
- web proxy
aircrack-ng
- wifi security tools
- Burp
- intercepting web proxy
- Frida
- dynamic binary instrumentation toolkit
ghidra
- binary reverse engineering tool
radare2
- binary reverse engineering tool with command-line interface
binwalk
- identify files and code in binary firmware images
john
(John the Ripper)- password cracker
hashcat
- password cracker with good GPU support
auditd
- event auditing for Unix-like operating systems
SSH
The Secure Shell Protocol (SSH) is the most common way to get secure remote shell access to a machine. It supports a wide range of use cases, including port forwarding, X display forwarding, and SOCKS proxying. The most popular implementation is OpenSSH, which I describe here.
The primary components of OpenSSH are sshd
, the SSH server daemon that runs on the machine you want to access remotely, and ssh
, the client application that runs on your local machine.
Global configuration files for ssh
and sshd
can be found in /etc/ssh
.
/etc/ssh/ssh-config
is used to configure ssh
and /etc/ssh/sshd-config
is used to configure sshd
.
Per-user configuration is in the directory ~/.ssh
, and configuration files there must have permissions 700
to be used.
~/.ssh/config
overrides the global /etc/ssh/ssh-config
.
Key-based authentication
SSH can use various forms of authentication, including the password for the user on the remote machine, public-private keypairs, and Kerberos.
Using passwords exposes you to brute-force attacks.
You should configure your servers only to accept key-based authentication by adding the lines PubkeyAuthentication yes
and PasswordAuthentication no
to sshd-config
.
You can generate a public-private keypair using the interactive ssh-keygen
command.
It puts both a public and private key in the ~/.ssh
directory.
The private key, called id_rsa
by default, stays on your local machine and is used to prove your identity.
It must have permissions 600
for the programs to work correctly.
You should protect your private key with a passphrase.
Otherwise, someone who obtains your private key or gains access to your local user account automatically gains access to all of the machines you can SSH into.
The public key, called id_rsa.pub
by default, is placed onto machines that you want to access using your private key.
More specifically, the contents of id_rsa.pub
are appended as a line in the file ~/.ssh/authorizd_keys
in the home directory of the user that you want to log in as on the remote machine that you want to access.
The authorized_keys
file must have permissions 600
for the programs to work correctly.
You can use the ssh-copy-id
program to automatically add your public key to the appropriate authorized_keys
file on a remote machine.
With your keypair set up in this way, you can SSH into the remote machine and get an interactive shell without using the remote user’s password:
ssh -i <path/to/private/key> <username>@<remote host>
In your ~/.ssh/config
file, you can specify usernames and keys to use with particular hosts:
Host <remote host>
User <username>
IdentityFile <path/to/private/key>
IdentitiesOnly yes # don't try default private keys
Then you can ssh into the machine with the simple command:
ssh <remote host>
SSH agents
If you set a passphrase on your private key, you will be prompted for this passphrase each time you want to use the key.
You can use the ssh-agent
and ssh-add
programs to remember the private key passphrase for a certain amount of time:
eval `ssh-agent`
ssh-add -t <lifetime> <private key>
You can check which keys have been added to the agent as follows:
ssh-add -l
You can configure the command involving ssh-agent
to be run every time you log in to your machine, so you only have to run ssh-add
to store your private key passphrase.
ssh-agent
can also be used to implement single-sign-on across multiple remote machines, so that the passphrase for your private key only has to be entered on your local machine.
This requires the ForwardAgent
option to be enabled in the ssh_config
file on clients and the AllowAgentForwarding
option to be enabled in the sshd_config
file on servers.
You can then forward your agent connection with the -A
flag:
ssh -A <user>@<remote host>
SOCKS proxying
SSH can be used to create an encrypted SOCKS proxy connection. A SOCKS proxy connection is similar to a virtual private network (VPN) connection. It is an encrypted channel between a local and remote host. The local host sends packets across the channel; the remote host receives the packets and then forwards them to their final destinations. The final destinations can vary across packets and do not need to be specified ahead of time. The below command opens a SOCKS tunnel to the remote host on the specified local port number:
ssh -D <local port> <user>@<remote host>
You can configure your operating system to forward all network traffic over SOCKS tunnel by specifying the appropriate local port number. You can also configure web browsers to forward web-based traffic.
Local port forwarding
SSH features local port forwarding, also known as tunneling. It allows you to specify a port on your local machine such that connections to that port are forwarded to a given host and port via the remote machine. Data you send to the local port are passed through the encrypted SSH tunnel to the remote machine and then sent by the remote machine to the destination you specify. This is useful for getting access to a service behind a firewall from your local machine:
ssh -L <local port>:<destination host>:<destination port> <user>@<remote host>
Remote port forwarding
SSH features remote port forwarding, also known as reverse tunneling.
It allows you to specify a port on the remote machine such that connections to that port are forward to a given host and port via your local machine.
Data sent to the remote port are passed through the encrypted SSH tunnel to your local machine and then sent by your local machine to the destination you specify.
By default, the remote port is only accessible from the remote host itself.
You can open it to the wider Internet by enabling the GatewayPorts
option in the sshd_config
file on the remote machine.
This can be run on a machine that’s behind a firewall or NAT to enable other machines to access it:
ssh -R <remote port>:<destination host>:<destination port> <user>@<remote host>
X forwarding
SSH can also forward X graphical applications from a remote host to your local machine.
The X11Forwarding
option must be enabled in the sshd_config
file on the remote machine.
You must be running an X server on your local machine, and the DISPLAY
environment variable must be set correctly in your local machine’s shell.
DISPLAY
tells an X application where to send its graphical output.
Its format is <hostname>:<display>.<screen>
.
A display is a collection of monitors that share a mouse and keyboard, and most contemporary computers only have one display.
A screen is a particular monitor.
The hostname is the name of the host running the X server.
It can be omitted, in which case localhost
will be used.
The display number must always be included, and numbering starts from 0.
The screen number can be omitted, in which case screen 0 will be used.
For example, a DISPLAY
value of :0
means that the X server is running on localhost, and graphical output should be rendered on the first screen of the first display.
You can then run ssh -X
to enable X forwarding.
SSH should set DISPLAY
in your shell session on the remote host to localhost:10.0
, and it will tunnel traffic sent there to the X server on your local machine.
With the -X
flag, SSH will subject X forwarding to security restrictions, which for some default configurations include a timeout after which new X connections cannot be made.
One way to bypass these security restrictions is using the -Y
flag instead of -X
.
File transfers
SSH can transfer files to and from remote machines with the scp
and sftp
commands.
The program sshfs
, which is not part of OpenSSH, can mount directories on a remote machine using SSH.
Debugging
To debug issues with SSH, you can run ssh -vvv
and sshd -ddd
for verbose debugging output.
If you are using SSH to access a server and your Internet connection is spotty, dropped connections can be frustrating.
One way to address this is by running tmux on the remote machine, so that you can reattach to sessions if you get dropped.
If you are mobile or have a truly terrible Internet connection, mosh is a less featureful alternative to SSH that provides a better experience.
Encryption
GPG
GNU Privacy Guard (GPG) is a good way to encrypt and decrypt individual files. It supports symmetric (passphrase-based) and asymmetric (keypair-based) encryption.
For GPG commands that produce binary output, the -a
flag encodes the binary output as ASCII text for ease of transmission.
Symmetric encryption
Use the following command for symmetric encryption:
gpg -o <encrypted output file> -c --cipher-algo AES256 <plaintext input file>
You will be prompted to choose a passphrase.
Asymmetric encryption
Asymmetric encryption requires working with GPG keypairs. GPG keypairs are distinct from SSH keypairs.
To create a GPG keypair, run gpg --generate-key
, which will prompt you to provide a name and email address to associate with the created keypair and then to enter a passphrase for the private key.
You can reference a keypair by its id, name, or email address in GPG commands.
You can list keys in GPG’s keyring with gpg --list-keys
and edit keys with gpg --edit-key <key>
.
To export and import public keys, use gpg -o <output key file> --export <key>
and gpg --import <input key file>
.
The --export-secret-key
and --allow-secret-key-import
flags do the same thing for private keys.
With asymmetric encryption, you encrypt a file for a given public key that is present in your GPG keyring, and the corresponding private key is required to decrypt it:
gpg -o <encrypted output file> -e -r <recipient key> <plaintext input file>
Decryption
Use the following command to decrypt a file encrypted by GPG:
gpg -o <plaintext output file> -d <encrypted input file>
You will be prompted to enter the passphrase for either the symmetric encryption or the appropriate private key.
You can configure gpg-agent
to reduce the number of times you have to enter a private key’s passphrase.
rclone
rclone is an excellent way to perform encrypted cloud backups.
In ~/.config/rclone/rclone.conf
, set up a crypt backend over a backend for your cloud provider.
You can then use rclone sync
to make an encrypted cloud backup match the contents of a local folder:
rclone sync --links --fast-list <path/to/local/folder> <crypt-backend:>
You can use the rclone bisync
command to make an encrypted cloud backup sync bidirectionally with multiple clients.
You can also mount an encrypted cloud drive as a local filesystem:
rclone mount --vfs-cache-mode full <crypt-backend:> <path/to/local/mount/point>
Git
Version control software keeps track of how files change over time and allows multiple people to work on the same set of files concurrently. Git is a popular program for version control. It’s a complicated and flexible tool: some ways of using it make working in teams easy, while others make it painful. This section describes some key Git concepts and suggests a good workflow for collaborative projects.
Concepts
Commits
The files for a Git-controlled project exist in a directory on your filesystem known as the work tree. As version control software, Git keeps track of how files change over time. But changes that you make in the work tree aren’t automatically reflected in Git’s historical record.
Git records changes in terms of commits, which are snapshots of the work tree. Each commit contains a set of modifications that affect one or more files relative to the previous commit. These modifications are colloquially called changes or diffs, for differences.
To create a commit, first make some changes in your work tree.
You can run git status
to see which files have changed in your work tree and git diff
to see exactly what the changes are.
Commits are prepared in the staging area.
Run git add <file>
to add changes affecting a file in the work tree to the staging area.
You can check exactly what’s in staging with git diff --staged
.
You can create a commit by running git commit
and adding a message when prompted.
Changes are then taken from the staging area and added to Git’s historical record as a commit.
Each commit is assigned a unique hash identifier.
git show <commit identifier>
shows the changes associated with a particular commit.
To remove a file from the staging area, run git restore --staged <file>
.
To discard changes to a file in your work tree, run git restore <file>
.
Branches
It’s common to work with multiple versions of the same project. For example, you may want to add a big button to your website, but you aren’t sure whether red or blue would be a better color. You decide to create different versions of the site to check.
Git supports different project versions via branches. A branch is a named sequence of commits, where each commit except the first one points to a parent commit that happened before it. A ref is a human-readable name that references a particular commit. Each branch name is a ref that points to the most recent commit, or tip, of the corresponding branch.
With Git, you are always working on some branch.
Most projects start with a branch called main
, which holds the main version of the project.
When you make a commit, the current branch’s ref is updated to point to the new commit, and the new commit’s parent is set to the previous tip of the branch, i.e. the previous value of the branch’s ref.
You can list branches and determine your current branch with git branch
.
You can switch branches with git switch <branch name>
.
HEAD
is a special ref that always points to the tip of the current branch.
You can run git log
to see the sequence of commits that makes up the current branch, starting from the tip.
git log -p
shows the commits and their corresponding diffs.
To create a new branch and switch to it, run git switch -c <new branch name>
.
The ref of the new branch then points to the same commit as the ref of the branch you switched from.
When you make commits in this new branch, they only change the new branch’s ref.
They do not modify the ref of the original branch or the commit sequence that is shared between the new and original branches.
The new and original branches are said to diverge at the start of that shared commit sequence.
Returning to our button example, say you evaluate the different colors by creating new branches off of your website’s main
branch.
One is called red-button
and the other blue-button
.
In each branch, you make a commit that adds the appropriately colored button.
You decide you like red best.
Now, you want to get the changes you made in the red-button
branch into the website’s main
branch.
There are two main ways to do so.
Merge
One way to integrate changes from one branch into another is to perform a merge.
You can merge changes from the red-button
branch into main
by switching to main
and running git merge red-button
.
For this discussion, we’ll call the branch that the changes are coming from the source branch and the branch being merged into the target branch.
In a merge, Git takes the changes that have been made in the source branch since it diverged with the target branch and applies them to the target all at once.
If the target branch hasn’t experienced any commits since it diverged with the source branch, then the source branch is just the target branch with additional commits added. In this case, which is known as a fast-forward merge, the merge operation simply sets the target branch’s ref to the source branch’s ref.
If both the target and the source branches have experienced commits since they diverged, then the merge operation adds a new commit to the target branch, known as a merge commit, that contains all of the changes from the source branch since divergence. Merge commits have two parents: the refs of the source and target branches from before the merge.
After a merge, the two branch histories are joined.
Commits that were made in the source branch are displayed in the git log
of the target branch, interleaved with commits that were made in the target branch according to creation time, even though they do not affect the target branch directly.
You can run git log --graph
to see a graphical representation of previously merged branches.
Running git log --first-parent
shows only commits that were made in the target branch.
If a target and source branch have made different changes to the same part of a file since divergence, then the merge may not be able to happen automatically.
This is known as a merge conflict.
In this case, Git will pause in the middle of creating the merge commit and allow you to manually edit conflicting files to decide which changes should be preserved.
When you are done resolving the merge conflict, add the conflicting files to the staging area and run git merge --continue
.
Running git merge --abort
cancels a paused merge.
When you run git show
on a merge commit, it will only show changes in files that have been modified relative to both parents.
This means that it typically only shows files in which you manually merged changes as part of a conflict resolution.
It also doesn’t include the context for those conflict resolutions.
You can run git show --first-parent
to see all changes made by a merge commit relative to the target branch, and git show --remerge-diff
to see the context for merge conflict resolutions.
Rebase
An alternative to merging changes from one branch into another is to rebase them.
You can rebase changes from the red-button
branch onto main
by switching to red-button
and running git rebase main
.
For this discussion, we’ll call the branch that the changes are coming from the source branch and the branch that the changes are being rebased onto the target branch.
In a rebase, Git first determines the changes associated with each commit made in the source branch since it diverged with the target branch. It saves these sets of changes to temporary storage. It then sets the ref of the source branch to point to the ref of the target branch. Finally, it creates new commits in the source branch that apply the saved changes one set at a time.
The overall effect of a rebase is that changes made in the source branch are re-applied as if they were made on top of the target branch rather than the point of divergence. Rebases result in a linear history that is simpler than the joined history after a merge. Unlike a merge, which modifies the target branch, a rebase modifies the source branch and leaves the target unchanged.
If a target and source branch have made different changes to the same part of a file since divergence, then the rebase may not be able to happen automatically.
This is known as a rebase conflict.
In this case, Git will pause in the middle of creating the first commit whose changes do not automatically apply.
Like in a merge conflict, you can manually edit conflicting files to decide which changes should be preserved in the commit, add them to the staging area, and run git rebase --continue
to continue with the rebase.
Note that resolving a conflict for a particular rebased commit may prevent subsequent changes from being applied automatically.
This can result in a painful cascading rebase-conflict scenario that should be avoided.
Running git rebase --abort
cancels a paused rebase.
Rather than adding new commits to the tip of a branch, rebasing rewrites the branch’s commit history. As discussed further below, this means that rebases should not be used on branches that are being actively worked on by more than one collaborator.
You can perform more complex rebases with the --onto
flag.
git rebase --onto <new-base> <end-point> <start-point>
gathers changes corresponding to commits that have been made in the branch specified by the <start-point>
ref going back until the <end-point>
ref, sets the current branch to <new-base>
, then applies the changes.
The git cherry-pick
command is an easy way to take the changes associated with a commit or range of commits and apply them to the tip of the current branch, one at a time.
It is effectively shorthand for git rebase --onto HEAD <end-point> <start-point>
.
Remotes
Git keeps track of work trees, branches, commits, and other project information in a repository.
You can create a new git repository on your local filesystem by running git init
.
Git is a distributed version control system, which means that there can be many repositories for a single project. These repositories can also be in different locations. For example, a collaborative project might exist in repositories on your local machine, on a server belonging to a Git hosting service, and on the machine of another developer. Working effectively with others requires sharing information between these repositories.
Repositories distinct from the one you are currently working in are called remotes.
You can list the names and URLs of remotes for your current repository by running git remote -v
, and see information about a particular remote by running git remote show <remote name>
.
You can add remotes with git remote add <remote name> <remote URL>
and rename remotes with git remote rename <old remote name> <new remote name>
.
To download data from remote repositories, you can run git fetch <remote name>
.
This takes branches from the remote repository and creates or updates corresponding remote branches in your local repository.
Remote branches are named with the format <remote name>/<branch name>
.
You can’t work on remote branches directly, but a local branch can be configured to have a direct relationship with a remote branch. Such local branches are called tracking branches. The remote branch associated with a tracking branch is called its upstream branch.
To create a new tracking branch based on a remote upstream branch, you can run git switch -c <new branch name> <remote name>/<branch name>
.
You can also set up an existing branch to track a remote upstream branch by switching to the existing branch and running git branch --set-upstream-to <remote name>/<branch name>
.
The git pull
command integrates changes from a particular remote branch into the current local branch.
If you run git pull
with no other arguments from a tracking branch, it will automatically use the upstream remote branch.
git pull <remote name> <branch name>
specifies a particular remote branch to use.
If the local branch ref is an ancestor of the remote branch ref, git pull
will fast-forward the local branch to the remote.
If the two branches have diverged, it will rebase the local branch on top of the remote one.
Make sure you pass the --rebase
option or set pull.rebase
to true in your Git config.
The git push
command uploads the current local branch to a remote branch.
If you run git push
with no other arguments from a tracking branch, it will automatically update the upstream remote branch.
git push <remote name> <branch name>
specifies a particular remote branch to update.
If you have created a new local branch, you can use the following command to create a new remote branch with the same name as your new local branch, update the remote branch with the contents of your local branch, and establish a tracking relationship between them: git push --set-upstream <remote name> <current local branch name>
.
You can delete a remote branch by running git push -d <remote name> <branch name>
By default, git push
will only succeed if the remote branch ref is an ancestor of the local branch ref.
In other words, it will only succeed if it can fast-forward the remote branch ref to the local branch ref.
If you want to rewrite the history of the remote branch, you can use git push --force-with-lease
.
However, the --force-with-lease
flag will still result in failure if a commit has been made in the remote branch since the last time you pulled it.
To overwrite the remote branch with your local branch in all scenarios, you can use git push --force
.
You should rarely need to use the --force
flag, and you should never rewrite the history of a remote branch that other developers may be working on.
When you want to start collaborating on a new project that is already in progress, a common thing to do is clone that project’s repository from some Git hosting service.
The command git clone <remote URL>
creates a new directory with the name of the remote repository, copies the remote repository into the new directory, sets up the cloned repository as a remote with the name origin
, creates remote branches for each branch in origin
, and creates a local tracking branch for origin
’s primary branch.
Workflow for collaboration
This workflow outlines how to use Git for collaborative projects in the most painless way possible. It describes the process of getting a new feature added into the project from start to finish.
-
Clone the project’s repository from wherever it is hosted.
-
Create a feature branch to work in. This branch may be created off of your local copy of the repository’s
main
branch or a different feature branch. -
Do the required development work in your feature branch. While working, you should maintain a small number of commits, often only one, at the tip of the branch. The commits should be semantically distinct and the sets of files they modify should often be disjoint. They should also have short, descriptive commit messages.
For example, if we wanted to add support for user accounts to a web app, the commits in our feature branch, displayed with
git log --oneline
, might look like this:ff725997b backend: add support for user accounts b54b004df frontend: add login page f7f7769b4 frontend: main page: display logged in user's name
The following tips will help you to maintain your commits:
-
Use interactive rebasing (
git rebase -i
) to order and squash commits.git commit --fixup <commit-to-fixup>
and the--autosquash
argument togit rebase -i
are helpful for this. -
Use
git add -p
to add individual chunks of files to the staging area. -
Use
git commit --amend
to fold changes from the staging area into the previous commit.
Maintaining a small number of semantically distinct commits at the tip of your feature branch makes your branch easier to maintain, understand, and review. It also makes rebasing easy.
During the course of development, you often have to incoporate changes from a mainline branch into your feature branch. You should not use a merge in this scenario. For one thing, merging pollutes your feature branch’s history and makes it hard to identify which commits are actually relevant to the feature. But more importantly, merges that involve conflict resolution end up splitting changes to the same region of code across multiple commits: the feature branch commit that originally introduced the changes and the potentially massive merge commits that resolved conflicts. This makes it hard for collaborators and even yourself to work with your branch.
To incorporate changes from a mainline branch, you should rebase the feature branch onto it. Doing so results in a clean linear history that is easy to understand and work with. And having a small number of semantically distinct commits in the feature branch guarantees you won’t experience cascading rebase conflicts.
-
-
When your feature is done and tests are passing locally, push your feature branch to a new corresponding remote branch for review. Address any feedback by making changes to your local branch and using the tips mentioned above to maintain your commits. You likely won’t need to create any new semantically distinct commits in response to reviews. Use
git push --force-with-lease
to push versions of your feature branch with rewritten history back up to the remote. This is fine to do as long as no other developers are working on the remote copy of your feature branch. -
To incorporate your feature branch into the mainline branch, first rebase it on the mainline branch a final time. Then fast-forward the mainline branch to the feature branch. This results in a clean linear history for the mainline branch.
With this workflow, you almost never perform an actual git merge
.
However, merges are useful in certain scenarios.
Imagine that you maintain a fork of some upstream project and want to incorporate changes from a new version of upstream.
In this case, rebasing all the changes you’ve made in your fork onto a new upstream version is impractical, and the split of changes between your original commits that introduced them and merge commits for new versions of upstream is useful.
Stacked branches
Sometimes, when working on a large feature, you may want to make the changes in distinct parts that can be reviewed and integrated separately.
One way to do this is to create a separate branch for each part of the feature such that each part’s branch is an extension of the previous part’s.
More specifically, you would create a part-1
feature branch off of main
, a part-2
feature branch off of part-1
, and so on.
Each part’s branch should contain the tip of the previous part’s branch (or main
) in its history, so that the parts of the feature all apply cleanly on top of one another.
This scenario is often referred to as having stacked branches.
While stacked branches can make the review process for large features simpler and more effective, they also involve additional management work.
For example, when the main
branch changes, you have to rebase the entire branch stack on top of a new commit such that part-1
is based on the new main
, part-2
is based on the new part-1
, and so on.
Similarly, when one part of the feature is changed in response to review feedback, the subsequent parts must all be rebased.
Git’s --update-refs
argument to the rebase command handles the extra work involved in rebasing stacked branches automatically.
It performs the specified rebase command as normal, but for any local branches whose refs point to commits affected by the rebase, those refs are updated to the post-rebase commits.
So, if you need to rebase a stack of branches on a new version of main
, you can check out the last part in the stack, rebase it on main
with the --update-refs
argument, and all of the previous parts in the stack will also be rebased on top of each other and the new main
automatically.
You can then push all of the updated branches to the remote server with a single command.
The --update-refs
argument also works with interactive rebases, which is useful for incorporating review feedback.
You can check out the last part in the stack, make changes as needed, then perform an interactive rebase with the --update-refs
flag to move and squash your commits.
There will be update-ref
lines in the list of commands that allow you to control exactly where the refs of each branch in the stack will point after the rebase.
Tips
- Documentation for Git is at git-scm.com/docs/.
- Use Git hooks to run scripts when certain events occur.
- Set
merge.conflictStyle
todiff3
orzdiff3
in yourgitconfig
to make conflict resolution sane. - Use
git worktree
commands to set up multiple work trees for the same repository. The work trees use different branches and directories on the file system. - You can create aliases for git commands in your
gitconfig
git blame
will tell you the last commit that modified a region of code.git bisect
is useful for identifying when a bug was introduced.- You can tag commits using
git tag
. - The
--autostash
option is useful if you need to perform potentially destructive operations but don’t have a clean work tree. - You can use a
gitignore
file to prevent files on your filesystem from being tracked by Git. git clean
removes untracked files from the work tree.git reset <commit>
resets the current branch ref as well asHEAD
to a given commit without modifying the work tree.git reset --hard <commit>
resets the current branch ref,HEAD
, and the work tree to a given commit.git reset
is useful in conjunction withgit rebase -i
for splitting up one commit into multiple.- You can use partial or shallow clones to make cloning a large remote much faster at the expense of functionality.
- Use
git diff
andgit apply
to create and apply patches based on changes in the work tree or staging area. Usegit format-patch
andgit am
to create and apply patches based on commits.
Build tools
Build systems
For language that require compilation, build systems handle invoking the compiler. They typically let you write configuration files that specify the command line arguments, libraries, input files, and output files that the compiler will use. Many allow you to make these specifications in a cross-platform way, so that your code can be both built on different platforms and built to execute on different platforms.
Build systems may support incremental builds, which only recompile files that have been modified since the previous compilation, and build caches, which store compilation outputs, for efficiency.
The right build system to use depends on the language being compiled and project requirements. Build systems are sometimes integrated into compilers or package management systems. Useful build systems to know about for C and C++ compilation include CMake, GNU Autotools, and Make.
Package managers
Package managers allow you to split a codebase into packages, where a package is a single library or application. They also allow you to manage dependencies, which are the packages that your packages require in order to be built or run. Build-tool package managers typically are specific to a particular programming language and integrate with a build system. Unlike operating-system package managers, which install packages at a system-wide level, build-tool package managers typically only install dependencies in a particular build environment.
A first-party package is a library or application that is a part of your codebase. It is defined by a package manifest file that allows you to specify a package name, a version number, a set of files to include in the package, metadata, and a dependency list that references other packages. You can use a package manager to upload first-party packages to a package registry server.
A third-party package is a one that has been defined outside of your codebase and published to a registry. For each first-party package in a codebase, the package manager can automatically download and cache all missing dependencies from the package registry. The downloaded dependencies can then be used during development, compilation, and execution.
To specify a dependency in a first-party package’s manifest file, you must include a range of version numbers to indicate which versions of the dependency your package is compatible with. When you modify one of these ranges or add or remove a dependency, you should use the package manager to generate a lockfile for the first-party package.
A lockfile is a file that names a specific version for each of a package’s dependencies, whether they are listed directly in the package manifest or present in the dependency graph as a dependency of a dependency. In generating a lockfile, the package manager chooses specific dependency versions from the allowed ranges to minimize the overall number of dependencies required. For example, a third-party package that appears twice in the dependency graph with version ranges that overlap can be represented by a single entry in the lockfile. A third-party package that appears twice in the dependency graph with disjoint version ranges must be represented by two entries.
Once you have a lockfile, you can use the package manager to install the dependencies listed in it. A particular lockfile will always install exactly the same versions of exactly the same dependencies. Lockfiles are thus an efficient way to facilitate reproducible builds. Rather than adding all of a package’s dependencies into version control, you can simply add a lockfile.
Package managers should prevent the direct use of phantom dependencies. Phantom dependencies are dependencies of a package’s dependencies; they are present in the package’s dependency graph but not its manifest file. A package’s phantom dependencies can change versions or be added or removed unexpectedly as its explicit dependencies change over time. Because of this, the direct use of phantom dependencies can lead to hard-to-diagnose bugs, and attempts to use phantom dependencies should break a package’s build.
For simple projects, a codebase might contain only a single first-party package. However, more complex projects may have multiple first-party packages in a single codebase. Package mangers that support such codebases should have the following capabilities:
- To allow first-party packages in the codebase to be dependencies of other first-party packages.
- To generate a combined lockfile that lists dependencies for multiple packages in the codebase, which allows for the selection of dependency versions that minimize the total number of dependencies required.
- To audit whether multiple packages in the codebase have disjoint version ranges of the same dependency.
- To identify which packages in the codebase are affected by changes to a lockfile, which facilitates efficient rebuilding.
- To install dependencies for a given list of first-party packages only.
- To bundle a first-party package with all of its dependencies to facilitate deployment.
Task runners
Task runners orchestrate shell commands related to your codebase. They help to manage the complexity and computational work associated with building, testing, and deploying software projects.
One feature of task runners is that they allow you to invoke arbitrary shell commands or scripts via simple shorthands.
For example, running taskrunner lint
might invoke the linter using a complicated list of configuration arguments.
Another feature is that they keep track of dependencies between commands.
For example, if you run taskrunner test
, the task runner might run taskrunner build
before the test command so that your tests run against the most recent version of your project.
Task runners track dependencies via a user-defined task dependency graph. Conceptually, each task definition comprises a command, a set of tasks that it depends on, a set of inputs, and a set of outputs. Before the task runner executes a task, it makes sure that all of its dependencies have been executed first.
A task’s inputs and outputs might include files, environment variables, and other system state. They allow the task runner to perform caching. More specifically, the runner can hash the contents of the input set and check that hash against a cache. If the hash hits, then the inputs haven’t been changed, and so the outputs from a previous shell command run can be restored directly from the cache without running the shell command again. If the hash misses, then the task runner runs the task’s shell command and stores the outputs in the cache. The cache can be stored locally or on a remote server.
For this caching to work correctly, a task must always yield the same output when given the same input. Such tasks are called hermetic; hermetic tasks are amenable to parallelization as well as caching. To promote hermeticity, some task runners do not allow a task to access any system state not explicitly declared in its input set. Some task runners also restrict the types of commands that tasks can perform to limit their ability to violate hermeticity.
Task inputs and outputs also enable task runners to identify the set of tasks that have been affected by a change in system state. For example, if some files on the system have been changed, the set of tasks affected by the change contains each task that has a changed file in its input set and each task that has such a task as a dependency.
Task runners might allow you to implement special handling for changes to particular types of files. For example, when a package manager’s lockfile changes, the task runner might invoke the package manager to determine which packages have been affected by the change, and only consider the lockfile to be changed in the input sets for tasks that correspond to the affected packages. Similarly, task runners may be able to integrate with build systems or package managers to generate sets of input files for certain types of tasks.
Current popular task runners include moonrepo and Turborepo.
Virtual environments
In the course of developing software, you’ll run into a lot of virtual environments. Below I describe the main kinds, how they’re useful, and some tools that can help you use them effectively.
Virtualization
Virtualization is when a piece of software known as a hypervisor uses features of real computer hardware to create and manage a set of virtual machines that can run guest operating systems as if each guest OS were running on an isolated instance of the underlying hardware. Sometimes the guest OSes are aware of the fact that they’re running on a virtual machine – this is known as paravirtualization – and sometimes they aren’t.
Virtualization is useful for setting up isolated development environments without polluting your primary operating system. It’s also useful for testing software against a wide range of operating systems or getting access to software that isn’t available on your primary OS.
Virtual machines are represented as disk-image files, which contain the guest OS, and hypervisor configuration files. Using identical images and configuration files results in identical instantiated VMs. These files can be distributed to share development environments, to package pre-configured applications with their dependencies, and to guarantee that networked applications are tested and deployed in identical environments.
Because virtualization uses hardware directly, it’s fast relative to emulation. However, it comes with the limitation that each virtual machine has the same architecture as the underlying physical hardware.
Popular hypervisors are kvm, bhyve, Hyper-V, the macOS Virtualization Framework, and Xen.
Emulation
Emulation is when a software emulator mimics the behavior of hardware. It’s broadly similar to virtualization, except that because an emulator works purely in software, it can emulate any kind of hardware. Operating purely in software also makes emulation slower than virtualization.
Emulation can be used for the same things as virtualization, but its worse performance makes it less likely to be used for distributing or running production applications. It is most useful for testing software across a wide range of hardware architectures, peripheral devices, and operating systems. You can also run nearly any piece of software, even ancient relics, using emulation.
The most popular general-purpose emulator is QEMU.
Simulation
The distinction between emulators and simulators is subtle. While emulators emulate the behavior of an entity, simulators simulate the entity’s behavior as well as some aspect of its internal operation. For example, an emulator for a particular CPU could take a set of instructions and execute them in any way, as long as the externally observable results are the same as they would be for the CPU being emulated. On the other hand, a simulator might execute the set of instructions in the same way that the real CPU would, taking the same number of simulated cycles and using simulated versions of its microarchitectural components.
In some sense, whether a piece of software is an emulator or simulator depends on the level of detail you’re interested in. Generally, though, simulators are much slower than emulators. They are typically used to do high-fidelity modeling of hardware before investing the resources to produce a physical version. A popular hardware simulator is gem5.
Containerization
Containers are lightweight virtual environments within a particular operating system. They isolate applications or groups of applications by limiting unnecessary access to system resources.
Some containerization systems represent containers as image files that can be instantiated by a container runtime. These kinds of containers are similar to virtual machines, but because they work within a particular operating system, they are both more efficient and less flexible. They can be used to package and distribute individual applications, networked or non-networked, with all of their dependencies or to set up isolated development environments.
Image-based containers also help to ensure applications are tested and used in identical environments, and they make it easy to deploy and scale networked applications. Production networked applications should always be run in some kind of container to limit damage in case of compromise.
Popular containerization systems include FreeBSD’s jails and Docker on Linux.
Using chroot
on UNIX-like systems changes the apparent root directory for a process, but it is not a containerization system.
Compatibility layers
Compatibility layers are interfaces that allow programs designed for some target system to run on a different host system. There are many kinds of compatibility layers, but typical ones work by implementing target system library function calls in terms of functions that are available on the host system. Some compatibility layers require recompiling the program, and others work on unmodified target system binaries.
Notable compatibility layers include:
- WINE for running Windows executables on UNIX-like operating systems
- Winelib, part of WINE, for compiling Windows programs into native executables on UNIX-like operating systems
- Windows Subsystem for Linux for running Linux executables on Windows
- Cygwin for compiling programs for UNIX-like operating systems into native Windows executables
- Rosetta, a dynamic binary translator for running executables for different architectures on macOS
- Linuxulator for running Linux executables on FreeBSD
- linuxkpi for compiling Linux kernel drivers as part of the FreeBSD kernel
Docker
Docker is a tool for creating and running containers. Containers are instantiated from images, and you can instantiate multiple disposable containers from a single image. You can specify how an image should be built, including which packages and files to include, in a Dockerfile. Base images for use as starting points in Dockerfiles can be pulled from Docker Hub or custom-made.
Docker builds images in terms of layers according to a specified Dockerfile and build context, i.e. a set of files that the build can access. Each command line in the Dockerfile creates a new layer in the final image, and created layers are stored in a build cache. For command lines whose inputs, including all previous commands in the Dockerfile, have been seen before, the corresponding layers are simply restored from the build cache. When a command line’s input does change, its corresponding layer and the layers of all subsequent command lines will be rebuilt. You should organize your Dockerfiles to use the build cache efficiently and be aware of situations in which you might have to clear the build cache to force a layer to be rebuilt.
Dockerfiles also support multi-stage builds, which allow you to specify multiple build stages based on different base images. They are useful for producing minimal images without unnecessary build-time dependencies and for writing Dockerfiles with good build-cache properties that are relatively easy to read and maintain.
Docker creates images in the Open Container Initiative (OCI) format. It uses the containerd daemon to manage images and containers at a high level. To actually instantiate containers, containerd uses runc.
Useful commands
docker build
- Create a new image from a Dockerfile.
docker commit <container> <tag>
- Create a new image from a container’s current state.
docker run <image>
- Start a container.
docker stop <container>
- Stop a container.
docker ps -a
- Show containers.
docker images
- Show downloaded images.
docker rm <container>
- Remove a container.
docker rmi <image>
- Remove an image.
docker buildx ls
- View cross-platform image builders.
docker buildx create --platform <platform>
- Create a cross-platform image builder for the given target platform(s) and print its name.
docker buildx use <builder>
- Use a cross-platform image builder for future builds.
docker buildx inspect --bootstrap
- Initialize and print information about the currently used cross-platform image builder.
docker buildx build --platform <platform>
- Create a new image for the target platform(s) from a Dockerfile; the currently used cross-platform image builder must support the specified platform(s).
QEMU
QEMU is a full-system emulator. It’s useful for testing system-level software across a wide range of hardware platforms and architectures. It has device models that emulate real peripheral devices, and it also supports VirtIO. When QEMU is emulating a target architecture that matches the architecture of the host machine, it can use hypervisors such as kvm on the host machine to achieve performance on par with that of virtualization.
A notable feature of QEMU is user-mode emulation, which is supported on Linux and BSDs. It supports running binaries compiled for the same operating system but a different architecture, and it’s lighter weight than doing full-system emulation. It’s useful for testing and debugging cross-compiled applications.