How to install ArchiveBox to preserve websites you care about

Install ArchiveBox an open source self-hosted web archive to preserve websites you care about.

archivebox
ArchiveBox

Installation and configuration

Install required dependencies.

You can skip youtube-dl if you do not plan to use it. You can assume that you will need at least 1 GB of free space to install these packages.
$ sudo apt install python3 python3-pip git curl wget youtube-dl chromium

Clone source code to /srv/archivebox/ directory.

$ sudo git clone https://github.com/pirate/ArchiveBox.git /srv/archivebox --depth 1

Ensure that output directory exists.

$ sudo mkdir /srv/archivebox/output

Create /srv/archivebox/etc/ArchiveBox.conf configuration file.

Inspect these settings at Configuration wiki page.
# Example config file for ArchiveBox: The self-hosted internet archive.
# Copy this file to ~/.ArchiveBox.conf before editing it.
# Config file is in both Python and .env syntax (all strings must be quoted).
# For documentation, see:
#    https://github.com/pirate/ArchiveBox/wiki/Configuration

################################################################################
## General Settings
################################################################################

#OUTPUT_DIR="output"
#OUTPUT_PERMISSIONS=755
ONLY_NEW=True
TIMEOUT=3600
MEDIA_TIMEOUT=7200
#TEMPLATES_DIR="archivebox/templates"
#FOOTER_INFO="Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests."


################################################################################
## Archive Method Toggles
################################################################################

FETCH_TITLE=True
FETCH_FAVICON=True
FETCH_WGET=True
FETCH_WARC=True
FETCH_PDF=True
FETCH_SCREENSHOT=True
FETCH_DOM=True
FETCH_GIT=True
FETCH_MEDIA=False
SUBMIT_ARCHIVE_DOT_ORG=True


################################################################################
## Archive Method Options
################################################################################

CHECK_SSL_VALIDITY=True
FETCH_WGET_REQUISITES=True
#RESOLUTION="1440,900"
#WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"
#CHROME_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"
#GIT_DOMAINS="github.com,bitbucket.org,gitlab.com"
#COOKIES_FILE="path/to/cookies.txt"
#CHROME_USER_DATA_DIR="~/.config/google-chrome/Default"


################################################################################
## Shell Options
################################################################################

USE_COLOR=False
SHOW_PROGRESS=False
LC_ALL=C.UTF-8

################################################################################
## Dependency Options
################################################################################

#CURL_BINARY="curl"
#GIT_BINARY="git"
#WGET_BINARY="wget"
#YOUTUBEDL_BINARY="youtube-dl"
#CHROME_BINARY="chromium-browser"

Change owner and group to www-data/www-data.

$ sudo chown -R www-data:www-data /srv/archivebox

Ensure that application can store data in output direcory.

$ sudo chmod 770 /srv/archivebox/output

Web-server configuration

Install nginx web-server.

$ sudo apt install nginx

Disable default configuration.

$ sudo unlink  /etc/nginx/sites-enabled/default

Create /etc/nginx/sites-available/archivebox configuration file.

server {
  listen 80;
  server_name _;

  root /srv/archivebox/output/; 
  index index.html;

  location / {
    try_files $uri $uri/ =404;
  }

  location /archive/ {
    autoindex on;
  }
}

Enable this specific configuration.

$ sudo ln -s /etc/nginx/sites-available/archivebox /etc/nginx/sites-enabled/

Reload nginx service.

$ sudo systemctl reload nginx

Archive URL

Finally, use the following code snippet to archive specific URL.

$ URL="http://lwn.net"; sudo -u www-data bash -c "cd /srv/archivebox/; set -a; source etc/ArchiveBox.conf; echo $URL | /srv/archivebox/archive"
[*] [2019-06-16 21:35:13] Parsing new links from output/sources/stdin-1560720913.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-06-16 21:35:13] Saving main index files...
    √ output/index.json
    √ output/index.html
[▶] [2019-06-16 21:35:13] Updating content for 1 pages in archive...

[+] [2019-06-16 21:35:13] "http://lwn.net"
    http://lwn.net
    > output/archive/1560720913
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org
[√] [2019-06-16 21:35:32] Update of 1 pages complete (19.55 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors
    To view your archive, open: output/index.html
[*] [2019-06-16 21:35:32] Saving main index files...
    √ output/index.json
    √ output/index.html

This can be more or less complicated, you will see next week why I am using this particular way to perform archiving process.

Additional notes

Do not forget to create and configure SSL certificate.