Changing preferred content

Every git-annex repository has a “preferred content” expression that defines which part of the main git-annex repository will be downloaded.

For example, by default we setup new media players with the following preferred content:

include=* and (exclude=video/original/*)

This means the media player will fetch all files but the files in the video/original directory. This is implemented through a group setting called mediaplayer. New groups can be defined. For example, here is how the mediaplayer group was created:

git annex groupwanted mediaplayer 'include=* and (exclude=video/original/*)'


Groups are global across different media players and cannot be erased once created, so make sure the name is good before creating it. It should be a singular, descriptive name.

This created the new mediaplayer group, which can be used similarly to the standard groups. Then the repository can be added to that group and then configured to follow the configured preferred content expression from the group, as is done during the installation process:

git annex group . mediaplayer
git annex wanted . groupwanted

You can also assign any other media player to a given group. So say you have created an audio group with:

git annex groupwanted audio 'include=audio/*'

You could assign the git annex repository to the group with:

git annex group audio
git annex wanted groupwanted

Group configuration is also available to remote operators through the web interface.

Currently, only the mediaplayer group is defined, when new groups are defined, they should be documented here.

Unused and deleted files

When a file is deleted from the git repository, git-annex still has a copy. This is also the case when a file is modified with git annex edit: the previous version stays around for a while. Those are called unused files.


Those files should be scheduled for removal automatically, but for safety reasons, this is not currently enabled. See Redmine issue #17493 for followup.

Unused files can be inspected with:

git annex unused

Those unused files can then be completely destroyed with:

git annex drop --unused

If, however, there are no more copies of the file anywhere, git-annex will refuse to remove those old copies. In this case, you need to use --force:

git annex drop --unused --force

Unused files may also exist on the S3 repository. Add --from s3 to the above commands to operate on the S3 remote from the main website.

Finally, in some cases, files remain in the current repository when they are supposed to have been moved to a different repository. For example, this can happen with transfers to S3 that failed to complete. In this case, this command will drop the local files that have been transfered:

git annex drop --auto .


The media players communicate various metadata about their status through two different channels: a Custom metadata script and Puppet facts.

Generic Puppet facts

Puppet gives us a bunch of facts which provide information about the machines.

short name of the machine (mpYYYYMMDD-N, see the Naming convention for more information)
domain name of the machine (should always be, see the Naming convention for more information)
concatenation of hostname and domain (e.g.
memoryfree, memoryfree_mb
the amount of RAM memory available on the machine in a variable unit or in MiB, e.g. 2.15 GB or 2198.61
memorysize, memorysize_mb
the total amount of RAM on the machine in a variable unit or in MiB, e.g. 3.48 GB or 3562.93
operatingsystem, operatingsystemrelease, os

those describe the operating system of the machine, including the version and code name. Example:

operatingsystem => Debian
operatingsystemmajrelease => 8
operatingsystemrelease => 8.0
os => {"name"=>"Debian", "family"=>"Debian", "release"=>{"major"=>"8", "minor"=>"0", "full"=>"8.0"}, "lsb"=>{"distcodename"=>"jessie", "distid"=>"Debian", "distdescription"=>"Debian GNU/Linux 8.0 (jessie)", "distrelease"=>"8.0", "majdistrelease"=>"8", "minordistrelease"=>"0"}}
osfamily => Debian

those facts describe the disks installed in the machine. example:

blockdevice_sda_model => M4-CT128M4SSD2
blockdevice_sda_size => 128035676160
blockdevice_sda_vendor => ATA
blockdevices => sda

the sizes are in bytes.

uptime, uptime_seconds, uptime_hours, uptime_days
the time since the last reboot of the machine, as a human-readable timestamp (e.g. 50 days), seconds (4343990), hours (1206) and days (50).

those fields describe the processors (or CPU) installed on this machine. Example:

processor0 => AMD E-350 Processor
processor1 => AMD E-350 Processor
the private IP address of this media player. we do not currently ship the public IP address through Puppet.
boardmanufacturer, boardproductname, boardserialnumber
hardware information about the motherboard of the machine

The last check-in time is not per se a fact but is kept by the Puppet dashboard separately. The last checkin time is currently in UTC. We are using the universal time zone (UTC) to avoid confusion if media players are deployed across multiple time zones. That way we always have uniform timezone regardless of where the media player is located.

Custom Puppet facts

The following facts have been implemented (Redmine issue #16706 to provide us with a better overview of the situation in the Puppetmaster dashboard.

The port that should be used while connecting to the central server to connect to a mediaplayer with SSH.
The port that should be used while connecting to the central server to reach a mediaplayer’s VNC server.

The following facts are generated by the script in the gitannex Puppet module.

the amount of disk space available for git-annex to download new files. units vary: may be in Gigabytes, Megabytes, depending on the size available.
the list of files currently being transfered. If no transfer in is progress, an empty list, [], will be shown. Otherwise, a list of file names will be shown. For example: ['video/mp4_sd/779.mp4','video/mp4_sd/781.mp4']
gitannex_files_present_count, gitannex_files_present_size
the size and number of files already present in the git-annex repository. this is equivalent to the local annex keys and local annex size fields in the git annex info output.
gitannex_files_total_count, gitannex_files_total_size
the total size and number of files, missing or present the git-annex repository. this is equivalent to the annexed files in working tree and size of annexed files in working tree fields in the git annex info output.


The missing and total counts currently exclude the originals directory as to avoid confusion because the media players do not sync that content. The actual total repository size is larger.

gitannex_files_missing_count, gitannex_files_missing_size
the size and number of files missing from the git-annex repository. this is calculated from the different between the total and present facts.
this holds the relative date (e.g. “one week ago”) of the last commit on the master branch of the git-annex repository. this branch holds the last changes to the file repository (adding, renaming, removing files) that the local git-annex repository is aware of and is a good description of how up to date the media player is. this is equivalent to git log -1 --format=%ct.
gitannex_master_age_days, gitannex_master_age_hours, gitannex_master_age_minutes, gitannex_master_age_seconds
same as the above, but rounded up to days, hours, minutes or seconds. that is, if the commit is 2 days and 3 hours long, gitannex_master_age_hours is 51 hours.

The following facts are generated from the script in the vnstat Puppet module.

vnstat_bandwidth_usage_up_5_seconds, vnstat_bandwidth_usage_up_day, vnstat_bandwidth_usage_up_yesterday, vnstat_bandwidth_usage_up_month, vnstat_bandwidth_usage_up_year
the upload bandwidth usage in the last 5 seconds, the current and previous day, the current month and year.
vnstat_bandwidth_usage_down_5_seconds, vnstat_bandwidth_usage_down_day, vnstat_bandwidth_usage_down_yesterday, vnstat_bandwidth_usage_down_month, vnstat_bandwidth_usage_down_year
same for download bandwith

Note that the total amount of disk space allocated for downloading files can be somewhat deduced from the total git_annex_files_present_size and gitannex_disk_space_available. It will be accurate insofar that all the files in the partition are managed by git-annex.


There are various “settings” available in the Puppet dashboard. They are arbitrary key/value pairs that get passed down in the Puppet configurations and can affect (or not) the behavior of media players.

Do not use fields that are not explicitely documented here, as it may make a media player unreachable or unusable.

Configuration settings

Those are settings that control various operations of git-annex on the media player.

gitannex_sync_start_hour, gitannex_sync_start_minute, gitannex_sync_stop_hour, gitannex_sync_stop_minute
time (hour and minute) at which the git-annex assistant should start and stop syncing the media player. none or both fields need to be specified. if no field is specified, the media player is always on.

Upload bandwidth limit. If no units are specified, the provided number is in kibibytes, that is 1024 bytes per second. A unit should be provided to avoid confusion.

This is passed verbatim to to the --bwlimit option of rsync. Here’s an excerpt of the rsync manual explaining the how the units are interpreted and the limit implemented:

The RATE value can be suffixed with a string to indicate a size
multiplier, and may be a fractional value (e.g.
"--bwlimit=1.5m").  If no suffix is specified, the value will be
assumed to be in units of 1024 bytes (as if "K" or "KiB" had
been appended).

For backward-compatibility reasons, the rate limit will be
rounded to the nearest KiB unit, so no rate smaller than 1024
bytes per second is possible.

Rsync writes data over the socket in blocks, and this option
both limits the size of the blocks that rsync writes, and tries
to keep the average transfer rate at the requested limit.  Some
"burstiness" may be seen where rsync writes out a block of data
and then sleeps to bring the average rate into compliance.


The suffixes are as follows: "K" (or "KiB") is a kibibyte
(1024), "M" (or "MiB") is a mebibyte (1024*1024), and "G" (or
"GiB") is a gibibyte (1024*1024*1024).  If you want the
multiplier to be 1000 instead of 1024, use "KB", "MB", or "GB".
(Note: lower-case is also accepted for all values.)  Finally, if
the suffix ends in either "+1" or "-1", the value will be offset
by one byte in the indicated direction.

Examples: [...] 1.5mb-1 is 1499999 bytes, and [...] 2g+1 is
2147483649 bytes.

Download bandwidth limit. If no units are specified, the provided number is in bytes per second. A unit should be provided to avoid confusion.

This is passed verbatim to the --limit-rate option of wget. Here’s an excerpt of the wget manual explaining how the units are interpreted and how the limit is implemented:

Limit the download speed to amount bytes per second.  Amount may
be expressed in bytes, kilobytes with the k suffix, or megabytes
with the m suffix.  For example, --limit-rate=20k will limit the
retrieval rate to 20KB/s.  This is useful when, for whatever
reason, you don't want Wget to consume the entire available

This option allows the use of decimal numbers, usually in
conjunction with power suffixes; for example, --limit-rate=2.5k
is a legal value.

Note that Wget implements the limiting by sleeping the
appropriate amount of time after a network read that took less
time than specified by the rate.  Eventually this strategy
causes the TCP transfer to slow down to approximately the
specified rate.  However, it may take some time for this balance
to be achieved, so don't be surprised if limiting the rate
doesn't work well with very small files.

Informative settings

Those fields are not necessarily used by Puppet for anything, but are used by Isuma operators to add information about the machine. Fields may or may not be available.

Location (address, street, city, country) of this media player.
Free-form description of the site where the media player is (e.g. “Isuma Office, Cara’s desk”)
Geographic coordinates of this media player, if address is missing or irrelevant (e.g. 45° 30’ 0” N, 73° 34’ 0” W)
isuma_mp_operator_name, isuma_mp_operator_phone, isuma_mp_operator_email, isuma_mp_operator_address
name, phone number, email and city address of the last known local operator of the media player.
random notes about the media player.

New fields may be added, but they must have the prefix isuma_mp_.

External synchronisation drives

Content can be synchronised to media players using an external synchronisation drive. That drive, when connected to a media player, will add all the missing content to the media player, and all content only on the media player will be added to the drive as well.

Syncing a media player

This is the standard procedure to synchronise a media player with an external synchronisation drive.

  1. connect the drive
  2. observe the led start flashing
  3. wait for the led to stop flashing
  4. disconnect the drive

The media player should now be synced with the drive, and the drive should have the latest content from the media player.

Debugging information is sent to syslog, in /var/log/daemon.log. Here’s an example logfile excerpt:

Jun 24 15:52:12 koumbit-mp-test logger: starting mediaplayers /lib/udev/mediaplayers-syncdrive on sdc, looking for label isuma_sneakernet
Jun 24 15:52:13 koumbit-mp-test logger: starting mediaplayers /lib/udev/mediaplayers-syncdrive on sdc1, looking for label isuma_sneakernet
Jun 24 15:52:13 koumbit-mp-test logger: mounting sdc1 on /media/isuma_sneakernet
Jun 24 15:52:13 koumbit-mp-test logger: synchronizing git-annex repository /var/isuma/git-annex with remote sneakernet as www-data

Updating a synchronisation drive

Just connecting a synchronisation drive on a media player should download all the content from the media player and update the synchronisation drive to that content.

To see which transfers are in progress, you can use the following command:

antoine@koumbit-mp-test:/var/isuma/git-annex$ sudo -u www-data -H git annex info --fast
repository mode: indirect
trusted repositories: 0
semitrusted repositories: 7
        00000000-0000-0000-0000-000000000002 -- bittorrent
        2d61a8de-a24e-44e3-9aa0-54f033fec1e9 -- [here]
        36d2cb94-e0a2-446a-87c9-02f73135b302 -- anarcat@desktop008:~/src/isuma/isuma-files
        9401d7b3-44d2-48ab-a9f1-c77fac469a1a -- [s3]
        c510ddad-24cd-4353-b5f4-03581f6f9dca -- [origin]
        d2a7d4ff-1dbf-4bfa-bb97-ae593626daf6 -- [sneakernet]
        e747d5c8-ea47-480f-8c5d-2986ce65ed89 --
untrusted repositories: 1
        00000000-0000-0000-0000-000000000001 -- web
transfers in progress:
        uploading video/mp4_sd/strata_may15_hd.mp4.mp4 to sneakernet
available local disk space: 929.96 gigabytes (+1 megabyte reserved)

Above we see that a video file (video/mp4_sd/strata_may15_hd.mp4.mp4) is being uploaded from the media player to the sneakernet, that is, the synchronisation drive. This file was downloaded on the media player after the synchronisation drive was created, so git-annex is updating the drive.

If synchronisation would be complete, you would see transfers in progress: none.


Note that git-annex may wait a little between two transfers, so you may want to run the command multiple times to make sure the transfer is complete.

To make sure no content is missing, compared to a media player, you can use:

git annex find --not --in sneakernet --in here

Manual updates of synchronisation drives

However, if the media player isn’t up to date, it’s still possible to synchronise the drive by hand in one shot, with:

cd /media/isuma_sneakernet/git-annex
git annex sync
git annex get --exclude 'video/original/*'

There may be bandwidth limits on the sync drive. Use the annex.web-download-command setting to control that. For example, to disable bandwidth limits by hand, use:

git config --unset annex.web-download-command

To see the current setting, use:

git config --get annex.web-download-command

Design notes

This was originally implemented using rsync in the 2.x series (see Redmine issue #181 for background) but we now use the git annex sync command with the --content argument to synchronise the contents. This is implemented in the /lib/udev/mediaplayers-syncdrive, deployed through Puppet in the mediaplayers module. It is configured to automatically start when a properly formatted hard drive is connected in /etc/udev/rules.d/010_mediaplayers_syncdrive.rules.

The synchronisation script will automatically mount (and then unmount, when finished) the drive on /media/isuma_sneakernet then run the git annex sync --content command. Puppet is assumed to have already properly configured the remote in the main git repository for that sync to work properly.

External drive format

A drive is identified as carrying Isuma content if it has the isuma_sneakernet filesystem label (all lowercase). The git-annex repository should be in a git-annex subdirectory on the filesystem (all lowercase).

This folder is further subdivided by content and file format. So, one should find a file structure like this on the drive:


Some of those files may not be synced with the media player based on prefered content settings. For example, usually the video/original content is not synced to the media players.

Creating a new synchronisation drive

This is usually done from an existing media player, but can actually be done from anywhere that has a network connection. But that will mean a lot of data will be downloaded over the wire, which will be slow or worse, may end up imposing extra bandwidth costs with your Internet Service Provider.

  1. connect the drive

  2. find the drive identifier:

    $ dmesg | tail
    [1373209.300987] scsi 18:0:0:0: Direct-Access     OEM      Ext Hard Disk    0000 PQ: 0 ANSI: 5
    [1373209.301912] sd 18:0:0:0: Attached scsi generic sg3 type 0
    [1373209.427839] sd 18:0:0:0: [sdd] Spinning up disk...
    [1373211.584051] .ready
    [1373211.615951] sd 18:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
    [1373211.640576] sd 18:0:0:0: [sdd] Write Protect is off
    [1373211.640580] sd 18:0:0:0: [sdd] Mode Sense: 10 00 00 00
    [1373211.664833] sd 18:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [1373211.776849]  sdd: sdd1
    [1373211.926204] sd 18:0:0:0: [sdd] Attached SCSI disk


    $ cat /proc/partitions
    major minor  #blocks  name
       8        0  488386584 sda
       8        1     248832 sda1
       8        2          1 sda2
       8        5  488134656 sda5
     254        0  479952896 dm-0
     254        1    8179712 dm-1
       8       48 1953514584 sdd
       8       49 1953512001 sdd1

    In both examples above, the new partition discovered is /dev/sdd1.


    The following can destroy data if not followed properly. In particular, we are using the device /dev/sdd1 from here on, if that device is in use for some other filesystem, it will be destroyed at the next step. You can use the df command to see mounted filesystems.

  3. format it with an ext4 filesystem with the magic label:

    mkfs -t ext4 -L isuma_sneakernet /dev/sdd1
  4. mount the drive:

    mkdir /media/isuma_sneakernet
    mount /dev/sdd1 /media/isuma_sneakernet
  5. clone the git-annex repository:

    git clone /var/isuma/git-annex /media/isuma_sneakernet/git-annex


    If you are not on a media player, the /var/isuma/git-annex repository will not be available. Not all is lost however! You can still clone from any other git-annex repo, including the one on the central server. For example, this should also work:

    git clone /media/isuma_sneakernet/git-annex

    You may need to create a new SSH key pair and install it on the central server. Since it is running an old version of Monkeysphere, you will also need to run:

    monkeysphere-authentication u antoine

    For the change to be effective.

    Once the repository is cloned, however, you will likely want to ensure the synchronisation drive doesn’t require SSH keys to synchronise the metadata on media players. So change the URL to the internal repository, even if it doesn’t exist yet:

    git remote set-url origin /var/isuma/git-annex
  6. make sure repository is readable by the webserver, for uploads:

    chown -R www-data /media/isuma_sneakernet/git-annex


    If you are running this on a non-Debian system, this user may not exist. For the record, the current UID for www-data is 33, so this would be equivalent:

    chown -R 33 /media/isuma_sneakernet/git-annex

    See also the Debian base-passwd package for more information about those identifiers.

  7. launch the sync script:

    umount /dev/sdd1
    /lib/udev/mediaplayers-syncdrive sdd1


    If you are not running this on a media player, the above will fail because it will not find the git-annex repository. You can still synchronise data directly from S3 using the following commands:

    sudo -u www-data git annex enableremote s3
    sudo -u www-data git annex get

This can take up to 24 hours right now (June 2015), depending on the data set size.


Creating a completely new sync drive from scratch, at the Koumbit datacenter, took around 30 hours with connexion rate-limited to 5MB/s. The dataset was about 860GB in June 2015, see Redmine issue #17834 for details.

Creating user accounts

Access to the media player is granted on a per-user basis. Users need to be created in Puppet, in the user::admins class. For example, this grants access to the antoine user:

user {
    ensure     => present,
    gid        => $gid,
    groups     => $groups,
    comment    => 'Antoine Beaupre',
    managehome => false,
    shell      => '/bin/bash',
    password   => '$6$[...censored..]';

A new block like this needs to be added to the site/user/manifests/admins.pp file for every user we want to give access to.


Note that this grants access to all machines managed through Puppet, including sudo access. Some rearchitecturing of the Puppet classes would need to be performed to have access specific to the media players, but this was not a requirement at first.


The previous access system was based on the root account, which is now locked down.

Upgrading git-annex

Depending on the git annex installation method, there are various ways of updating git-annex when a new release comes out.

As a rule of thumb, as long as the first part of the git-annex version number doesn’t change, upgrades are non-destructive and will be forward- and backward-compatible. For example, right now the version number is 5.20150409, which means it is basically Git annex 5. A major upgrade including a data migration would come if the next release is something like 6.x.

Those changes are documented upstream in the upgrades page and are not very common. Keep in mind that git-annex “will always support upgrades from all past versions”, so upgrading is usually a painless process, which only requires running git annex upgrade after deploying the new codebase.

Since we are usually deploying with Debian packages from NeuroDebian, only that method is documented here. You can see the latest versions available from NeuroDebian in their package page. To perform an upgrade by hand, you can simply do:

apt-get install git-annex-standalone

With Puppet, it is also possible to specify the desired version with:

class { 'gitannex':
  method => 'package',
  ensure => '5.20150819+gitgc587698-1~ndall+1',

The downside of using a specific version in Puppet is that it needs to be updated every time a new release comes up.


Hopefully, git-annex will eventually be part of the standard backports, issue #760787 was opened about this in Debian. That way, git-annex will be part of the regular unattended-upgrades process.