#acl LcnGroup:read,write,delete,revert All:read
== THIS PAGE HAS BEEN MOVED TO SHAREPOINT! ==
Please refer to this site/make edits here for the most updated information: https://partnershealthcare.sharepoint.com/sites/LCN/SitePages/Git-Annex.aspx
----
<
>
<
>
= Git Annex =
The freesurfer repository contains many large binary files that can't be stored directly on github. Instead, they should be stored in a separate space, reserved for times when retrieval is required, like for updating test data, performing local installations, etc. Think of ''source code'' as text files, and think of ''data files'' as binary files not required for compilation. This page details how to work with git-annex, the software used for storing and retrieving these data files in the freesurfer git repo, and it's mostly meant as a guide for internal maintainers. For basic instructions on retrieving annex data, visit the [[BuildGuide|build guide]]. Additional documentation can be found on the [[https://git-annex.branchable.com/walkthrough|git-annex website]].
== General Concept ==
Git annex is a tool for storing large files outside of the main repository. When you commit a new data file with git annex, the main repo is aware of this new file, but it will actually store it as a symlink. For example, if you were to create a fresh clone of the freesurfer repository from github, you would see that `mri_convert/testdata.tar.gz` is actually a broken relative symlink to a hashed file in the `.git/annex` directory of your repo. It's broken because you have yet to pull the actual data files from the separate annex repository in `/space/freesurfer/repo/annex.git`. Instructions to do this are described below.
== Setup ==
As described in the [[BuildGuide|build guide]], the annex data source must be setup as a remote repository. For those developing on the Martinos filesystem, cd into your repository and run:
{{{
git remote add datasrc file:///space/freesurfer/repo/annex.git
}}}
For those developing outside of Martinos, run:
{{{
git remote add datasrc https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git
}}}
Afterwards, the output of `git remote -v` should look something like this:
{{{
datasrc file:///space/freesurfer/repo/annex.git (fetch)
datasrc file:///space/freesurfer/repo/annex.git (push)
origin git@github.com:ahoopes/freesurfer.git (fetch)
origin git@github.com:ahoopes/freesurfer.git (push)
upstream git@github.com:freesurfer/freesurfer.git (fetch)
upstream git@github.com:freesurfer/freesurfer.git (push)
}}}
== Adding a File ==
Generally, only the freesurfer source code administrators should add a file to the annex - especially since only users at the Martinos Center will have write access to the filesystem. The following example assumes we want to add an example test script and test data tarball called 'testdata.tar.gz' to a subdirectory. '''NOTE:''' Please make sure the file you are adding has group-writable and world-readable permissions; otherwise, people will not be able to pull your new file from the server.
{{{
git add test.sh
chmod 664 testdata.tar.gz # optional - just make sure it's world readable
git-annex add testdata.tar.gz
git commit -m "added test to subdirectory"
git push
git annex copy --to datasrc
}}}
== Getting a File ==
First, fetch the state of the remote data source. This must be done every time you want to download ''new'' annex data:
{{{
git fetch datasrc
}}}
Then, to retrieve the contents of a data file:
{{{
git-annex get mri_convert/testdata.tar.gz
}}}
Or to retrieve everything under the current directory:
{{{
git-annex get .
}}}
== Modifying a File ==
To modify the contents of a data file, first unlock it, which eliminates the symlink:
{{{
git-annex unlock mri_convert/testdata.tar.gz
}}}
Then, after making modifications, re-add it to the annex:
{{{
chmod 664 mri_convert/testdata.tar.gz
git-annex add mri_convert/testdata.tar.gz
git commit -m "updated the test data"
git push
git-annex copy --to datasrc
}}}
== Tagging ==
Git -annex provides the ability to tag data files. Freesurfer utilizes tags so that subsets of the data can be retrieved without having to download everything. The data files have been broken down into the following 2 categories:
1. Files required for build time checks - '''makecheck'''
1. Files required for a local installation - '''makeinstall'''
It is essential that data files get tagged properly so that our servers and diskspace are not overwhelmed when only a known subset of the data is required.
==== Get Tagged Files ====
To get only the data files required for installation:
{{{
git fetch datasrc
git-annex get --metadata fstags=makeinstall .
}}}
==== Display metadata ====
To show all the metadata associated with a file:
{{{
git-annex metadata mri_convert/testdata.tar.gz
}}}
==== Assign Metadata ====
Assigning metadata to a datafile is the job of a source code administrator, similar to adding a datafile. When adding metadata to an annex file, it is best to start with a clean checkout of the repository and be in the 'dev' branch. Then add the tag as follows:
{{{
git-annex metadata mri_convert/testdata.tar.gz -s fstags=makecheck
git-annex copy --to datasrc
}}}
We can also append tags:
{{{
git-annex metadata mri_convert/testdata.tar.gz -s fstags+=makeinstall
git-annex copy --to datasrc
}}}
No need to perform any commits or pushes or pull requests after this is done.
==== Listing Files with a Given Tag ====
{{{
git-annex find --metadata fstags=makeinstall
}}}
== Mirroring ==
The git annex repository exists on the local file system in the following directory:
{{{
/space/freesurfer/repo/annex.git
}}}
The public-facing git annex repository exists on local file system in the following directory (mounted by our server):
{{{
/cluster/pubftp/dist/freesurfer/repo/annex.git
}}}
Currently, we ''mirror'' the two repositories daily using the following commands:
{{{
ssh pinto
rsync -av /space/freesurfer/repo/annex.git/* /cluster/pubftp/dist/freesurfer/repo/annex.git
git update-server-info
}}}