= THIS PAGE HAS BEEN MOVED TO SHAREPOINT! =
Please refer to this site/make edits here for the most updated information: 
https://partnershealthcare.sharepoint.com/sites/LCN/SitePages/Samseg-Testing.aspx
----
<<BR>>
<<BR>>
<<BR>>


## page was renamed from SamsegAsegTesting
#acl LcnGroup:read,write,delete,revert CmetGroup:read,write,delete,revert All:
''This page is readable only by those in the LcnGroup and CmetGroup.''

''Author(s): Nick Schmansky, Andrew Hoopes''

'''See: [[Samseg]]'''

= Samseg Testing =

This page documents one front of the work on [[Samseg]], where testing of a version of Samseg operating on T1-weighted input and using the existing RB FS atlas, is conducted to compare its performance against the existing FreeSurfer v6.0 subcortical segmentation processing stream, and manual segmentations.  Initial work was conducted in an evaluation task: [[SamsegEvaluationMarch2016]].  The current work pertaining to this page extends that work by:

 * Creation of a test cycle allowing the comparison of a subject run of Samseg (which is under algorithmic development) to the FS aseg results from v6 for that subject (or manual segmentaton).  The test cycle should generate reports on a run, allowing for debug, and evaluation of performance to meet, initially, the aims of the CorticoMetrics grant aims (ie, an accelerated FS based on T1 input only)

 * Creation of a data set which includes enough subjects to cover the wide variety of possible inputs.  Starting with Buckner40 and ADNI60, and continuing with many others, particularly subjects having lots of neck in the FOV, as well as 'tough cases' both in terms of registration and segmentation, plus a variety of scanner model, and age and disease states.

 * Documentation of design decisions.

Those working on this project include: Doug Greve (DG), Nick Schmansky (NS), Andrew Hoopes (AH), Christian Larsen (CH) and Lee Tirrell (LT)

== Test Data ==

=== structure: ===

The test subjects and scripts are located here: {{{/cluster/fsm/users/samseg}}}

{{{
samseg
├── scripts
├── subjects
│   ├──ADNI60
│   ├──Buckenr40
│   └── ...
└── tests
    ├──testdate_ADNI60
    ├──testdate_Buckner40
    └── ...
}}}

Test sets used to compare samseg results will include:

 * have manual labels:
  * Buckner39 - 39 manually labeled subjects used to create the FS subcortical atlas
  * Siemens13 - 13 manually labeled subjects scanned on Siemens Sonata (currently used in testing the FS atlas: [[AsegTestNotes]])
  * GE14 - 14 manually labeled subjects scanned on GE Signa (currently used in testing the FS atlas: [[AsegTestNotes]])
  * ADNI-Hippo - [[http://www.hippocampal-protocol.net/SOPs/index.php|135 ADNI subjects with manual segmentations of hippocampus]]
  * LPBA40 - [[http://loni.usc.edu/atlases/Atlas_Detail.php?atlas_id=12|40 subjects with manual segmentations of gm, wm and csf]]
  * OASIS-TRT-20 - [[https://osf.io/zevma/|20 OASIS subjects with both cortical and subcortical manual labels]]

 * have freesurfer asegs (either v5.3 or v6.0):
  * Buckner40 - public buckner40, processed by FS v6, where aseg's have been manually inspected as part of FS release testing
  * ADNI60 - processed by FS v6, where aseg's have been manually inspected as part of FS release testing
  * ADNI_714_1.5T - 714 ADNI data subjects scanned at 1.5T, processed by FS 5.3, and QA'd by Tian Ge (asegs not each manually inspected)
  * ADNI_1150_3T - 1150 ADNI data subjects scanned at 3T, processed by FS 5.3, and QA'd by Tian Ge (asegs not each manually inspected)
  * IXI_79 - 79 [[http://brain-development.org/ixi-dataset/|IXI]] subjects (all healthy controls), scanners: Philips 1.5T & 3T and GE 1.5T, processed by FS 5.3
  * !ThreeScanners14 - 14 subjects each scanned on GE 1.5T, Siemens 1.5T and Siemens 3T
  * Miriad20 - [[http://www.ucl.ac.uk/drc/research/methods/miriad-scan-database|20 subjects scanned at multiple timepoints (including same-day) from a set of 40+]]
  * Outliers - consisting of subject scans known to be problematic (big ventricles, lots of neck, skewed placement in scanner, etc.)

 * other:
   * ADNI-1.5T-all - 4800 1.5T data from ADNI, 001.mgz files only, many are the same subject, different timepoints
   * ADNI-3T-all - 7976 3T ADNI data, 001.mgz files only, many are the same subject, different timepoints

 * sets awaiting inclusion:
  * poor quality sets - images that have artifacts or generally poor contrast, hence un-recon'able
  * disorder sets - data from various disorder studies (outside of AD), eg. ms, schizophrenia, autism, tumor, child, addiction, etc.
  * defaced set - subjects run through mri_deface, compared to non-defaced, testing for identical results
  * fcd set - subjects with large and small cortical dysplasias, and also follow-ups with a cortical area (fcd) resected
  * HCP, OASIS, HABS, BRAINS, Mindboggle-101, BGSP - all publicly available datasets where the intent is to process by FS v6 to include in the aseg test set.
  * ANDI-Long - test-retest set
  * Bammer120 - test-retest set

=== running a test: ===

To run a test, use the {{{runtest}}} script located in {{{/cluster/fsm/users/samseg/scripts}}}. To test samseg with a multi-subject set, use the {{{-set}}} flag and indicate the test set directory (located in {{{/cluster/fsm/users/samseg/subjects}}}) as well as a test results output dir:

{{{
./runtest -all -set <subjects dir> <test outdir>
}}}
example: {{{scripts/runtest -all -set subjects/ADNI60 tests/newADNItest}}}


{{{-all}}} runs each test step, including running samseg, computing dice scores, and creating charts. To specify which steps to run, use {{{-samseg}}}, {{{-dice}}}, and/or {{{-chart}}} instead.
To test samseg with an individual subject, use the {{{-ind}}} flag and indicate the path to the subject as well as a test results output dir:

{{{
./runtest -all -ind <subject> <test outdir>
}}}

NOTE: I was under the impression that matplotlib (python plotting package) was installed for the center, but I guess not. Instead, I've pointed the runtest shebang to my anaconda install on topaz

=== results: ===

In the test output directory, the script will create a dir for each subject and an {{{analysis}}} dir (not produced for an individual subject test). Each subject contains its samseg output as well as {{{dice.dat}}} and {{{dice.log}}}, produced by mri_compute_seg_overlap. In the analysis dir, {{{subjs_dice.log}}} is a summary file of the mean overlap for each subject. {{{labels_dice.log}}} is a summary file of the mean overlap for each brain structure across all subjects. The runtest script also creates a {{{labels_dice.no_outliers.log}}}, which ignores any subjects with a mean overlap below a certain threshold (default is 0.2 - can be changed with {{{-outlier}}}); the subjects ignored are written to {{{outliers}}}. An associated chart is created for each of these three files and saved as .png.

== Test Runs ==

||<#c5e0de> '''Robust, 2244-subject test:'''||
||06.22.2017 - Buckner39, Buckner40, ADNI60, ADNI_714_1.5T, ADNI_1150_3T, IXI_79, GE14, Siemens13 - [[SamsegTesting/JuneTest|JuneTest]]||
||<#c5e0de> '''Kvlreg vs Elastix:'''||
||06.18.2017 - Buckner39 and ADNI60 - multi-resolution parameter estimation and tune optimization convergence criteria - [[SamsegTesting/2017-06-18|2017-06-18]]||
||<#c5e0de> '''Kvlreg vs Elastix:'''||
||06.05.2017 - comparison of initial affine registration techniques - [[SamsegTesting/elastix_vs_kvlreg|elastix_vs_kvlreg]]||
||<#c5e0de> '''Kvlreg:'''||
|| 06.08.2017 - Siemens13, GE14, Buckner40, and ADNI-HIPPO - [[SamsegTesting/2017-06-08|2017-06-08]]<<BR>>06.01.2017 - Buckner39 - all subjects registered - (mean: 0.83, overall: 0.86) - [[SamsegTesting/2017-06-01_Buckner39|2017-06-01_Buckner39]]<<BR>> 06.01.2017 - ADNI60 - all subjects registered - (mean: 0.846, overall: 0.88) - [[SamsegTesting/2017-06-01_ADNI60|2017-06-01_ADNI60]]||
||<#c5e0de> '''Elastix test:'''||
|| 05.26.2017 - Buckner39 - all subjects registered - (mean: 0.83, overall: 0.86) - [[SamsegTesting/2017-05-26_Buckner39|2017-05-26_Buckner39]]<<BR>> 05.26.2017 - ADNI60 - all subjects registered - (mean: 0.82, overall: 0.87) - [[SamsegTesting/2017-05-26_ADNI60|2017-05-26_ADNI60]]||
||<#c5e0de> '''Code clean-up:'''||
|| 05.23.2017 - GE14 - '''no change''' since the last test - (mean: 0.77, overall: 0.83)<<BR>> 05.21.2017 - Buckner39 - '''no change''' since the last test - still 21 failed registrations, and an average overall dice of 0.86 for successful subjects<<BR>> 05.21.2017 - ADNI60 - basically '''no change''' since the last test - 18 failed registrations, and still an average overall dice of 0.84 for successful subjects||
||<#c5e0de> '''ADNI135 hippocampus:'''||
|| 03.25.2017 - 135 ADNI subjects with manual hippocampus labels - mean hippocampus overlap (0.96) - [[SamsegTesting/2017-04-14_ADNIHIPPO|2017-04-14_ADNIHIPPO]]||
||<#c5e0de> '''Timing test:'''||
|| 03.25.2017 - 2 Buckner39 subjects - average elapsed time 1256 s (20.9 min) - [[SamsegTesting/2017-03-25_timing|2017-03-25_timing]] ||
||<#c5e0de> '''After adding Rician noise, includes comparison to v6 aseg:'''||
|| 03.25.2017 - Siemens13 - all subjects successful (mean: 0.83, overall: 0.87) - [[SamsegTesting/2017-03-25_Siemens13|2017-03-25_Siemens13]]<<BR>> 03.25.2017 - GE14 - registration fixed - all subjects successful (mean: 0.77, overall: 0.83) - [[SamsegTesting/2017-03-25_GE14|2017-03-25_GE14]]<<BR>> 03.25.2017 - Buckner39 - '''no change''' - 21 subjects with unsuccessful registration - [[SamsegTesting/2017-03-25_Buckner39|2017-03-25_Buckner39]]<<BR>> 03.25.2017 - Buckner40 - registration fixed - all subjects successful (mean: 0.86, overall: 0.89) - [[SamsegTesting/2017-03-25_Buckner40|2017-03-25_Buckner40]]<<BR>> 03.25.2017 - ADNI60 - '''no change''' - 19 subjects with unsuccessful registration ||
||<#c5e0de> '''V6 aseg:'''||
||Siemens13 - V6.0 aseg - overlap between v6 aseg and manualseg - [[SamsegTesting/Siemens13_V60_aseg|Siemens13_V60_aseg]]<<BR>> Buckner39 - V6.0 aseg - overlap between v6 aseg and seg_edited - [[SamsegTesting/Buckner39_V60_aseg|Buckner39_V60_aseg]]<<BR>> GE14 - V6.0 aseg - overlap between v6 aseg and manualseg - [[SamsegTesting/GE14_V60_aseg|GE14_V60_aseg]] ||
||<#c5e0de> '''March 2017 initial test:'''||
|| 03.19.2017 - Siemens13 - all subjects produced dice scores above 0.8! - [[SamsegTesting/2017-03-19_Siemens13|2017-03-19_Siemens13]]<<BR>> 03.19.2017 - GE14 - 2 subjects had unsuccessful registration - [[SamsegTesting/2017-03-19_GE14|2017-03-19_GE14]]<<BR>> 03.19.2017 - Buckner39 - 21 subjects had unsuccessful registration! - [[SamsegTesting/2017-03-19_Buckner39|2017-03-19_Buckner39]]<<BR>> 03.17.2017 - Buckner40 - 2 subjects with low dice due to poor reg, and 1 subject failed - [[SamsegTesting/2017-03-17_Buckner40|2017-03-17_Buckner40]]<<BR>> 03.06.2017 - ADNI60 - 19 subjects with low dice due to poor registration - [[SamsegTesting/2017-03-06_ADNI60|2017-03-06_ADNI60]] ||

== Outlier Subjects ==

A list of subject that are known to fail or produce poor results

'''expected failures'''<<BR>>
||Buckner40    ||128||
||GE14         ||1333||
||ADNI1150_3T  ||388031||
||ADNI1150_3T  ||119525||
||ADNI1150_3T  ||120411||
||ADNI714_1.5T ||39519||
||ADNI714_1.5T ||51533||
||ADNI714_1.5T ||68581||
||ADNI714_1.5T ||143520||
||ADNI714_1.5T ||149771||

'''subject set outliers''' (subjects that do poorly relative to other subjects in their test set)<<BR>>
||ADNI60  ||0057||
||ADNI60  ||0889||

== Tasks ==

 * create new MIRIAD dataset (for test-retest, for longitudinal, for AD/normals compare, and for GE 1.5T scanner coverage):
  * decide on which subset of subjects (40?)
  * decide on which follow-up timepoints to include

 * create a test-retest test:
  * to compare the structure differences of a subject scanned twice (aka test-retest)
  * to see how the FS differences compare to the Samseg differences (hopefully, samseg has smaller test-retest differences)
  * can use ThreeScanner14 and MIRIAD datasets

 * create a asegstats significance test:
  * using aseg.stats files from FS, samseg, and from manual label sets, create test scripts which show:
   * absolute percent difference
   * t-test significance
  * plot these

 * create test coverage matrix:
  * one dimension are the things we want to cover (age, disease, scanner manufacturer, scanner strength, outlier/artifact/failures) and the other dimension is each dataset (Buckner40, ADNI60, IXI79, GE14, Siemens13, etc.)
  * we want to shoot for complete coverage in the matrix
  * but using only enough subjects to provide enough 'statistical power'