Creating a Validator Mapping File
The JSON mapping file is the most important piece of configuration for the Fragment Validator. It acts as a "Rosetta Stone," telling the validator how to interpret your source CSV file and transform it into a standardized format that IDhub can understand.
This guide breaks down each section of the mapping file with explanations and examples.
Section-by-Section Explanation
field_mapping
- Purpose: To map columns from your source CSV file to their target columns in the database.
- Format: A dictionary where the
"key"is the target database column name and the"value"is the header name of the source column in your CSV. - Example:
This tells the validator: "For the database's
sample_idfield, get the data from my CSV'scollaborator_sample_idcolumn."
static_fields
- Purpose: To assign a fixed, constant value to a database field for every row in your file. This is useful when a value is the same for all records in a batch (e.g., the project name or sample type).
- Format: A dictionary where the
"key"is the target database column name and the"value"is the static value you want to assign. - Example:
This will set the
projectfield to "IBDGC-PRO" and thesample_typefield to "bge" for all records processed with this mapping.
subject_id_candidates
- Purpose: To tell the validator which column(s) to use to identify the subject for each row. The validator will check these in order. This is the most flexible and powerful feature for subject resolution.
- Format: A dictionary where the
"key"is the header name of the source column in your CSV, and the"value"is theidentifier_typethat corresponds to that ID. - Example:
This tells the validator: "For each row, first look in the consortium_id column. If you find a value, treat it as a consortium_id. If that column is empty, look in the niddk_no column and treat that value as a local_id."
!!! note "Backward Compatibility"
The validator also supports an older format where this field is a simple list of strings (e.g., ["consortium_id", "subject_id"]). In that case, the validator will use the column name itself as the identifier_type. The dictionary format is preferred for clarity and flexibility.
subject_id_type_field
- Purpose: An alternative way to specify the
identifier_type. If this field is set, the validator will look for a column in your CSV with this name and use its value as theidentifier_typefor all candidates. - Format: A string containing a column name from your CSV.
- Example:
If your CSV has a
type_of_idcolumn, the validator will use the value in that column for each row (e.g., "consortium_id" or "local_id"). This is generally less flexible than the dictionary format forsubject_id_candidatesand should only be used in specific cases. Set it tonullif you are using the dictionary format.
center_id_field
- Purpose: To specify which column in your CSV contains the name of the center associated with the record.
- Format: A string containing a column name.
- Example: The validator will take the value from this column (e.g., "MSSM", "Cedars-Sinai") and use its fuzzy-matching and alias logic to find the correct numeric center ID.
default_center_id
- Purpose: A fallback numeric ID to use if the
center_id_fieldis not provided in the mapping, or if the column is empty for a given row. - Format: An integer.
exclude_from_load
- Purpose: To list any columns from your source CSV that are needed for validation (like
consortium_id) but should not be loaded into the final data table itself. This prevents metadata used for mapping from being incorrectly inserted as data. - Format: A list of strings.
- Example:
Table-Specific Templates
Here is a complete example of a mapping file that uses all available features.
{
"field_mapping": {
"sample_id": "collaborator_sample_id",
"knumber": "k_number"
},
"static_fields": {
"project": "IBDGC-MAIN",
"sample_type": "bge"
},
"subject_id_candidates": {
"consortium_id": "consortium_id",
"niddk_no": "local_id"
},
"subject_id_type_field": null,
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id"]
}
This template provides an example of how to structure your CSV file for submitting Lymphoblastoid Cell Line (LCL) data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (lcl_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"knumber": "knumber",
"niddk_no": "niddk_no"
},
"subject_id_candidates": {
"consortium_id": "consortium_id",
"niddk_no": "niddk_no"
},
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id", "center_id", "identifier_type"]
}
CSV Data Example
consortium_id,niddk_no,knumber,center_name
IDG-001-A,NIDDK-1111,K1111,MSSM
IDG-002-B,,K2222,Cedars-Sinai
,NIDDK-3333,K3333,Emory
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
The validator uses one or more columns to find the correct subject in the database. For the LCL table, it will try the following columns in order. At least one of these must have a value for each row.
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
consortium_id- Purpose: The primary IBDGC identifier for a subject.
- Identifier Type:
consortium_id
niddk_no- Purpose: An alternative subject identifier (the NIDDK number). This column also contains the data for the
niddk_nofield in the LCL table itself. - Identifier Type:
niddk_no
- Purpose: An alternative subject identifier (the NIDDK number). This column also contains the data for the
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
- Notes: The system uses fuzzy matching and a list of aliases to find the correct center. For example, "MSSM", "Sinai", and "mount_sinai" will all resolve to the same center.
LCL Data Fields
These columns map directly to the fields in the lcl table in ID Hub.
knumber- Purpose: The "K-number" identifier for the cell line.
niddk_no- Purpose: The NIDDK number associated with the cell line. Note that this field serves a dual purpose: it's used as a subject ID candidate and as data for the LCL record.
This template provides an example of how to structure your CSV file for submitting enteroid data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (enteroid_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id"
},
"subject_id_candidates": {
"subject_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}
CSV Data Example
subject_id,identifier_type,center_name,sample_id
IDG-001-A,consortium_id,MSSM,ENT-001
IDG-002-B,consortium_id,Cedars-Sinai,ENT-002
LOCAL-999,local_id,Emory,ENT-999
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
identifier_type- Purpose: Specifies what type of ID is in the
subject_idcolumn for that row (e.g.,consortium_id,local_id).
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
Enteroid Data Fields
These columns map directly to the fields in the enteroid table in ID Hub.
sample_id- Purpose: The unique identifier for this specific enteroid sample.
This template provides an example of how to structure your CSV file for submitting genotype data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (genotype_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"genotype_id": "id",
"genotyping_project": "project",
"genotyping_barcode": "barcode"
},
"subject_id_candidates": {
"consortium_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id", "center_id", "identifier_type"]
}
CSV Data Example
consortium_id,identifier_type,center_name,id,project,barcode
IDG-001-A,consortium_id,MSSM,GENO-001,GSA-Array-v1,987654321
IDG-002-B,consortium_id,Cedars-Sinai,GENO-002,GSA-Array-v1,987654322
IDG-003-C,consortium_id,Emory,GENO-003,GSA-Array-v2,987654323
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
consortium_id- Purpose: The primary IBDGC identifier for a subject.
identifier_type- Purpose: Specifies what type of ID is in the
consortium_idcolumn for that row. For this mapping, it will typically beconsortium_id.
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated. Although the current
genotype_mapping.jsondoes not specify acenter_id_field, providing this column allows for future-proofing and consistency.
- Purpose: The name of the center where the data originated. Although the current
Genotype Data Fields
These columns map directly to the fields in the genotype table in ID Hub. The header names here (id, project, barcode) are the expected source column names as defined in the field_mapping.
id- Purpose: The unique identifier for this specific genotype record. This will be mapped to the
genotype_idcolumn in the database.
- Purpose: The unique identifier for this specific genotype record. This will be mapped to the
project- Purpose: The name of the genotyping project (e.g.,
GSA-Array-v1). This maps to thegenotyping_projectcolumn.
- Purpose: The name of the genotyping project (e.g.,
barcode- Purpose: The barcode of the genotyping array. This maps to the
genotyping_barcodecolumn.
- Purpose: The barcode of the genotyping array. This maps to the
This template provides an example of how to structure your CSV file for submitting Olink proteomics data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (olink_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id"
},
"subject_id_candidates": {
"subject_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}
CSV Data Example
subject_id,identifier_type,center_name,sample_id
IDG-001-A,consortium_id,MSSM,OLINK-001
IDG-002-B,consortium_id,Cedars-Sinai,OLINK-002
LOCAL-999,local_id,Emory,OLINK-999
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
identifier_type- Purpose: Specifies what type of ID is in the
subject_idcolumn for that row (e.g.,consortium_id,local_id).
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
Olink Data Fields
These columns map directly to the fields in the olink table in ID Hub.
sample_id- Purpose: The unique identifier for this specific Olink sample.
This template provides an example of how to structure your CSV file for submitting sequencing data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (sequence_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id",
"sample_type": "sample_type",
"vcf_sample_id": "vcf_sample_id"
},
"subject_id_candidates": {
"consortium_id": "consortium_id"
},
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id", "center_id"]
}
CSV Data Example
consortium_id,center_name,sample_id,sample_type,vcf_sample_id
IDG-001-A,MSSM,SEQ-001,WGS,SAM-001A
IDG-002-B,Cedars-Sinai,SEQ-002,RNA-Seq,SAM-002B
IDG-003-C,Emory,SEQ-003,16S,SAM-003C
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
consortium_id- Purpose: The primary IBDGC identifier for a subject. This is used to find the correct subject in the database.
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
Sequence Data Fields
These columns map directly to the fields in the sequence table in ID Hub. The header names here are examples of common source column names.
sample_id- Purpose: The unique identifier for this specific sequencing sample.
sample_type- Purpose: The type of sequencing performed (e.g.,
WGS,RNA-Seq,16S).
- Purpose: The type of sequencing performed (e.g.,
vcf_sample_id- Purpose: The sample identifier found within the VCF file, if applicable.
This template provides an example of how to structure your CSV file for submitting general specimen data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (specimen_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id",
"sample_type": "sample_type",
"year_collected": "year_collected",
"redcap_event": "redcap_event",
"region_location": "region_location",
"sample_available": "sample_available",
"project": "project",
"identifier_type": "identifier_type"
},
"subject_id_candidates": {
"subject_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}
CSV Data Example
subject_id,identifier_type,center_name,sample_id,sample_type,year_collected,project
IDG-001-A,consortium_id,MSSM,SPEC-001A,Plasma,2023,IBDGC-PRO
IDG-002-B,consortium_id,Cedars-Sinai,SPEC-002B,Serum,2024,IBDGC-PRO
LOCAL-999,local_id,Emory,SPEC-999Z,Stool,2024,Immuno-Chip
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
-
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
- Note: The header name
subject_idis defined by thesubject_id_candidatesarray in the mapping configuration. If you used a different column name in your source file (e.g.,participant_id), you would update thesubject_id_candidatesto["participant_id"].
-
identifier_type- Purpose: Specifies what type of ID is in the
subject_idcolumn for that row. - Notes: This single type is applied to all fields listed in
subject_id_candidates. Common values areconsortium_idorlocal_id. The name of this column itself (identifier_type) is defined by thesubject_id_type_fieldin the mapping configuration.
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated.
- Notes: The system uses this name to find the correct numeric center ID. The header name
center_nameis defined by thecenter_id_fieldin the mapping configuration.
Specimen Data Fields
These columns map directly to the fields in the specimen table in ID Hub. The header names here (sample_id, sample_type, etc.) are the expected source column names as defined in the field_mapping section of the config.
sample_id- Purpose: The unique identifier for this specific specimen.
sample_type- Purpose: The type of specimen (e.g.,
Plasma,Serum,Stool).
- Purpose: The type of specimen (e.g.,
year_collected- Purpose: The year the specimen was collected.
project- Purpose: The name of the project associated with this specimen.