Creating a Validator Mapping File
The JSON mapping file is the most important piece of configuration for the Fragment Validator. It acts as a "Rosetta Stone," telling the validator how to interpret your source CSV file and transform it into a standardized format that IDhub can understand.
This guide breaks down each section of the mapping file with explanations and examples.
Configuration Schema
Schema: field_mapping
- Purpose: To map columns from your source CSV file to their target columns in the database.
- Format: A dictionary where the
"key"is the target database column name and the"value"is the header name of the source column in your CSV. - Example:
This tells the validator: "For the database table's
sample_idfield, get the data from my CSV'scollaborator_sample_idcolumn."
Schema: static_fields
- Purpose: To assign a fixed, constant value to a database field for every row in your file. This is useful when a value is the same for all records in a batch (e.g., the project name or sample type).
- Format: A dictionary where the
"key"is the target database column name and the"value"is the static value you want to assign. - Example:
This will set the IDhub table
projectfield to "cd_ileal" and thesample_typefield to "bge" for all records processed with this mapping.
Schema: subject_id_candidates and subject_id_type_field
There are two primary methods for telling the validator how to find the subject associated with each row in your CSV. You should choose one method.
Method 1: Different ID Types in Different Columns
This is the most common and flexible method. You use the subject_id_candidates dictionary to specify multiple columns, each with its own pre-defined identifier type.
subject_id_candidates: A dictionary where the key is the CSV column name, and the value is its correspondingidentifier_type. The validator checks these columns in the order they appear in your mapping file.subject_id_type_field: Must be set tonull.
Example:
Your CSV has separate columns for consortium IDs and local IDs:
Your mapping file would define the type for each column and the search order:
"subject_id_candidates": {
"consortium_id": "consortium_id",
"niddk_no": "local_id"
},
"subject_id_type_field": null
This tells the validator: "For each row, first look in the consortium_id column and treat any value as a consortium_id. Then look in the niddk_no column and treat that value as a local_id."
Method 2: ID Value and Type in Separate Columns
This method is used when your CSV file has one column for the subject identifier and another column that specifies the type of identifier for that row.
subject_id_candidates: A dictionary or list specifying the column(s) that contain the identifier value.subject_id_type_field: The name of the CSV column that contains the identifier type (e.g., "consortium_id", "local_id").
Example:
Your CSV has a generic subject_id column and an identifier_type column:
Your mapping file would point to these two columns:
"subject_id_candidates": {
"subject_id": "consortium_id" // The type here is a fallback and is overridden by the column value.
},
"subject_id_type_field": "identifier_type"
This configuration tells the validator: "For each row, get the identifier value from the subject_id column. Then, get the identifier's type from the identifier_type column for that same row."
Do Not Mix Methods
Do not mix these two methods. If you use the recommended dictionary format in subject_id_candidates to define types for each column, you must set subject_id_type_field to null.
Backward Compatibility
The subject_id_candidates field also supports a simple list of strings (e.g., ["consortium_id", "subject_id"]). In that case, the validator will use the column name itself as the identifier_type. The dictionary format is preferred for clarity and flexibility.
Schema: center_id_field
- Purpose: To specify which column in your CSV contains the name of the center associated with the record.
- Format: A string containing a column name.
- Example: The validator will take the value from this column (e.g., "MSSM", "Cedars-Sinai") and use its fuzzy-matching and alias logic to find the correct numeric center ID.
Schema: default_center_id
- Purpose: A fallback numeric ID to use if the
center_id_fieldis not provided in the mapping, or if the column is empty for a given row. Thecenter_id = 1isUnknownin IDhub. - Format: An integer.
Schema: exclude_from_load
- Purpose: To list any columns from your source CSV that are needed for validation (like
consortium_id) but should not be loaded into the final sample table itself. This prevents metadata used for mapping from being incorrectly inserted as data. - Format: A list of strings.
- Example:
Clarifying Exclusions
Exclusion of some of these fields may seem counterintuitive since the values still end up in the database. This is necessary because while something like center_id may get mapped in the subjects table, it doesn't get loaded in the corresponding sample table that the rest of the data gets loaded into. Each loading is typically filling multiple tables.
Table-Specific Templates
Here is a complete example of a mapping file. It demonstrates the recommended method for subject identification (Method 1), where different ID types are in different columns in the source file. Note how subject_id_candidates defines the types, and subject_id_type_field is set to null.
{
"field_mapping": {
"sample_id": "collaborator_sample_id",
"knumber": "k_number"
},
"static_fields": {
"project": "IBDGC-MAIN",
"sample_type": "bge"
},
"subject_id_candidates": {
"consortium_id": "consortium_id",
"niddk_no": "local_id"
},
"subject_id_type_field": null,
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id"]
}
This template provides an example of how to structure your CSV file for submitting Lymphoblastoid Cell Line (LCL) data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (lcl_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"knumber": "knumber",
"niddk_no": "niddk_no"
},
"subject_id_candidates": {
"consortium_id": "consortium_id",
"niddk_no": "niddk_no"
},
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id", "center_id", "identifier_type"]
}
CSV Data Example
consortium_id,niddk_no,knumber,center_name
IDG-001-A,NIDDK-1111,K1111,MSSM
IDG-002-B,,K2222,Cedars-Sinai
,NIDDK-3333,K3333,Emory
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
The validator uses one or more columns to find the correct subject in the database. For the LCL table, it will try the following columns in order. At least one of these must have a value for each row.
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
consortium_id- Purpose: The primary IBDGC identifier for a subject.
- Identifier Type:
consortium_id
niddk_no- Purpose: An alternative subject identifier (the NIDDK number). This column also contains the data for the
niddk_nofield in the LCL table itself. - Identifier Type:
niddk_no
- Purpose: An alternative subject identifier (the NIDDK number). This column also contains the data for the
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
- Notes: The system uses fuzzy matching and a list of aliases to find the correct center. For example, "MSSM", "Sinai", and "mount_sinai" will all resolve to the same center.
LCL Data Fields
These columns map directly to the fields in the lcl table in ID Hub.
knumber- Purpose: The "K-number" identifier for the cell line.
niddk_no- Purpose: The NIDDK number associated with the cell line. Note that this field serves a dual purpose: it's used as a subject ID candidate and as data for the LCL record.
This template provides an example of how to structure your CSV file for submitting enteroid data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (enteroid_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id"
},
"subject_id_candidates": {
"subject_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}
CSV Data Example
subject_id,identifier_type,center_name,sample_id
IDG-001-A,consortium_id,MSSM,ENT-001
IDG-002-B,consortium_id,Cedars-Sinai,ENT-002
LOCAL-999,local_id,Emory,ENT-999
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
identifier_type- Purpose: Specifies what type of ID is in the
subject_idcolumn for that row (e.g.,consortium_id,local_id).
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
Enteroid Data Fields
These columns map directly to the fields in the enteroid table in ID Hub.
sample_id- Purpose: The unique identifier for this specific enteroid sample.
This template provides an example of how to structure your CSV file for submitting genotype data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (genotype_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"genotype_id": "id",
"genotyping_project": "project",
"genotyping_barcode": "barcode"
},
"subject_id_candidates": {
"consortium_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id", "center_id", "identifier_type"]
}
CSV Data Example
consortium_id,identifier_type,center_name,id,project,barcode
IDG-001-A,consortium_id,MSSM,GENO-001,GSA-Array-v1,987654321
IDG-002-B,consortium_id,Cedars-Sinai,GENO-002,GSA-Array-v1,987654322
IDG-003-C,consortium_id,Emory,GENO-003,GSA-Array-v2,987654323
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
consortium_id- Purpose: The primary IBDGC identifier for a subject.
identifier_type- Purpose: Specifies what type of ID is in the
consortium_idcolumn for that row. For this mapping, it will typically beconsortium_id.
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated. Although the current
genotype_mapping.jsondoes not specify acenter_id_field, providing this column allows for future-proofing and consistency.
- Purpose: The name of the center where the data originated. Although the current
Genotype Data Fields
These columns map directly to the fields in the genotype table in ID Hub. The header names here (id, project, barcode) are the expected source column names as defined in the field_mapping.
id- Purpose: The unique identifier for this specific genotype record. This will be mapped to the
genotype_idcolumn in the database.
- Purpose: The unique identifier for this specific genotype record. This will be mapped to the
project- Purpose: The name of the genotyping project (e.g.,
GSA-Array-v1). This maps to thegenotyping_projectcolumn.
- Purpose: The name of the genotyping project (e.g.,
barcode- Purpose: The barcode of the genotyping array. This maps to the
genotyping_barcodecolumn.
- Purpose: The barcode of the genotyping array. This maps to the
This template provides an example of how to structure your CSV file for submitting Olink proteomics data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (olink_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id"
},
"subject_id_candidates": {
"subject_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}
CSV Data Example
subject_id,identifier_type,center_name,sample_id
IDG-001-A,consortium_id,MSSM,OLINK-001
IDG-002-B,consortium_id,Cedars-Sinai,OLINK-002
LOCAL-999,local_id,Emory,OLINK-999
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
identifier_type- Purpose: Specifies what type of ID is in the
subject_idcolumn for that row (e.g.,consortium_id,local_id).
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
Olink Data Fields
These columns map directly to the fields in the olink table in ID Hub.
sample_id- Purpose: The unique identifier for this specific Olink sample.
This template provides an example of how to structure your CSV file for submitting sequencing data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (sequence_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id",
"sample_type": "sample_type",
"vcf_sample_id": "vcf_sample_id"
},
"subject_id_candidates": {
"consortium_id": "consortium_id"
},
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["consortium_id", "center_id"]
}
CSV Data Example
consortium_id,center_name,sample_id,sample_type,vcf_sample_id
IDG-001-A,MSSM,SEQ-001,WGS,SAM-001A
IDG-002-B,Cedars-Sinai,SEQ-002,RNA-Seq,SAM-002B
IDG-003-C,Emory,SEQ-003,16S,SAM-003C
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
consortium_id- Purpose: The primary IBDGC identifier for a subject. This is used to find the correct subject in the database.
Center Identification
center_name- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
Sequence Data Fields
These columns map directly to the fields in the sequence table in ID Hub. The header names here are examples of common source column names.
sample_id- Purpose: The unique identifier for this specific sequencing sample.
sample_type- Purpose: The type of sequencing performed (e.g.,
WGS,RNA-Seq,16S).
- Purpose: The type of sequencing performed (e.g.,
vcf_sample_id- Purpose: The sample identifier found within the VCF file, if applicable.
This template provides an example of how to structure your CSV file for submitting general specimen data to the fragment-validator.
Mapping Configuration Example
This is the JSON mapping configuration (specimen_mapping_example.json) that corresponds to this data template.
{
"field_mapping": {
"sample_id": "sample_id",
"sample_type": "sample_type",
"year_collected": "year_collected",
"redcap_event": "redcap_event",
"region_location": "region_location",
"sample_available": "sample_available",
"project": "project",
"identifier_type": "identifier_type"
},
"subject_id_candidates": {
"subject_id": "consortium_id"
},
"subject_id_type_field": "identifier_type",
"center_id_field": "center_name",
"default_center_id": 1,
"exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}
CSV Data Example
subject_id,identifier_type,center_name,sample_id,sample_type,year_collected,project
IDG-001-A,consortium_id,MSSM,SPEC-001A,Plasma,2023,IBDGC-PRO
IDG-002-B,consortium_id,Cedars-Sinai,SPEC-002B,Serum,2024,IBDGC-PRO
LOCAL-999,local_id,Emory,SPEC-999Z,Stool,2024,Immuno-Chip
Column Annotations
Below is a description of each column in the template and its purpose.
Subject Identification
-
subject_id- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
- Note: The header name
subject_idis defined by thesubject_id_candidatesarray in the mapping configuration. If you used a different column name in your source file (e.g.,participant_id), you would update thesubject_id_candidatesto["participant_id"].
-
identifier_type- Purpose: Specifies what type of ID is in the
subject_idcolumn for that row. - Notes: This single type is applied to all fields listed in
subject_id_candidates. Common values areconsortium_idorlocal_id. The name of this column itself (identifier_type) is defined by thesubject_id_type_fieldin the mapping configuration.
- Purpose: Specifies what type of ID is in the
Center Identification
center_name- Purpose: The name of the center where the data originated.
- Notes: The system uses this name to find the correct numeric center ID. The header name
center_nameis defined by thecenter_id_fieldin the mapping configuration.
Specimen Data Fields
These columns map directly to the fields in the specimen table in ID Hub. The header names here (sample_id, sample_type, etc.) are the expected source column names as defined in the field_mapping section of the config.
sample_id- Purpose: The unique identifier for this specific specimen.
sample_type- Purpose: The type of specimen (e.g.,
Plasma,Serum,Stool).
- Purpose: The type of specimen (e.g.,
year_collected- Purpose: The year the specimen was collected.
project- Purpose: The name of the project associated with this specimen.