Creating a Validator Mapping File

The JSON mapping file is the most important piece of configuration for the Fragment Validator. It acts as a "Rosetta Stone," telling the validator how to interpret your source CSV file and transform it into a standardized format that IDhub can understand.

This guide breaks down each section of the mapping file with explanations and examples.

Configuration Schema

Schema: field_mapping

Purpose: To map columns from your source CSV file to their target columns in the database.
Format: A dictionary where the "key" is the target database column name and the "value" is the header name of the source column in your CSV.
Example:
```
"field_mapping": {
  "sample_id": "collaborator_sample_id"
}
```
This tells the validator: "For the database table's sample_id field, get the data from my CSV's collaborator_sample_id column."

Schema: static_fields

Purpose: To assign a fixed, constant value to a database field for every row in your file. This is useful when a value is the same for all records in a batch (e.g., the project name or sample type).
Format: A dictionary where the "key" is the target database column name and the "value" is the static value you want to assign.
Example:
```
"static_fields": {
  "project": "cd_ileal",
  "sample_type": "bge"
}
```
This will set the IDhub table project field to "cd_ileal" and the sample_type field to "bge" for all records processed with this mapping.

Schema: subject_id_candidates and subject_id_type_field

There are two primary methods for telling the validator how to find the subject associated with each row in your CSV. You should choose one method.

Method 1: Different ID Types in Different Columns

This is the most common and flexible method. You use the subject_id_candidates dictionary to specify multiple columns, each with its own pre-defined identifier type.

subject_id_candidates: A dictionary where the key is the CSV column name, and the value is its corresponding identifier_type. The validator checks these columns in the order they appear in your mapping file.
subject_id_type_field: Must be set to null.

Example:

Your CSV has separate columns for consortium IDs and local IDs:

consortium_id,niddk_no,other_data
IDG-001-A,,...
,NIDDK-3333,...

Your mapping file would define the type for each column and the search order:

"subject_id_candidates": {
  "consortium_id": "consortium_id",
  "niddk_no": "local_id"
},
"subject_id_type_field": null

This tells the validator: "For each row, first look in the consortium_id column and treat any value as a consortium_id. Then look in the niddk_no column and treat that value as a local_id."

Method 2: ID Value and Type in Separate Columns

This method is used when your CSV file has one column for the subject identifier and another column that specifies the type of identifier for that row.

subject_id_candidates: A dictionary or list specifying the column(s) that contain the identifier value.
subject_id_type_field: The name of the CSV column that contains the identifier type (e.g., "consortium_id", "local_id").

Example:

Your CSV has a generic subject_id column and an identifier_type column:

subject_id,identifier_type,other_data
IDG-001-A,consortium_id,...
LOCAL-999,local_id,...

Your mapping file would point to these two columns:

"subject_id_candidates": {
  "subject_id": "consortium_id" // The type here is a fallback and is overridden by the column value.
},
"subject_id_type_field": "identifier_type"

This configuration tells the validator: "For each row, get the identifier value from the subject_id column. Then, get the identifier's type from the identifier_type column for that same row."

Do Not Mix Methods

Do not mix these two methods. If you use the recommended dictionary format in subject_id_candidates to define types for each column, you must set subject_id_type_field to null.

Backward Compatibility

The subject_id_candidates field also supports a simple list of strings (e.g., ["consortium_id", "subject_id"]). In that case, the validator will use the column name itself as the identifier_type. The dictionary format is preferred for clarity and flexibility.

Schema: center_id_field

Purpose: To specify which column in your CSV contains the name of the center associated with the record.
Format: A string containing a column name.
Example:
```
"center_id_field": "center_name"
```
The validator will take the value from this column (e.g., "MSSM", "Cedars-Sinai") and use its fuzzy-matching and alias logic to find the correct numeric center ID.

Schema: default_center_id

Purpose: A fallback numeric ID to use if the center_id_field is not provided in the mapping, or if the column is empty for a given row. The center_id = 1 is Unknown in IDhub.
Format: An integer.

Schema: exclude_from_load

Purpose: To list any columns from your source CSV that are needed for validation (like consortium_id) but should not be loaded into the final sample table itself. This prevents metadata used for mapping from being incorrectly inserted as data.
Format: A list of strings.

Example:

"exclude_from_load": ["consortium_id", "center_id"]

Clarifying Exclusions

Exclusion of some of these fields may seem counterintuitive since the values still end up in the database. This is necessary because while something like center_id may get mapped in the subjects table, it doesn't get loaded in the corresponding sample table that the rest of the data gets loaded into. Each loading is typically filling multiple tables.

Table-Specific Templates

Full TemplateLCLEnteroidGenotypeOlinkSequenceSpecimen

Here is a complete example of a mapping file. It demonstrates the recommended method for subject identification (Method 1), where different ID types are in different columns in the source file. Note how subject_id_candidates defines the types, and subject_id_type_field is set to null.

{
  "field_mapping": {
    "sample_id": "collaborator_sample_id",
    "knumber": "k_number"
  },
  "static_fields": {
    "project": "IBDGC-MAIN",
    "sample_type": "bge"
  },
  "subject_id_candidates": {
    "consortium_id": "consortium_id",
    "niddk_no": "local_id"
  },
  "subject_id_type_field": null,
  "center_id_field": "center_name",
  "default_center_id": 1,
  "exclude_from_load": ["consortium_id"]
}

This template provides an example of how to structure your CSV file for submitting Lymphoblastoid Cell Line (LCL) data to the fragment-validator.

Mapping Configuration Example

This is the JSON mapping configuration (lcl_mapping_example.json) that corresponds to this data template.

{
  "field_mapping": {
    "knumber": "knumber",
    "niddk_no": "niddk_no"
  },
  "subject_id_candidates": {
    "consortium_id": "consortium_id",
    "niddk_no": "niddk_no"
  },
  "center_id_field": "center_name",
  "default_center_id": 1,
  "exclude_from_load": ["consortium_id", "center_id", "identifier_type"]
}

CSV Data Example

consortium_id,niddk_no,knumber,center_name
IDG-001-A,NIDDK-1111,K1111,MSSM
IDG-002-B,,K2222,Cedars-Sinai
,NIDDK-3333,K3333,Emory

Column Annotations

Below is a description of each column in the template and its purpose.

Subject Identification

The validator uses one or more columns to find the correct subject in the database. For the LCL table, it will try the following columns in order. At least one of these must have a value for each row.

subject_id
- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
consortium_id
- Purpose: The primary IBDGC identifier for a subject.
- Identifier Type: consortium_id
niddk_no
- Purpose: An alternative subject identifier (the NIDDK number). This column also contains the data for the niddk_no field in the LCL table itself.
- Identifier Type: niddk_no

Center Identification

center_name
Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.
Notes: The system uses fuzzy matching and a list of aliases to find the correct center. For example, "MSSM", "Sinai", and "mount_sinai" will all resolve to the same center.

LCL Data Fields

These columns map directly to the fields in the lcl table in ID Hub.

knumber
Purpose: The "K-number" identifier for the cell line.
niddk_no
Purpose: The NIDDK number associated with the cell line. Note that this field serves a dual purpose: it's used as a subject ID candidate and as data for the LCL record.

This template provides an example of how to structure your CSV file for submitting enteroid data to the fragment-validator.

Mapping Configuration Example

This is the JSON mapping configuration (enteroid_mapping_example.json) that corresponds to this data template.

{
  "field_mapping": {
    "sample_id": "sample_id"
  },
  "subject_id_candidates": {
    "subject_id": "consortium_id"
  },
  "subject_id_type_field": "identifier_type",
  "center_id_field": "center_name",
  "default_center_id": 1,
  "exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}

CSV Data Example

subject_id,identifier_type,center_name,sample_id
IDG-001-A,consortium_id,MSSM,ENT-001
IDG-002-B,consortium_id,Cedars-Sinai,ENT-002
LOCAL-999,local_id,Emory,ENT-999

Column Annotations

Below is a description of each column in the template and its purpose.

Subject Identification

subject_id
- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
identifier_type
- Purpose: Specifies what type of ID is in the subject_id column for that row (e.g., consortium_id, local_id).

Center Identification

center_name
- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.

Enteroid Data Fields

These columns map directly to the fields in the enteroid table in ID Hub.

sample_id
- Purpose: The unique identifier for this specific enteroid sample.

This template provides an example of how to structure your CSV file for submitting genotype data to the fragment-validator.

Mapping Configuration Example

This is the JSON mapping configuration (genotype_mapping_example.json) that corresponds to this data template.

{
  "field_mapping": {
    "genotype_id": "id",
    "genotyping_project": "project",
    "genotyping_barcode": "barcode"
  },
  "subject_id_candidates": {
    "consortium_id": "consortium_id"
  },
  "subject_id_type_field": "identifier_type",
  "center_id_field": "center_name",
  "default_center_id": 1,
  "exclude_from_load": ["consortium_id", "center_id", "identifier_type"]
}

CSV Data Example

consortium_id,identifier_type,center_name,id,project,barcode
IDG-001-A,consortium_id,MSSM,GENO-001,GSA-Array-v1,987654321
IDG-002-B,consortium_id,Cedars-Sinai,GENO-002,GSA-Array-v1,987654322
IDG-003-C,consortium_id,Emory,GENO-003,GSA-Array-v2,987654323

Column Annotations

Below is a description of each column in the template and its purpose.

Subject Identification

consortium_id
- Purpose: The primary IBDGC identifier for a subject.
identifier_type
- Purpose: Specifies what type of ID is in the consortium_id column for that row. For this mapping, it will typically be consortium_id.

Center Identification

center_name
- Purpose: The name of the center where the data originated. Although the current genotype_mapping.json does not specify a center_id_field, providing this column allows for future-proofing and consistency.

Genotype Data Fields

These columns map directly to the fields in the genotype table in ID Hub. The header names here (id, project, barcode) are the expected source column names as defined in the field_mapping.

id
- Purpose: The unique identifier for this specific genotype record. This will be mapped to the genotype_id column in the database.
project
- Purpose: The name of the genotyping project (e.g., GSA-Array-v1). This maps to the genotyping_project column.
barcode
- Purpose: The barcode of the genotyping array. This maps to the genotyping_barcode column.

This template provides an example of how to structure your CSV file for submitting Olink proteomics data to the fragment-validator.

Mapping Configuration Example

This is the JSON mapping configuration (olink_mapping_example.json) that corresponds to this data template.

{
  "field_mapping": {
    "sample_id": "sample_id"
  },
  "subject_id_candidates": {
    "subject_id": "consortium_id"
  },
  "subject_id_type_field": "identifier_type",
  "center_id_field": "center_name",
  "default_center_id": 1,
  "exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}

CSV Data Example

subject_id,identifier_type,center_name,sample_id
IDG-001-A,consortium_id,MSSM,OLINK-001
IDG-002-B,consortium_id,Cedars-Sinai,OLINK-002
LOCAL-999,local_id,Emory,OLINK-999

Column Annotations

Below is a description of each column in the template and its purpose.

Subject Identification

subject_id
- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
identifier_type
- Purpose: Specifies what type of ID is in the subject_id column for that row (e.g., consortium_id, local_id).

Center Identification

center_name
- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.

Olink Data Fields

These columns map directly to the fields in the olink table in ID Hub.

sample_id
- Purpose: The unique identifier for this specific Olink sample.

This template provides an example of how to structure your CSV file for submitting sequencing data to the fragment-validator.

Mapping Configuration Example

This is the JSON mapping configuration (sequence_mapping_example.json) that corresponds to this data template.

{
  "field_mapping": {
    "sample_id": "sample_id",
    "sample_type": "sample_type",
    "vcf_sample_id": "vcf_sample_id"
  },
  "subject_id_candidates": {
    "consortium_id": "consortium_id"
  },
  "center_id_field": "center_name",
  "default_center_id": 1,
  "exclude_from_load": ["consortium_id", "center_id"]
}

CSV Data Example

consortium_id,center_name,sample_id,sample_type,vcf_sample_id
IDG-001-A,MSSM,SEQ-001,WGS,SAM-001A
IDG-002-B,Cedars-Sinai,SEQ-002,RNA-Seq,SAM-002B
IDG-003-C,Emory,SEQ-003,16S,SAM-003C

Column Annotations

Below is a description of each column in the template and its purpose.

Subject Identification

consortium_id
- Purpose: The primary IBDGC identifier for a subject. This is used to find the correct subject in the database.

Center Identification

center_name
- Purpose: The name of the center where the data originated. This name is used to look up the correct center ID.

Sequence Data Fields

These columns map directly to the fields in the sequence table in ID Hub. The header names here are examples of common source column names.

sample_id
- Purpose: The unique identifier for this specific sequencing sample.
sample_type
- Purpose: The type of sequencing performed (e.g., WGS, RNA-Seq, 16S).
vcf_sample_id
- Purpose: The sample identifier found within the VCF file, if applicable.

This template provides an example of how to structure your CSV file for submitting general specimen data to the fragment-validator.

Mapping Configuration Example

This is the JSON mapping configuration (specimen_mapping_example.json) that corresponds to this data template.

{
  "field_mapping": {
    "sample_id": "sample_id",
    "sample_type": "sample_type",
    "year_collected": "year_collected",
    "redcap_event": "redcap_event",
    "region_location": "region_location",
    "sample_available": "sample_available",
    "project": "project",
    "identifier_type": "identifier_type"
  },
  "subject_id_candidates": {
    "subject_id": "consortium_id"
  },
  "subject_id_type_field": "identifier_type",
  "center_id_field": "center_name",
  "default_center_id": 1,
  "exclude_from_load": ["subject_id", "center_id", "identifier_type"]
}

CSV Data Example

subject_id,identifier_type,center_name,sample_id,sample_type,year_collected,project
IDG-001-A,consortium_id,MSSM,SPEC-001A,Plasma,2023,IBDGC-PRO
IDG-002-B,consortium_id,Cedars-Sinai,SPEC-002B,Serum,2024,IBDGC-PRO
LOCAL-999,local_id,Emory,SPEC-999Z,Stool,2024,Immuno-Chip

Column Annotations

Below is a description of each column in the template and its purpose.

Subject Identification

subject_id
- Purpose: Contains the value of the subject's identifier. This is the primary column used to find the subject in the database.
- Note: The header name subject_id is defined by the subject_id_candidates array in the mapping configuration. If you used a different column name in your source file (e.g., participant_id), you would update the subject_id_candidates to ["participant_id"].
identifier_type
- Purpose: Specifies what type of ID is in the subject_id column for that row.
- Notes: This single type is applied to all fields listed in subject_id_candidates. Common values are consortium_id or local_id. The name of this column itself (identifier_type) is defined by the subject_id_type_field in the mapping configuration.

Center Identification

center_name
- Purpose: The name of the center where the data originated.
- Notes: The system uses this name to find the correct numeric center ID. The header name center_name is defined by the center_id_field in the mapping configuration.

Specimen Data Fields

These columns map directly to the fields in the specimen table in ID Hub. The header names here (sample_id, sample_type, etc.) are the expected source column names as defined in the field_mapping section of the config.

sample_id
- Purpose: The unique identifier for this specific specimen.
sample_type
- Purpose: The type of specimen (e.g., Plasma, Serum, Stool).
year_collected
- Purpose: The year the specimen was collected.
project
- Purpose: The name of the project associated with this specimen.