Manual Data Ingestion Guide
This guide provides a step-by-step process for curators to manually prepare, validate, and load data into IDhub. This workflow is used for data that does not come from an automated source like the REDCap pipeline.
Local Environment Setup
Before you can run the validator script on your local machine, you need to set up your environment with the correct tools and credentials.
Conda Environment Setup
The project uses Conda to manage Python and its dependencies, ensuring everyone runs the same version of the tools.
- Install Conda: If you don't have it, install Miniconda or Anaconda.
- Create the Environment: From the root directory of the
idhubproject, run the following command. This will create a new environment namedidhub-devusing the project'senvironment.ymlfile. - Activate the Environment: Before running any scripts, you must activate the environment each time you open a new terminal:
Your terminal prompt should now show
(idhub-dev)at the beginning.
Obtain and Use API Keys
You will need two keys to run the validator. It is critical to use the correct key for the environment (qa or production) you are targeting.
NOCODB_TOKEN: This key allows the validator to connect to the NocoDB API.- How to get it: You can generate this key yourself from the NocoDB web interface. Log in, click your user icon in the bottom-left corner, and go to the "API Tokens" section.
- QA vs. Production: You will need a separate token for each environment.
- For validating against QA, log into
qa.idhub.ibdgc.organd generate a token there. - For validating against Production, log into
idhub.ibdgc.organd generate a token there.
- For validating against QA, log into
GSID_API_KEY: This key allows the validator to communicate with the subject identity service.- How to get it: This key is managed by the system administrators. Please contact them to obtain the key for the environment you need to work with.
Once you have the keys, paste them into your .env file as the values for the corresponding variables. The script will automatically load them when you run it.
Create a Secure .env File
The validator requires secret API keys to communicate with other IDhub services. These are managed in a .env file, which is a plain text file you create in the root of the idhub project directory.
Do Not Commit This File
The .env file contains sensitive credentials and is listed in .gitignore to prevent it from ever being saved to the git repository. Never share this file or commit it.
- Create the file: In the root of the
idhubproject, create a new file named.env. - Restrict .env permissions: The
.envfile contains highly sensitive keys, so it should only be accessible by your user. -
Add content: Copy and paste the following template into your
.envfile.
The fragment ingestion process involves two main services, which you can trigger via the GitHub Actions interface:
- Fragment Validator: Validates your data file and converts it into standardized "fragments".
- Table Loader: Loads the validated fragments into the database.
Step 1: Prepare Your Data File
Before you can ingest data, you must prepare your file according to IDhub's requirements.
- Format: Use CSV
- Header: The first row of your file must be a header row with clear column names.
- Content: Ensure your file contains all required fields for the table you are loading, especially a subject identifier (like
consortium_id) and a unique identifier for the record (likesample_id). There can be multiple candidate subject IDs.
See more on data preparation →
Step 2: Configure Table Mappings
The validation process uses configuration files to understand how to process your data. Specifically, mapping.json files tell the validator how columns in your file map to fields in the database.
- When is this needed?: You only need to worry about this if you are submitting a new type of file with a different structure or new columns that the system has not seen before.
- What to do: If you have a file with a new structure, you must work with the developer team to create or update a
mapping.jsonfile. - Example Mapping: This configuration tells the system that the
source_idcolumn in your CSV should be mapped to thesample_idfield in the database, and that thesubject_identifiercolumn should be used to find the subject's GSID.
{
"field_mapping": { # content fields
"sample_id": "collaborator_sample_id",
"sample_type": "sample_type",
"batch": "batch",
"identifier_type": "identifier_type" # this is the type of subject ID e.g. consortium_id
},
"subject_id_candidates": ["subject_id"], # can include multiple
"subject_id_type_field": "identifier_type",
"center_id_field": "center_id",
"default_center_id": 1,
"exclude_from_load": ["subject_id", "center_id", "identifier_type"] # excluded from sample table
}
For existing, known file formats, these configurations will already be in place.
Step 3: Validate the Data with the Fragment Validator
Once your file is ready, you will use the Fragment Validator to check your data and create validated fragments. If you are running the environment locally, you can run the validator from the command line:
# Navigate to the fragment-validator directory
cd idhub/fragment-validator/
# Run the validator
python main.py \
--table-name lcl \
--input-file /path/to/your/data.csv \
--mapping-config config/lcl_mapping.json \
--source "name_of_source_file.csv"
--env production # defaults to qa
Step 4: Load the Validated Fragments
After the Fragment Validator runs successfully and generates a Batch ID, you can use the Table Loader to load this batch into the database.
Using GitHub Actions GUI (Recommended)
- Go to the
Actionstab in the IDhub GitHub repository. - Find the "Fragment Ingestion Pipeline" workflow in the list on the left.
- Click the "Run workflow" button on the right side of the navigation.
- Fill out the form:
environment: Choose the same environment you used for validation (qaorproduction).batch_id: Paste the Batch ID you copied from the successful Fragment Validator run.dry_run: true / false
- Run the workflow. Consider running with
dry_runchecked initially. Review the output log to ensure the changes are what you expect. If everything looks correct, run the workflow again withdry_rununchecked to perform the live load.
Using the CLI
You can also run the Table Loader from the command line for local development.