September 30, 2024

CMPS6790 Milestone1

📓 Read Jupyter Notebook for This Project

T Cell Receptor Modeling Based on VDJdb

30-09-2024 13:07:44

T Cell Receptor Modeling Based on VDJdb

Introduction

Author: Jiarui Li
Email: jli78@tulane.edu
Github Page Link: https://jiarui-li.com/
Course: CMPS 6790 Data Science
Course Link: https://nmattei.github.io/cmps6790/projects/FinalTutorial/FinalTutorial/

This is the coursework for the Milestone 01 of CMPS 6790 Data Science.
For this project, I will leverage dataset VDJdb and task T Cell Receptor Modeling
to learn the ETL (Extraction, Transform, and Load) skills.
The final work will be uploaded to my Github page.

Workflow And Collaboration Plan

For this project, it is finished independently by Jiarui Li.
Therefore, this section will mainly describe the workflow.

Code and version management
The code will be managed in a git repository to manage
the version and sync all updates.
Code organization
For submit convenience, all code will be in one jupyter
notebook. Also, no extra files required.
The package requirement will be attached to this notebook.
Document organization
The document will be managed by docflow, which allows us
to show everything in jupyter notebook also generate a
markdown file for GitHub page. The code used to generate
the page is at below.
Development procedure
This poject development will follow the protocal of rapid
application development cycle. First, we will decide the
developement problem and target. Then, we will plan the TODO
list for currect cycle. Finally, develop the codes, test
them, and identify the new issues and requirement.

def generate_github_page(doc_dict:dict, title:str='', save_path:str='page.md') -> None:
   '''
   Function used to generate the final github page.
   
   Parameters
   ----------
   doc_dict  (dict[doc.Document]) : The sections of this page.
   title     (str)                : The title of this page.
   save_path (str)                : The save path to store the html file
   '''
   _doc_title    = doc.Title(title, level=1)
   # Extract sections as list
   _doc_sections = [_doc for _doc in doc_dict.values()]
   # Append title to section list for calalog generation
   _doc_sections = [_doc_title] + _doc_sections
   _final_doc = doc.Document(
       _doc_title,
       doc.DateTimeStamp('%d-%m-%Y %H:%M:%S'),    # Add page generation timestamp
       doc.Text('\n\n'),
       doc.Catalog(doc.Document(*_doc_sections)), # Add catalog for entire document
       doc.Text('\n\n'),
       *_doc_sections[1:]                         # Add all sections
   )
   _final_doc.save(save_path, format='markdown')

Environment And Packages

Environment and packages used for this project is documented below:

	Name	Version	Install	Description
0	pandas	2.0.3	`pip install pandas==2.0.3`	pandas - a powerful data analysis and manipulation library for Python ===================================================================== pandas is a Python package providing fast, flexible, a…
1	numpy	1.24.4	`pip install numpy==1.24.4`	NumPy ===== Provides 1. An array object of arbitrary homogeneous items 2. Fast mathematical operations over arrays 3. Linear Algebra, Fourier Transforms, Random Number Generation How to use t…
2	matplotlib	3.7.2	`pip install matplotlib==3.7.2`	An object-oriented plotting library. A procedural interface is provided by the companion pyplot module, which may be imported directly, e.g.:: import matplotlib.pyplot as plt or using ipython:…
3	requests	2.32.3	`pip install requests==2.32.3`	Requests HTTP Library ~~~~~ Requests is an HTTP library, written in Python, for human beings. Basic GET usage: >>> import requests >>> r = requests.get(‘https://www.python.org…
4	json	2.0.9	`pip install json==2.0.9`	JSON (JavaScript Object Notation) http://json.org is a subset of JavaScript syntax (ECMA-262 3rd edition) used as a lightweight data interchange format. :mod:`json` exposes an API familiar to users…

Dataset

VDJdb

Dataset Link: https://vdjdb.cdr3.net/

Dataset Introduction

VDJdb is a curated database of T-cell receptor (TCR) sequences
with known antigen specificities. The primary goal of VDJdb is
to facilitate access to existing information on T-cell receptor
antigen specificities.

Reason for Choose This Dataset

T Cell is a significant part of our immune system to protect our
body away from cancer, virus, and other antigens. Therefore,
Identify the rule that how T Cells are activated can help us
understand how to develop vaccine for diseases. To understand
how T cell work, T cell receptor (TCR) is the key factor.
Therefore, we choose this dataset and try to model TCR.

Questions for This Dataset

Which TCR fragement is strongly related to which epitope (processed antigens) fragments?
Is there relationship between different TCR? (For example, is there fragment leads to other fragment appearance.)
(ETL) Extraction, Transform, and Load
Extract

Following the document of VDJdb (https://vdjdb-web.readthedocs.io/en/latest/api.html),
we used requests package to fetch the entire
VDJdb database. Then transfer it to dataframe.

Data URL: https://vdjdb.cdr3.net/api/database/
Number of Records: 102990
Number of Columns: 16
Transform
Filter Columns

Filter the columns we are interested in and drop other columns.

gene: TCR is comprised of two sequence, this denotes which sequence it is.
crd3: The TCR sequence.
v.segm: The V region of the TCR sequence.
j.segm: The J region of the TCR sequence.
antigen.epitope: The epitope this TCR can bind to.
species: Species for this record.
vdjdb.score: Experiment confidence. (Can be considered as data quality)
Example

	gene	cdr3	v.segm	j.segm	antigen.epitope	species	vdjdb.score
10	TRA	CAVRDGGTGFQKLVF	TRAV3*01	TRAJ8*01	HPVGEADYFEY	HomoSapiens	1
11	TRB	CASRQDRDYQETQYF	TRBV5-1*01	TRBJ2-5*01	HPVGEADYFEY	HomoSapiens	1
12	TRA	CAARGIGSGTYKYIF	TRAV13-1*01	TRAJ40*01	HPVGEADYFEY	HomoSapiens	1
13	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
14	TRA	CAARGIGSGTYKYIF	TRAV13-1*01	TRAJ40*01	HPVGEADYFEY	HomoSapiens	1

Filter TRB

According to the typical researches, we commonly focus on TRB, which is
significantly decided the behaviour of the TCR.
Therefore, we only keep records for TRB.

Example

	gene	cdr3	v.segm	j.segm	antigen.epitope	species	vdjdb.score
1	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
3	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
5	TRB	CASSAPTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
7	TRB	CASSARTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
8	TRB	CASSPRRYNEQFF	TRBV9*01	TRBJ2-1*01	HPVGEADYFEY	HomoSapiens	1

Clean N/A Data

Clean all N/A data for columns v.segm and j.segm

Example

	gene	cdr3	v.segm	j.segm	antigen.epitope	species	vdjdb.score
1	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
3	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
5	TRB	CASSAPTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
7	TRB	CASSARTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
8	TRB	CASSPRRYNEQFF	TRBV9*01	TRBJ2-1*01	HPVGEADYFEY	HomoSapiens	1

Filter High Quality Data

vdjdb.score shows confidence of this data.
Therefore, to make sure our analysis depends on high quality data.
We only keep records with vdjdb.score >= 1.

Example

	gene	cdr3	v.segm	j.segm	antigen.epitope	species	vdjdb.score
1	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
3	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
5	TRB	CASSAPTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
7	TRB	CASSARTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
8	TRB	CASSPRRYNEQFF	TRBV9*01	TRBJ2-1*01	HPVGEADYFEY	HomoSapiens	1

Load

We now use CSV file to manage the data.
Therefore, we store back the preprocessed
data to a CSV file.
When we need it, we could use pandas to
read it.

Read file example

	Unnamed: 0	gene	cdr3	v.segm	j.segm	antigen.epitope	species	vdjdb.score
0	1	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
1	3	TRB	CASSARSGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
2	5	TRB	CASSAPTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
3	7	TRB	CASSARTGELFF	TRBV9*01	TRBJ2-2*01	HPVGEADYFEY	HomoSapiens	1
4	8	TRB	CASSPRRYNEQFF	TRBV9*01	TRBJ2-1*01	HPVGEADYFEY	HomoSapiens	1

Analysis

Stat Species

Because for different species, the patterns of TCR might be different.
We are going to know how many species are there and the distribution
of the species.

species	count
HomoSapiens	8185
MusMusculus	1198
MacacaMulatta	659

Visualization

Analyze V/J Region Relationship

V and J regions significantly decided the function
of the TCR. Therefore, it is interesting to analyze
their distribution.

Visualization

This post is written by Jiarui LI, licensed under CC BY-NC 4.0.

CMPS6790 Milestone1

T Cell Receptor Modeling Based on VDJdb

Introduction

Workflow And Collaboration Plan

Environment And Packages

Dataset

VDJdb

Dataset Introduction

Reason for Choose This Dataset

Questions for This Dataset

(ETL) Extraction, Transform, and Load

Extract

Transform

Filter Columns

Example

Filter TRB

Example

Clean N/A Data

Example

Filter High Quality Data

Example

Load

Analysis

Stat Species

Analyze V/J Region Relationship