SUPPORTING
FILES
Bioinformatics methodologies for celiac disease and its comorbities
This page presents the scripts developed within the work presented in the article entitled “Bioinformatics methodologies for celiac disease and its comorbities” by Eugenio Del Prete, Angelo Facchiano and Pietro Liò.
The presented
work describes a pipeline for the integration and the analysis of microarray
datasets on coeliac disease and some of its comorbidities. The getting and
cleaning data step (selection) is up to the user: some hints on the selection are reported in the publication. This semi-automated
pipeline is divided into two scripts in R language.
The first script 'coeliac_disease_example.R' should
be launch as first, because the generated TXT file is part of the input for the
second script 'semantic_similarity_example.R'. As
deducted from the names, the scripts are an example of how to perform the
analysis described in the pubblication for only two
datasets. The practical details and possible issues are
reported in the two README (TXT) files.
FIRST SCRIPT
README_ScriptOne and coeliac_disease_example.R
The general idea
is to select some coeliac disease microarray dataset and extract the
differential expressed genes in a two-state comparison: patient with coeliac
disease vs healthy controls or patients with coeliac disease vs gluten free
diet controls. From the intersection of these results, the most important
functional annotation from biological process (i.e. one of the main Gene
Ontology annotations) are reported. The same process is performed for the selected autoimmune comorbidities, in
order to find the functional annotations, which are in common with the previous
ones from coeliac disease. The main statistical method for this part is the
Gene Set Enrichment Analysis. The algorithm of the first script is shown.
SCRIPT I: DATASET EVALUATION (GENE
SET ENRICHMENT ANALYSIS) |
INPUT: GEO
Microarray Data |
OUTPUT:
Differential expressed genes, GO terms tree, correspondence genes-GO terms 1: Set work folder 2: Load libraries for process 3: Download GEO dataset 4: Convert GEO dataset in ExpressionSet class 5: Create matrix design 6: Calculate differential expression 7: Create statistic table and sub-table 8: Save sub-table (Excel format) 9: Choose logFC threshold 10: Create topGO class with annotation 11: Store candidate genes and related GO terms 12: Perform Fisher’s test (and Kolmogorov-Smirnov’s test) 13: (Compare the tests) 14: Create and plot GO terms tree 15: Save GO terms tree (Pdf format) 16: Create correspondence genes-GO terms 17: Save correspondence genes-GO terms (Text format) |
|
EXAMPLE
OF OUTPUT FROM FIRST SCRIPT (Dataset GSE84729a, coeliac disease)
-
Subset of important statistics
for differential expressed genes: GSE84729a_table.csv
-
Gene Ontology terms tree:
CD1_GSE84729a_classic_5_all.pdf
-
Correspondence genes-GO terms:
CD1_GSE84729a_correspondence.txt
SECOND
SCRIPT
README_ScriptTwo and semantic_similarity_example.R
The set of
correspondences between genes and Gene Ontology terms is the input for the
second part of the analysis. The next step is the creation of similarity
matrices, which can quantify the similarity among all the selected datasets
(coeliac disease and comorbidities), by means of the genes, the Gene Ontology
terms and the Disease Ontology terms. The Disease Ontology ID are already
reported in the second script for the selected autoimmune diseases, i.e.
Alopecia Areata, Arteritis, Autoimmune Thyroid Disease,
Dermatomyositis, Primary Biliary Cirrhosis, Peripheral Neuropathy, Rheumatoid
Arthritis, and Vitiligo. Finally, another enrichment on candidate genes is reported by using KEGG annotation. The algorithm of the
second script is shown.
SCRIPT II: PATHWAY AND MATRIX
(SEMANTIC SIMILARITY) |
INPUT:
Correspondence genes-GO terms, DO terms |
OUTPUT:
Gene lists, semantic similarity matrices, KEGG enrichment graph 1: Set work folder 2: Load libraries for process 3: Load correspondence genes-GO terms files 4: List genes and GO terms 5: Save gene lists (Rda format) 6: Select GO field and prepare annotation 7: Create genes semantic similarity matrix 8: Create GO terms semantic similarity matrix 9: Save genes and GO terms semantic similarity matrices (Rda format) 10: Plot and save semantic similarity matrices (Pdf format) 11: Load DO terms 12: Create DO terms semantic similarity matrix 13: Save DO terms semantic similarity matrix (Rda format) 14: Create KEGG Enrichment graph 15: Plot and save KEGG Enrichment graph (Pdf format) |
|
EXAMPLE
OF OUTPUT FROM SECOND SCRIPT
The output is related to all the 'xxx_correspondence.txt' files
generated with the first script. For pubblication
motivations, only one example (not graphical) is reported:
do_sem_sim_mat.Rda. In order to show the semantic similarity matrix for
Disease Ontology terms, type in R command line: load("do_sem_sim_mat.Rda").
SOURCES
TO INSTALL R, RSTUDIO AND BIOCONDUCTOR
The last release
of R is reported in this (Italian) mirror:
http://cran.mirror.garr.it/mirrors/CRAN/
As second step,
we suggest to download RStudio IDE from this link:
https://www.rstudio.com/products/RStudio/
RStudio IDE is not compulsory, but we
encourage using it because of all its features (especially for beginners).
Finally, the
Bioconductor repository needs a pre-installation, from the link:
https://www.bioconductor.org/install/.
Reference to article and other information will be added after publication of the article.