SUPPORTING FILES

Bioinformatics methodologies for celiac disease and its comorbities

This page presents the scripts developed within the work presented in the article entitled “Bioinformatics methodologies for celiac disease and its comorbities” by Eugenio Del Prete, Angelo Facchiano and Pietro Liò.

The presented work describes a pipeline for the integration and the analysis of microarray datasets on coeliac disease and some of its comorbidities. The getting and cleaning data step (selection) is up to the user: some hints on the selection are reported in the publication. This semi-automated pipeline is divided into two scripts in R language. The first script 'coeliac_disease_example.R' should be launch as first, because the generated TXT file is part of the input for the second script 'semantic_similarity_example.R'. As deducted from the names, the scripts are an example of how to perform the analysis described in the pubblication for only two datasets. The practical details and possible issues are reported in the two README (TXT) files.

FIRST SCRIPT

README_ScriptOne and coeliac_disease_example.R

The general idea is to select some coeliac disease microarray dataset and extract the differential expressed genes in a two-state comparison: patient with coeliac disease vs healthy controls or patients with coeliac disease vs gluten free diet controls. From the intersection of these results, the most important functional annotation from biological process (i.e. one of the main Gene Ontology annotations) are reported. The same process is performed for the selected autoimmune comorbidities, in order to find the functional annotations, which are in common with the previous ones from coeliac disease. The main statistical method for this part is the Gene Set Enrichment Analysis. The algorithm of the first script is shown.

SCRIPT I: DATASET EVALUATION (GENE SET ENRICHMENT ANALYSIS)

INPUT: GEO Microarray Data

OUTPUT: Differential expressed genes, GO terms tree, correspondence genes-GO terms

1: Set work folder

2: Load libraries for process

3: Download GEO dataset

4: Convert GEO dataset in ExpressionSet class

5: Create matrix design

6: Calculate differential expression

7: Create statistic table and sub-table

8: Save sub-table (Excel format)

9: Choose logFC threshold

10: Create topGO class with annotation

11: Store candidate genes and related GO terms

12: Perform Fisher’s test (and Kolmogorov-Smirnov’s test)

13: (Compare the tests)

14: Create and plot GO terms tree

15: Save GO terms tree (Pdf format)

16: Create correspondence genes-GO terms

17: Save correspondence genes-GO terms (Text format)

EXAMPLE OF OUTPUT FROM FIRST SCRIPT (Dataset GSE84729a, coeliac disease)

- Subset of important statistics for differential expressed genes: GSE84729a_table.csv

- Gene Ontology terms tree: CD1_GSE84729a_classic_5_all.pdf

- Correspondence genes-GO terms: CD1_GSE84729a_correspondence.txt

SECOND SCRIPT

README_ScriptTwo and semantic_similarity_example.R

The set of correspondences between genes and Gene Ontology terms is the input for the second part of the analysis. The next step is the creation of similarity matrices, which can quantify the similarity among all the selected datasets (coeliac disease and comorbidities), by means of the genes, the Gene Ontology terms and the Disease Ontology terms. The Disease Ontology ID are already reported in the second script for the selected autoimmune diseases, i.e. Alopecia Areata, Arteritis, Autoimmune Thyroid Disease, Dermatomyositis, Primary Biliary Cirrhosis, Peripheral Neuropathy, Rheumatoid Arthritis, and Vitiligo. Finally, another enrichment on candidate genes is reported by using KEGG annotation. The algorithm of the second script is shown.

SCRIPT II: PATHWAY AND MATRIX (SEMANTIC SIMILARITY)

INPUT: Correspondence genes-GO terms, DO terms

OUTPUT: Gene lists, semantic similarity matrices, KEGG enrichment graph

1: Set work folder

2: Load libraries for process

3: Load correspondence genes-GO terms files

4: List genes and GO terms

5: Save gene lists (Rda format)

6: Select GO field and prepare annotation

7: Create genes semantic similarity matrix

8: Create GO terms semantic similarity matrix

9: Save genes and GO terms semantic similarity matrices (Rda format)

10: Plot and save semantic similarity matrices (Pdf format)

11: Load DO terms

12: Create DO terms semantic similarity matrix

13: Save DO terms semantic similarity matrix (Rda format)

14: Create KEGG Enrichment graph

15: Plot and save KEGG Enrichment graph (Pdf format)

EXAMPLE OF OUTPUT FROM SECOND SCRIPT

The output is related to all the 'xxx_correspondence.txt' files generated with the first script. For pubblication motivations, only one example (not graphical) is reported: do_sem_sim_mat.Rda. In order to show the semantic similarity matrix for Disease Ontology terms, type in R command line: load("do_sem_sim_mat.Rda").

SOURCES TO INSTALL R, RSTUDIO AND BIOCONDUCTOR

The last release of R is reported in this (Italian) mirror:

http://cran.mirror.garr.it/mirrors/CRAN/

As second step, we suggest to download RStudio IDE from this link: https://www.rstudio.com/products/RStudio/

RStudio IDE is not compulsory, but we encourage using it because of all its features (especially for beginners).

Finally, the Bioconductor repository needs a pre-installation, from the link:

https://www.bioconductor.org/install/.

Reference to article and other information will be added after publication of the article.