BCHS3201: Microarray Paper


You will be working with data generated using Affymetrix Arabidopsis thaliana (ATH1) full genome chips.  Please watch the microarray lecture posted in Blackboard for information on how the chips are constructed and how they are used. Step-by-step instructions are provided here for managing the data. While I have provided details here, keep in mind that in a real research lab, you would have to decide for yourself how to organize the data and make sense of it.

Arabidopsis thaliana

Arabidopsis thaliana is a small, flowering plant found all over the world. It is commonly considered a weed in the United States and can be found in the Midwest (Texas is too hot; the plant likes temperatures around 68°F).  Arabidopsis serves as a model plant because it has a number of characteristics that make it amenable to study. The plant is small, reaching only 30 cm in height when full grown. It grows well grows well in both soil and nutrient media making it easier to develop carefully controlled studies (Meyerowitz, 1989). It is easily grown indoors in a laboratory. Crop plants require much larger facilities and land to study.  The life cycle of Arabidoposis is only 6 weeks from seed to seed- producing. This allows a much faster pace for experiments than most crop plants where only one generation of plants can be grown in a calendar year (unless your university is fortunate enough to have land on two hemispheres so you can get two growing seasons in).  Arabidopsis plants produce thousands of seeds per plant and these seeds are tiny making them easy to store in microcentrifuge tubes in the freezer (Meyerowitz, 1989).

Arabidopsis has a haploid genome of 5 chromosomes consisting of approximately 125 megabases (The Arabidopsis Genome Initiative, 2000).  This is a very small genome compared to that of crop species. Maize, for example, is around 2,500 megabases in size (Adam, 2000). Most genes in Arabidopsis exist at a single locus in the genome. Crop plant genomes are large in part because their genomes contain large sections that are duplicated. This makes creating complete knock-outs of a particular gene difficult.  Arabidopsis is amenable to genetic manipulations either through traditional cross-breeding techniques or more modern genetic modification techniques (mutation through T-DNA inserts, chemical agents, or CRISPR-CAS9). Studies conducted in Arabidopsis are often directly transferable to crop species as many of the genes have homologues in crop plants. Studying them first in Arabidopsis is easier, cheaper, and faster.

Sugar and Phytohormone Signaling Pathways

Sugars have a role in basic plant metabolism as a carbon source and also play a role as signaling molecules, contributing to the regulation of a number of pathways in plants.  The expression of genes involved in mobilization of starch and lipid reserves is usually repressed by the presence of high sugar levels in the plant while genes involved in storage of carbohydrates are upregulated (Jang & Sheen, 1997; Yu, 1999). Soluble sugar levels in plants also play a role in a number of developmental processes including time to flowering (Bernier et al., 1993), shoot to root ratios (Wilson, 1988), and senescence (cells stop dividing and normal biological processes begin to deteriorate) (Dai et al., 1999). The DNA chip data you will be analyzing for class is part of a larger study to elucidate the full impact of sugar signaling in Arabidopsis and to identify potential components of signaling pathways for future study.

Phytohormones are involved in a wide array of plant responses. The plant phytohormones ethylene and abscisic acid are also intertwined with the sugar response signaling pathways.

Ethylene plays a role in a plant’s development as well as its response to environmental conditions. Ethylene has a role in shoot and root elongation, sex determination, petal senescence, and fruit ripening. It also is involved in the plant’s response to flooding and pathogens.

Abscisic acid is involved in preventing pre-mature germination of seeds, root elongation, and stomatal closure.  Stomata are pores in the leaf epidermis which control the rate of gas exchange. The pore is surrounded by two bean-shaped guard cells that regulate the size of the pore opening. Abscisic acid plays a critical role in the closure of the guard cells.  Plants with mutations in the abscisic acid biosynthesis pathway have a “wilty” phenotype because they are unable to close their stomata during the day when loss of water to evaporative processes is high. The mutant, aba2, has been found to allelic to the glucose insensitive 1 (gin1) mutant (meaning the mutation for both aba2 and gin1 lie in the same gene).

Signaling pathways often work together to fine-tune plant development and responses. Seed germination, for example is finely controlled by antagonist interactions between sugar and abscisic acid which inhibit germination and gibberellin and ethylene which promote germination (figure 1).

Figure 1. Seed germination is controlled by a combination of signals from sugar levels, abscisic acid, gibberellin, and ethylene.

The sugar-insensitive 6 (sis6) mutant is slightly resistant to the inhibitory effects of abscisic acid on germination (Pattison, 2004). When seeds are grown in a petri plate with nutrient medium supplemented with abscisic acid, germination is delayed in wild-type plants. The sugar-insensitive 3 (sis3) mutant is slightly resistant to the effect of abscisic acid in comparison to wild-type (Columbia ecotype) seeds.  The abscisic acid insensitive 4-1 (abi4-1) mutant displays precocious seed germination in the presence of abscisic acid, germinating despite the presence of exogenous ABA which should significantly delay germination (figure 2).

Figure 2.  The sis6 mutant is insensitive to the inhibitory effects of ABA on germination.  Seeds were sown on the indicated media and grown in continuous white fluorescent light. Germination was scored every 12 hours for four days and then every 24 hours thereafter. Error bar represent the mean ± standard deviation (n=3).  This experiment was conducted three times with similar results. From Pattison, 2004.

How the Data was Collected for this set of Experiments

In order to conduct a chip experiment, RNA must be collected from the samples. In our experiments, Arabidopsis seeds were surface sterilized, cold treated at 4° C in the dark for three days and then plated on Nytex mesh screens placed in petri dishes containing minimal nutrient media.  After 20 hour under continuous light at 21° the nytex meshes were transferred to plates containing either minimal media, or minimal media supplemented with 100 mM sorbitol, 100 mM glucose, 10 µM abscisic acid or 50 µM ACC (ethylene pre-cursor). Seeds were grown on the new media for 12.5 hours and then frozen in liquid nitrogen.  RNA was extracted using a phenol/chloroform extraction (Verwoerd et al.,

1989). RNA samples were sent to the Molecular Genomics Core Facility at the University of Texas Medical Branch in Galveston for processing.

Part 1.  Selecting your experimental conditions

To begin your work on the microarray project, you need to select your topic of study. You need to decide what you would like to examine and then select the appropriate control condition. Your options are in Table 1 below.

Table 1. Select your topic of study for the microarray project. Choose one option. Each row represents

one possible option. Because the control must be appropriately matched to the experimental condition, you may not mix and match between rows.

Part 2. Identifying differences in gene regulation between control and experimental conditions.

1.  Download the spreadsheet corresponding to your selected control and experimental conditions to your computer.

2.  Take a few minutes to familiarize yourself with the spreadsheet layout.

Column A:  AGI#.  AGI stands for Arabidopsis Genome Initiative. Every gene in the Arabidopsis was assigned a unique identifier during the genome sequencing project. The Affymetrix DNA chip contains over 22,000 genes representing nearly every known gene in the genome of Arabidopsis.

Column B: Affy Probe Index #. The Affymetrix probe index # refers to the probe array that corresponds to each gene.  Each probe array contains 11 pairs of probe to the same gene. One probe in each pair is a perfect match to the gene and the other contains a mismatch in the center of the probe. The software uses the data from the perfect match sets and the mismatch sets to subtract out signal that may have arisen from near (but not quite perfect) matches. The names of the probe sets are based on what was known about the gene sequence at the time the chip was created.

Names ending in             means

_at                                     all probes match one known transcript

_a                                       all probes match alternate transcripts from the same gene

_s                                       all probes match transcripts from different genes

_x                                       some probes match transcripts from different genes

Notice that rows 2 through 65 do not have AGI#’s and the Probe Index #’s all begin with AFFX. These are the quality control probe arrays for the chip. They are included so that researchers know that there were not technical issues with the chip or samples.  A mix of probes that will result in positive and absent calls are included.

Signal Columns: Each experiment in this data set was conducted between 3 and 6 times. The columns that contain the word “Signal” in the header represent the value for the signal reads.

Detection Columns: The column to the right of each signal column is the Detection Column.

P= present A=absent M=marginal

Present means the gene was expressed in the sample, resulting in a measurable signal above a minimal detection threshold. Absent means the gene was not expressed under the experimental conditions.  Marginal means the expression was very near the detection threshold. Marginal calls require further investigation and experimentation to confirm.

Converted Detection Columns:  The column to the right of each Detection Column is the Converted Detection Column. The PMA calls are converted to a numeric value which allows the researcher to average the detection calls and decide whether or not to include a particular gene in the data set.




Descriptions: what was known about the gene at the gene identity or function at the time the Chip was created.

3.  Open a new Excel file and name it as follows: Lastname_firstname_microarray.

4.  Change the name of Sheet 1 to “control” by right clicking on the tab and selecting “rename” from the pop up menu. Copy and paste all the data from your control sheet into the “control tab”.

5.  Click the “+” sign to add another tab at the bottom of the Excel sheet. Rename the new sheet “experimental”.  Copy and paste all the data from your experimental sheet into the “experimental tab”.

6.  For both experimental and control conditions, delete the rows containing the controls. These will be the rows at the top (that lack an AGI#).

7.  Scroll to the right. Skip a column after the “Descriptions” column. Label the next column to the right “AVG control PMA” or “AVG experimental PMA”. Calculate the average PMA call for each gene using the converted detection column values for each condition. For example, if converted PMA detection calls are located in cells E2, I2, M2, an Q2, the formula you enter into the cell would be “=(E2+I2+M2+Q2)/4”.  Do this for both your control and experimental sheets. Enter the formula and copy/paste it down the column. The row numbers will change automatically.

8.  Click the “+” sign to add another tab to the bottom of the Excel sheet. Rename the new sheet “combined”.

9. Copy the following columns into the “combined” data sheet.  You will nee d t o pa st e “values”  for  any  columns  c ontaining  form ulas.  It ’s  under  past e  opt ions.a. AGI#

b. Signal columns for the control c.  Leave a blank column

c.  Signal columns for the experimental

d. Leave a blank column

d. AVG control PMA column

e.  AVG experimental PMA column

10.  In the combined data sheet, add another column to the right of your AVG control PMA and AVG Experimental PMA columns.. Label this one “final PMA call”. Type in the formula “=MAX(XX2:XY2) where XX is the column labeled “AVG control PMA” and XY is the column labeled “AVG exp PMA” (substitute your actual column letters for XX and XY). This formula will transfer the maximum value for the two columns to the new “final PMA call column”. The point of doing this is to preserve genes in the data set where there was signal in one of the two conditions. For example, you would not want to delete a gene from the data set because it had an absent call in the control but was upregulated 15 fold in the experimental conditions. By looking at the results using the final column, we can eliminate genes where the signal was not detected in BOTH conditions.

11.  In the combined spreadsheet, highlight your entire data set. Make sure you pick up all the cells with data. Click “Sort & Filter” in the toolbar. Click custom sort. Check the box on the right in pop-up box that says “My data has headers”. Sort by the “final PMA call” column from smallest to largest. Delete all rows that have a value of zero for final PMA call. This will eliminate all genes that were not expressed in either the control or experimental condition from the data set.

12. Add a column to the right of the “Final PMA call” column labeled “AVG control signal” in your combined spreadsheet. Average the values for the signal columns in your control data set. Use the formula =AVERAGE(X2:Y2) where X is the first column with the control signal data and Y is the last column of control signal data. Copy and paste the formula from row 2 all the way down the column. The row numbers will automatically change in the formula.

13.  Add a column to the right of the “AVG control signal” column labeled “AVG experimental signal” in your combined spreadsheet. Average the values for the signal columns in your experimental data set. Use the formula =AVERAGE(X2:Y2) where X is the first column with the control signal data and Y is the last column of control signal data. Copy and paste the formula from row 2 all the way down the column. The row numbers will automatically change in the formula.

14. Add a column to the right labeled of the “AVG experimental signal” column labeled “AVG control/AVG experimental”.  You will divide the average control signal value by the average experimental value using the formula “=XX2/XY2” [where XX is your AVG control signal column (row 2) and XY is your AVG experimental signal column (row2)]. Copy the formula down the column.

15.  Add a column to the right of the “AVG control/AVG experimental” column labeled T-test. You will calculate whether there is a statistically significant  difference between the two conditions. The syntax for this formula is T.Test(array1,array2, tails, type). Array 1 will be the cells containing the signal values for the control. Array 2 will be the cells containing the signal values for the experimental samples. These are NOT the averaged signals but the original values on the left-hand side of your spreadsheet. We will use a 2-tailed T-test. The type will be a two-sample equal variance test which Excel designates as “2”.

For example, if the control signal columns were B, C, D and the experimental signal columns were E, F, and G, then the formula to set up in row 2 for the T-Test would be “=TTEST(B2:D2, E2:G2,2,2). Copy the formula down the row to calculate the p-values for the T-Test for each gene.

16.  Click the “+” sign to add another tab to the bottom of the Excel sheet. Rename the new sheet “final”. Copy all the data from the “combined” spreadsheet into your “final” spreadsheet using the copy and paste value option. This will allow you to go back to the combined sheet to relax the stringency of your data selection if you find you end up with no genes at all in your data set when you complete the following steps.

17. Highlight your entire spreadsheet. Click “Sort & Filter” in the toolbox. Click custom sort. Click the “my data has headers” box on the right of the pop-up box. Sort by T-test value from largest to smallest. Delete all genes that have a p-value greater than 0.05. The expression of these genes is not significantly different between the control and experimental conditions and can be eliminated from the data set.

18.  Highlight your entire spreadsheet again. Click “Sort & Filter” in the toolbox. Click custom sort. Click the “my data has headers” box on the right in the pop-up box. Sort by AVG control/AVG experimental from smallest to largest.  Delete all genes that have a fold change between 1.99999 and 0.499999.

What you are looking for are genes where the change in expression is two-fold above or below the level for the control condition.  You want to keep genes in the data set where the AVG control/AVG experimental value is below 0.5 or lower. These are genes that are UPREGULATED in the experimental compared to the control.  The larger number is in your denominator so the numbers are less than 1.

You also want to keep genes in the data set where the AVG control/AVG experimental value is 2 or higher. In this case, the genes are DOWNREGULATED in the experimental condition compared to the control condition. Since the larger number is in the numerator, the value is greater than 1. If you do not have any genes with at least a two-fold difference in expression, between control and experimental, relax your conditions and select genes with fold changes between 1.5 and 0.66.

19.  Change the font color for all of the down-regulated genes to red [AVG control/AVG experimental values above 2 (or 1.5 if you relaxed the conditions)].

20.  Change the font color for all of the up-regulated genes to green [AVG control/AVG experimental values below 0.5 (or 0.66 if you relaxed the conditions)].

21.  Determine how many genes were up-regulated and how many were down-regulated.