# What is PinnacleZ?
PinnacleZ is a tool for classifying gene expression profiles by integrating
gene expression data and protein networks.
It is an implementation for Cytoscape of the searching and scoring
algorithms specified in Chuang, H. Y. and Lee, E., et al.,
"Network-based classification of breast cancer metastasis,"
<i>Molecular Systems Biology</i> 3:140 (2007). By applying a protein
network-based approach, indicators of a phenotype, or <i>markers</i>, are not
just genes, but subnetworks of the given protein network. This approach assumes
a direct correspondence between a gene in expression data and a protein in
a protein network. In other words, gene X in the given expression data
is related to protein X in the given protein network.
# Terms used in this page
* <b>Adjacent Node:</b> this is typically used in a sentence like
<i>X is the adjacent node to Y</i>. This means there is an edge between
nodes X and Y. In biological terms, protein X and Y interact.
* <b>Edge:</b> a line in Cytoscape between two nodes. In biological terms,
this indicates an interaction between two proteins.
* <b>Gene Expression Matrix:</b> a table of numbers, where each row
represents a gene and each column represents a unique molecular condition
or state of a biological cell.
Each cell in the table specifies the level of expression of a given gene under
a given condition.
* <b>Gene Expression Vector:</b> a row in the gene expression matrix.</b>
Each vector corresponds to a protein in the protein network.
* <b>Protein Network:</b> a network in Cytoscape representing
protein interactions. For example, assume a protein network in
Cytoscape has nodes X and Y, and there is a connection between X and Y.
This means protein X interacts with protein Y, or vice versa.
In graph theory jargon, a network is called a <i>graph</i>.
* <b>Node:</b> a protein in the protein network. In graph theory parlance,
this is also called a <i>vertex</i>.
# The Overall Process of PinnacleZ
1. PinnacleZ calculates a set of <i>modules</i>. A module is merely a
subnetwork of the given protein network. A module is calculated by starting out
only with a <i>starting node</i>. A starting node can be any node in the
protein network. Nodes are then added to the module.
A node is added to the module only if
* the node is adjacent to any node already in the module, and
* the node improves the overall score of the module. (Scoring is defined
in the next step.)
If no node can be added that meet the two criteria above, the process of
building a module stops.
PinnacleZ goes through each node in the protein network and calculates its
module. It collects all of these modules together. These are called
<i>real modules</i>.
1. PinnacleZ scores each real module. A score is a numerical quantity that measures
how "good" a module is. PinnacleZ gives the user a choice between two scoring
methods: mutual information and T test. The score depends on the
gene expression vectors contained in the module.
1. Most of the real modules were produced by mere chance and are statistically
insignificant. These modules must be removed. PinnacleZ filters
out insignificant modules by passing them through statistical tests.
In order to do this, PinnacleZ first:
1. randomly associates a gene expression vector and its corresponding protein;
1. recalculates all the modules now that the associations between
gene expression vectors and nodes have been randomized;
1. scores the modules--these modules are collected together and are called <i>random modules</i>.
1. This randomized process is repeated many times. The number of random trials
is determined by the user. The more random trials, the better the results,
but the computation time becomes longer.
1. <i>Statistical Test 1</i>: The scores of all random modules are collected
together and are placed in a null distribution. If a real module's score is
insignificant when compared against the null distribution, it is discarded.
1. <i>Statistical Test 2</i>:
The random module scores are used to estimate the parameters of a distribution.
If the user selected mutual information for a scoring method,
the gamma distribution is used. If the user selected T test, the normal
distribution is used. If a real module has an insigficant score compared to
the distribution, it is discarded.
1. <i>Statistical Test 3</i>:
The gene expression vectors of a real module are combined into one vector.
A score is calculated based on this vector. The order of the vector's columns
is then randomized. Another score is calculated from this randomized vector.
This randomization process is repeated many times. The randomized scores are
then placed in a null distribution. If the real score is insignificant
compared to the null distribution, the module is discarded.
1. The real modules that passed the three statistical tests are presented
to the user.
# Input for PinnacleZ
PinnacleZ requires three sources of input: a gene expression matrix,
a class file, and a protein network.
<b>Note</b>: `\ws` indicates white space, which is a tab or a space.
## The Gene Expression Matrix
The gene expression matrix is a text file with the following format:
* The first line describes the names of the columns of the matrix.
It follows this format:
`names \ws condition1 \ws condition2 \ws` ... `conditionN`
* Subsequent lines describe gene expression vectors. It follows this format:
gene1 \ws number1 \ws number2 \ws ... numberN
gene2 \ws number1 \ws number2 \ws ... numberN
...
geneM \ws number1 \ws number2 \ws ... numberN
<b>Note:</b> The number of gene expression numbers in a row must exactly
be the number of conditions given in the first line. If this is not so,
PinnacleZ will not accept the gene expression matrix file.
The following is an example of a valid gene expression matrix file:
names +Glucose -Glucose +Succinate
AT_Gene_01 1.0 2 3e-9
AT_Gene_02 7.0 26 10e10
AT_Gene_03 2.0 22 62e10
AT_Gene_04 9.0 12 6e12
In the above example, there are three conditions and four genes.
## The Class File
The class file specifies the classification of each condition specified in
the gene expression. It has the following format:
* Each line specifies the classification of a condition. It has this format: `condition-name \ws classification`
* Each condition specified in the gene expression matrix must have a
classification. If it does not, PinnacleZ will not accept the class file.
* The classification must be a positive integer.
* If T-Test is used for the scoring method, <i>only two classes are
allowed</i>: `1` and `2`.
* If mutual information is used, any number of classes are allowed.
The following is an example of a valid class file based on the example
gene expression matrix given above:
+Glucose 1
-Glucose 2
+Succinate 1
In the above example, `+Glucose` and `+Succinate` are in one
class, and `-Glucose` is in another.
## The Protein Network
* Protein networks must be loaded in Cytoscape, but do not need a view.</li>
* The <i>ID</i> property of nodes must match the gene names specified in
the gene expression matrix. If a node's <i>ID</i> property does not match
any of the gene names in the expression matrix, it will be ignored.
# Options
<ul>
<li><b>Score Model:</b> the score model to use; see steps 3 and 6 in the Overall Process.</li>
<li><b>Number of Random Trials:</b> the number of random trials to calculate; see step 4 in the Overall Process.</li>
<li><b>ST 1 P-value cutoff:</b> the P-value cutoff to determine if a module is
significant for Statistical Test 1.</li>
<li><b>ST 2 P-value cutoff:</b> the P-value cutoff to determine if a module is
significant for Statistical Test 2.</li>
<li><b>ST 3 P-value cutoff:</b> the P-value cutoff to determine if a module is
significant for Statistical Test 3.</li>
<li><b>Number of ST 3 Trials:</b> the number of randomizations to perform for
Statistical Test 3.</li>
<li><b>Max Node Degree:</b> the maximum number of adjacent nodes a node can
have before being added to a module. This is useful to exclude <i>hubs</i>, or
nodes with a lot of edges, from being over-represented in the results.</li>
<li><b>Min Improvement:</b> the minimum percentage a module's score must improve
before adding any more nodes.</li>
<li><b>Max Module Size:</b> the maximum number of nodes a module is contained.</li>
<li><b>Max Radius:</b> only allow nodes that are a specified distance from
the starting node.</li>
</ul>