About the Programmers Guide

GenePattern provides access to a broad array of computational methods used to analyze genomic data. Its extendable architecture makes it easy for computational biologists to add analysis and visualization modules, which ensures that GenePattern users have access to new computational methods on a regular basis.

If you are new to GenePattern, begin with the basics:

Concepts provides a brief introduction to GenePattern and its primary objects (modules, pipelines, suites). All other GenePattern documentation assumes that you are familiar with these concepts.
Quick Start provides a brief hands-on introduction to GenePattern.

This guide assumes that you are a programmer and familiar with GenePattern. As a programmer, you generally work with GenePattern in one of two ways:

Creating GenePattern modules. Each GenePattern module invokes a program that executes a desired function. You can use any language to write a program that can then be invoked as a GenePattern module. For more information, see the following sections of this guide:
- Creating Modules describes how to create and edit a GenePattern module.
- Writing Modules provides tips for writing code that will be invoked as a GenePattern module.
Accessing GenePattern from Java, MATLAB, R, or Python. GenePattern libraries for these four programming environments make it easy for your applications to run GenePattern modules and retrieve analysis results. Each library supports arbitrary scripting, access to GenePattern modules via function calls, and development of new methodologies that combine modules in arbitrarily complex combinations. For more information, see the following sections of this guide:

Creating Modules: The Module Integrator

Only the GenePattern team can create, edit, or install modules on the GenePattern public server. Therefore, to create a module, you must have a local GenePattern server installed (see Starting Your Own GenePattern Server).

Creating a GenePattern module is a multi-step process:

Find or write a program that executes the desired function. Any program that can be executed from the command line can be run as a GenePattern module. Programs encapsulated in Docker containers are preferred as Docker handles the management of code dependencies. If you are writing the program and creating the Docker container, you can use any programming language; for example, you can use a compiled language, such as C, to create an executable or use a scripting language, such as Perl, to create a script that is run by an interpreter. For more information, see Writing Modules for GenePattern.
Use GenePattern to create a module that invokes the program that you have written. It takes just a few minutes to enter the necessary information. You can decide which parameters from the algorithm to expose to the user and can replace command line parameter names that are hard to remember with names that are self-explanatory. You can also create drop-down list choices for parameters to reduce the possibility of invoking the module with incorrect values. This section provides information on how to create and edit modules in GenePattern.
Run the module several times, testing it thoroughly before making it available to other GenePattern users.

Malicious code: By adding a module, a user can execute arbitrary code on the GenePattern server. Because arbitrary code may include malicious code, take precautions to protect your server: for example, employ virus scanner software and restrict access to appropriately privileged (non-root) users. For more information about securing your server, see Securing the Server.

Creating a Module

To create a module in GenePattern:

Click Modules & Pipelines>New Module. GenePattern displays the module integrator.
Enter values for the module properties. For more information, click the Help icon.
Click Save to create the module. GenePattern checks the following:
- Every parameter not marked as optional is included in the command line.
- Every parameter in the command line is a parameter, environment variable, or system property.
- Module parameter names are valid.
If no errors are found, GenePattern copies the support files to the server and makes the module available to the GenePattern clients.

The in-depth article Creating a GenePattern Module provides a hands-on tutorial that walks you through the process of creating a GenePattern module.

Licensed Modules

Modules can have end-user license agreements (EULAs) attached to them. In order to run a licensed module (either standalone or within a pipeline), users must view and accept the license terms; GenePattern will display the license and prompt the user to accept it. After a user accepts a licensed module's terms, that user can run the module and GenePattern will not prompt the user to re-accept the license on subsequent runs. However, a user will need to re-accept the license whenever the licensed module is updated to a new version. The GenePattern Server records each acceptance of a licensed module's terms and reports these records to a database application hosted at the Broad Institute. If this module will only be on your GenePattern server and you need to maintain your own database for the terms agreements, contact the GenePattern development team for assistance.

Academic users can agree to the license and use the module. Commercial users, however, need to contact the GenePattern team to work out terms for use, particularly if commercial users are establishing their own GenePattern server on which to host the licensed module.

The EULA terms need to be in a text document (.txt). You can add a EULA to your module in the module integrator by clicking Add License File.

Editing Modules

To edit a module:

Click Modules & Pipelines to display the GenePattern home page.
Open the module integrator in one of the following ways:
- Select a module that you created. When GenePattern displays the module parameters, click Edit.
- Select a public module. When GenePattern displays the module parameters, click Properties. When GenePattern displays the module properties, click Clone to create a copy of the module. Enter a name for the new copy of the module. When GenePattern displays the properties of the new module, click Edit.
Enter values for the module properties. For more information, click the Help icon.
Click Save to create a new version of the module.

Module Properties

When you create or edit a module, GenePattern displays its properties in the module integrator. Click the Help icon to display descriptions of each property and its valid values. The help text is provided here for your convenience.

Module Integrator Help Text

When you create or edit a module, GenePattern displays its properties in the module integrator. Click the Help icon to display the following descriptions of each property and its valid values.

Creating and Editing Modules

Note: Only the GenePattern team can create, edit or install modules on the GenePattern public server. Therefore, to create a module, you must have a local GenePattern server installed.

Creating a GenePattern module is a multi-step process:

Find or write a program that executes the desired function. Any program that can be executed from the command line can be run as a GenePattern module. If you are writing the program, you can use any programming language; for example, you can use a compiled language, such as C, to create an executable or use a scripting language, such as Perl, to create a script that is run by an interpreter.
Use GenePattern to create a module that invokes the program that you have written. It takes just a few minutes to enter the necessary information in the module integrator. You can decide which parameters from the algorithm to expose to the user and can replace command line parameter names that are hard to remember with names that are self-explanatory. You can also create drop-down list choices for parameters to reduce the possibility of invoking the module with incorrect values.
Run the module several times, testing it thoroughly before making it available to other GenePattern users.

When you save your changes, the module properties that you have entered are validated as follows:

Every parameter you have not marked as optional must be listed in the command line.
Every command line parameter must be either a parameter, environment variable, or system property.
The module name and parameter names must be legal - in general, you should avoid punctuation marks and other special characters.

If everything checks out, the uploaded files are saved in the GenePattern module library and the module registered in the module database. The module and its uploaded files are indexed in the background so that they are available for searching. You can run the module immediately and can share it with others.

The following sections describe each module property in detail:

Title Bar
Details
Support Files
Command Line
Parameters

An example for each property is given based on the Consensus Clustering module, which may be found in the module repository.

Title Bar

Name

The name of the module will be used in the drop-down module catalog lists and as a directory name on the server with this name. It should be a short but descriptive name, without spaces or punctuation, and may be mixed upper- and lower-case.

ConsensusClustering example: ConsensusClustering

Version

Each time you update a module, you create a new version of the module. Typically, you want to edit the most recent version of a module. If you want to edit an earlier version, select that version from the drop-down list of versions.

Help

Click Help to display this text.

Save

Click Save to save your changes, creating a new version of the module, and remain in the module integrator.

Save and Run

Click Save and Run to save your changes, creating a new version of the module, exit from the module integrator and run the module.

Details

LSID

The Life Science Identifier (LSID) used to uniquely identify a GenePattern module. You cannot create or edit LSIDS. They are created automatically by the GenePattern server when a module is saved.

ConsensusClustering example: urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00030:5

Description

The description is where to explain what your module does, and why someone would want to use it. It can be anywhere from a sentence to a short paragraph in length. The description, sometimes in abridged form, is displayed in the pipeline designer module choice list, in generated code when creating scripts from pipelines, and in the web client. It's a very good way for you to document succinctly why your module exists.

ConsensusClustering example: Resampling-based clustering method

Author

Enter the author's name. If you share this module with others, they will know how to give the author credit and whom to contact with questions, suggestions, or enhancement ideas.

ConsensusClustering example: Stefano Monti

Organization

Enter the author's affiliation (company or academic institution). If you share this module with others, they will know how to give the author credit and whom to contact with questions, suggestions, or enhancement ideas.

ConsensusClustering example: Broad Institute

License

Upload a text file containing the End-User license agreement. Users will be prompted to accept this license before running the module.

Version Comment

Enter a brief description of the changes that you have made to the module. When GenePattern clients display a drop-down list of versions, the comments for each version are visible in the drop-down list.

ConsensusClustering example: Added ability to create heatmap images of clusters

Module Category

On the GenePattern home page, modules and pipelines are organized by categories. Pipelines are always assigned to the category name pipeline. When you create/update a module, you can choose an existing category name or create a new category name. If your module fits into an existing category, such as Preprocess & Utilities, select that category from the drop-down list; otherwise, click the New button to add a new category. GenePattern creates the drop-down list of categories dynamically based on the categories of the modules installed on your GenePattern server. If you delete the last module in a given category, that category is removed from the drop-down list. ConsensusClustering example: Clustering

Privacy

Modules may be marked as either public or private. When a module is first created, the default is to mark it private. When a module is first created, the default is to mark it private.

Public modules are accessible to everyone who uses the server on which it resides.
Private modules may be accessed only by the module's owner, which is the username that the user logged in with. Private modules are not visible to others building pipelines or running modules. When a module is first created, the default is to mark it private.

ConsensusClustering example: public

Quality Level

The quality level is a simple three-level classification that lets the user know what level of confidence the author has in the robustness of the module. In increasing order of quality expectations, they are: are "development", "preproduction", and "production". Although these terms have no strict definitions, they are useful for setting user expectations. If you make this module public, set the quality level appropriately.

ConsensusClustering example: production

CPU Type

If your module is compiled for a specific platform (Intel, Alpha, PowerPC, etc.), indicate that here. CPU requirements are enforced when the module is run.

ConsensusClustering example: any

Operating System

If your module requires a specific operating system (Windows, Linux, MacOS, etc.), indicate that here. Operating system requirements are enforced when the module is run.

ConsensusClustering example: any

Language

There is no specific language support or requirement enforcement at this time. However, by describing the primary language that a module is implemented in, you give some hints to the prospective user about their system requirements.

ConsensusClustering example: Java

min. language level

If your module requires at least a certain revision of the language runtime environment(eg. 1.3.1_07), indicate that here. This is not currently enforced, but provides useful information to the prospective module user.

ConsensusClustering example: none specified

Docker Image

Each GenePattern job runs in a docker container. Each module must declare a docker image by setting the job.docker.image property in the manifest file. This is passed as the arg to the docker run command. For production modules this should be a tagged version of an image from a public repository.
Docker image specification format: IMAGE[:TAG|@DIGEST]
e.g. genepattern/docker-java17:0.12

When a module is run, the command line will be passed into the container via a "docker run" command. This will be similar to this pattern for running the program on its own (without GenePattern) in docker;
$ docker run [OPTIONS] <job.docker.image> <commandLine>
Note: you shoule NOT include "docker run [OPTIONS]" or the docker image in the module command line. The GenePattern server will format the docker run command and options.

Output File Formats

Select the file formats of the output files generated by your module. If your module generates an output file format not included in the list, click New to add that format to the list.

Support Files [obsolete]

Note on support file obsolescence: This section on Support Files reflects older (pre-docker) versions of GenePattern. This sections behavior is still supported (as of 3.9.11b377) to support older modules, but the preferred method is now to have all support files provided pre-installed in the docker container.

Any files required by your module, such as scripts, libraries, property files, DLLs, executable programs, etc. must be uploaded to the server. These files may be referenced in the command line field using the <libdir>filename nomenclature. There is no upper limit on the number of files which may be uploaded, assuming there is enough space.

To add a file, click Add Files and select the file to add. When you save the module, GenePattern copies the file to the server and adds it to the Current Files list.
To remove a file, select the check box next to the file in the Current Files list. When you save the module, GenePattern removes the file from the server and the Current Files list.

Files that have been uploaded appear as links in this section. You may view or download them by clicking appropriately in your browser.

Help Files: Public modules should always include a help file that provides instructions for using the module, a detailed description of each input parameter, a detailed description of each output file (both its format and content), and either an explanation of the algorithm or a reference to the paper, journal or book that explains it.

To add a help file to your module, include the appropriate text file as the first text file in the list of Support Files.

When a user selects your module, GenePattern displays a form that includes the module parameters and a Help button. When the user clicks the Help button, GenePattern examines the list of support files for the module and displays the first file that has a standard documentation file extension. If no documentation file was provided, GenePattern displays a message indicating that no information is available. (By default, the standard documentation file extensions are html, htm, xhtml, pdf, rtf, and txt. You can modify this list of extensions by editing the files.doc property in the GenePattern /resources/genepattern.properties file.)

ConsensusClustering example: Current files: Acme.jar archiver.jar common_cmdline.jar ConsensusClustering.pdf file_support.jar geneweaver.jar gp-common.jar ineq_0.2-2.tar.gz ineq_0.2-2.tgz jaxb-rt-1.0-ea.jar my.local.install.r RunSomAlg.jar trove.jar version.txt

Command Line

The crux of adding a module to the GenePattern server is to provide the command line that will be used to launch the module, including substitutions for settings that will be specified differently for each invocation. In the command line field, you will provide a combination of the fixed text and the dynamically-changed text which together constitute the command line for an invocation of the module.

Perhaps the trickiest thing about specifying a command line is making it truly platform-independent. To accomplish this we now use Docker container images to encapsulate the runtime environment for a module. Older versions of GenePattern would dynamically modify the command line for the server they are on.

Parameters: Parameters that require substitution should be enclosed in brackets (ie. <filename>). Every parameter listed in the parameters section must be mentioned in the command line unless its optional field is checked. A default value may be provided and will be used if the user fails to specify a value when invoking the module.

Click the View Argument List button to display a list of the parameters mentioned in the command line. You can change the order of the parameters by dragging them to a new position in the list or by editing the text of the command line.

Substitution properties: In addition to parameter names, you may also use environment variables, Java system properties, and any properties defined in the %GenePatternInstallDir%/resources/genepattern.properties file. In particular, there are predefined values for <java>, <perl>, and <R>, three languages that are used within various modules that may be downloaded from the module catalog at the public GenePattern website. Useful substitution properties include:

<java>	path to Java, the same one running the GenePattern server. For Docker images java is assumed to be on the path within the container and will substitute as just "java".
<perl>	path to Perl, installed with GenePattern server on Windows, otherwise the one already installed on your system. For Docker images perl is assumed to be on the path within the container and will substitute as just "perl".
<R>	path to a program that runs R and takes as input a script of R commands. R is installed with GenePatternserver on Windows and MacOS. For Docker images R is assumed to be on the path within the container and will substitute as just "Rscript".
<java_flags>	memory size and other Java JVM settings from the GenePattern/resources/genepattern.properties file
<libdir>	directory where the module's support files are stored
<job_id>	job number
<name>	name of the module being run
<filename_basename>	for each input file parameter, the filename without directory
<filename_extension>	for each input file parameter, the extension without filename or directory
<filename_file>	for each input file parameter, the input filename without directory
<path.separator>	Java classpath delimiters (: or ;), useful for specifying a classpath for Java-based modules
<file.separator>	/ or \ for directory delimiter
<line.separator>	newline, carriage return, or both for line endings
<user.dir>	current directory where the job is executing
<user.home>	user's home directory

Rather than having to customize your module's command line for the exact location of the language runtime on each computer, you can use the substitution properties. For example,

<java> -cp <libdir>mymodule.jar com.foo.MyModule <arg1>

GenePattern will then take care of locating the Java runtime, asking it to begin execution at the MyModule class using code from the uploaded file mymodule.jar.

Standard input/output: If your module is designed to accept a standard input stream and/or write to a standard output stream, you can use redirection syntax when describing the command line. To redirect a file to the input stream, enter the text \< followed by the input file parameter. To redirect the standard output or standard error streams to a named file, enter the text \> or \\>& followed by the name of the output file. In the following example, the LogTransform module reads its input from the standard input stream and writes its output to the standard output stream:

<perl> <libdir>log_transform.pl \< <input.filename> \> <output.file>

ConsensusClustering example (actually all on one line):
<java> <java_flags> -DR_HOME=<R_HOME> -cp <libdir>geneweaver.jar edu.mit.wi.genome.geneweaver.clustering.ConsensusClustering <input.filename> <kmax> <niter> <normalize.type> -N <norm.iter> -S <resample> -t <algo> -L <merge.type> -i <descent.iter> -o <out.stub> -s -d <create.heat.map> -z <heat.map.size> -l1 -v

Parameters

The input parameters section of the form appears perhaps to be the most daunting. And yet there is little that is required to make a working module declaration. Each parameter in the command line that comes from a user input must have an entry in this section. Otherwise the clients would know nothing about how to prompt the user for input nor could they explain to the user what type of input is expected.

To add one or more parameters, enter the number of parameters to add and click the Add Parameter button.

Name

Each parameter has a name, which can be whatever you like, using letters, numbers, and period as a separator character between "words". It can be of mixed upper- and lower-case. The name is used inside <brackets> within the command line to indicate that the value of that variable should be substituted at that position within the command line. The name is also used as a label within the web client to prompt the user for the value for that field. And the name is used as a way of identifying which parameter is which for the scripting clients.

ConsensusClustering examples: kmax, input.filename

Optional

Some parameters are not required on the command line. These parameters, when left blank by the user when the module is invoked, result in nothing being added to the command line for that parameter.

Description

The description field is optional, but is very useful. It allows the module author to provide a more detailed description than the name itself. What is the "kmax" parameter used for? Does it interact with any other parameters? Do you have any advice about what is a reasonable range of settings for it? The description is displayed by the GenePattern clients when they prompt for input for each field.

ConsensusClustering example: Type of clustering algorithm

Default Value

Some parameters should have a default value which will be supplied on the module's command line if no setting is supplied by the user when invoking the module. This is not the same as the defaults defined in the program invoked by the module. Instead, this allows the module author to create a default, even when none exists in the program being invoked by the module.

The default value may use substitution variables, just like the rest of the command line. So a valid default for an output file might be <input.filename_basename>.foo, meaning that the output file will have the same stem as the input.filename parameter, but will have a .foo extension.

Default values for parameters that have a choice list must be either blank or one of the values from the choice list. Any other setting will result in an error message. If no default for a choice list is provided, the first entry on the list will be the default.

ConsensusClustering examples: NMF, 5, <input.filename_basename>

Flag

Some parameters need to have extra text prefixing them on the command line when they are specified. For example, you might need to write "-F filename" to pass in a filename. The prefix text "-F" or "-F " would be specified here. To insert a space between the flag and the parameter, add the space to the prefix text.

example (with space): -F inputfile
example (without space): -Finputfile

Type

Declaration of the type of an input parameter allows the client to make a smarter presentation of the input to the user. (As of GenePattern 1.2, all parameters are being treated as either text or input file types). Parameter type choices are:

Text
Input File
When you select a parameter type of input file, a drop-down list of file formats appears in the file format column. Select the valid file format(s) for this parameter. If your module requires an input file format not included in the list, scroll back to the Output Description field and click New to add that format to the list. For this type of parameter, when the user enters the name of the file, the GenePattern clients pass along the entire file rather than just the file name.
Choice
Some parameters are best represented as a drop-down list of choices. By constraining input to those from the list, the user is saved typing and cannot make a mistake by choosing an invalid setting (unless there is a dependency on some other parameter). To enter the choices, click the Edit Choices link and enter the choices in the Edit Choice List window.

For each choice enter the value required by the program (Value) and, optionally, a more human-readable value (Display Value). When you exit from the Edit Choice List window, the choices you entered are displayed as a semi-colon delimited set of choices. For example:

hierarchical=Hierarchical clustering;SOM=Self-organizing map;NMF=Non-negative Matrix Factorization;3.14159265=pi

would create a drop-down list that looks like this:
Integer
Floating Point
Directory
Password

Writing Modules: Coding for GenePattern

For a simple example of a GenePattern module, please refer to https://github.com/genepattern/ExampleModule.

Creating a GenePattern module is a two-step process:

Use the guidelines provided below to write a program that executes the desired function.
Create a GenePattern module that invokes the program that you have written. For more information, see Creating Modules in GenePattern.

When writing a program that will be run as a GenePattern analysis module, keep in mind the following:

Use the programming language of your choice. You can write the program in the language of your choice. You can use a compiled language, such as C, to create an executable or you can use a scripting language, such as Perl, to create a script that is run by an interpreter.
Write messages to standard error and standard output. GenePattern modules are run on the server. The user provides arguments and retrieves results, but does not interact during module execution. If necessary, write normal output to standard output (stdout) and error messages to standard error (stderr); avoid writing error messages to standard output. GenePattern captures stdout and stderr in log files, which can be retrieved by the user.
Write output files to the current working directory. When a module completes, GenePattern displays the output files that are in the current working directory. Files written to other locations are not displayed as module output files (otherwise known as analysis result files).
Read module data files from <libdir>. If your module needs to read from any data files which are part of the module (rather than user input), it will need to know the directory where the module lives on the server; that is, <libdir>. Note that if using a Docker container you may instead specify the exact path within the container to the support file.
Read and write standard GenePattern file formats. When reading and writing data files, you generally want to use the standard GenePattern file formats. This makes it easier for users to analyze their data using a combination of GenePattern modules. If you choose to use your own unique file formats, be aware that other GenePattern modules will not be able to read those files.
For Java, MATLAB, and R, GenePattern provides libraries that include methods for reading and writing GenePattern files (such as res, gct, and odf files). These libraries are designed for accessing GenePattern from the Java, MATLAB, and R environments, but are also useful when writing modules to be invoked by GenePattern. For instructions on downloading the libraries, see Using GenePattern from Java, Using GenePattern from MATLAB, or Using GenePattern from R.
Use parameter flags. When designing the program and its command line, use parameter flags (for example, -f input_file) rather than relying on parameter positions. Parameter flags allow users to build command lines with variable numbers of arguments, which makes it easy to omit optional parameters.
Process all parameters as strings. All command line parameters are passed to your code as strings, even if a parameter is apparently numeric. If your code expects a numeric argument, explicitly convert the string argument to a number; for example, as.integer(arg).
Avoid absolute pathnames. When writing code to be used with GenePattern, avoid absolute pathnames unless you know they are present within your Module's Docker container. For example, in perl, specify the interpreter on the command line rather than embedding the interpreter in the script; that is, use the command line "perl myscript" rather than including "#!/usr/bin/perl" as the first line of the myscript.pl file unless you are certain of the path within the Docker container of your module.
Avoid Windows forbidden filenames. Machines running Windows cannot accept files with the following names, regardless of the file extension: con, prn, aux, nul, com1, com2, com3, com4, lpt1, lpt2, lpt3. For cross-platform compatibility, avoid files with these names.

Visualization modules are similar to analysis modules. The only difference between analysis and visualization modules is that analysis modules run on the server machine and visualization modules run on the client machine. Each module is launched in a separate process. An applet is used to launch the visualization module.

Writing R Modules

If you are writing R code to be invoked as a GenePattern module, follow the guidelines in Writing Modules for GenePattern. This section provides additional information for modules written in R:

Adapting R Code
Using R Packages
Supported R Versions
Using Parameter Flags

Adapting R Code

When you create a module in GenePattern you specify the command line that invokes the program that performs the desired function. Generally the command line includes arguments such as the parameters for the algorithm and the data file to analyze. This is roughly equivalent to the structure of a UNIX command line call. Since GenePattern on its own has no concept of an R session, your command line must create one to execute your code with its arguments. In the R world, the recommended way to accomplish this is the Rscript utility.

GenePattern simply launches Rscript and hands off the parameters, then waits for the results. It does nothing more than an equivalent UNIX command line call, so your R script must be structured to receive and process these parameters as string values and execute any necessary R functions. One way to do this is to declare any functions, constants, etc. up front in the script, with any processing of command line arguments near the end along with the main R function call.

A full tutorial on Rscript is beyond the scope of this Guide. For more details, see the official Rscript documentation at the link above, or - for your particular version - by typing '?Rscript' within R. Beyond that, please see our Tutorial module for example code and usage.

GenePattern provides substitutions to allow you to call Rscript. As an example, the following command line invokes R (version 3.1 in this case) on the myscript.R file, passing input.filename as a single parameter:

commandLine=<R3.1_Rscript> <libdir>myscript.R <input.filename>

Using R Packages

The GenePattern team maintains a list of R packages to facilitate their use in GenePattern modules. As of GenePattern 3.9.6 you can list these packages in a descriptor file named 'r.package.info' to be included in your module ZIP bundle. When installing your module, GenePattern will automatically attempt to install these packages if they are not already present. Once installed, the packages will be available to any other module using that version of R. To use them simply include normal 'library' calls in your R code as usual. See our Tutorial module for an example.

The r.package.info file is a simple text file in comma-separated value format that specifies the name and (optionally) some other attributes of the package for GenePattern to use to identify, download, and install it.

In general, specifying only the first column (package name) is sufficient for basic use as that is enough for GenePattern to identify the package in both its installed package library and in the supported repositories (CRAN and Bioconductor at present). There are times when it's necessary to pin down a specific package version (due to errors in a CRAN package dependency spec, for example) or even a particular URL (for packages not hosted in one of these two repositories), which is where the other columns come into the picture. These are usually not needed until you are ready to distribute your module to others or run into version issues.

The section below gives a short example along with details for the file's columns and GenePattern's rules to process them. All package dependencies should be listed, not just the top-level ones, as GenePattern requires a complete list of what needs to be installed and managed. The should be listed in the order they are required, with earlier dependencies listed first. Remember to add the header line as it is required.

Here's a short example based on the declaration in our Tutorial module:

package,requested_version,archive_name,src_URL,Mac_URL,Windows_URL getopt,1.20.0,CRAN optparse,1.0.2,CRAN # Comments are added with the '#' character. Blank lines are ignored limma edgeR

The columns have the following meanings:

`package`	The unadorned package name, e.g. optparse or colorspace, as would be used in a library() or install.packages() call. This is required; the line will be skipped with a non-fatal warning if it's missing.
`requested_version`	The requested version string of the package, e.g. 1.0.2 or 1.2-4. This should exactly match the package’s declared version string as found in CRAN or Bioconductor (e.g. from optparse_1.0.2 or colorspace_1.2-4). This is optional, and its use is not necessary or even recommended except in conjunction with archive_name; see the notes below for details.
archive_name	A tag indicating which central package repository hosts this package. You can use CRAN or BIOC to indicate CRAN or Bioconductor respectively, though only the CRAN tag has any special meaning. Any other tag is silently ignored. This is also optional but has special meaning in conjunction with requested_version; see the notes below for details.
src_URL Mac_URL Windows_URL	The URL of a source package to be used on those platforms which install from source (i.e. Linux). This is optional and its use is not recommended unless you need a package not available in CRAN or Bioconductor. It will be ignored on other platforms. The Mac_URL and Windows_URL columns have analogous meanings for the Mac OS X and Windows platforms, though these should point to binary packages for the appropriate platform. NOTE: the Windows platform is no longer officially supported.

GenePattern will check its installed package library and attempt to install based on the following rules:

If a package with that name is already present then the entry is skipped and nothing is installed, regardless of any specified version.
Otherwise, if a URL is specified in the column appropriate for the current platform, that will be used for installation regardless of any other specifications. Currently only ftp and http URLs are supported.
Otherwise, if no such URL is specified but (a) the CRAN archive name AND (b) a version string are specified AND (c) we're installing from source (i.e. on Linux), then Genepattern will construct a CRAN URL to use for installation. The version string should exactly match the way it is specified in the package file name.
NOTE: there are two possible URL conventions, one for the Archived packages and one for the Current packages. Both URLs are built and tried in turn, starting with the Archived URL since pinned packages in general will not be Current.
Otherwise, the package name will be fed into the biocLite installer method which will try to install it from either CRAN or Bioconductor as appropriate. In this case, all other info on the line is considered to be "informational-only" (that is, for documentation purposes) and will not affect installation.
NOTE: for CRAN, this approach gives a "floating" version which will track their latest available at the time of install. Beware as this can be the source of future issues if/when the package is updated in CRAN.

Note that the packages are shared across all modules for that version of R, so choosing a different version may not be possible (we are considering support for this in the future). In practice, the GenePattern team will pin CRAN packages to a specific version to prevent after-the-fact version drift and conflicts across packages. If you submit a module for use on one of our public servers, we will likely change the r.package.info file to make sure it plays well with existing modules.

What Are the Benefits of This Method?

With this method of installing R packages there is no need to bundle them in the module ZIP; nor do you write the extra code to install them when the module is first run. It also means that any package installation errors will be detected during the installation of the module, rather than having it seem to install correctly only to fail at run time. Furthermore, GenePattern also checks the packages ahead of each job run, pre-emptively catching any inadvertent changes in the back-end file system. Last but not least, this allows packages to be installed only once per server and then shared between modules rather than being installed for each individual module.

How to Include an R Package

The easiest way is to use the entries in the r.package.info file from one or more of our existing modules and/or use our Tutorial module as a template for your own. Note that choosing the correct mix of packages is usually dependent on the version of R you are using, so make sure the r.package.info files come from modules matching that version.

For a list of available R libraries from the GenePattern team, please contact us at gp-help AT broadinstitute DOT org. The current list is available at (Review Note: Peter to provide an updated link when ready...)

Note that in order to use this example module, you will need to install R-3.1.3. For information on how to do so please see Using Different Versions of R in GenePattern.

Supported R Versions

We recommend you develop modules with the latest version of R supported by GenePattern (R-3.1.3 at the time of this writing, April 2016). Note that for purposes of stability this will usually lag the current version available from CRAN. Older versions of R are also available in GenePattern to support existing modules, but it's best to avoid these unless you have particular compatibility needs (e.g. porting old code).

Also be aware that there are known issues running multiple versions of R on the Mac platform that have nothing to do with GenePattern. See our Administrators Guide for details.

The automatic package installer works with Rscript modules for R >= version 2.14. As of this writing (April 2016):

<R2.14_Rscript>, <R2.15_Rscript>, <R3.0_Rscript>, <R3.1_Rscript>, and <R3.2_Rscript> are supported on our public servers.

As a note for Administrators: Rscript modules use the built-in run-rscript.sh wrapper script; site-specific customization is defined in the GenePatternServer/resources/custom.properties file.

# example custom.properties entry

R3.1_Rscript=<run-rscript> -v 3.1 --

To reverse-engineer for your GP instance, look at the Admin > Server Settings > Custom properties page (which is a view of the GenePatternServer/resources/custom.properties file). Additionally, the wrapper.properties file is a template for manually configuring your server:

GenePatternServer/resources/wrapper_scripts/wrapper.properties (also available on GitHub)
GenePatternServer/resources/wrapper_scripts/run-rscript.sh

Circa the 3.9.7 release, these scripts are in use on Broad hosted production servers.

We target the final point-version for each R major.minor line. This is for stability; it can affect the version of Bioconductor, for example. As of this writing, those versions are:

R-2.14.2, R-2.15.3, R-3.0.3, and R-3.1.3, and R-3.2.5

Using Parameter Flags

When designing the program and its command line, use parameter flags (for example, -f input_file) rather than relying on parameter positions. Parameter flags allow users to build command lines with variable numbers of arguments, which makes it easy to omit optional parameters. When writing R code, if you have optional input parameters on the command line, you must use named rather than positional parameters in the command line definition.

For example, to write an R function that takes a filename as input, the main function might be:

myfunction <- function(...)

{

   args <- list(...)

   for(i in 1:length(args)) {

      flag <- substring(args[[i]], 0, 2)

      value <- substring(args[[i]], 3, nchar(args[[i]]))

      if(flag=='-i')

      {

        # code to set variables, etc...

      }

}

There are good alternatives available such as the optparse package. See our Tutorial module for an example.

Writing MATLAB Modules

If you are writing MATLAB code to be invoked as a GenePattern module, follow the guidelines in Writing Modules for GenePattern. In addition, for MATLAB code, you must address licensing and distribution issues, as described in this section:

Two Approaches: Direct and Compiled
MATLAB Versions
Adapting Your MATLAB Code
Compiling Your MATLAB Code
Distributing Your MATLAB Code
Example: Deploying a Compiled MATLAB Application

Two Approaches: Direct and Compiled

You can invoke a MATLAB executable from a GenePattern module using one of two approaches: the direct approach or the compiled approach. Following are brief descriptions of each approach, including its advantages and disadvantages:

Direct approach. In the direct approach, the GenePattern module directly invokes the MATLAB executable, which executes your M-code. This approach is best suited for use on a standalone GenePattern server, where you already have a MATLAB license and you will not redistribute the MATLAB-based GenePattern modules to other users who do not have their own MATLAB licenses. The advantages to this approach are: it is the simplest way to getting your M-code running on GenePattern, it can be used for any MATLAB and GenePattern supported platform, and it allows for easier modification of the M-code files as you modify your analysis. The disadvantages of this approach are: it requires a MATLAB license for each concurrent user on the GenePattern server machine and, if you change platforms, you must change the command line because different platforms have different methods of passing arguments to MATLAB.
Compiled approach. In the compiled approach, you use the MATLAB Compiler to generate a standalone executable; the GenePattern module then invokes that executable. The advantages to this approach are: it allows redistribution of the module to other GenePattern users who do not have their own MATLAB license and it can be run on a shared server without needing to get a MATLAB license for each concurrent user. The disadvantages are: it requires you to have a MATLAB Compiler license, it can be used only on platforms supported by the MATLAB Compiler, it must be compiled separately for each platform, and the GenePattern server must have the MATLAB Component Runtime (MCR) installed before it can run the compiler-generated executable.

If you are simply using your M-code on your standalone GenePattern server, the direct approach is simpler; however, if you want to give copies of your M-code to other people or deploy your M-code on a shared GenePattern server, the compiled approach is preferred. The compiled approach may provide slightly better performance for fast running modules since the startup delay will be shorter, but the actual execution time will be approximately the same for either approach.

MATLAB Versions

The instructions in this section are based on the following MATLAB versions:

MATLAB 7.1 (part of Release 14). Earlier versions of MATLAB use a different (deprecated) mechanism for deployment, which is not compatible with the instructions provided in this guide. For Mac OS X, these instructions were tested using MATLAB 7.2.
MATLAB Compiler 4.0 or later. The MATLAB Compiler is currently available on Windows, Unix, and Mac OS X; therefore, these are the only platforms on which you may deploy a compiled MATLAB-based module. For Mac OS X, these instructions were tested using the MATLAB Compiler 4.4.
The MATLAB Compiler generates executables only for the platform on which it is executing. For example, if you create a MATLAB executable on Windows, the MATLAB-based module that invokes that executable can only be deployed on a GenePattern server running under Windows. Therefore, you need a MATLAB (and MATLAB Compiler) license for each platform on which you wish to deploy your executable.

Adapting Your MATLAB Code

When you create a module in GenePattern, you specify the command line that invokes the program that performs the desired function. Generally, the command line includes arguments, such as the parameters for the algorithm and the data file to analyze.

Calling script M-code from a command line is possible, but generally not useful because you cannot pass arguments to the script. To pass arguments to your M-code, create a no-return entry function to serve as the top level call into MATLAB. The following example defines a no-return entry function that accepts two parameters:

function analyzeThis ( filename, whatToWrite )

... (your M-code here)

Writing Modules for GenePattern provides additional guidelines for writing code that will run as a GenePattern module.

Compiling Your MATLAB Code

If you do not plan to use the compiled M-code approach, skip this section and continue with Distributing Your MATLAB Code.

Compiling your MATLAB M-code into a standalone executable is described in the MATLAB Compiler Documentation. Please refer to this documentation to understand all of the options available to you. To summarize the simplest case, from within MATLAB, at the MATLAB prompt, execute the following command:

mcc -m analyzeThis

where analyzeThis is the name of your entry function. This command generates the following files in your $MATLAB_ROOT/work directory:

analyzeThis (Linux, Mac OS X) or analyzeThis.exe (Windows)	Executable file
analyzeThis.ctf	Component Framework file
analyzeThis.c (Linux, Windows)	C language Source Code
analyzeThis.h (Linux, Windows	C Language Header file
analyzeThis_main.c	C language Source Code
analyzeThis_mcc_component_data.c	C language Source Code

Note: To use the MATLAB compiler on Mac OS X, you must have Xcode 2.2 installed; minimally, the Developer Tools, gcc 4.0, gcc 3.3, Mac OS X SDK, and BSD SDK. These instructions were tested using Xcode 2.2.1.

Distributing Your MATLAB Code

After writing your MATLAB code, create a GenePattern module that invokes the code that you have written. Creating Modules in GenePattern describes how to create a GenePattern module. This section provides supplemental information for MATLAB:

Direct Approach Distribution
Compiled Approach Distribution

Direct Approach Distribution

Creating Modules in GenePattern describes how to create a GenePattern module that invokes the code that you have written. This section provides additional information that applies when you are directly calling the MATLAB executable from the GenePattern module:

Windows Command Line
Preferred Command Line

Windows Command Line

On Windows, your GenePattern module definition form can contain a simple command line that calls MATLAB with the -r flag to execute your function; for example:

matlab -nosplash -r "analyzeThis <p1> <p2>"

This example invokes MATLAB without the splash screen (-nosplash) and directs it to execute the quoted command, where p1 and p2 are parameters that you specify in the GenePattern module definition form and that are passed to the MATLAB command line as Strings. MATLAB looks for the function analyzeThis on the MATLAB path; therefore, it is not necessary to upload the function as a support file, although it is recommended.

To ensure that the GenePattern server can call the MATLAB executable, you typically add the MATLAB directory to your PATH system environment variable. (Alternatively, you can enter the full path to the MATLAB executable on the command line, but this makes it more difficult to deploy the module on other GenePattern servers.)

To check that MATLAB is on your path:

Open a DOS window.
Type matlab and press Enter.

If the MATLAB application starts, MATLAB is on your path.

If MATLAB is not on your path, add it:

Select Start>Settings>Control Panel.
Double-click System.
Select the Advanced tab.
Click the Environment Variables button.
Select or create the PATH variable.
Add the $MATLAB_ROOT/bin directory to the path.

Open a new DOS window and check again that MATLAB is on your path.

Preferred Command Line

On platforms other than Windows, the execution of the command line differs slightly due to variations in the Java Virtual Machines (VMs) that GenePattern is running. If you use the simple matlab command, as described for Windows, the Java VMs on these platforms attempt to parse and quote the command line resulting in MATLAB generating errors in its eval function.

On these platforms, you must use a wrapper Java class to launch MATLAB. This wrapper class also works on Windows and does not rely on the PATH variable, which makes it the preferred method for implementing the direct approach on any platform.

To use the wrapper Java class:

On the GenePattern module definition form, add the runmatlab.jar file as a support file. To request a copy of this file, send e-mail to gp-help (at) broadinstitute.org; alternatively, the java source code for the RunMatlab wrapper class is included here: RunMatlab.java.
Write your command line as follows:
<java> -cp <libdir>runmatlab.jar RunMatlab <libdir> analyzeThis <p1> <p2>

Where analyzeThis is the name of your MATLAB entry function name and <p1> and <p2> are the arguments to the function. The RunMatlab class ensures that the arguments are correctly written out and calls MATLAB with the -nosplash and -nodisplay arguments.

Compiled Approach Distribution

Creating Modules in GenePattern describes how to create a GenePattern module that invokes the code that you have written. This section provides additional information that applies when you are compiling your M-code into a standalone executable and invoking that executable from the GenePattern module:

Preparing the GenePattern Server
Writing the Launcher Script
Writing the Module Command Line
Adding Support Files
Distribution Licensing

Preparing the GenePattern Server

To run a standalone executable generated by the MATLAB Compiler, the GenePattern server must have the MATLAB Component Runtime (MCR) installed. This is a collection of shared libraries, which contains the runtime code for MATLAB, that is used by the standalone application. If the GenePattern server has MATLAB installed, you do not need to install the MCR; it is already installed.

Full details for installing the MCR can be found in the MATLAB Compiler documentation, in the section titled "Deploying Components to Other Machines". To summarize this documentation, on the GenePattern server machine, you need to run the MCRInstaller:

On Windows, to run the MCRInstaller:

Copy <matlabroot>\toolbox\compiler\deploy\win32\MRCInstaller.exe to the server machine.
Run MCRInstaller.exe.

On Linux, to run the MCRInstaller:

In MATLAB, at the MATLAB prompt, execute the command buildmcr.
Copy <matlabroot>/toolbox/compiler/deploy/MCRInstaller.zip to the server machine.
On the server machine, unzip MCRInstaller.zip into a directory (<mcr_root>).
Update the dynamic library path for the user running the GenePattern server:
setenv LD_LIBRARY_PATH <mcr_root>/runtime/glnx86: <mcr_root>/sys/os/glnx86: <mcr_root>/sys/java/jre/glnx86/jre1.4.2/lib/i386/client: <mcr_root>/sys/java/jre/glnx86/jre1.4.2/lib/i386: <mcr_root>/sys/opengl/lib/glnx86:${LD_LIBRARY_PATH}

On Mac OS X, to run the MCRInstaller:

In MATLAB, at the MATLAB prompt, execute the command buildmcr.
Copy <matlabroot>/toolbox/compiler/deploy/MCRInstaller.zip to the server machine.
On the server machine, unzip MCRInstaller.zip into a directory (<mcr_root>).
Update the library path for the user running the GenePattern server:
setenv DYLD_LIBRARY_PATH <mcr_root>/<ver>/runtime/mac: <mcr_root>/<ver>/sys/os/mac: <mcr_root>/<ver>/bin/mac: /System/Library/Frameworks/JavaVM.framework/JavaVM: /System/Library/Frameworks/JavaEmbedding.framework/JavaEmbedding: /System/Library/Frameworks/JavaVM.framework/Libraries setenv XAPPLRESDIR <mcr_root>/<ver>/X11/app-defaults

Writing the Launcher Script

When the MATLAB Compiler generates a standalone executable, it also generates a Component Framework (.ctf) file. The .ctf file must be on the path when you run the standalone executable. The easiest way to address this requirement is to create a launcher script (.bat or .sh file) that adds the .ctf file to the PATH or LIBPATH and then runs the standalone executable.

On Windows, for example, to launch the MATLAB executable analyzeThis.exe, create a launcher script, mllaunch.bat, that contains the following lines:

set LIBDIR=%1

set PATH=%LIBDIR%;%PATH%

analyzeThis %2 %3

On Linux, for example, to launch the MATLAB executable analyzeThis.exe, create a launcher script, mllaunch.sh, that contains the following lines:

#!/bin/csh

export MCR_ROOT=<path where you installed the files from MCRInstaller.zip>

export LD_LIBRARY_PATH=$1:$MCR_ROOT/runtime/glnx86:$MCR_ROOT/sys/os/glnx86:\

$MCR_ROOT/sys/java/jre/glnx86/jre1.4.2/lib/i386/client:\

$MCR_ROOT/sys/java/jre/glnx86/jre1.4.2/lib/i386:\

$MCR_ROOT/sys/opengl/lib/glnx86

 

export PATH=$1:$PATH

chmod a+x $1/analyzeThis

analyzeThis $2 $3

The chmod line sets the executable permission on the executable file; by default, the GenePattern server does not set this permission for uploaded files.

On Mac OS X, for example, to launch the MATLAB executable analyzeThis.exe, create a launcher script, mllaunch.sh, that contains the following lines:

#!/bin/sh

export MCR_ROOT=/Volumes/os9/gpserv

export LD_LIBRARY_PATH=$1:/Volumes/os9/matlab7.2/sys/os/mac:

/Volumes/os9/matlab7.2/bin/mac/

export DYLD_LIBRARY_PATH=$LD_LIBRARY_PATH

export PATH=$1:$PATH

chmod a+x $1/writeToFile

writeToFile $2 "$3"

The chmod line sets the executable permission on the executable file; by default, the GenePattern server does not set this permission for uploaded files.

Writing the Module Command Line

On the GenePattern module definition form, write a command line calls the launcher script, passing the <libdir> parameter as the first argument (so that it can be added to the path).

On Windows, the following command line calls the launcher script, mllaunch.bat:

<libdir>mllaunch.bat <libdir> <param1> <param2>

On Linux or Mac OS X, the following command line calls the launcher script, mllaunch.sh:

sh <libdir>mllaunch.sh <libdir> <param1> <param2>

In both command lines, the first <libdir> sets the path to the mllaunch script. The second <libdir> is passed as the first argument to the script so that the script can add this directory to the appropriate environment variables. The <param1> and <param2> variables are parameters to the MATLAB application, which you define in the module definition form and specify in the command line as usual.

Adding Support Files

For the compiled approach, you must specify at least two support files for the MATLAB application: the executable file and .ctf file. If your application requires additional files for its execution, also add those files as support files.

Distribution Licensing

Should you choose to distribute your MATLAB based module to others, you must ensure you are in compliance with the MATLAB licensing agreement:

http://www.mathworks.com/company/aboutus/policies_statements/agreement.pdf

Following are a few key points for GenePattern developers:

You may not distribute code that uses MATLAB and that competes with any of The MathWorks products.
You may not modify or remove any license file included with the MCR Libraries.
Users of your GenePattern modules must be made aware of the MATLAB license agreement in documentation and accept it before installing your modules.
Your MATLAB application must have an about box or equivalent "visible" location that includes the legend "MATLAB copyright 1984-yyyy the MathWorks Inc.", where yyyy is the year you released your module.

Please refer to the MATLAB licensing agreement for exact details. You are responsible for reviewing and complying with the MATLAB software license. The above summary does not exempt you from this responsibility.

Example: Deploying a Compiled MATLAB Application

This section provides a step-by-step example of deploying a simple M-file application as a GenePattern module on a GenePattern server. Where the instructions are platform specific, the example shows instructions for Windows, Linux, and Mac OS X.

Writing the M-file

The first step is writing the MATLAB M-file that you want to share. For this example, write a simple application that takes a filename and a String and writes the String out to a file with the given name. This application consists of the following lines:

% write the variable whatToWrite to a file called filename in the current directory

fid = fopen(filename,'w');

fprintf(fid,'#writing to a file\n\n');

fprintf(fid,whatToWrite);

fclose(fid);

Adapting the M-file

To call the M-file from the command line and pass it parameters, you must turn this script into a no-return function. To do this, add a function definition line at the start of the M-file and save the file using the name of the function (for example, writeToFile.m).

function writeToFile( filename, whatToWrite)

% write the parameter whatToWrite to a file called filename in the current directory

fid = fopen(filename,'w');

fprintf(fid,'#writing to a file\n\n');

fprintf(fid,whatToWrite);

fclose(fid);

Compile the M-file

Within the MATLAB environment, call the MATLAB Compiler to convert this function into an application:

>> mcc -m writeToFile

Within the current working directory, this creates a number of files, including the following:

writeToFile.exe (Windows) or
writeToFile (Linux, Mac OS X)
writeToFile.ctf

Prepare the GenePattern Server

Install the MATLAB Component Runtime (MCR) on the GenePattern server, if you have not done so already. If the GenePattern server has MATLAB installed, it also has the MCR installed.

Windows

To install the MCR:

Copy <matlabroot>\toolbox\compiler\deploy\win32\MRCInstaller.exe to the GenePattern server machine.
At the DOS prompt, or from Windows Explorer, run the following:
MCRInstaller.exe

Linux or Mac OS X

To install the MCR:

Within the MATLAB environment, create the MCRInstaller zip file:
>> buildmcr mcrdir
This creates a directory, mcrdir, beneath the current working directory and creates a file within that directory called MCRInstaller.zip.
Copy the zip file to your GenePattern server (if it is a different machine) and install it into a directory. For example, add a directory, matlab, under the GenePattern server directory and install the library files in MCRInstaller.zip into that directory:

cd GenePatternServer mkdir matlab cd matlab cp <path to mcrinstaller.zip>MCRInstaller.zip . unzip MCRInstaller.zip

Create the Launcher Script

Create the launcher script that sets the environment variables and then calls the MATLAB application.

Windows

Create the launcher script as a batch file that sets the PATH variable for the environment and then calls the MATLAB application. To do so, in a text editor, create the following mllaunch.bat file:

set LIBDIR=%1

set PATH=%LIBDIR%;%PATH%

writeToFile %2 %3

Linux

Create the launcher script as an .sh file that sets the PATH and LD_LIBRARY_PATH variables for the environment, ensures that the application is executable, and then calls the MATLAB application. To do so, in a text editor, create the following mllaunch.sh file:

#!/bin/csh

export MCRROOT=/home/username/GenePatternServer/matlab/v70

export LD_LIBRARY_PATH=$1:$MCRROOT/runtime/glnx86:$MCRROOT/sys/os/glnx86:$MCRROOT/sys/java/ jre/glnx86/jre1.4.2/lib/i386/client:$MCRROOT/sys/java/jre/glnx86/jre1.4.2/lib/i386:$MCRROOT/ sys/opengl/lib/glnx86

export PATH=$1:$PATH

chmod a+x $1/testTwo

writeToFile $2 $3

Note that the MCR_ROOT variable is set to the v70 directory, which you created by unzipping MCRInstaller.zip.

Mac OS X

Create the launcher script as an .sh file that sets the LD_LIBRARY_PATH and DYLD_LIBRARY_PATH variables for the environment, ensures that the application is executable, and then calls the MATLAB application. To do so, in a text editor, create the following mllaunch.sh file:

#!/bin/sh

export MCR_ROOT=/Volumes/os9/gpserv

export LD_LIBRARY_PATH=$1:/Volumes/os9/matlab7.2/sys/os/mac:/Volumes/os9/matlab7.2/bin/mac/

export DYLD_LIBRARY_PATH=$LD_LIBRARY_PATH

export PATH=$1:$PATH

chmod a+x $1/writeToFile

writeToFile $2 "$3"

Create the GenePattern Module

Use GenePattern to create a module that executes the launcher script.

Windows

For the command line, enter the following:
sh <libdir>mllaunch.bat <libdir> <fname> <txt>
Define two parameters:
- <fname> for the output file name
- <txt> for the text to write to the file
Include the following support files:
- mllaunch.bat
- whatToWrite.exe
- whatToWrite.ctf

Linux or Mac OS X

For the command line, enter the following:
sh <libdir>mllaunch.sh <libdir> <fname> <txt>
Define two parameters:
- <fname> for the output file name
- <txt> for the text to write to the file
Include the following support files:
- mllaunch.sh
- whatToWrite
- whatToWrite.ctf

Save the Module and Test It

Save the module and execute it. The module should create two files:

A stdout file that contains execution information.
A file with the name and text that you specified.

Debugging (Linux Only)

If the following error appears in the stdout file, you have not correctly set the path to the libraries that you installed from MCRInstaller.zip:

error while loading shared libraries: libmwmclmcrrt.so.7.0: cannot open shared object file: No such file or directory

Double check the path. If it is correct, you may be using a different Unix shell than the one used in this example. Check that the mllaunch.sh file uses the correct command (export in this example) to set PATH and LD_LIBRARY_PATH.

Using GenePattern from Java

Using Java as a GenePattern client allows you to run GenePattern modules and visualizers from within a Java application. This section describes how you can use the GenePattern Java library to run GenePattern analyses as easily as calling a routine. It contains the following topics:

Getting Started in Java
GenePattern Java Library
Running a Java Program
Using LSIDs from Java

Getting Started in Java

If you are not familiar with Java, see the http://java.sun.com website, which provides downloadable programs, samples, tutorials, and book suggestions.

GenePattern Java Library

The GenePattern Java library allows you to invoke a GenePattern module as if it were a local Java method running on your client and to get back from the module a list of result files. A zip file containing the Java library (and Javadoc that describes the API for accessing the server and running modules) is available on your GenePattern server.

To download the GenePattern Java library to your computer:

Start GenePattern.
Select Downloads>Programming Libraries.
Under Java, click zip to download the zip file for the GenePattern Java library.
After downloading the zip file, unzip it into the directory where you will be doing your Java development.

Running a Java Program

This section explores a simple Java application that preprocesses a dataset and displays it using the HeatMapViewer. The included code can be copied and pasted into your Java program so that you can try it out, modify it, and create your own solutions. The full source code of the sample application is available here.

import org.genepattern.matrix.Dataset;
import org.genepattern.client.GPClient;
import org.genepattern.webservice.JobResult;
import org.genepattern.webservice.Parameter;
import org.genepattern.io.IOUtil;

import java.io.File;

public class MyProgram 
{

   public static void main(String[] args) throws Exception 
   {
      GPClient gpClient = new GPClient("http://localhost:8080", "your email address");

After initializing the required settings, the application runs the PreprocessDataset module to preprocess a dataset. This example references the dataset using a publicly-accessible URL, but a filename would be equally valid. When you invoke the runAnalysis method, the GenePattern library invokes the appropriate module on the server, passing all of the input parameters and input files. Control returns to your application when the module completes. (To run a module asynchronously, invoke the runAnalysisNoWait method or use the runAnalysis method in a separate thread.)

 String inputDataset = "ftp://ftp.broadinstitute.org/pub/genepattern/all_aml/all_aml_train.res";

 JobResult preprocess = gpClient.runAnalysis("PreprocessDataset",

    new Parameter[] {

       new Parameter("input.filename", inputDataset)

});

When the module completes, you can query the JobResult object for an array of filenames that are the output from the module. You can download the result files or leave them on the server and refer to them by URL. Referring to result files by URL is especially useful for intermediate results. In this example, the JobResult object named preprocess contains a list of filenames (of length 1, in this case), which the application displays in a heat map:

 // view results in a HeatMapViewer visualizer

 gpClient.runVisualizer("HeatMapViewer",

    new Parameter[] {

       new Parameter("dataset", preprocess.getURL(0).toString())
 });

The last statements in the application download the preprocessed data and load it into a matrix for further analysis:

 String downloadDirName = String.valueOf(preprocess.getJobNumber());

 // download result files

 File[] outputFiles = preprocess.downloadFiles(downloadDirName);

 // load data into matrix for further manipulation

 Dataset dataset= IOUtil.readDataset(outputFiles[0].getPath());

}

You can combine GenePattern analyses with any capabilities that the Java environment has to offer. Use Java's 2-D and 3-D graphics libraries to create graphic output, or summarize and report on the data using your own code. The basic idea to remember is that GenePattern modules create result files and those files are available to the Java application for processing.

For more information:

See the Modules page for a list of the GenePattern modules, with links to their documentation.
Use GenePattern to generate the Java code required to run a module or pipeline:
1. Select a module (or pipeline). GenePattern displays the parameters for the module (pipeline).
2. Optionally, enter the parameter values that you want to use.
3. Use the View Code or Generate Code field (at the bottom of the form) to display the Java code required to execute this module/pipeline with these parameters.

Using LSIDs from Java

Life Science Identifiers (LSIDs) can be used instead of module names to identify modules for GenePattern to run. An LSID may be submitted in place of the module name in the methods runAnalysis and runVisualizer. When an LSID is provided that does not include a version, the latest available version of the module identified by the LSID will be used. If a module name is supplied, the latest version of the module with the nearest authority is selected. The nearest authority is the first match in the sequence: local authority, Broad authority, other authority.

If you are unfamiliar with LSIDs and GenePattern versioning, see Concepts.

Using GenePattern from MATLAB

Using MATLAB as a GenePattern client allows you to run GenePattern modules and to manipulate and visualize the results in a powerful, commercial technical computing application that works on most major platforms. Using GenePattern allows you to invoke methods written in many other languages without having to worry about how to launch them. This section describes how you can use the GenePattern MATLAB library to run GenePattern analyses:

Getting Started in MATLAB
GenePattern MATLAB Library
Running a MATLAB Program
Using LSIDs from MATLAB

Getting Started in MATLAB

Resources and documentation for MATLAB are available at http://www.mathworks.com/.

GenePattern MATLAB Library

The GenePattern MATLAB library allows you to invoke a GenePattern module as if it were a local MATLAB function running on your client and to get back from the module a list of result files. A zip file containing the MATLAB library is available on your GenePattern server.

To download the GenePattern MATLAB library to your computer:

Start GenePattern.
Select Downloads>Programming Libraries.
Under MATLAB, click zip to download the zip file for the GenePattern MATLAB library.
After downloading the zip file, unzip it into your MATLAB7/toolboxes directory. If you do not have permission to put files in that directory, unzip into any other directory.
After downloading and unzipping the files, add the directories to your MATLAB path:
1. At a MATLAB prompt, open the pathtool:
  >>pathtool
2. Use the MATLAB pathtool to add the GenePatternServer and GenePatternFileSupport directories, with subfolders, to the MATLAB search path.

Note: MATLAB 7.0.4 (R14SP2) and later use Java Virtual Machine (JVM) 1.5. If you are using an earlier version of MATLAB, you must change the JVM that MATLAB is using to JVM 1.5. For instructions, see http://www.mathworks.com/support/solutions/data/1-1812J.html?solution=1-1812J.

Running a MATLAB Program

This section explores a simple MATLAB program that runs a module, displays the resulting output, and loads it into a MATLAB matrix for further analysis. The included code can be copied and pasted into your MATLAB client so that you can try it out, modify it, and create your own solutions.

The first statements in the application initialize various settings, which you must do once in every application that accesses GenePattern. You will need to customize the italicized GenePattern server URL, GenePattern user name (typically, your e-mail address) and password (if required) with values appropriate for your GenePattern server.

% Create a GenePattern server proxy instance

gp = GenePatternServer('http://localhost:8080','my.email@my.domain', 'mypassword');

After initializing the required settings, the application runs the TransposeDataset module to transpose a dataset. This example references the dataset using a publicly-accessible URL, but a filename would be equally valid. As shown below, you can call the GenePattern methods directly or by calling the runAnalysis method. When you call a GenePattern method, such as TransposeDataset, the GenePattern library invokes the module on the server, passing all of the input parameters and input files. Control returns to your application when the module completes. (To run a module asynchronously, invoke the method in a separate thread.)

% input dataset for transpose operation

params.output_file_name = 'transposed.out'

params.input_filename='http://www.broadinstitute.org/mpr/publications/projects/Leukemia/

ALL_vs_AML_train_set_38_sorted.res'

 

% transpose the dataset

transposeResult = gp.TransposeDataset(params)

% alternate call to transpose the dataset

transposeResult = runAnalysis(gp, 'TransposeDataset', params)

When the module completes, it returns a MATLAB structure that contains a list of filenames that are the output from the module. In this example, transposeResult is a structure with a list of filenames (of length 1, in this case). The application displays the results in a file viewer window and also loads them into a matrix so that further manipulation can be performed:

% display the transposed results

edit 'transposed.out.gct'

 

% now read the output into a matrix

% so we can do further manipulation in MATLAB

myData = loadGenePatternExpressionFile('transposed.out.gct')

You can combine GenePattern analyses with all of the rich functionality of MATLAB. For example, you can use MATLAB's plotting methods to create graphic output, save modified matrices to files using save, or summarize and report on the data using your own code. The basic idea to remember is that GenePattern modules create result files and those files are available to the MATLAB client for processing.

For a list of the GenePattern modules available on your server, run the listMethods function on your GenePatternServer object. To view the names of the input parameters for a module, use the describeMethod function on your GenePatternServer object, passing it the module name.

% display the available GenePattern modules

listMethods(gp)

 

% now look at the parameters for the TransposeDataset module

describeMethod(gp, 'TransposeDataset')

Alternatively, to get the parameters with their default values filled in, use the getMethodParameters function of the GenePatternServer object. This returns a MATLAB structure with named elements for each parameter, filled in with the default value if one exists. After filling in the missing parameters and overriding defaults if desired, this structure can then be passed on to the runAnalysis method.

% display the available GenePattern modules

params2 = getMethodParameters(gp, 'TransposeDataset')

params2.input_filename='http://www.broadinstitute.org/mpr/publications/projects/Leukemia/ALL_vs_AML_train_set_38_sorted.res'

 

% transpose the dataset

transposeResult = gp.TransposeDataset(params2)

The GenePattern MATLAB library also has convenience methods to read and write GenePattern files (such as res, gct, and odf files). Even if you choose not to look in the library, you can extend the techniques shown above to implement your own analyses.

For more information:

See the Modules page for a list of the GenePattern modules, with links to their documentation.
Use GenePattern to generate the MATLAB code required to run a module or pipeline:
1. Select a module (or pipeline). GenePattern displays the parameters for the module (pipeline).
2. Optionally, enter the parameter values that you want to use.
3. Use the View Code or Generate Code field (at the bottom of the form) to display the MATLAB code required to execute this module/pipeline with these parameters.

Using LSIDs from MATLAB

You can use Life Science Identifiers (LSIDs) to identify a module when executing GenePattern code in MATLAB. An LSID may be submitted in place of the module name to getMethodParameters or runAnalysis. When providing an LSID to a method in addition to a module name, the LSID alone is used to determine what module to run. When an LSID is provided that does not include a version, the latest available version of the module identified by the LSID will be used. If you are unfamiliar with LSIDs and GenePattern versioning, see Concepts.

% Example using LSIDs from MATLAB

params = getMethodParameters(gp, 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00026:3');

params.output_file_name = 'transposed.out' params.input_filename='http://www.broadinstitute.org/mpr/publications/projects/Leukemia/ALL_vs_AML_train_set_38_sorted.res' % transpose the dataset transposeResult = runAnalysis(gp, 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00026:3', params)

Using GenePattern from R

Using R as a GenePattern client allows you to run GenePattern modules and to manipulate and visualize the results in a powerful, free statistical desktop package that works on most major platforms. Using GenePattern allows you to invoke methods written in many other languages without having to worry about how to launch them or whether you are passing incorrect parameters. This section describes how you can use the GenePattern R library to run GenePattern analyses:

Getting Started in R
GenePattern R Library
Running an R Program
Using LSIDs from R

Getting Started in R

If you are not familiar with R, see the following resources on the www.r-project.org website:

An Introduction to R (PDF, approx. 100 pages, 650kB), based on the former "Notes on R", gives an introduction to the language and how to use R for doing statistical analysis and graphics.
A draft of the R language definition (PDF, approx. 60 pages, 400kB) documents the language; that is, the objects that it works on, and the details of the expression evaluation process, which are useful to know when programming R functions.
Writing R Extensions (PDF, approx. 85 pages, 500kB) covers how to create your own packages, write R help files, and the foreign language (C, C++, Fortran, ...) interfaces.
R Data Import/Export (PDF, approx. 35 pages, 270kB) describes the import and export facilities available either in R itself or via packages which are available from CRAN.
R Installation and Administration (PDF, approx. 30 pages, 200kB).
The R Reference Index (PDF, approx. 2200 pages, 12MB) contains all help files of the R standard and recommended packages in printable form.

GenePattern R Library

The GenePattern R package allows you to invoke a GenePattern module as if it were a local R method running on your client and to get back from the module a list of result files. The package requires R version 2.4.1 or greater and the rJava package. The package can be downloaded from your GenePattern server in Windows(.zip), source (.tar.gz), and Mac OS X (.tgz) formats.

To download the GenePattern R package to your computer:

Start GenePattern.
Within GenePattern, select Downloads>Programming Libraries.
Click appropriate link to download the GenePattern R package for your operating system.
Install the package into your R environment by using the install.packages command:
install.packages("full-path-to-GenePattern-R-package", type="source", repos=NULL)

Note: If you are using a version of R which you cannot modify (because it is a publicly-shared version and you do not have appropriate privilege), you can have it load the GenePattern library by setting the environment variable R_LIBS=<GenePattern install directory>/R/library in your autoexec.bat, .cshrc, .bashrc or other shell startup file. R will then load from its usual location, but will also search for and find the GenePattern library from your installation.

Running an R Program

This section explores a simple R program that runs a module, displays the resulting output, and loads it into an R matrix for further analysis. The included code can be copied and pasted into your R environment so that you can try it out, modify it, and create your own solutions.

The first statements in the application initialize various settings, which you must do once in every application that accesses GenePattern. You will need to customize the italicized GenePattern server URL, GenePattern user name (typically, your e-mail address), and password with values appropriate for your GenePattern server. The gp.login method returns a GPClient object that contains the information required for running modules on a GenePattern server.

# Load GenePattern package

library(GenePattern)

username <- "your email address"

password <- "your password"

servername <- "http://localhost:8080"

 

# Obtain a GPClient object which references a specific server and user

gp.client <- gp.login(servername, username, password)

After initializing the required settings, the application runs the PreprocessDataset module to preprocess a dataset. This example references the dataset using a publicly-accessible URL, but a filename would be equally valid. When you call an R method, such as run.analysis, the GenePattern package invokes the appropriate module on the server, passing all of the input parameters and input files. Control returns to your application when the module completes. (To run a module asynchronously, use the method runAnalysisNoWait.)

# input dataset for preprocess operation

input.ds <- "ftp://ftp.broadinstitute.org/pub/genepattern/all_aml/all_aml_train.res"

 

# preprocess the dataset

preprocess.jobresult <- run.analysis(gp.client, "PreprocessDataset", input.filename=input.ds)

When the module completes, it returns a JobResult object with which you can execute various methods. For example, you can call a method using a JobResult object to get an R list of the filenames that are the output of the module. Afterwards, you can download the files or leave them on the server and refer to them by URL. In this example, we view the results in a heat map:

# Obtain the url location of the result and run the visualizer

preprocess.out.file.url <- job.result.get.url(preprocess.jobresult, 0)

run.visualizer(gp.client, "HeatMapViewer", dataset=preprocess.out.file.url)

In this example, the application downloads the result file and displays the results in a file viewer window, then also loads the data into a matrix so that further manipulation can be performed in R:

# download result files

download.directory <- job.result.get.job.number(preprocess.jobresult)

download.directory <- as.character(download.directory)

preprocess.out.files <- job.result.download.files(preprocess.jobresult, download.directory)

 

# display the preprocessed result

preprocessed.out.file <- as.character(preprocess.out.files[1])

file.show(preprocessed.out.file)

 

# now read the output into a matrix

# so we can do further manipulation in R

data <- read.dataset(preprocessed.out.file)

You can combine GenePattern analyses with all of the rich statistical functionality of R. For example, you can use R's plot and legend methods to create graphic output, output JPEGs of your visualized data using savePlot, save modified matrices to files using save, or summarize and report on the data using your own code. Just remember: GenePattern modules create JobResult objects and those objects are available to the R client for processing.

The GenePattern R package also has methods to read and write GenePattern files (such as res, gct, and cls files), to enable running of multiple modules in parallel, to run modules with input from files that were output from previous modules without moving them from the server, and other utilities. Even if you choose not to look in the library, you can extend the techniques shown above to implement your own analyses.

For more information:

See the Modules page for a list of the GenePattern modules, with links to their documentation.
Use GenePattern to generate the R code required to run a module or pipeline:
1. Select a module (or pipeline). GenePattern displays the parameters for the module (pipeline).
2. Optionally, enter the parameter values that you want to use.
3. Use the View Code or Generate Code field (at the bottom of the form) to display the R code required to execute this module/pipeline with these parameters.

Using LSIDs from R

You can use Life Science Identifiers (LSIDs) instead of module names to identify modules for GenePattern to run. For R, this is primarily useful when you want to specify a particular version of a module for GenePattern to run. The easiest way to specify a particular version of a module is to specify the LSID as an argument to an R method such as run.analysis in place of the GenePattern module name. For example, the following statement invokes version 1 rather than the latest version of the PreprocesDataset module:

preprocess.jobresult <- run.analysis(gp.client, "urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020:5", input.filename=input.ds)

If you are unfamiliar with LSIDs and GenePattern versioning, see Concepts.

Using GenePattern from Python

Using Python as a GenePattern client allows you to run GenePattern modules from a Python script or interactive prompt. This section describes how you can use the GenePattern Python library to run GenePattern analyses as easily as calling a function. It contains the following topics:

Getting Started in Python
GenePattern Python Library
Running a Python Program
Using LSIDs from Python

Getting Started in Python

If you are not familiar with Python, see the following resources on the https://www.python.org/ website:

Python Beginner's Guide: A set of tutorials and frequently asked questions regarding Python.
Python Documentation: Complete documentation for the Python language and common libraries.

GenePattern Python Library

The GenePattern Python package allows you to invoke a GenePattern module as if it were a Python method call. The GenePattern Python package supports both Python 2.7 and Python 3.4 or later. The package can be downloaded from your GenePattern server.

Method 1: Download from GenePattern

To download the GenePattern Python package to your computer:

Start GenePattern.
Within GenePattern, select Resources > Programming Libraries.
Click appropriate link to download the GenePattern Python package.
Extract the ZIP file that was just downloaded.
Install the package into your Python environment by running the following command in the same directory in which the files were just unzipped:

python setup.py install

Note: If you do not have the appropriate permissions to install a python module you may need to run this command as an administrator (Windows) or using the sudo command (Linux, Mac).

Method 2: Install From PIP

The GenePattern Python package can be installed from the Python Package Index (PIP) by running the following command:

pip install genepattern-python

Note: To install a package you may need administrator permissions, such as running the above command using sudo.

Running a Python Program

This section explores a simple Python program that connects to a GenePattern Server, runs a module and loads the resulting files for further analysis. The included code can be copied and pasted into your Python script or interactve terminal so that you can try it out, modify it or create your own solutions.

The very first thing you will need to do is to import the GenePattern library into your script. This can be achived by entering the following shown below. All methods provided by the GenePattern library can then be accessed from the gp namespace.

import gp

The next step in using the GenePattern Python library is to connect to an existing GenePattern server. This will require entering the address of your GenePattern, as well as your username and password credentials. The code below shows an example which connects to a GenePattern server running on the same computer as your Python terminal. Note that the address used must end in /gp for the library to successfully connect to GenePattern. Obviously myusername and mypassword should be changed as appropriate.

# Create a GenePattern server proxy instance

gpserver = gp.GPServer('http://localhost:8080/gp','myusername', 'mypassword')

If you are not aware of which modules are available on the GenePattern server this can be programmatically explored by running the code shown below. This will return a list of GPTask objects, each representing a different module. These GPTask objects will provide the module name, LSID, a description and the version number

task_list = gpserver.get_task_list()

If you already know the name or LSID of the module you want to call, you can obtain a GPTask object for it directly by running the code below. This example code obtains a GPTask object for the PreprocessDataset module.

module = gp.GPTask(gpserver, "PreprocessDataset")  # Obtaining GPTask by module name

module = gp.GPTask(gpserver, "urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020:5")  # Obtaining GPTask by LSID

However you obtain a GPTask object, before this task can be used to run GenePattern jobs its parameters must first be loaded from the GenePattern server. This can be accomplished by running the code shown below.

module.param_load()

After loading the parameters they can be explored by calling the code below. This will return a list of GPParam objects. Each GPParam object contains a parameter name, description, type, whether it's optional and other metadata, as shown below.

params_list = module.get_parameters()  # Get the list of GPParam objects



for param in params_list:  # Loop through each parameter

    print( param.get_name() )  # Print the parameter's name

    print( param.get_description() )  # Print the parameter's description

    print( param.get_default_value() )  # Print the parameter's default value

    print( param.is_optional() )  # Print whether the parameter is optional

In order to run a GenePattern job from Python you must first obtain a GPJobSpec object from the correct GPTask object and then set the appropriate parameters for the job. Files can be uploaded by calling GPServer.upload_file(). An example for the PreprocessDataset module is shown below.

job_spec = module.make_job_spec()  # Create the GPJobSpec

uploaded_file = gpserver.upload_file("file_name", "/path/to/the/file/file_name")  # Upload the input file

job_spec.set_parameter("input.filename", uploaded_file.get_url())  # Attach the input file to the correct parameter

for param in module.get_parameters():  # Loop through all the other parameters and set their default values

    if param.get_name() != "input.filename" and param.get_default_value() != None:  # If it's not the parameter we just set, and if it has a default value

        job_spec.set_parameter( param.get_name(), param.get_default_value() )  # Set the default value

Once the GPJobSpec is ready, it can be used to launch a GenePattern job as shown below. This will return a GPJob object representing the job. By default this will halt code execution until the job has finished running on GenePattern. For long running jobs, however, this may not be desirable. By optionally passing in False as a parameter a GPJob object representing the pending or still processing job will be returned and the execution of the program will continue.

job = gpserver.run_job(job_spec)  # This will halt execution until the job is complete

job = gpserver.run_job(job_spec, False)  # This will return the job object and continue execution even if the job isn't finished

If the latter option is used, the status of the job can be quaried programmatically by calling the follwing code:

job.is_finished()  # Quaries the server and returns True if the job is complete, False otherwise

job.get_info()  # Returns a brief description of the job's current state

Finally, once the job is complete its output files can be obtained by making the call shown below. This will return a list of GPFile objects, each containing methods to download or read the contents of the file.

output_list = job.get_output_files()  # Get a list of output files

for file in output_list:  # Loop through each output file

    print( file.get_url() )  # Print the URL to the file

    data = file.read()  # Read the data in the file

Using LSIDs from Python

You can use Life Science Identifiers (LSIDs) instead of module names to identify modules for GenePattern to run. For Python, this is primarily useful when you want to specify a particular version of a module. The easiest way to specify a particular version is to specify the LSID as an argument to a Python method - such as the GPJobSpec constructor - in place of the GenePattern module name. For example, the following statement invokes version 1 rather than the latest version of the PreprocesDataset module:

job_spec = gp.GPJobSpec(gpserver, "urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020:5")

If you are unfamiliar with LSIDs and GenePattern versioning, see Concepts.

GenePattern Python Tutorial

A tutorial for using the GenePattern Python library, as well as a tutorial for using GenePattern in conjunction with common Python libraries for scientific computing and plotting, are available in the GenePattern Notebook Environment. These notebook tutorials can be downloaded from the Example Notebooks page.

Accessing GenePattern from the REST API

This section is currently under construction.

We have documented our current method for download a filtered set of jobs from GenePattern in our Google forum, and will continue to build out this section.

Please contact us with any questions.