Hawaii Notes Day 2

 

RNAO meeting notes

01/09/10

Attendees: Rob, Jesse, Jerry, Dave, Lauren, Alain, Stas, Amanda


Stas and Alain added 17 types of chemical probing—mostly able to characterize using yesterday’s model.  Rhodinium was hard to categorize … provides per nucleotide info about magnesium binding. Could add ligand to list of structural restraints—then could add “Babe” probes as well (tethered hydroxyl radical).  Already added some pubmed ids in there.  Need Dave to check; he will do when he’s back to two monitors.


“Is used to estimate secondary structure” or “is used to assist in estimation of secondary structure” rather than “is used to determine secondary structure”

AB would like to discuss merging OBI approach (which basically describes an assay, with inputs and outputs) vs what we produced, which pretty much describes a reagent.


Dave has outlined at high level what should go into practical approach for determining secondary structure (1.5 pages long).

RNA Secondary Structure Determination: A Practical Approach

Target journal: RNA

Introduction

RNA important

Lots of non-coding RNAs

Secondary structure is important and prediction is commonly performed

No tertiary structure has ever been solved without having a secondary structure ahead of time (used both to design constructs and to interpret the data)

ACTION ITEM: Rob will check with Norm Pace to confirm this

RK asks if any structure has been solved w/o comparative sequence analysis?

This contribution provides guidance on determining a secondary structure and on evaluating a secondary structure

Secondary structure is best determined by comparative analysis

Cheaper to get more data for comparative analysis by trolling through homologs from genomics data than to solve structure without comparative analysis

[Describe principles for comparative analysis]

Caveats

Requires many sequences

Takes user insight and time

Requires accepting consensus

Compensating base pair changes are the proof; if can find compensating base pair change per base pair, you’re in the league of best analysis; otherwise, if have one compensating base pair change per helix, probably acceptable.  You have hypotheses (helixes), compensating base pair changes may support or not support one or more such hypotheses

Software can be helpful

There is software to aid with the alignment

Alignment step is nontrivial; should not expect automated alignment method to just give what is required

Quite accurate

Ex: RnaseP, group 1 introns, ribosome

Acceptable Positive predictive value and specificity depends on your application

Secondary Structure Prediction

Many excellent packages facilitate prediction {like to collaborate on this paper with a number of computational groups} {maybe describe using the taxonomy in Paul’s paper}

Pseudoknot/pseudoknot-free methods

Thermodynamic/machine learning

Single/pairwise/multiple

This is a hypothesis

Ways to improve the accuracy {might have to go first in this section}

Use a method that uses multiple sequences that you believe to have the same structure(homologs)

[Provide guidance on (these three are important for selecting a software package)

How similar the sequence is

How many sequence there are

How long they are]

Cite examples

If you were trying to solve a particular sequence, and you used a given software package with multiple sequences, you’d do X amount better than if you used just one sequence

Use experimental data

Types

Shape

Enzymes

Chemical agents

NMR

Emerging technologies: microarray

Care must be taken

Many forms of data do not guarantee the right answer

Pseudoknots are still hard to predict

There are probably many incorrect structures that are compatible with the data set

Chemical modification data keeps you from totally bombing but doesn’t get you to the right structure

You need to input the chemical modification data into a program and let it be promiscuous

Chemical modification can be wrong (RNA mapped without being properly annealed, etc)

{ACTION ITEM: Would be nice to collect chemical mapping data and use same kind of secondary structure optimization Dave did for SHAPE; Alain says he will do the mapping if he gets the constructs, Rob says he will send the constructs if he can pick the species; Dave suggested 16S RNA}

Deletion/compensating change experiments coupled with a  functional assay

Discussion

Value of secondary structure

Allows quite a lot of insight into function by itself

Mechanistic—this bulge nucleotide does catalysis

Conservation of secondary structure is often evidence for function, can be used to search

Use to design constructs for 3D

Useful for understanding data

Has to be evaluated carefully


Make paper short and exciting rather than long and exhaustive; cite reviews of programs rather than all the programs themselves.


Should be brief and relatively undetailed so we can get consensus from a long author list.  Everyone agrees comparative structure analysis is good with some caveats, but list of caveats differs by person, etc. 


Would be an interesting research project to determine whether probes other than SHAPE ones are relevant any more, and publishable no matter which answer you get.


Distribute to proposed author list as outline at about this level of detail, ask if they want to add anything, evaluate contributions, add one point from everyone, then write.  Cast medium-wide net for input:

Experimentalists:

Doug

Kevin

Mike Yarus

Norm Pace

Tom Cech

Art Pardi?

Rob Batey

Ron Breaker

Jack Szostak

Rick Russel?

Harry Noller?

Dan Herschlag

Jim Brown

Michael Brenowitz

Sara Woodsen

Scott Strobel

Dave Bartel

Anna Pyle

Jerry Joyce


Computational:

Eric Westhof

Neocles

Craig

Ivo Hofacker

Michal Zucker

Francois Major

Robin Gutell?

Gary Stormo

Ann Condon?

Russ Altman

Bruce Shapiro

Rolf Backofen

Jan Gorodkin

Peter Stadler

Sean Eddy

Paul Gardner

Gerhart Steger

Robert Giegerich

Cenk Sanliap (?)

Ye Ding?

Chip Lawrence

(?) Asai


Concern: there will be no one left to review it.

If include Ron Breaker, will have to mention in-line probing technique.

Response rate will prob. be 50-60% at best.

How to deal with contention?  If someone suggests total rewrite, say “thanks for your feedback, we’ve got consensus from a lot of other people, we’ve made these changes that we hope address your concerns; do they, or would you prefer not to be an author on the paper?”


“Letter to the Editor” type paper …contact Tim Nelson or Eric Westhof first?  Eric; do same thing we did for alignment ontology paper.


Say all software packages are broadly similar except that package X is better for certain kind of data?  List 5 or 6 top programs (all from people on the list)?


---break---

First invite nucleus of people we know and are easy to work with, then invite others to sign on.


Invite some Japanese colleagues?  Dr. Asai (first name?)

ACTION ITEM: Alain, Rob, and Dave will identify the people they know well and want to invite

ACTION ITEM: Dave will ask Eric if he can think of anyone who was left off; Rob will do the same for Mike


Send it to the ROC list?  No.


ISA Tools (report from Jesse and Lauren)

ISACreator: used by experimentalist to put data in

ISAConfigurator: used by superuser to develop elements to put in tables. 

Once configurator file is made up, you input it into the creator.

There’s a locking feature so experimentalists can’t screw up the template.


Front page of ISACreator asks for study name, description, dates, publication, protocol you used (can add multiple).  If you’re doing a lot of experiments following the same routine, don’t have to fill that in over and over.


Excel table where column headers are defined by your protocol and the minimum information required for it.

In a lab, would probably only do this when getting ready to submit an experiment for publication.

Once protocols are in OBI, can attach to ISACreator



ACTION ITEM: Colin needs to put RNAO in OBI

ACTION ITEM: implement whatever MI standard we come up with here in ISAConfigurator, try using ISACreator to generate ISA-TAB files for some existing experimental data (Lauren, Jesse, Amanda)


How is Kevin storing his data?  Excel files.  Dave puts them in a CSV repository.


Preliminary MI discussion

ISA-Tab is not an MI standard, but is a fairly frequently used format for describing MI standards, so we’d like to come up with one that fits into it.


Already gives us all the basic stuff about investigation, study, assay, who did it, what the title was, what the description was, who the contact person is, etc, so we should definitely not reinvent that/represent it differently.


As a user community, we will probably be making sure that the relevant study design types, study factors, study endpoints, study measurements, study technologies, protocols, protocol parameters, sample parameters, etc.  are defined for this community and are input into the relevant ontologies.


Avoid adding additional top-level descriptors to study (like “number of genes tested”) because won’t fit easily into ISA-TAB.


Proposed Workflow/Object Breakout

May be worth saving the traces, but then have to store the control/background with it, and ladders for the alignment.  + reagent, - reagent, at least two sequencing ladders (ddntps)



RNA

Fragment

Construct

Fragment-in-Construct

AssayInstanceAssay Type

Traces

PI (peak intensity) instance PI Type

Raw Reactivity

DA (data analysis) instance DA Type



“Signal processing” rather than “Data analysis”?


Process according to Alain:

Some RNA—do chemical mapping (plus reagent, minus reagent, two ddntps (only two rather than all four because in capillary, only have 4 colors)).  Should we just stick to capturing the data from our lab after the processing.


User probably cares about the construct and the final intensity data, but nothing between those two.  Dave can imagine that there are some cases where he’d like to get back to the raw data.  There are some human interventions in the intermediate steps such that they are not completely reproducible.


Do you need to store the intermediates in a structured db you can query, or do you just need to store those files somewhere (offline)?  Offline would be ok. 


Construct + optional primer plus/minus/2 ddntps (sequencing ladders)4 tracesalign tracescorrect for RT biases (signal processing)peak integrationget raw peak intensities (aligned)do intensity background subtractiondo normalizationthis gives you per-base reactivity data for one construct+one primercombine a lot of these into per-base reactivity for whole molecule.


Where the primer anneals is important; do tiling primers so can get overlap between them.


If you do something with SAFA with direct labeling, you don’t have a primer, although you have a labeling site.  If you had a tRNA and did cleavage and end-labeled it and ran it on a gel—no primer.


Should support data being captured by traditional gels, or just data being captured by high-throughput experiments?  Do we want to do only quantitative data where did peak integration, or also qualitative data?


SAFA collapses the gel into a bunch of traces, so could handle it with a trace model.  Dave would prefer not to do that, but won’t make perfect the enemy of getting something captured.


GEO tells you to upload raw data file, but the individual spots are not searchable.  So have people upload raw gel image and searchable SAFA traces?  Most people just care about numbers in the end.


Agreed: store construct + optional probe, per-base reactivity for 1 primer + construct, optionally allow people to upload a blob of all the intermediate steps and record whether such intermediate data was uploaded but don’t record what KIND of intermediate data was uploaded.


Store data before doing reconstruction of overlapping primers because (a)techniques for this are more in their infancy, and (b) reconstruction can be (re)done entirely computationally.


Traces/Gel images can be gigabytes (but can probably be compressed); may take a long time to upload over network.



RNA

Fragment

Construct

Fragment-in-Construct

AssayInstanceAssay Type

Raw Reactivity

Normalization + assembly instance

Assembled Reactivity


Record normalized vs not in assay instance?  One or more of raw reactivity vectors, do normalization and assembly, get something that covers a whole bunch of constructs (?)


Most people will query: for this RNA, give me all the assembled reactivities.  For someone developing a new method for secondary structure prediction: from this RNA, what is the best assembled reactivity for (say) 16S?  What about 16S from e coli?  If skeptical, what other assembled reactivities are there?  What raw reactivities did they come from?


So: analyze data, put in isatab files, upload isatab files + a zip file containing the extra data.


What is needed to do normalization and assembly step?  The raw reactivity, which is one number per nucleotide.  Normalized, assembled data comes from many assays together.  So have assembly treated as a different kind of assay with outputs from the other assays as inputs.


So what do we want to record about each of the objects we’ve identified?

RNA Sequence:

Name (free text, eg, group 1 intron)

Species (from NCBI taxonomy; there is an NCBI taxonomy id for artificial sequence)

RFAM id (may be null)

Function (can we get this automatically through RFAM as an ontology term, or from GO?)

Accession (from NCBI)

Sequence

Strand? No, store the positive and negative strand separately as two entries.  Want people to include strand info in name for viruses.


Sequence is the sequence of the single-stranded RNA that was in your test tube.  Preparation info goes elsewhere (eg, in vitro or in vivo, etc).


What about modified nucleotides?  Not captured at this stage.  Assume plain RNA all the way to the construct level, put something between construct and assay that captures modifications.  Know if in vitro transcribed, know it is unmodified; if in vivo, don’t know … probably modified.


Fragment:

Link to parent RNA sequence

Start (index of first nucleotide from RNA that is included)

Stop (index of last nucleotide from RNA that is included)


Do people actually care about how the fragments within constructs map to parent RNA?  Can figure it out from sequence … lots of complexity, lots of variability.  Thus, group prefers to store just construct and its sequence, with notes about the fact that parts of it came from a biological sequence.  Thus, get rid of Fragment and Fragment-in-Construct.  Now just have many-to-many relationship between RNA and Construct (which isn’t nice, so we’ll revisit that later—what about a join table SeqInConstruct?


Revised object/workflow:

RNASequence

Construct

SeqInConstruct

Assay Instance Assay Type, Primer

Raw Reactivity

Normalization and Assembly Instance

Assembled Reactivity



SeqInConstruct:

Link to RNASeq

Link to Construct

Start_seq

End_seq

Start_construct

End_construct

Changed (bool)


Construct:

Name

Sequence

Seq_checked (bool)?


Sometimes run multiple probes on exact same prep … do we need to record that it was exact same prep? No.

Remember that this is not your LIMS—doesn’t have to record every experiment, including those that don’t work.


Could be a constraint in the database where if the user says that Changed is false but the relevant subsequence of the construct doesn’t exactly match the relevant subsequence of the parent rnaseq, then the user is notified and chastised.


Do we have to record details like where we ordered the oligo from, was it HPLC purified, etc?  No, but would be good to know if you sequenced it yourself or are trusting someone else—ie, an evidence code.


Propose that same construct can go into many assays and that labeling or modification is property of assay, not the construct.  Agreed.


AssayType:

What’s in the spreadsheet



AssayInstance:

In vivo, in vitro transcribed, in vitro chemical synthesis, cell abstract

Date

Who

Where

Link to AssayType

Link to a Folding Protocol

Link to Sequence Prep, purification method

Solvent concentrations – (sum of all) divalent concentration , (sum of all) monovalent concentration, pH, temperature, concentration of ligand if there’s an exogenous molecule (zero/null otherwise?)  (Sometimes might want to know which divalent it was.  Do you really need to search for all assays done with 5 millimolar Mg, or do you just want to look at individual assays, see their solvent concentrations, and decide whether you believe in it or not.)  Will also be free text to handle weirdnesses and additional details.

Concentration of modifying agent

Duration of exposure to modifying agent

Concentration of RNA

Sequencing technology used: Gel/capillary/454/solexa--illuminaGA/solid/helicos/other—specify manufacturer

Readout method: labeling/RT/other (direct vs indirect)


Lauren thinks filling this in is doable, especially if there are drop-downs, and if can load an existing experiment and modify it.


Going back to things we left out earlier about all the steps from construct up to peak intensity:

Do we need to store what lanes you sequenced?  (eg, do you need to record it was ddA and ddC in the database?  Or should that just be recorded in protocol?


We decided not to handle imaginary use cases.

For the traces, do we need to record anything other than the sequencing method?  No.

Person who does the alignment may be different from person who does the experiment; should we record both?  Just allow more than one person to have done the assay, but not who did what.


For alignment, record:

Protocol

Manual/program

If program, record name, version number, UsedDefault (bool), free text for non-default params used (can’t just record command line because some programs only have guis).


For signal processing, record:

Decay correction

Baseline correction

---lunch---


For peak integration, record:

Which software used: Shapefinder, Cafa, ImageQuant, Safa, NIH Image, Other (specify in protocol)


People should record the software they use in their free-text protocol that includes info on data analysis, but will not require/verify any info on this.

There are 3-4 labs in the world that do this sort of data analysis.


We are leaving out a lot of information some MI people might want for “reproducibility”; is this going to cause us problems?  Need to address this in publishing it: say we want minimal information, including protocols bc these are needed to be able to reproduce your experiment, but we’re not capturing the protocols in machine readable form because we don’t believe they need to be searchable.  Enough info to run a new experiment that would yield the same structural hypothesis within noise of approach.


Normalized per-base reactivity: this is sufficient info for down-stream steps where you combine info from lots of different fragments?  Yes.  Raw peak intensities are for all four lanes and per base reactivity is not … so you lose info about the background subtraction; is that a problem?  Might want to know if someone had a bad background—subtracting a big number from a big number to get a small number. 


Store two vectors per construct + optional primer/label:

per base reactivity w/o background subtraction (with reagent)

per base background


Will NOT store background subtracted per-base reactivity per construct

To get reactivity for construct, subtract background from construct

Per-position data for whole RNA AND what constructs it comes from, with free text protocol on how they got from per-construct info to whole RNA info


Should put internal ids (like, from your lab’s LIMS system) into the free-text protocol or into one of the identifier fields (investigation identifier, study identifier) in ISA fields.


ACTION ITEM: someone (suggest student) will need to identify units, etc, of the fields that need to be recorded.


What are the next steps in moving this MI project forward?

One sub-aim of multi-PI project grant, for instance one run by Kevin?

What about an R01 between Kevin & Alain?  Could also look at reproducibility, whether chem. modification adds anything to shape


Need at least one professional-level DBA who will be there long-term and possibly additional support staff.  Not a good idea to do this out of start-up funding or with a grad student, because eventually grad student will leave.

Note that current MI practices do not generally allow MI standards and databases, or MI standards and formats, in the same paper: MI standards must be distinct from any practical usage, etc.


How about adding “compliance to MI standards” as an aim for Cafa upkeep, and use that funding to make DB?

Work together on writing MI standard first.

Address any concerns about community buy-in with statement that this is an emerging field, and the leaders in the field are paper authors.  Aim for NAR or RNA for publication.


Might still choose to do a proof-of-concept implementation of first draft of MI standards for publication, say that it’s an example of how such a standard can be compliant with isa tools and mibbi.


Action item: group will work on authoring MI standards draft publication


Next work item: Classification of non-single-nucleotide-resolution structure mapping methods (used for evidence codes)

Info about a structure:

From sequence alone:

Predictions from:

1 sequence

2 sequences

Many sequences


Many-sequence predictions can be covariation, split into either phylogenetic (like evo-fold) or non-phylogenetic, or common folding (like Sankoff)


Can also be broken down as thermodynamic, machine learning (SCFG, neural net—such as KNet Fold), basepair matching, manual, cycles, theoretically 3D prediction and read off base-pairs


Attribute of a prediction is: pseudoknots

Another is: noncannonicals


Best approach is to collect a whole lot of sequences and use covariation, but could also use these other methods that don’t work for one or another reason.


Info about structure:

From experiments alone:

3D—all atom (X-ray, NMR, Cryo-EM)

Distance constraints (pairs of residues) (NMR just for distance constraints, Fe-Babe, MOHCA, FRET)

Whether a region is paired (microarray, in-line probing, force microscopy, RNaseH)

for case where a region is a single nucleotide, SNRNASM


But force microscopy tells you how many pairs there are but not where they are!  Maybe could tell you which pairs there are if you did substitutions …


Propose that each of these things get their own evidence code, which is a hierarchy.  Each base pair will have evidence codes from the methods used to address it directly and those that were used to determine the structure but didn’t address that particular pair.


Levels of evidence:

Used to support this base pair by direct observation

Used to support fact that this base is paired (but not its specific partner) by direct observation

Used to produce structure that supports this base pair

Used to produce structure that supports that this base is paired (but not its specific partner)

Used to support that this base is unpaired by direct observation

Used to support a structure in which base is unpaired

….


Basically, have all combinations of

Single base/pair (number of bases involved) | supported/denied (whether this input is paired)| direct/indirect


For whole structure, will be able to say what evidence was brought in to support whole structure, but want this also at base-pair level to help determine between local structures.  But what if active site was at 3’ end of molecule, but there were only two possible secondary structures, and DMS mapping through rest of molecule eliminates one of them: doesn’t that mean the DMS mapping provides the structure at the 3’ end even though it didn’t directly observe it?  RK says no; DMS would support that structure but not show it by direct observation.


For each of the types of evidence mapped out above, we can say which level of evidence it provides.


Can imagine providing some annotation to structure right away, as by color code, for evidence code … could be put into user interface for putting pseudo-energies only for pairs you believe in …maybe should be a PPG proposal—bring in viewer, unified presentation framework.


Hierarchy of methods can be contributed to OBI in same way that SNRNASMs are contributed, as a taxonomy of methods with inputs and outputs—outline it as work to be done in the future, only contribute it to OBI if there seems to be definite value.


ACTION ITEM:  Rob will go through, make up short acronyms for each of these, send them to Michelle Giglio (U Maryland), who has an evidence code ontology.


---break---

Is MI and OBI two papers or one?  What about SIGS (standards in genomic sciences)?  What if we send MI and OBI to SIGS, and database to RNA, drawing on OBI and MI?  How about write one paper with MI and OBI stuff, but carefully partitioned, and aim for RNA?  Don’t include database, though.  Will this also include evidence code concept?


UNASSIGNED ACTION ITEM: Will be necessary to give this MI standard a name and contribute it to MIBBI.

MISNRNASM?

MISME (minimum information for a structure mapping experiment)?


Alain will be in charge of the MI/OBI paper.

ACTION ITEM: Alain will outline by Jan. 31.

ACTION ITEM: AB and RK will structure meeting notes by end of week

ACTION ITEM: RK will send Alain either photos of the trees he’s drawn during the meeting or computer-readable renderings of them


Author list: workgroup attendees, Neocles, Craig, Kevin, Phillipe, Suzanna, maybe Doug


Moving on to Dave’s paper:

ACTION ITEM: Dave will clean up outline, send to everyone in workshop group to comment

Previously captured Action item: pick core group


Evidence code stuff belongs in Alain’s paper rather than Dave’s.


Alain will be in charge of the database aspect.

For later, Alain will have the programmer that he has access to through his startup funds come out to CU to learn about how to set database up for his lab.  But will probably wait to finish with HTP sequencing stuff first (about 1 mo) so knows whether his process and thus what he needs to store will change; after that should have another discussion about data model.  Both Alain and Dave would be delighted if there were a database within a year—no screaming rush.  Given this amt of lead time, RK suggests putting together Safa/Cafa grant incorporating db development.


ACTION ITEM: RK/AB write paragraph on what goes where (OBI, CHEBI, RNAO) and run it by Phillipe and Susanne