Minutes of the Cambridge ROC meeting


Notes by CLZ:



NL: Running through the recently-submitted RNAO paper.

MD: Submitted main RNA residues to ChEBI, so we can refer to them there.  There are many modified nucleotides; not sure this was resolved. 

DH: We can look at the ChEBI definition later.

MD: Let's use a name like "Covalently_Connected_To" instead of Q1_Connected_To.

CB: We might want to distinguish between connections that can change when the molecule no longer is the same type of molecule.

MD: We're not doing spatial reasoning in the RNAO paper.  Why not have an object called a *bond* and use that to describe covalent connections? 

CB/NL: Bonds are an abstraction.  In aromatic rings, they don't really exist, there is an electron shared all around.

MD: We talk about bonds like they exist, not like they are just abstractions.

RH: Definition D4 refers to bonds.  Q1_connected_to is a primitive relation, not something to define.  Domain and range are atoms.

CB: ChEBI needs an atom-level description of molecules that includes covalent connections and/or bonds.

PG: (Referring to Figure 3) Secondary structure of RNA and prediction, reference early articles about ...

      2D description cite early 2D "unique decomposition" work. eg. Waterman's book, Nussinov, Zuker,...

CB: Reading through the definition of a molecule.

RH: Some problems with quantification, chains of arbitrary length.  Just writing down this definition in formal logic is problematic.  This approaches/uses second-order logic (talking about properties of subsets) and causes difficulty with reasoning.

MD: Can we just make a transitive relation that reflects covalent bonding?

RH: First-order logic (FOL) is semi-decidable. 

NL: Can we formulate this with first-order logic?

MD: If you have bonds, it can be better.  You don't have to quantify over all relations, the bonds are all listed explicitly.

CB: This shifts the problem one along.

RH: Using bonds doesn't avoid the fundamental problem.

MD: You always know whether a collection of atoms is a single molecule or more than one molecule, so you don't need your reasoner to check a definition to determine whether something is a molecule.

RH: The first-order logic definitions would probably not make it into an application.  But it may be important to be able to declare that some collection of atoms is a molecule.  Maybe there is an "infinite schema" method that would work.


MD: If we spend time making a representation at this level, how do we take steps later to, for example, reify bonds, or make a DL (description logic) simplification of what is described in this ontology?  What is the path to that level description?  Chemists talk about and use bonds.  We display them on the screen and people click on them.

CB: A bond is a region of space, a relative space, in between the nuclei of atoms that are Q1 or covalently connected.

RH: This seems to be fine, users want them.


NL: Submolecule, and what we call the submolecules relevant to RNA.

RE: Hasn't seen this before, it's nice to be able to refer to something like a nucleotide as a complete entity.

CB: There have been "molecular parts".

RE: Is there is distinction between "residue" and "nucleotide"?  "Residue" is used for things that have external bonds, "nucleotide" would mean a free molecule.

CZ: It would be nice to have a word besides "residue" since amino acids are also called residues.

NL: "Nucleotide residue" would work for a part of a larger RNA molecule.

ME: "Ribose" is problematic.

CB: "Phosphate" is also ambiguous.

MD: The word "group" is also something to use carefully.

DH: What software artifact corresponds to what appears in the RNAO paper?

CB: Nothing is typed in yet.

RH: CLIF could be used ... there is also SPAS, maybe not as standard.

DH: The OBO file is then derived from the CLIF?

CB: Currently it comes from an OWL file; hacked together.

MD: IUPAC says a moiety is a part of a molecule, a group is ....  So we can say "phosphate moiety".

CB: I prefer submolecule.

RH: In the CLIF file, there are some relations that refer to others that don't exist.

CB: It's a kind of drive-by axiomitization.

ME: The term "nucleobase" seems unclear.

MD: "Nucleobase" seems to have been specifically coined to describe part of a molecule.


NL: Non-covalent connections.

CB: Again these seem to be primitives as RH said, like covalent bonds.

NL: The main difference here is that domain and range are different.  These connections can be between larger entities, like part of a base, phosphate moiety, ...

MD: Why?  Why not do the same for covalent bonds too, a carbon is bonded to an aromatic ring, for example?

CB: Covalent bonds really are between nuclei, an interaction with an aromatic ring is more vague.

NL: Nucleotide stacking is not an atom to atom relation.

RH: What are the implications of, say, one atom binding to a ring?

NL: An argon atom could be Q2 bonded to a ring, for example.

CB: If a gecko is Q2 connected to the ceiling, it is also Q2 connected to the building and to England.  It extends like that.


NL: Boundaries.

MD: What is the purpose of boundaries?

CB: Originally, we were thinking of (base) edges.  Then thinking about the boundaries between atoms, and they are very fuzzy.  Even worse than the boundary between states that is demarcated by a river, which shifts back and forth.  Boundaries between atoms, or between molecules, are intrinsically fuzzy.  These are fiat boundaries.

NL: There is a difference between the fiat boundary between covalently connected nucleotides and the boundary between nucleotides that are basepairing.

CB: Thomas Bittner has been saying recently that all boundaries are fiat boundaries.



DH: ChEBI guided tour.  Cysteine.  www.ebi.ac.uk/chebi search for cysteine.  Outgoing and Incoming relations.  Relations between different molecular compounds.  SDF files tells what atoms are connected to which.  (wikipedia "mol file").  Connections to other databases like PRIDE, Rhea.

DH: Search for RNA.  Applet view of each molecule, or image view.  Uses jchempaint?

MD: ChEBI ID 50299 for canonical RNA nucleotide residues.  5' end residues, 3' end residues, differ in inclusion of an oxygen atom, have a single connection point.  GMP residue has two connection points. 

MD: Ribonucleic acid, 33697.

ME: We recently went through to be careful with plurals; phenol versus phenols. 

CB: Sequence ontology (SO) talks about certain molecules.  tRNA, mRNA, etc. can take these over from ChEBI.


CB: What he's working on in ChEBI for Rob Knight's need for atoms names in connection with RNA structure mapping.

CB: Things like Watson-Crick edges are RNA-specific ...

CB: He's added "purine Hoogsteen edge" and the like.

CB: ChEBI 25805 is oxygen.

MD: Do we distinguish between a bound oxygen atom and a free oxygen atom?

CB: What is the ontological status of bound atoms?  Maybe they are regions.

CB: Pyrimidine O2 atom is an oxygen atom and is part of some pyrimidine edge.

MD: What is an edge?

CB: A continuant part, rather than a boundary.

NL: Some of the entries have labels like RNAO_0000015, not human-readable names. 

MD: Let's use RDFS:label in addition; this is standard.

CB: I agree.  It's a little extra overhead.

NL: For modified bases, does this have to be done all over?

CB: The reason for purine versus pyrimidine is that we can start somewhere and modify as needed.

NL: Can we say a modified base is the same as XXX except for these other atoms?

CB: Not totally easy.

NL: We'd like to incorporate Janusz Bujnicki's database of something like 300 modified bases.

ME: Could we do this with nitrogens?

MD: Top class would be nucleobase N atom, etc.  This makes it easier to represent the structure, to represent basepairs, search for hydrogen bonds, etc.

NL: Janusz Bujnicki, Modomics, a database of modified RNA nucleotides.


NL: RNA edges.  Q2 interactions, noncovalent.  You can think RNA bases as being made up of two faces and three edges, like a triangle.  Bases interact electrostatically between their edges, making hydrogen bonds.  cis (meaning together) versus trans () refer to whether the glycosidic bond is on the same side of the line through the basepair or on the other.  18 different classes of edge to edge interactions.  They are disjoint, pretty much exhaust the types of basepairing that we see.  Different groups have different systems for annotating these.

CZ: Annotating basepairs can be a matter of overall geometry or a matter of individual hydrogen bonds.  Is a basepair a geometrical arrangement or an assemblage of hydrogen bonds?

NL: The presence or absence of individual hydrogen bonds can be due to low resolution of the NMR or crystal structure.

PG: Not yet reading in annotations of basepairs.  Still just working on the connections between Rfam and PDB/NDB.  PDB doesn't seem to cover a huge number of families.  Only 20 or so.  tRNA, rRNA, riboswitches.  So we can't spend a lot of time, it only covers less than 1% of the database.  There are more than 1 million tRNA sequences.  Maybe some day people will systematically crystallize families of RNAs; this would add a lot of value for Rfam.  Not certain whether they'll bring in the basepair annotations of the 3D structures that are solved; their pipeline is already slow enough for what they are already doing just with sequences.  This is on the wish list, but not high up on the wish list.

NL: So you wouldn't object to us annotating alignments.

PG: We talked in Benasque about setting up some sort of DAS file; slurp in a PDB sequence, the annotations could be connected to that, and then attached to the alignment in Rfam.  They'd like to get this from EMBL, but they haven't started importing PDB sequences yet.

NL: Which version of the molecule do you use from PDB?  Some portion might be missing, some drug may be bound, etc.

PG: Dealing with truncated sequences (as one finds in PDB files of RNA molecules) is hard for SCFG covariance models.  Can't incorporate into Rfam yet.

NL: Couldn't we just fill in the missing segment of sequence.

PG: Where is that data?  Is it consistently available in PDB?  Even mapping PDB onto EMBL or Genebank would be useful.

NL: The structures in PDB/NDB are from organisms for which the sequence is available somewhere. 

MD: NCBI MMDB does this.  PDB doesn't always have a full sequence part of the record, especially in older data files.

NL: E. coli and Thermus thermophilus, two bacteria.  The bases that are paired in the two structures can be aligned, and the basepair family is conserved between them. 

PG: We have 300,000 small subunit RNA sequences.  How do you display them all and display basepair annotation information as well?  Rfam is more about annotating the families, rather than providing detailed annotations about individual families; more specialized databases do these things.

CB: There is one subpredicate for each of the 18 pairing relations. 

NL: Let's look at definition D21.

MD: The domain of the relation would allow the nucleotide itself or just an edge.  These are domain relations.

RH: Can X be Q2 connected to Y, and also Q2 connected to something of which Y is a part?  Are you interested in finding out how many Q2 connections there are between two entities?



CB: Edges of a base are submolecules, so they can be Q2 connected to other edges; this is what makes a basepair.  What about stacking?  Does this occur between submolecules or between boundaries?  Can Q2 connections be between fiat parts of molecules? 

CZ: What do we need boundaries for?  They are questionable from a quantum mechanics point of view.

CB: Bonds and boundaries differ (although in the Q1 case, they are both things that happen between atoms).  A bond is a 3-dimensional region somewhere between two atoms.  A boundary is a 2-dimensional region cutting the bond in two.

CB: What about the 3' and 5' face of a nucleobase?

NL: The interaction between two faces in stacking is between two boundaries, two faces.  The electron cloud dies off quickly ....

CB: The boundary on one face of the base only comes into being when you bring two bases near enough to stack.

NL: The boundaries are important.  Edges are boundaries, and faces are boundaries.

CB: What is the domain and range of Q2 connection?  Is Q2 connection between atoms and submolecules, or is it between boundaries?

NL: Thomas Bittner said to avoid defining boundaries in the absence of a relation.  The boundary is dependent on the relation.  He was comfortable with the definition of the boundary

NL: Submolecules can be things that cut through atoms. 

CZ: I think that's goofy!


CZ: Sequence alignment ontology.  We want to assert more sophisticated correspondences between parts of RNAs, nucleotide to nucleotide, region to region, etc.

PG: Pfam is assigning probabilities to the assertions made in their alignments. Rfam could do that at some point

NL: Definition 29 is about correspondence. 

CZ: It is similar to Definition 4, it is a primitive, not a definition you would check with some reasoner, because it's hard to check that things occupy equivalent positions when superposed.

CB: We could make those more basic things the primitives:  NT_NT_share_common_descent or NT_NT_occupy_equivalent_positions_when_superposed, then we can define the correpondence from that.

CB: David Osumi-Sutherland in the Cambridge genetics department has worked on ontology of homology.  He had some tentative definitions for homology.  He works on anatomical structures, the fly brain. 


PG: Why not use the word homologous, rather than corresponds?  Or words like analogous, paralogous, etc.?

NL: We'd like to avoid using the word "homologous," using the more broad term "correspond", because it covers

MD: Imagine an insertion event. There is no longer correspondence between the inserted nucleotides in one organism and anything in a different organism that lacks the insertion.  Certain parts could be aligned, other parts not.  But if two bits correspond ...

MD: You're using relations, whereas he'd like to see these as a type(correspondence as a type) and you can add annotation to that object saying why you believe in the correspondence.  Do you use predicates or objects for the relation?  Minimize the number of primitives. 

CB: That's a good recommendation for putting this into OWL.

MD: The relation is minimal, and useful.  But if we add it as an object, you can add provenance information in OWL.  You can't add provenance information to a relation in OWL. 

CZ: Same deal with basepairs?  You can then annotate a basepair as being from MC-Annotate, FR3D, structure mapping, or covariation analysis.

CB: You should only have relations which are subsumption,

MD: When there are domain entities in the name of the relation, I get suspicious ...

CB: Kisses is a bad relation, kissing is an action which has participants and other properties.

CB: Illustrates:  Nucleotide N1 in structure A, N2 in structure B.  The correspondence between N1 and N2 is a dependent entity, and information content entity.  C can have properties (homologous, analogous, etc.) and provenance information. 

CZ: Someone could say that 2:12 of one molecule correspond to 5:15 of another.  They might think that this implies nucleotide to nucleotide correspondence.

MD: You add this as a property of the class. 

MD: We would have a class of things called nucleotide correspondence, and another called basepair correspondence.  We would have subclasses.  Correspondence is a class, and it has subclasses, nucleotide, basepair, region, etc. 

CB: States of affairs.

RH: I don't see where aboutness comes in here.  Intent.

MD: You need another relation ... something more focused.

RH: I can see how they would be generically dependent continuants. 

MD: Information content entities have an understood relation that is that the information content entity is about something.

RH: Aboutness is created through intentionality, what distinguishes humans from other things.  We can have something stand for something else.  Information artifact ontology.  Generic dependence, having some property in common.  Whiteness, for example, is OK as long as there are white things.  Aboutness, making something stand for something.  There is no intentionality here.

CB: We can take the assertion of correspondence between NT1 and NT2.  Or the assertion that there is no correspondence.  How do you tell the difference?

RH: ... We have an intent of why we represent this part of reality and this other part, not.

NL: In molecule A and B, there will be individual nucleotides in A that correspond to individual nucleotides in B, basepairs in A that correspond to basepairs in B.  We have four kinds of correspondence that we want to keep distinct.

RH: Then why is it generically dependent?  Generic dependent continuant exists as long as there are instances of type T. 

CB: Here it's specific dependence.  Correspondence is a relational quality.  It can't be transferred.  Printing, for example.  You can copy off a sheet of text and destroy the original. 

MD: Yes, in the BFO world this is how you do things.

MD: The issue is how you get this information content entity.

NL: Now, people publish alignments, and no one knows where they came from, and no one really knows what they mean.  We want people to be able to make very specific lists of alignments and, in the paper or even in the alignment itself, indicate where the alignment comes from and how confident ...

MD: ... questions about relational qualities ...

MD: Your alignment ontology, have you actually formalized the ontology, say in OWL?

MD: What do you want to do? 

NL: Allow people to tell all the correspondences they mean, and to be clear about it.

CZ: The data format will be a difficult issue; we may not find a format that efficiently encode the most complicated set of correspondences that someone might like to use, but we may find a format that allows for some improvement over.

CB: Before there were people, were there correspondences between the RNAs of Haloarcula marismortui and Thermus thermophilus?

MD: To me, it's clearly an artifact of our ability to reason about things.

CZ: Does common descent exist separately from having a human there to think about it?


RH: There's always some intentionality involved when making an ontology. 

NL: When we don't mean to make an assertion, we won't.  Parts of the molecules that don't correspond we won't say they do.

Anything else we should talk about today?

CZ: Motifs and submotifs that they share in common?  Similar to ChEBI.

CB: Sounds like simple parthood.

NL: ... amide group ... is it part of the chain, or part of a particular residue of the protein?  In RNA, you have riboses and phos--- group.  Those four oxygens that are attached to the phosphorus work as a group.  Hydrogen bonds can be made with them.  But, one oxygen is thought of as part of one nucleotide, while the other three belong to another!

MD: It's arbitrary, you just have to pick something.

NL: When you say "the phosphate group of one nucleotide interacts with some other thing" and the interaction is made by the O3' (which is part of one nucleotide's phosphate and another nucleotide's sugar).

CB: It's quite OK that my right eye is part of my right-hand side and part of my front.

NL: So if I have a nucleotidal residue ... what are the parts?

ME: Nucleobase, ribosyl group,

MD: We should include the O3' that's attached to the phosphorus of the nucleotide.

NL: Good, we can have overlapping things. Let's do that. 

MD: On the board:

nucleotide residue (5'+3' connection includes O5' O3')

The nucleotide residue (connected at the 5' end and the 3' end), that should include O3', the nucleotide's phosphate, and up to the O5'.

(further discussion)

NL: So each nucleotidal residue will contain two O3' atoms.

NL: In the submitted paper, we gave each nucleotide only one O3', the one connected to the phosphorus of the nucleotide.  But PDB gives three oxygens to each phosphorus.

CB: The issue is when you want to talk about repeating elements, then you don't want them to have overlaps.

MD: This is a good issue.

NL and CB: Divide into nucleosides and phosphates?

(time spent looking in ChEBI for entities that we could map to if we dice up the nucleotides in a specific way)

NL: We could have a 3' adenosyl group that's phosphorylated.  Would that be a 3' adenosyl phosphate group?   ChEBI ID 22261 is the 5' adenosyl group, which we is a 5'-adenosyl group don't want.  So we can add a 3' adenosyl group ....  3' adenosyl 5' phosphate?

ME: Not according to IUPAC rules.

CB: 5'-phosphoadenosin-3'-yl (ME thinks this is named OK)

NL: google it, there is a paper on it.



(working individually, small discussions)


JR: All databases number nucleotides and the atoms in them according to the same scheme; they split off one oxygen from the phosphate and assign it to the adjacent nucleotide.

NL: We can say what phosphate the O3' belongs to, but the O3' belongs to a different nucleotide than the phosphorus belongs to.

JR: The phosphate is not a proper part of the nucleotide.

CB: GMP residue ...

(further discussion of standards between different databases for naming of nucleotides, etc., display of hydrogens, stereochemistry, ...)

NL: People study RNAs at different pH's.  Protonations are important.

NL: GMP specifies a whole family of different protonated states.  We would put all the protons that keep the charge minimal.  But we wouldn't add protons that would make the charge positive.

MD: But if you're going to underspecify it ... have GMP

(discussing things to put into ChEBI concerning protonation of nucleic acid residues)

NL: List in order of pKA (???)

NL: The form of GMP residue that you display is what we'll show.  GMP residue IS the superclass that subsumes all the protonated forms.

JH: Yes.

NL: Let's get back to Jane's question.  The RNA chain with residues linked together by phosphate parts.

CB: The nucleosides are linked by phosphates.

NL: Each phosphate is a proper part of the RNA molecule, not of a nucleotide.  Phosporyl might be a proper part of a nucleotide.

JR: But we don't need that.

JR: The sugar edge is not a proper part of the base, we don't worry about that.

NL: Did we get it right in the paper where a suite starts and ends?

JR: The suite breaks in the middle of the ribose ring.  It's defined by dihedrals, you can't define them unless it's a continuing ... it goes from delta to delta, including delta at both ends.  The conformation of the suite depends on delta.  It's more a conformational thing than the set of atoms.

JR: We have a class that subsumes all of the different protonated states of the residue.  If you just think about backbone conformation, you can just look at the chain of backbone atoms.  Informally, we think about it including the backbone chain AND the ribose rings and the two bases.  Some extra things help to define the torsions.  The suite is not a sensible way to divide the molecule into submolecules. 

NL: They're not disjoint submolecules.

JR: If we're going to define a suite at all in terms of atoms, we should define it as all of the backbone, ribose, and base atoms.  Overlaps badly.  We will later want to understand what backbone conformations can go with bases stacking, etc.  ... also base-backbone interactions. 

NL: We'll have to make some modifications to the article once it gets back from the reviewers to clarify this phosphate business.


CB: One thought about backbones ...

JR: Backbone doesn't mean just one thing.  People talk about the bases and the backbone.  The backbone then contains the ribose, phosphorus, and bridging oxygens.  Backbone hydrogen bonds.  Lots of those interactions go to the O2' and the non-bridging oxygens.  Talking about the backbone as the connected chain with no branches is not so useful.  ... In proteins, people often don't worry about omega.  Omega is a torsion angle between a C and N in the protein backbone.  Normally you ignore omega and assume it's 180 degrees, but occasionally it can be 0 degrees.

JH: Chemically, we have coordinates and that's it, the angles are not specified.  We don't deal with conformations.

CB: Right, conformations are out of the scope of ChEBI.

JR: Ranges on each individual angle don't work well.

NL: They're correlated, they have ranges in a multidimensional space.


JR: We haven't defined stacking very well.  The interaction is not a specific atom atom interaction.  And it should say that the bases should be somewhat coplanar ... we don't want to be terribly specific about it, but we should indicate something.  Just to say that you have some kind of Q2 interaction seems to vague.

CB: Let's just say that the bases are roughly parallel.

NL: It's a face to face Q2 interaction.  "Stacking is a Q2 interaction between faces where the bases are roughly parallel."

JR: That comes in because in the description of the GNRA we say that bases 5 and 6 are stacked.  It's a marginal stack at best, often not stacked at all.  Other hydrogen bonds are more important to the motif than this stacking interaction.

CZ: The presence of that stacking interaction depends on how tight or loose the annotation system is; FR3D annotation may be looser, and so makes the stacking appear to occur a high percentage of the time.

JR: Describing the sequence, basepairs, stacking,

JR: Add the purine hydrogen bond, between the G and the R.  N7 of the base to the O2' of the ribose.  Sometimes FR3D calls it a basepair, but it isn't.

JR: We could say even more clearly that it is a fuzzy definition.  You need to decide case by case what the features are that constitute it.  A motif really is ... depends on long experience of what is important to the motif.

JR: There are three ways they are vague and unknown.  Flexibility over time, dynamics, are real but the least important problem.  We're trying to say what's there in a given structure.  The other two are crystallography/fitting errors, especially in low resolution structures.  The real one is that the real conformations are NOT always the same, due to the environment in which the motif exists.  Induced fit, etc.  They may not all truly look the same.

NL: UNCG's maybe, because they're often isolated.

JR: We should talk about this.


CB: Do all motifs, in the right circumstances, have something to catalyze a particular reaction, or do they help make the RNA have the right shape?

NL: All of the above.  Some are structural, some catalytic.

JR: They're a locally favorable arrangement.  Energetically favorable.

NL: They happen during folding, maybe, form part of a module.

JH: Are they predicted by some program, or community decision? 

JR: Even with 3D structures, people don't agree about where the end of an alpha helix is (in proteins).

CB: There is disagreement about whether there is one species of giraffe or six species.  They're isolated populations, so they never get together.

JH: Protein or RNA motifs exist as a kind of similarity measure, a report of shared features.  What can we say about that?  Do those shared features lead to anything in the real world.

CB: The hammerhead ribozyme is a motif

JR: No, it's fold, it has motifs.

CZ: Motifs are the same when they are expected to have the same pattern of sequence variability.  Helices, for instance, have a well identified pattern of sequence variability.

JR: Few things are going to have such a nice situation.  The motif might not depend so much on what the particular sequence is, as on the context, the surrounding helices and junctions and such that create that motif.  An A-minor motif, for example, depends on the helices coming into the right place.

NL: Which local motifs don't have much sequence similarity, lots of variability.

JR: Some motifs have nice basepairs that constrain the sequence variability, but others depend on other interactions, stackings, hydrogen bonds, etc.  Some might be interaction motifs with a protein which (interacts with the backbone so it doesn't depend much on the basepairs?). 

CZ: If you look at basepairs, you'll see certain motifs.  If you look at backbones, you'll see other motifs, ones that might not depend so much on what the bases are doing.


NL: A sarcin motif is much more constrained than a helix.  Lots less base variability.  The stacking is playing a large role.  (basepairing?)

JR: The S-motif is a little strained, it seems.  It may not be typical.

NL: It's very rigid.  Jiri Sponer has simulated it, it's the most rigid thing he's simulated.

JR: It may well be strained.  It has to work to get into that conformation.


JH: How is RNAO dealing with motifs now?

NL: It's hard to say that every one has to have these interactions, or those.  We don't even want to say that it needs to have 80% of these interactions.

CB: It's a chemical approach.

JR: Atomic-level interactions.

CB: An approach which might be better is to treat it more like anatomy.  Don't say that a heart is something composed of lots of aorta cells.

JR: There you have function and evolution in a more clear sense.  With RNA motifs we don't know the function.

CB: Why are RNA motifs conserved?

NL: You CAN tell the function in many cases.

CB: Folding into the right shape IS a function.

NL: GNRA.  In the ribosome, 22 out of 23 are making a loop-receptor interaction.  UNCG does this only 1 out of 20, and it's a very weird interaction.

JR: There usually won't be one answer.  Not all motifs will have a nice answer like that.  There is a cis proline, cis pro touch turn in proteins that are a bit unstable, but appear at active sites, but always doing something different at that active site.  Beta strand is an ill-defined motif, but it's very useful.  But no defined function.

CB: That sounds like a function to me.

JR: They all make a reasonably decent structure.  Or they're there to prevent something else.

NL: So a heart is more architectural.  Eric Westhof works at this level.  But it's very complicated to do in ontology.  A tRNA has an architecture ....  They all bind with a similar free energy to the ribosome, despite sequence variability.  If the amino acid part binds tightly, something else binds less tightly.

JR: Some things make things more stable.  Lots of things do.  Not anything you can predict clearly, and some things make things less stable.

JR: There is a point talking about sequence motifs.  Those we can search for.  A lot of protein-RNA interactions destroy the local RNA structure (unfold the RNA) and look just for particular sequence.

CB: Motifs are parts, parts of architectures.

JH: Motifs are things that range over a collection of structures.

NL: They're universals.

JH: A particular RNA can have a part that exemplifies the motif, rather than being the motif. 

CB: Exemplifies the motif shape, instantiates the motif ...

JR: Not a natural class.

CB: It must be natural, because it's conserved.  We might not know what the conserved thing at the moment is, but you're looking at it.

JR: Different features of the motif may be conserved, but we don't know yet what the conservation pattern is.

NL: You mean the sequence pattern?

JR: We look around in all these structures, we don't get a clear picture of the motif and its variability.  Protein people don't call it a fold or superfold unless they see it in three separate lineages.

JH: A motif is a quality or pattern that inheres in molecules ... it's existentially dependent on a collection.

CB: If there were only one duckbilled platypus in the world ... and it has thick fur ...

JH: If there's only one instance, it's not a motif.

JR: But in the context of all mammals, ...

CB: You could have all kinds of functional motifs in a given RNA. 

JR: You don't say that this 5-residue stretch forms a motif.  They don't always fold the same way outside of the context in which they occur.  The S-motif is a submotif.  Even with the whole hairpin, it doesn't always fold up that way.  Take the 5S S-motif (Loop E), it doesn't form the S-motif by itself, if you crystallize it by itself. 

NL: Tinoco did that, if you crystallize it with helices on either end, ....

JR: The whole 5S doesn't form a clear structure if you make it apart from the ribosome.


NL: RNA motifs change shape depending on context.

CB: They're like a lap.  They are conformations.  A kind of super conformation.  The RNA molecule as a whole has a conformation, and it has conformation parts.  Conformations are qualities, hurrah.

MD: What a way to avoid what it actually is.  Not very meaningful.

JH: Some motifs are sequence-only and some are shape-oriented.

(discussion during break:)

JR: We used to try to describe general families of folds in proteins.  It turns out that this was very artificial.  There was a whole continuum of folds, not tight categories.

MD: Motifs are largely informational.  From the realist perspective

JR: We should have a name for a stretch of stacked pairs whose backbone forms an approximate a-form helix, but may have non-canonical or non-Watson-Crick basepairs in the middle, without those being as much as an internal loop.

JR: We should also have a vaguer idea of stacking, where a base can stack on a basepair, but maybe not always on the base on one side or the other of the pair.

(varied discussion)



MD: Guided tour of the google code RNAO pages.  Download tab is a good place to post the version that you announce in a paper.  You can deprecate a previous version, especially things that are not version releases.  Notes from a meeting, if we want them to be open and public, can go in the wiki.  You can post documents, papers, etc.  Issues is for flagging issues that need to be fixed.  Bugs, catastrophic mistakes, feature requests.  Then we can have a discussion.  Source, with versioning, check-out, etc. 

NL: Do you have to log in?

MD: The source is there for anyone to download, but only project owners and committers can check it out, make changes, and check it back in.  Owners can also create and delete pages.  Full control.  A committer has limited control.

MD: Source, browse.  Trunk is the main code repository.  (There is a .obo and .xml file.)

NL: If we want to update the .obo file, do we do that here?

MD: It should be generated from the .owl file.

JH: The owl to obo conversion doesn't work well.  The .obo format is nicer to look at than the .xml file.  .obo is more human readable.

NL: How do we create a new .obo file?

CB: I can write a style sheet

JH: To convert OWL to OBO, you need to be aware of what properties you've used in your OWL file.  The default doesn't know ...

MD: The syntax for OWL 2 is fairly sophisticated

CB: The current OBO doesn't have some of the more complicated logical ...

JH: Chris Mungall has put in more in OBO 1.3 however this doesn't yet exist in tools, etc ... vapor ware

MD: So we're editing the .xml file and generating the .obo file in some automated way.

JR: Where are we with motifs and cycles?  Can they be in the definitions?  Every motif needs them.

JH: At some point, you have to not include in OWL the full extent of the logical definition available in first order logic.

MD: You can have cycles in OWL.  The problem is specification of a structural cycle.  There are no variables to do that.  In OWL 2, you can make reference back to the origin itself.  If you start with one atom, you can refer back to it.  That's the extent of the language now.

JR: So we won't be defining motifs inside the formal system.

MD: You can have a class of motifs X, describe it.  Use the vocabulary defined in the ontology.  SWRL (?) rules.  OWL allows for DL (description logic) safe rules. 

JH: DL rules are restricted to known individuals.

MD: known individuals in the knowledge base.  Not an anonymous restriction.  Can't have "I have an atom that is connected to some C atom".  You have to put it in the A box.

JR: GNRA motif.  six residues. 

MD: Yes, what you have here you can represent.  You can represent them by rules, because you're binding your variables.

JH: These things are hard to do.

RH: How will the three versions relate?

CB: Smuggle the first-order logic definitions into the OWL code, in an annotations.  Keep the CLIF as an annotation.

RH: How do you insure that the axioms in this paper comply with the axioms you write in OWL?

JH: You can insure ... OBO 1.3 doesn't exist yet ... it's good.

RH: It exists on paper.

JH: So the paper describes what will be in the first order logic version, and OWL will do its best to cover that.

RH: It would be good to have proofs for the axioms in OWL being implied by the axioms in first-order logic.  Automatically generating OWL from the axioms in first-order logic.  It would take a logician to work it out.

JH: That would be cool!  Very useful for many ontologies.

RH: I propose that we see whether or not the axioms in OWL are compatible, follow from the first-order theory.  Verification.

JH: That would be great!


MD: Look at rnao.xml.  Should be .rdf.  It's an rdf xml file.  Click it, then view raw file.  Right click ...  This will always be the same link to get the current version and view it in Protege:  http://rnao.googlecode.com/svn/trunk/rnao.xml  View it in Protege 4.02.  Open Protege, click Open OWL Ontology from URI, and paste the URL in. 

(all installing Protege 4.02 and getting rnao.xml loaded in it)


(discussion of granular parts, whether things that have the same parts are the same thing)

CB: Diachronic is change over time.  A protein is the same molecule regardless of its phosphorylation state.

JH: Same thing but different it.

CB: Like if I'm turned into a frog.

JH: Barry refers to this distinction ... granular part versus determinate part was crucial ...

RH: That was 1984.

JR: What are we talking about when we say a molecule?  A single individual one, we don't know anything about it.  There are a few cases we can, ....  The arguments about the definitions are peculiar, because we often are talking about the molecule in general.

CB: Like tRNA.  A class. 

JH: The higher up in the hierarchy, the less you can say.

JR: Describing the hierarchy in terms of the classes is fine, talking about individuals is harder.

CB: Anything true about the class is true for the individual.

(two discussions going on)

JR: There are no ontologies that allow probabilistic statements, or statements that quantify the level at which we really know something.

JH: Only mathematics is so clear ... and even then not decidable.

JR: A question for later.  In the paper, the proposal for dealing with backbone conformations was awkward.

CB: One proposal was to do it with relations.

JR: Each conformation is a child of ...

CB: In first-order logic, ... predicate ...

JH: has_conformation, the range of which is child of quality ...

JR: Can it be a property of a suite, rather than of a bond between residues?  Suites overlap in many ways.

CB: Yes.

JR: OK, that's perfectly straightforward and reasonable.

JH: Where does the conformation go in the hierarchy?

CB: In the A box.

JH: The A box is the assertion about individuals which belong to a class.

JR: We want to be able to say that it could be a defining property of a motif.

CZ: What do we do about !! conformations?

JR: There are suites that don't fall into one of our named categories.


(still discussing, lunch is delayed)

MD: To get labels to display nicely, File, Preferences, Renderer, check Render entities using annotation values.  You can also make sure the font is something nice like Arial.


MD: Looking under material entity in RNAO.  Thing, entity, continuant, independent_continuant, material_entity. 


MD: It is a bit dangerous to have too many people download and install the SVN and be able to commit changes to the ontology.  Versioning is not supported, and is difficult to do with an ontology anyway. 

(more discussions)


To add new entities, File, Preferences, New Entities, set as shown below:

MD: (Editing backbone conformations in Protege.)


Looking for a definition of molecule already in ChEBI or IUPAC.  "Polyatomic entity" might be what we need ....  We'll put our own definition of molecule.


Submolecules are defined in the RNAO paper to be covalently connected.

What is an edge of an RNA base?  Not a submolecule, because they're not covalently connected within themselves.

What is a face of an RNA base?

The left side of a human is all cells on the left of a plane through the body.  The plane is a fiat boundary.

The top face of a mathematical cube is a lower-dimensional object than the cube. 

The top surface of a cube would seem to include at least a layer of atoms.  It's something you can touch, something you can interact with chemically.

Might the "top" face of a base be a region?  It would seem that it should be two-dimensional.

CZ: I think the top face of a base is the same as the base itself.


Atoms that are directly covalently bonded (sharing electrons and all) are "covalently_bonded_to".  If there is a chain of such connections between two atoms, they are "covalently_connected_to".  Atoms that are hydrogen bonded, or have some dispersion interaction are "weakly_interacting_with", while atoms connected by a chain of "covalently_bonded_to" or "weakly_interacting_with" are "weakly_connected_with".  A ribosome RNA chain is held together by weak connections.  A ribosome with its proteins is a whole conglomeration of things held together with weak connections.


Cleaning up:  collection of nucleotides under part of molecule.  Boundary of a nucleic acid.


Rules.  There is zero documentation, you have to remember the syntax.  Press tab for auto-completion to start.

It seems that it's not necessary to instantiate the variables when it is clear from the usage what types they need to be:

'nucleotide residue' (?nt1), 'nucleotide residue' (?nt2), 'nucleotide residue' (?nt3), 'nucleotide residue' (?nt4),

'nucleotide residue' (?nt5), 'nucleotide residue' (?nt6),

Instead you can just start up with this:











->'GNRA motif'(m)

MD: The conclusion of this set of rules need to refer to a specific individual, that's what m is.  It's listed under the "Individuals" tab.  Search results will appear under the Individual tab.  Do one rule at a time, see the results.  Unload it, then load another rule, see the results.  Depending on the reasoner, it may take quite some time to calculate.  As it reads your set of assertions (see below for data format), it fills in all of the inferences for each entity that it can make from the hierarchy in the ontology.  This is a general-purpose reasoner, not a reasoner customized to your search task.  The idea is to output your annotations in this generic format, so that you can everyone else can use the data. SPARKL is a language that can query data in this format. 

MD: Writing the conclusion line needs to refer to an already-existing object, this is not the place for creating a new name.

JH: Having the definition adds information, even though we are not going to use the definition for classification with a reasoner.

Michel Dumontier says:  Put this in a file:

@prefix rnao: <http://purl.obolibrary.org/obo/>.

@prefix rna: <http://purl.obolibrary.org/obo/>.

@prefix ro: <http://purl.obolibrary.org/obo/>.

@prefix : <http://example.org>.

:m1 rdfs:label "".

:m1 :pdb_code "1A50"                                                       # these assertions concern PDB file 1A50

:m1 rdf:type rnao:RNAO_0000168.                                   # m1 is a molecule or something

:m1 rnao:has_proper_part :nt1.                                        # m1 contains something called nt1

:m1 rnao:has_proper_part :nt2.                                        # m1 contains something called nt1

:nt1 rdf:type rnao:RNAO_0000102.                                  # nt1 is a cytosine residue

:nt2 rdf:type rnao:RNAO_0000105.                                   # nt2 is a guanine residue

:nt1 rnao:three_prime_to_five_prime_to :nt2                    # covalent backbone connection

:nt1 rnao:pairs_with_CSW :nt2                                          # these two make a cSW basepair

MD: This makes it easy for everyone to read the data. It makes it easier.  This is WAY more standard than RNAML.

JR: How to indicate provenance or level of confidence in assertions?

:a rdf:type owl:AnnotationAxiom .

:a owl:subject :nt1.

:a owl:predicate rnao:three_Prime_five

:a owl:object :nt2.

:a rdfs:comment "michel dumontier asserted this"


RH: Typed in the first 9 definitions in the RNAO paper.  CLIF.  There are various predicates, like q1conn.  It has two arguments. 

functions[(a,0),(b,0),(c,0)]                                 # constants; no arguments, their value is constant

predicates[(po,2), (ppo,2), .... , (q1conn,2)]


Definition 3:

formula(forall([X,Y],equiv(overlaps(X,Y),exists([Z],and(po(Z,X),po(Z,Y))))))    # defines "overlaps(X,Y)"; reasoner will substitute

Definition 4:  covalently_connected_sum (?)

RH: Open world reasoning:  You assert certain covalent bonds.  On the basis of those, you can tell atoms that are definitely covalently connected.  But in the open world setting, unless you specifically assert that other atoms have NO covalent bond between them, the reasoner will not assume that they are not covalently connected.  Putting in all of these negations leads to combinatorial explosions.  If you put an upper limit on how many atoms you will ever work with, then the possibilities are limited, but you cannot exceed that number. 

Next steps

Modify the RNAO paper as written, hopefully the reviewer/editor won't mind substantial changes.  Some things shouldn't appear in print!

Genome biology paper - next paper about RNAO.

CB: Composition tables for relations. 

Robert will continue to think through the idea of expressing the definition of a molecule in 1st and 2nd order logic.

More motifs defined; rules written out.

Knowledge base: put in examples.  Common knowledge base. How to represent all of the PDB data?  We can do as much as the RNAO allows.

Good synonyms for three_prime_to_five_prime_to.  Maybe "just_before" and "just_after".

Accommodate interactions with metal ions, water bridging interactions.

RNA sequence alignment stuff.

FOL-OWL alignment.

MD: Types for basepairing, base stacking.

JR: Have several different groups try to populate the knowledge base.  Compare the output.

MD: This shows how the ontology is useful.

JR: It would be useful to have a consistency check, a list of assertions agreed to by the annotation programs. 

MD: Find out what people want to do with this data, discover each other's data. 

Produce something concrete to present to the RNA Society in June.  The existence of the RNAO is a start, but what does it do for us?  Is there ANY concrete example of using the ontology and a reasoner to do something interesting?  What examples are there now of people writing out data in RDF/RNAO format, and what can practitioners do with the data that is provided in those formats, and how?  Michel can almost certainly help us cobble together some useful examples of this.  It might be nice if, for example, FR3D could read annotations of interactions in RDF format for symbolic searches.  Check on similar things with MC-search; does it only read RNAML?  Provide an example of a database search (using SPARKL (sp ?)) that is facilitated by writing the data in RDF format, perhaps to compare the annotations provided by MC-annotate, FR3D, and RNA view.  Provide a converter that can take dot-bracket output such as in mfold and convert it to RDF format assertions about basepairs, hairpins, internal loops, junctions.  Define more motifs using RNAO terms as an example of what the community can soon expect to do.  Have FR3D (and other programs) output motif annotations in an RDF format.  It would be golden to be able to match up the sequences of internal loops predicted by mfold to known internal loops with the same sequence from, say, FR3D.

Question:  How do we assert that a particular collection of nucleotides in a 3D structure file forms a motif?  We want to tell what role each nucleotide in the 3D structure file corresponds to nucleotides nt1, nt2, ..., nt6 in the RNA rules for the mot