ELMs (Eukaryotic Linear Motifs) are found in many eukaryotic proteins and are important for many regulatory protein-protein interactions.

Viruses are thought to commonly use ELMs, taking advantage of their simple nature and rapid evolvability.

However, because of their simplicity, it is hard to discern between real functional ELMs and sequences that match ELM patterns by chance.

We developed a method of comparing the occurrence of ELMs in viral sequences to 100,000 randomly shuffled sequences.

By comparing the observed occurrence in real sequence to random sequences, we can estimate the likelihood of each ELM occurring easily, and so to assess in which cases the ELMs in viral proteins are harder to evolve.

Assuming that ELMs that are harder to evolve by chance are more likely to be functional allows us to infer which ELMs are more likely to be physiologically relevant.

We validated this assumption by comparing ELM occurrence that are rare in shuffled sequences with those that occur in numerous shuffled sequences, in sets of eukaryotic and prokaryotic viruses (the latter act as a negative control, since ELMs are almost exclusively found in eukaryotes).

Indeed, we found that the rarely-shuffled ELMs are significantly more enriched in eukaryotic viruses, and in particular in animal viruses, than they are in prokaryotic viruses.

In addition, we found that rarely-shuffled ELMs are enriched in an experimentally-validated set of viral proteins and are more evolutionarily conserved.

Thus, the lower the number - the less likely it is that this ELM occurrence can happen easily, and the more likely it is that this ELM occurrence is functional.

In the previous page we provide links to the data of all the eukaryotic viruses we used in this study.

For each virus, we give a link to each of its proteins and a link to their sequences in the NCBI.

For each viral protein, we provide the list of all the regions that have patterns that match ELMs, the ELM type, their location along the sequence, and their occurrence in the shuffled sequences (rarely shuffled ELMs are highlighted).

The last two values for each type of ELM are its occurrence in the 100,000 shuffled sequences according to two sets of shuffled sequences: the first indicates its occurence when we shuffled the sequences between all the viral proteins, and the second indicates its occurence when we shuffled disordered content only within the proteins of the specific virus.

As stated above, lower numbers point to a higher likelihood of this ELM instance to be functional.

ELM analysis is based on the ELM database that provides a well-annotated list of functional motifs:

The ELM database

In addition, we provide a list of domains that are inferred to exist in each viral protein.

Domain analysis was done using the Pfam program:

The Pfam database

For additional information regarding the method and the dataset - plesae contact Tzachi Hagai: tzachinj4 at gmail.