Example#

This Jupyter notebook shows how to use the library with common examples.

[1]:

import lightmotif
lightmotif.__version__

[1]:

'0.10.0'

[2]:

from urllib.request import urlopen

Creating a motif#

A Motif can be created from several sequences of the same length using the lightmotif.create function. This first builds a CountMatrix from each sequence position, and then creates a WeightMatrix and a ScoringMatrix.

[3]:

motif = lightmotif.create(["AATTGTGGTTA", "ATCTGTGGTTA", "TTCTGCGGTTA"])

Loading a motif#

The lightmotif.load function can be used to load the motifs found in a given file. Because it supports any file-like object, we can immediately download a motif from the JASPAR database and parse it on the fly:

[4]:

url = "https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar"
with urlopen(url) as response:
    motif = next(lightmotif.load(response, format="jaspar16"))
    print(f"Loaded motif {motif.name} of length {len(motif.counts)}")

Loaded motif MA0002.1 of length 11

Adding pseudo-counts#

By default, the loaded scoring matrix is built with zero pseudo-counts and a uniform background, which may not be ideal. Using the CountMatrix.normalize and WeightMatrix.log_odds methods, we can build a new ScoringMatrix with pseudo-counts of 0.1:

[5]:

pssm = motif.counts.normalize(0.1).log_odds()

Preparing a sequence#

Since the motif we loaded is a human transcription factor binding site, it makes sense to use a human sequence. As an example, we can load a contig from the human chromosome 22 (NT_167212.2).

[6]:

url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992"
with urlopen(url) as response:
    response.readline()
    sequence = ''.join(line.strip().decode() for line in response)

To score a sequence with lightmotif, if must be first encoded and stored with a particular memory layout. This is taken care of by the lightmotif.stripe function.

[7]:

striped = lightmotif.stripe(sequence)

Calculate scores#

Once the sequence has been prepared, it can be used with the different functions and methods of lightmotif to compute scores for each position. The most most basic functionality is to compute the PSSM scores for every position of the sequence. This can be done with the ScoringMatrix.calculate method:

[8]:

scores = pssm.calculate(striped)

The scores are computed in an efficient column-major matrix which can be used to further extract high scoring positions:

The argmax method returns the smallest index with the highest score
The max method returns the highest score
The threshold method returns a list of positions with a score above the given score

[9]:

print(f"Highest score: {scores.max():.3f}")
print(f"Position with highest score: {scores.argmax()}")
print(f"Position with score above 14: {scores.threshold(13.0)}")

Highest score: 14.844
Position with highest score: 17203
Position with score above 14: [17203, 240997, 43974, 9785, 156007, 36679, 157496, 133872, 263825, 25220]

Otherwise, the resulting array can be accessed by index, and flattened into a list (or an array):

[10]:

print("Score at position 90517:", scores[156007])

Score at position 90517: 14.600027084350586

Using p-value thresholds#

LightMotif features a re-implementation of the TFP-PVALUE algorithm which can convert between a bitscore and a p-value for a given scoring matrix. Use the ScoringMatrix.score method to compute the score threshold for a p-value:

[11]:

print(f"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}")

Score threshold for p=1e-5: 13.820

The ScoringMatrix.pvalue method can compute the p-value for a score, allowing to compute them for scores obtained by the scoring pipeline:

[12]:

for index in scores.threshold(13.0):
    print(f"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}")

Hit at position  17203: score=14.844 p=2.15e-06
Hit at position 240997: score=14.707 p=4.05e-06
Hit at position  43974: score=13.871 p=8.34e-06
Hit at position   9785: score=13.050 p=1.81e-05
Hit at position 156007: score=14.600 p=4.77e-06
Hit at position  36679: score=13.915 p=7.39e-06
Hit at position 157496: score=13.062 p=1.72e-05
Hit at position 133872: score=13.934 p=7.15e-06
Hit at position 263825: score=13.185 p=1.5e-05
Hit at position  25220: score=13.716 p=1.22e-05

Scanning algorithm#

For cases where a long sequence is being processed, and only a handful of significative hits is expected, using a scanner will be much more efficient. A Scanner can be created with the lightmotif.scan function, and yields Hit objects for every position above the threshold parameter:

[13]:

scanner = lightmotif.scan(pssm, striped, threshold=13.0)
for hit in scanner:
    print(f"Hit at position {hit.position:6}: score={hit.score:.3f}")

Hit at position  17203: score=14.844
Hit at position 240997: score=14.707
Hit at position   9785: score=13.050
Hit at position  43974: score=13.871
Hit at position 156007: score=14.600
Hit at position  36679: score=13.915
Hit at position 157496: score=13.062
Hit at position 133872: score=13.934
Hit at position 263825: score=13.185
Hit at position  25220: score=13.716

Although it gives equivalent results to the calculate example above, the scan implementation uses less memory and is generally faster for higher threshold values.

Reverse-complement#

All the examples above are showing how to calculate the hits for the direct strand. To process the reverse-strand, one could reverse-complement the sequence; however, it is much more efficient to reverse-complement the ScoringMatrix, as it is usually much smaller in memory.

[14]:

psmm_rc = pssm.reverse_complement()
scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)