## Who am I

I am a PhD. student of computer science at the University of Helsinki. I do combinatorial pattern matching for bioinformatics on strings and labeled graphs. I'm also interested in data compression, machine learning with neural networks and 3d-graphics.

## Projects

### Real time graphics demo using raymarching

A 3d-graphics demo for Skrolli party 2019.

### Basic 3d-graphics engine

A 3d-graphics demo for Assembly 2018.

### Shortest common superstring approximation in compact working space

The shortest common superstring problem is a classical problem in string processing. It is defined as follows. Let $\mathcal S = S_1,\ldots,S_m$ be a set of strings from an alphabet $\Sigma$ of size $\sigma$. The problem is to find the shortest string that contains every string in $\mathcal S$ as a substring. If there are multiple solutions, it is enough to return one. We have a solution that computer a greedy approximation in $O(n \log \sigma)$ bits of working space.

Github Paper### Themisto: A pseudoaligner for metagenomics based on a succinct colored de Bruijn graph

A metanenomic sample is a set of sequences of reads from microbial life living in a particular environment. Standard analysis involves estimating the species composition of the environment by aligning the reads against a reference database. Since the age of pangenomics, alignment is preferentially done against a variation graph encompassing all variation within a species.

Themisto is a space-efficient tool for indexing such variation graphs. The Themisto index is a compressed colored de-bruijn graph of order $k$, where each node has a set of colors representing the reference sequences that contain the $k$-mer (substring of length $k$) corresponding to the node. Reads are pseudoaligned to the index using a method similar to the one used by the tool Kallisto: all $k$-mers of the read are located in the de-bruijn graph and the intersection of the color sets of the nodes is returned.

Github Website### Variable-order Markov models in compact space

Markov models are a basic tool in biological sequence analysis. They model sequences as random processes following simple rules where the key assumption is that the process has *limited memory*, i.e. the distribution of the next symbol in the sequence depends only on a few of the previous symbols. This is called the *Markov property*. The previous symbols that determine the distribution of the next symbol are called the current *state* or the current *context* of the process. The model is composed of a list of transition probabilities between all possible states. The transition probabilities are usually learned automatically from training data.

In *fixed-order models*, the state of the chain is the last $k$ characters of the current sequence, for some constant $k$. In *variable-order models*, the state can have variable length. We have implemented and designed a space-efficient framework for representing and running large variable-order Markov models.

### Finnish language article generation with an LSTM neural network

I trained a textgenrnn-LSTM neural network that wrote an article to the Finnish computer culture magazine Skrolli. The article was published along with a 5-page writeup on how it was done.

Published in: Skrolli-magazine issue 2019.1. (Finnish)### Space-efficient clustering of metagenomic reads

We present a framework for implementing a clustering pipeline for metagenomic reads. The pipeline works in two parts. First, reads with long shared $k$-mers are grouped together and second, groups sharing similar sequence statistics are merged.

Github Paper