My research interests spread over
various fields involving massive information, such as database systems and data
mining, bioinformatics, information retrieval, machine learning, etc. I am
especially interested in designing efficient methods, either theoretical or
experimental, to improve processing time and reduce storage requirements for
information.
Design efficient algorithms for Bioinformatics and Computational Biology problems
- I am interested in designing efficient
algorithms for biological sequence analysis problems, such as multiple
sequence alignment problems, constrained multiple sequence alignment problems,
discovery of repetitive patterns in DNA sequences, discovery of patterns with
wildcards in strings etc. I've published quite a few conference and journal
papers on these problems. Numerous technologies, such as dynamic programming,
greedy algorithm, heuristic search, suffix tree, parallel algorithm etc., have
been customized and then applied to these problems.
Biomedical Literature Classification
- Different from regular classification methods,
we map biomedical documents into new feature spaces with MeSH ontology. We
then develop various kinds of classification methods based on the properties
of the domain ontology to improve the classification accuracies. Many Machine
Learning and statistical techniques are involved in this project. One paper
was published for this topic.
One-To-Some shortest path algorithm
- In this project, many state-of-art shortest
path algorithms are investigated and adjusted to be applied to the problem.
One paper was published for this topic.
Mining frequent patterns with wildcards in data streams (Supported by NSF)
- This data mining project aims to efficiently
find frequent patterns with wildcards in data streams, especially when the
alphabetic size of the data stream is large. Two papers were published for
this project. One paper is in preparation.
Optimize classifiers for biased data sets in data streams