DGST

DGST (Distributed Generalized Suffix Tree) is an efficient and scalable generalized suffix tree construction algorithm on distributed data-parallel platforms. DGST supports indexing all suffixes for a variety of strings, ranging from a single long string to multiple strings of varied lengths. In addition, DGST is built on the widely-used distributed data-parallel platform Apache Spark.

News

DGST won the first place in the 3rd National University Cloud Computing Application Innovation Competition Big Data Skills Challenge held in 2016-2017, China.

Prerequisites

Apache Spark: As DGST is built on top of Spark, you need to get the Spark installed first. If you are not clear how to setup Spark, please refer to the guidelines here. Currently, DGST is developed on the APIs of Spark 1.6.3 version.
Apache HDFS: DGST uses HDFS as the distributed file system. If you are not clear how to setup HDFS, please refer to the guidelines here. The HDFS version is 2.6.5.
Java: The JDK version is 1.7.

Compile

DGST is built using Apache Maven. To build DGST, Run mvn scala:compile compile package in the root directory.

The compiled jar is available at target/DGST-1.0-SNAPSHOT.jar

Note: A precompiled jar is provided in the lib directory.

Run

The entry of DGST is defined in the scala class GST.Main.

Run DGST with the following command:

spark-submit --master [SPARK_MASTER_ADDRESS] \
--executor-memory 2G \
--driver-memory 20G \
--total-executor-cores [Total cores in the Spark cluster] \ 
--conf spark.memory.fraction=0.9 \
--conf spark.memory.storageFraction=0.1 \
--driver-class-path [Configuration directory] \
--class GST.Main \
DGST-1.0-SNAPSHOT.jar \
<Input path> <Output path> <Temporary path> \
<Range size> <Maximum sub-tree size> <Parallelism>

The run command contains the following parameters:

<Input path>: The input data path on HDFS or local file system.
<Output path>: The output data path on HDFS or local file system.
<Temporary path>: The tempoaray data path on HDFS or local file system.
<Range size>: The size of range in the LCP-Range structure
<Maximum sub-tree size>: The Maximum sub-tree size (i.e., the maximum S-prefix frequency)
<Parallelism>: The computation parallelism on Spark

Note: --driver-class-path is the directory where the configuration file is. The parameters configured in the run command have higher priority than those in the configuration file. To find more information about the configuration file, please refer to the Configuration Guide.

Note: We also port the state-of-the-art ERa algorithm to Spark. Please replace the main class with GST.EraMain to run the ERa algorithm. The performance comparison on Spark is here.

Demo

The dataset in demo directory contais two strings. Each file represents a string.

The generalized suffix tree for the two demo strings can be found here.

Licence

Please contact the authors for the licence info of the source code.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
conf		conf
demo		demo
lib		lib
src/main		src/main
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DGST

News

Prerequisites

Compile

Run

Demo

Licence

About

Releases

Packages

Languages

PasaLab/DGST

Folders and files

Latest commit

History

Repository files navigation

DGST

News

Prerequisites

Compile

Run

Demo

Licence

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages