Skip to contents

The basic workflow of the rPHG2 package is as follows:

  1. Create a connection object
  2. Read data into the R environment
  3. Analyze and visualize data retrieval

This document introduces you to rPHG2’s methods and grammar, and shows you how to apply them to the previously mentioned workflow.

Creating connection objects

rPHG objects can be created through two primary sources: * local data * server connections

Creating initial “connection” objects helps unify downstream reading and evaluation steps for PHGv2 data. In the next couple of sections, we will show you how to create either local or server connnection objects.

Local data

Local connections are for local TileDB instances or direct locations of hVCF files on a local disk. To create a local connection, we will create a PHGLocalCon object using the following constructor function:

For the above example, we will create a connection to some example local hVCF files provided with the rPHG2 package:

  • LineA.h.vcf
  • LineB.h.vcf

Since the full paths to these files will differ between each user, we can use the system.file() function to get the full path:

system.file("extdata", "LineA.h.vcf", package = "rPHG2")

This will get the full path from the extdata directory for the file, LineA.h.vcf, found in the rPHG2 source code.

We can further build on this to create a collection of full file paths to our hVCF data:

hVcfFiles <- system.file(
    "extdata", 
    c("LineA.h.vcf", "LineB.h.vcf"), 
    package = "rPHG2"
)

Now that we have a collection of hVCF files, we can use the PHGLocalCon() constructor function to create a PHGLocalCon object:

localCon <- hVcfFiles |> PHGLocalCon()

localCon
## A PHGLocalCon connection object
##   DB URI....   : 
##   hVCF Files...: 

From here, we can move to the next section to create a HaplotypeGraph object interface with the JVM.

Server connections

Note

We are still actively working on this section and will not have the same compatibility as rPHG. Stay tuned for further details!

Conversely, server locations are for databases served on publicly available web services leveraging the Breeding API (BrAPI) endpoints. Since this is a connection to a server, a URL path instead of a local file path will be needed. We will use the PHGServerCon() constructor function to create a PHGServerCon object:

srvCon <- "phg.maizegdb.org" |> PHGServerCon()

srvCon
## A PHGServerCon connection object
##   Host............: phg.maizegdb.org
##   Server Status...: 200 (OK)

In the above example, we have made the assumptions that this URL:

  1. Uses a secure transfer protocol (https)
  2. Uses default ports for data serving
  3. Has BrAPI specified endpoints

For points 1 and 2, if the URL uses non-secure protocols (“http”) and/or has a modified port number, you will need to specify these with the protocol and port parameters in the constructor function. For example:

"www.my-unsecure-phg.org" |> 
    PHGServerCon(
        port     = 5300, 
        protocol = "http"
    )

For point 3, if the constructor cannot resolve mandatory endpoints, an exception will occur.

Creating JVM objects

Note

This will require the rJava package to be installed along with a modern version of Java (e.g., \geq v17) to work properly!

Now that we have either a local or server-based connection object, we can convert the raw hVCF data into a HaplotypeGraph JVM object and bridge the Java reference pointer to R. Before we build the JVM graph object, we need to initialize the JVM and add the JAR files to our environment that are found within the latest distribution of PHGv2. If you have not downloaded this, please see instructions here before you continue! To initialize, run the following command:

initPhg("phg/lib/path")

…where phg/lib/path is the lib directory found within the decompressed release of PHGv2. Now that the JVM has been initialized, we can build the JVM graph using buildHaplotypeGraph() using the local connection object as input:

graph <- localCon |> buildHaplotypeGraph()

graph
## A HaplotypeGraph object @ 7d7758be
##   # of ref ranges....: 38
##   # of samples.......: 2
##   # of chromosomes...: 2

In the above example, we took the local connection object and passed it into a HaplotypeGraph constructor. Here, we have a basic class that contains a pointer object where we can direct data from Java to R:

graph |> javaRefObj()
## [1] "Java-Object{net.maizegenetics.phgv2.api.HaplotypeGraph@7d7758be}"

Reading data

Now that we have created a HaplotypeGraph object, we can begin reading data using the read* family of rPHG2 functions.

Sample IDs

To return a vector of sample IDs from the graph object, we can use the readSamples() function:

graph |> readSamples()
## [1] "LineA" "LineB"

Reference ranges

To return information about all reference ranges found within the graph object, we can use the readRefRanges() function. This will return a GRanges object which is a common data class in the GenomicRanges package:

graph |> readRefRanges()
## GRanges object with 38 ranges and 1 metadata column:
##        seqnames      ranges strand |         rr_id
##           <Rle>   <IRanges>  <Rle> |   <character>
##    [1]        1      1-1000      * |      1:1-1000
##    [2]        1   1001-5500      * |   1:1001-5500
##    [3]        1   5501-6500      * |   1:5501-6500
##    [4]        1  6501-11000      * |  1:6501-11000
##    [5]        1 11001-12000      * | 1:11001-12000
##    ...      ...         ...    ... .           ...
##   [34]        2 38501-39500      * | 2:38501-39500
##   [35]        2 39501-44000      * | 2:39501-44000
##   [36]        2 44001-45000      * | 2:44001-45000
##   [37]        2 45001-49500      * | 2:45001-49500
##   [38]        2 49501-50500      * | 2:49501-50500
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Haplotype IDs

To return all haplotype IDs as a “sample ×\times reference range” matrix object, we can use the readHapIds() function:

m <- graph |> readHapIds()

# Show only first 3 columns
m[, 1:3]
##          1:1-1000                           1:1001-5500                       
## LineA_G1 "12f0cec9102e84a161866e37072443b7" "3149b3144f93134eb29661bade697fc6"
## LineB_G1 "4fc7b8af32ddd74e07cb49d147ef1938" "8967fabf10e55d881caa6fe192e7d4ca"
##          1:5501-6500                       
## LineA_G1 "1b568197f6f329ec5b71f66e49a732fb"
## LineB_G1 "05efe15d97db33185b64821791b01b0f"

Haplotype ID metadata

To return metadata for each haplotype ID as a tibble object, we can use the readHapIdMetaData() function:

## # A tibble: 76 × 6
##    hap_id                 sample_name description source checksum ref_range_hash
##    <chr>                  <chr>       <chr>       <chr>  <chr>    <chr>         
##  1 0eb9029f3896313aebc69… LineA       haplotype … /User… Md5      39f96726321b3…
##  2 12f0cec9102e84a161866… LineA       haplotype … /User… Md5      546d1839623a5…
##  3 13417ecbb38b9a159e3ca… LineA       haplotype … /User… Md5      5812acb1aff74…
##  4 18498579d89483ac270e8… LineA       haplotype … /User… Md5      e07d04d2fc96b…
##  5 184a72815a2ba5949635c… LineA       haplotype … /User… Md5      cb86faf105b19…
##  6 1b568197f6f329ec5b71f… LineA       haplotype … /User… Md5      d896e9cc56e74…
##  7 1e38bd82670c3f10982f7… LineA       haplotype … /User… Md5      db22dfc14799b…
##  8 3149b3144f93134eb2966… LineA       haplotype … /User… Md5      57705b1e2541c…
##  9 369464a8743d2e40ad83d… LineA       haplotype … /User… Md5      66465399052d8…
## 10 3ec680649615da0685b8c… LineA       haplotype … /User… Md5      347f0478b1a55…
## # ℹ 66 more rows

Haplotype ID metadata (positions)

To return positional information for each haplotype ID as another tibble object, we can use the readHapIdPosMetaData() function:

## # A tibble: 76 × 5
##    hap_id                           contig_start contig_end start   end
##    <chr>                            <chr>        <chr>      <int> <int>
##  1 0eb9029f3896313aebc69c8489923141 2            2          49501 50300
##  2 12f0cec9102e84a161866e37072443b7 1            1              1  1000
##  3 13417ecbb38b9a159e3ca8c9dade7088 2            2              1  1000
##  4 18498579d89483ac270e8cca57f34752 1            1          16501 17500
##  5 184a72815a2ba5949635cc38769cedd0 2            2          12001 16500
##  6 1b568197f6f329ec5b71f66e49a732fb 1            1           5501  6500
##  7 1e38bd82670c3f10982f70390c599a8d 2            2          16501 17500
##  8 3149b3144f93134eb29661bade697fc6 1            1           1001  5500
##  9 369464a8743d2e40ad83d1375c196bdd 1            1           6501 11000
## 10 3ec680649615da0685b8c245e0f196e2 2            2          11001 12000
## # ℹ 66 more rows

All hVCF data

In a majority of cases, we may need more than one piece of hVCF data. We can read all of the prior data simultaneously as PHGDataSet object which is an in-R-memory representation of all hVCF data:

phgDs <- graph |> readPhgDataSet()
phgDs
## A PHGDataSet object
##   # of ref ranges....: 38
##   # of samples.......: 2
## ---
##   # of hap IDs.......: 76
##   # of asm regions...: 76

If this object is created, we can use the prior read* methods to instantaneously pull out the previously mentioned R data objects:

Filter data

In some cases, you may want to query information and focus on one specific reference range and/or sample(s). We can filter our PHGDataSet objects using the filter* family of functions.

For the following examples, let’s picture our primary data as a 2-dimensional matrix of:

  • samples (rows)
  • reference ranges (columns)
  • hap ID data (elements)

If we were to represent this as an object in R called phgDs, it would look like the following diagram:

phgDs
## A PHGDataSet object
##   # of ref ranges....: 38
##   # of samples.......: 2
## ---
##   # of hap IDs.......: 76
##   # of asm regions...: 76

…where each individual colored cell is a haplotype ID.

Filter by sample

If we want to filter based on sample ID, we can use the filterSamples() function. Simply add the sample ID or collection of sample IDs as a character string or a vector of character strings, respectively. Using the prior diagram as a reference, I will filter out anything that is not the following:

  • B73
  • Ky21
  • Mo17
phgDs |> filterSamples(c("B73", "Ky21", "Mo17"))

Note: If samples are added to the filter collection, but are not present in the data, they will be discarded. If no samples are found, an exception will be thrown stating that no samples could be found.

Filter by reference range

If we want to filter based on sample ID, we can use the filterRefRanges() function. Currently, this takes a GRanges object where we can specify integer-based ranges located in genomic regions by chromosome (i.e., seqnames). For example, if I want to return all ranges that intersect with the following genomic regions:

Chromosome Start (bp) End (bp)
"1" 100 400
"2" 400 900

I can create the following GRanges object and pass that to the filter method:

gr <- GRanges(
    seqnames = c("1", "2"),
    ranges = IRanges(
        c(100, 800),
        c(400, 900)
    )
)


phgDs |> filterRefRanges(gr)

Chaining methods

Since we have two dimensions, we can filter simultaneously by “piping” or combining filter methods in one pass:

gr <- GRanges(
    seqnames = c("1", "2"),
    ranges = IRanges(
        c(100, 800),
        c(400, 900)
    )
)

phgDs |> 
    filterSamples(c("B73", "Ky21", "Mo17")) |> 
    filterRefRanges(gr)

Summarize and visualize data

In addition to parsing multiple hVCF files into R data objects via a PHGDataSet, rPHG2 also provides functions to summarize and visualize data.

Global values

To begin, we can return general values on the number of observations within our dataset using the numberOf* family of functions:

phgDs
## A PHGDataSet object
##   # of ref ranges....: 38
##   # of samples.......: 2
## ---
##   # of hap IDs.......: 76
##   # of asm regions...: 76
# Get number of samples/taxa
phgDs |> numberOfSamples()
## [1] 2
# Get number of chromosomes
phgDs |> numberOfChromosomes()
## [1] 2
# Get number of reference ranges
phgDs |> numberOfRefRanges()
## [1] 38
# Get number of haplotype IDs
phgDs |> numberOfHaplotypes()
## [1] 76

Get counts of unique haplotype IDs

To get the unique number of haplotypes per reference range in our PHGDataSet object, we can use numberOfHaplotypes() again, but set the internal parameter byRefRange to TRUE:

phgDs |> numberOfHaplotypes(byRefRange = TRUE)
## # A tibble: 38 × 7
##    rr_id         n_haplo seqnames start   end width strand
##    <chr>           <int> <fct>    <int> <int> <int> <fct> 
##  1 1:1-1000            2 1            1  1000  1000 *     
##  2 1:1001-5500         2 1         1001  5500  4500 *     
##  3 1:11001-12000       2 1        11001 12000  1000 *     
##  4 1:12001-16500       2 1        12001 16500  4500 *     
##  5 1:16501-17500       2 1        16501 17500  1000 *     
##  6 1:17501-22000       2 1        17501 22000  4500 *     
##  7 1:22001-23000       2 1        22001 23000  1000 *     
##  8 1:23001-27500       2 1        23001 27500  4500 *     
##  9 1:27501-28500       2 1        27501 28500  1000 *     
## 10 1:28501-33000       2 1        28501 33000  4500 *     
## # ℹ 28 more rows

Visualize counts of unique haplotype IDs

We can visualize the prior section using a member of the plot* family of functions, plotHaploCounts(). By default, this will produce a scatter plot of data found across all reference ranges in the reference genome (similar to a Manhattan plot):

phgDs |> plotHaploCounts()

To get more “granular” views of specific regions within our data, we can pass a GRanges object from the package GenomicRanges to the gr parameter:

# library(GenomicRanges)

query <- GRanges(
    seqnames = "1", 
    ranges = IRanges(start = 50, end = 10000)
)

phgDs |> plotHaploCounts(gr = query)

We can also query mutltiple regions of interest simultaneously by adding more observations to the query object:

# library(GenomicRanges)

query <- GRanges(
    seqnames = c("1", "2"), 
    ranges = IRanges(start = c(50, 50), end = c(10000, 7500))
)

phgDs |> plotHaploCounts(gr = query)

Finally, we provide 3 separate “geometry” options for plotting using the geom parameter: * "l" - line plotting (e.g., geom_line()) (default) * "b" - bar plotting (e.g., geom_bar()) * "p" - point plotting (e.g., geom_point())

For example, if we want to switch this from a line plot to a bar plot, we can specify geom = "b":

phgDs |> plotHaploCounts(gr = query, geom = "b")

Visualize distributions of unique haplotype IDs

To get a more “global” estimate of unique haplotype ID distribution across the genome, we can us the function plotHaploDist():

phgDs |> plotHaploDist()