Database¶

We use a PostgreSQL database (recommended version 13) for working with beacon data. For more information on setting up the database consult Database Setup.

Attention

We recommend https://pgtune.leopard.in.ua/ for establishing PostgreSQL configuration parameters, in order to optimised database setup.

e.g. for PostgreSQL running on 8GB of RAM setting shared_buffers = 2GB effective_cache_size = 6GB can improve query performance.

Hint

Additional indexes can be added to improve query time performance

CREATE INDEX data_needed ON beacon_data_table
  USING btree(datasetId, chromosome, start, reference, alternate, varianttype,
              frequency, "end");

We use the DB schema below as a means for providing making data accessible as described by the Beacon API specification.

When designing the DB schema we also took into consideration information contained in *.vcf files.

Information for the metadata table currently needs to be provided by the data submitter, as such information cannot be extracted from *.vcf files.

The Data Provider can use the beacon_init utility to load data, but if Data Provider has a previous Database we recommend to skip the beacon_init utility and set up the database using one of the solutions:

creating a DB view that matches the DB schema for the beacon python server as described below;
migrate the database to match the DB schema;
keep own database, but modify the queries in beacon_api.utils.data_query().

CREATE TYPE access_levels AS enum('CONTROLLED', 'REGISTERED', 'PUBLIC');

CREATE TABLE IF NOT EXISTS beacon_dataset_table (
    index SERIAL,
    name VARCHAR(128),
    datasetId VARCHAR(128),
    description VARCHAR(512),
    assemblyId VARCHAR(16),
    createDateTime TIMESTAMP WITH TIME ZONE,
    updateDateTime TIMESTAMP WITH TIME ZONE,
    version VARCHAR(8),
    sampleCount INTEGER,
    externalUrl VARCHAR(256),
    accessType access_levels,
    PRIMARY KEY (index)
);

/*
The values in this table take a long time to compute on large datasets
Most likely these values do not change once the dataset is loaded so one
could compute the values and UPDATE the table once the all dataset is inserted

callcount: SELECT count(*) FROM (SELECT distinct(datasetId, chromosome, reference, start)
                                 FROM beacon_data_table) t;

variantcount: SELECT count(*) FROM beacon_data_table;
*/

CREATE TABLE IF NOT EXISTS beacon_dataset_counts_table (
    datasetId VARCHAR(128),
    callCount INTEGER DEFAULT NULL,
    variantCount BIGINT DEFAULT NULL
);

CREATE TABLE IF NOT EXISTS beacon_data_table (
    index SERIAL,
    datasetId VARCHAR(128),
    start INTEGER,
    chromosome VARCHAR(2),
    reference VARCHAR(8192),
    alternate VARCHAR(8192),
    "end" INTEGER,
    aggregatedVariantType VARCHAR(16),
    alleleCount INTEGER,
    callCount INTEGER,
    frequency REAL,
    variantType VARCHAR(16),
    PRIMARY KEY (index)
);

CREATE TABLE IF NOT EXISTS beacon_mate_table (
    index SERIAL,
    datasetId VARCHAR(128),
    chromosome VARCHAR(2), 
    chromosomeStart INTEGER,
    chromosomePos VARCHAR(128), /*for working with MATEID*/
    mate VARCHAR(2), 
    mateStart INTEGER,
    matePos VARCHAR(128), /*for working with MATEID*/
    reference VARCHAR(8192),
    alternate VARCHAR(8192),
    alleleCount INTEGER,
    callCount INTEGER,
    frequency REAL,
    "end" INTEGER,
    PRIMARY KEY (index)
);

CREATE UNIQUE INDEX data_conflict ON beacon_data_table (datasetId, chromosome, start, reference, alternate);
CREATE UNIQUE INDEX metadata_conflict ON beacon_dataset_table (name, datasetId);
CREATE UNIQUE INDEX mate_conflict ON beacon_mate_table (datasetId, chromosome, mate, chromosomePos, matePos);


CREATE OR REPLACE VIEW dataset_metadata(name, datasetId, description, assemblyId,
                                        createDateTime, updateDateTime, version,
                                        callCount, variantCount, sampleCount, externalUrl, accessType)
AS SELECT a.name, a.datasetId, a.description, a.assemblyId, a.createDateTime,
          a.updateDateTime, a.version, b.callCount,
          b.variantCount,
          a.sampleCount, a.externalUrl, a.accessType
FROM beacon_dataset_table a, beacon_dataset_counts_table b
WHERE a.datasetId=b.datasetId

Note

In order to retrieve HIT and MISS as per to the API specification, we make use of the queries exemplified below.

For HIT results, example query below searches in all datasets:

SELECT a.datasetId as "datasetId", b.accessType as "accessType", a.chromosome as "referenceName",
a.reference as "referenceBases", a.alternate as "alternateBases", a.start as "start", a.end as "end",
b.externalUrl as "externalUrl", b.description as "note",
a.alleleCount as "variantCount", a.variantType as "variantType",
a.callCount as "callCount", b.sampleCount as "sampleCount",
a.frequency, "TRUE" as "exists"
FROM beacon_data_table a, beacon_dataset_table b
WHERE a.datasetId=b.datasetId
AND b.assemblyId='GRCh38'
AND (a.start=3056601
AND NULL
AND NULL AND NULL
AND NULL AND NULL
AND coalesce(a.reference LIKE any('T'), true)
AND coalesce(a.variantType=NULL, true)
AND coalesce(a.alternate LIKE any('C'), true))
AND a.chromosome='Y'
AND coalesce(b.accessType = any('REGISTERED', 'PUBLIC'), true)
AND coalesce(a.datasetId = any('DATASET2'), true) ;

For MISS results, we would add:

SELECT DISTINCT ON (datasetId)
datasetId as "datasetId", accessType as "accessType",
'Y' as "referenceName", "FALSE" as "exists"
FROM beacon_dataset_table
WHERE AND coalesce(b.accessType = any('REGISTERED', 'PUBLIC'), true)
AND assemblyId='GRCh38'
AND coalesce(a.datasetId = any('DATASET2'), true) ;