Instructions

Note

In order to run beacon-python Web Server requirements are as specified below:

Environment Setup

The application requires some environmental arguments in order to run properly, these are illustrated in the table below.

ENV Default Description
DATABASE_URL localhost The URL for the PostgreSQL server.
DATABASE_PORT 5432 The port for the PostgreSQL server.
DATABASE_NAME beacondb Name of the database.
DATABASE_USER beacon Database username.
DATABASE_PASSWORD beacon Database password.
DATABASE_SCHEMA - Database Schema if used. Comma separated if multiple used.
HOST 0.0.0.0 Default Host for the Web Server.
PORT 5050 Default port for the Web Server.
DEBUG True If set to True, Standard Output.
PUBLIC_KEY - Public key, armored, for validating the token.
CONFIG_FILE ./beacon_api/conf/config.ini Provide specific Beacon Information.
TABLES_SCHEMA data/init.sql Provide beacon_init SQL fallback schema.
JWT_AUD   JWT audiences. Overwrites the audience variable in configuration file.

Setting the necessary environment variables can be done e.g. via the command line:

$ export DATABASE_URL=localhost
$ export DATABASE_PORT=5434
$ export DATABASE_NAME=beacondb
$ export DATABASE_USER=beacon
$ export DATABASE_PASSWORD=beacon
$ export HOST=0.0.0.0
$ export PORT=5050
$ export DEBUG=True
$ export PUBLIC_KEY=armored_key

Beacon Information

By default the beacon contains information about the beacon service. The information can be changed in a configuration file that has the structure specified below, or by pointing to the location of the file using CONFIG_FILE environment variable.

# This file is used to configure the Beacon `/info` API endpoint
# This file's default location is /beacon-python/beacon_api/conf/config.ini


[beacon_general_info]
# Name of the Beacon service
title=GA4GHBeacon at CSC

# Version of the Beacon implementation
version=1.8.0

# Author of this software
author=CSC developers

# Software license for this distribution
license=Apache 2.0

# Copyright holder for this software
copyright=CSC - IT Center for Science

# Documentation url for GA4GH Discovery
docs_url=https://beacon-python.readthedocs.io/en/latest/


[beacon_api_info]
# Version of the Beacon API specification this implementation adheres to
apiVersion=1.1.0

# Globally unique identifier for this Beacon instance
beaconId=fi.csc.beacon

# Description of this Beacon service
description=Beacon API Web Server based on the GA4GH Beacon API

# Homepage for Beacon service
url=https://beaconpy-elixirbeacon.rahtiapp.fi/

# Alternative URL for Beacon service for e.g. internal use cases
alturl=

# Datetime when this Beacon was created
createtime=2018-07-25T00:00:00Z

# GA4GH Discovery type `groupId` and `artifactId`, joined in /service-info with apiVersion
# See https://github.com/ga4gh-discovery/ga4gh-service-info for more information and possible values
service_group=org.ga4gh
service_artifact=beacon

# GA4GH Discovery server environment, possible values: prod, dev, test
environment=prod


[organisation_info]
# Globally unique identifier for organisation that hosts this Beacon service
org_id=fi.csc

# Name of organisation that hosts this Beacon service
org_name=CSC - IT Center for Science

# Description for organisation
org_description=Finnish expertise in ICT for research, education, culture and public administration

# Visit address of organisation
org_address=Keilaranta 14, Espoo, finland

# Homepage of organisation
org_welcomeUrl=https://www.csc.fi/

# URL for contacting organisation
org_contactUrl=https://www.csc.fi/contact-info

# URL for organisation logo
org_logoUrl=https://www.csc.fi/documents/10180/161914/CSC_2012_LOGO_RGB_72dpi.jpg

# Other organisational information
org_info=CSC represents Finland in the ELIXIR partner nodes

OAuth2 Configuration

Beacon utilises OAuth2 (JWT) Bearer tokens to authenticate users when they are accessing registered datasets. The configuration variables reside in the same CONFIG_FILE as described above in the oauth2 section.

[oauth2]
# OAuth2 server that returns public key for JWT Bearer token validation
server=https://login.elixir-czech.org/oidc/jwk

# Authenticated Bearer token issuers, separated by commas if multiple
issuers=https://login.elixir-czech.org/oidc/

# Where to send access token to view user data (permissions, statuses, ...)
userinfo=https://login.elixir-czech.org/oidc/userinfo

# What the value of `AcceptedTermsAndPolicies` and `ResearcherStatus` must be in order
# to be recognised as a Bona Fide researcher
bona_fide_value=https://doi.org/10.1038/s41431-018-0219-y

# String or URI to state the intended recipient of the token.
# If your application is part of a larger network,
# the network administrator should supply you with their `aud` identifier
# in other cases, leave this empty or use the personal identifier given to you from your AAI
# For multiple values, separate values with commas, e.g. aud1,aud2,aud3
audience=

# Verify `aud` claim of token.
# If you want to validate the intended audience of a token, set this value to True.
# This option requires you to also set a value for the `audience` key above.
# If your service is not part of any network or AAI, but you still want to use tokens
# produced by other AAI parties, set this value to False to skip the audience validation step
verify_aud=False
  • server should point to an API that returns a public key, which can be used to validate the received JWTBearer token.
  • issuers is a string of comma separated values, e.g. one,two,three without spaces. The issuers string should contain a list of entities that are viewed as trusted organisations.
  • bona_fide should point to an API that returns the bona_fide_status this is ELIXIR AAI specific.
  • audience is a string of comma separated values, e.g. aud1,aud2,aud3 of intended audiences. Audience is a value in JWT that describes what service(s) the token is intended for.

The audience hash or URI from the AAI service can be used, or if the service is part of a Beacon Network, use the key provided by the Beacon Network administrator.

Leave empty if the service doesn’t care about the intended audience.

verify_aud can be set to either True or False. If enabled, this option forces Beacon to verify the audience(s) in the supplied token. If disabled, the audience(s) of a token will not be validated.

Disabling this can be a good solution for standalone Beacons, that want to be able to use tokens generated by any authority. If verify_aud=True is set provide also value(s) for audience key, as otherwise the audience will be attempted to be validated, but as no audiences are listed, the validation will fail.

Note

For implementing CONTROLLED dataset permissions see Handling Permissions.

beacon-python Setup

For installing beacon-python do the following:

$ git clone https://github.com/CSCfi/beacon-python
$ pip install -r requirements.txt
$ cd beacon-python
$ pip install .

Hint

Before running the application:

To run the application from command line use:

$ beacon

For advance setup see Gunicorn Setup below.

Gunicorn Setup

By default the application will run a simple aiohttp web server, and best solution in most cases. For other options see aiohttp Server Deployment we recommend gunicorn.

$ gunicorn beacon_api.app:init --bind $THE_HOST:$THE_PORT \
                               --worker-class aiohttp.GunicornUVLoopWebWorker \
                               --workers 4

Database Setup

Full information about the database schema and the queries performed against it is available at: Database.

Starting PostgreSQL using Docker:

cd beacon-python
docker run -d \
           -e POSTGRES_USER=beacon \
           -e POSTGRES_PASSWORD=beacon \
           -e POSTGRES_DB=beacondb \
           -v "$PWD/data":/docker-entrypoint-initdb.d \
           -p 5432:5432 postgres:13

Hint

If one has their own database the beacon_init utility can be skipped, and make use of their own database by:

  • creating a DB View that matches the DB schema for the beacon python server see: Database for information on the database schema and queries;
  • migrating the database to match the Database schema;
  • modifying the queries in beacon_api.utils.data_query() in order to fit one’s own database.

Loading data (Optional)

For loading datasets to database we provide the beacon_init utility:

$ beacon_init --help
usage: beacon_init [-h] [--samples SAMPLES]
                  [--min_allele_count MIN_ALLELE_COUNT]
                  datafile metadata

Load datafiles with associated metadata into the beacon database. See example
data and metadata files in the /data directory.

positional arguments:
  datafile              .vcf file containing variant information
  metadata              .json file containing metadata associated to datafile

optional arguments:
  -h, --help            show this help message and exit
  --samples SAMPLES     comma separated string of samples to process.
                        EXPERIMENTAL
  --min_allele_count MIN_ALLELE_COUNT
                        minimum allele count can be raised to ignore rare
                        variants. Default value is 1

As an example, a dataset metadata could be:

{
    "name": "1000 genome",
    "datasetId": "urn:hg:1000genome",
    "description": "Data from 1000 genome project",
    "assemblyId": "GRCh38",
    "createDateTime": "2013-05-02 12:00:00",
    "updateDateTime": "2013-05-02 12:00:00",
    "version": "v0.4",
    "externalUrl": "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/",
    "accessType": "PUBLIC",
    "callCount": 3892,
    "variantCount": 4242
}

For loading data into the database we can proceed as follows:

$ beacon_init data/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz data/example_metadata.json

(EXPERIMENTAL) For loading data into the database from selected samples only we can proceed as follows:

$ beacon_init data/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz data/example_metadata.json --samples HG0001,HG0002,HG0003

For ignoring rare alleles, set a minimum allele count with --min_allele_count:

$ beacon_init data/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz data/example_metadata.json --min_allele_count 20

Note

One dataset can have multiple files, in order to add more files to one dataset, repeat the command above. The parameters callCount and variantCount from the metadata file reflect values of the entire dataset. These values can be initialised with 0 if they are not known and updated in beacon_dataset_counts_table table. As of this moment we do not provide an option for bulk upload of files from a dataset.

Note

For loading 1000 genome dataset see: 1000 Genome Loader instructions.