Genetic Links to Health

A data-centric approach to drug development

Find Out More

Rising drug prices are a serious concern for many Americans.

Little relief is expected as drug development costs continue to climb. As of 2016, FDA-approved drugs average $2.6 billion and 15 years of development time.

A large percentage of this effort is wasted on failed drugs. Drug candidates that make it out of the laboratory and into clinical trials have a very low success rate of around 10%.

A 2015 study suggests that the strength of association between a particular gene and medical condition can help determine the probability of generating a successful therapy.

Genetic Links to Health & Disease

Some diseases may be caused by multiple genes and a single gene may influence one or more diseases.


Understanding relationships between genes and disease can help make drug development more efficient: reducing costs, development time and consumer drug prices.

Data Processing


Gene-disease links retrieved from academic databases and literature.

Learn More


Data organized and loaded into a centralized database.

Learn More


Metrics created to evaluate gene-disease link quality.

Learn More


Comparing gene-disease links based on two key metrics.

Learn More

Therapeutic potential of gene-disease associations

Click on a data point to launch a pubmed search for that gene-disease pair
Tips on interpreting the graph


Gene-disease association by disease area


Prevalence of individual diseases

Data collection

Data is sourced primarily from academic researchers who share their results through a variety of public databases.


Data organization

Source Format Extraction Content
NIH Genetics Reference HTML Web scrape with beautiful soup Disease categories, few GDAs
DisGeNet CSV Download and load into pandas dataframe >420000 GDAs
Human Phenotype Ontology SQL & TSV Query through sqlite3, ipython, and pandas >115000 GDAs
DISEASES TSV Download and load into pandas dataframe >470000 GDAs


Data analysis

Distribution of GDA Association Scores

There are two metrics based on the strength of association and the number of associations for each individual gene and each individual disease.

  1. Association score = ( WUniProt + WCTDhuman + WClinVar ) + ( WRat + WMouse ) + ( WGAD + WLHGDN + WBeFree ) + WHPO + WDISEASES

    0 < Strength of association < 1
  2. Specificity score  is the inversion of the frequency describing the number of genes for an associated disease and the number of diseases for the associated gene. For example,

    0 < Gene and disease frequency < 1

The DisGeNet About Page has additional information concerning the strength of association metric.