Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques.

Overview

abstract

Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.

authors

Marchionni, Luigi
Xia, Xiaoxin
Shankrit, Shambhavi
Fertig, Elana J

publication date

April 24, 2017

published in

PloS one Journal

Research

keywords

Database Management Systems
Information Storage and Retrieval
Software

Identity

PubMed Central ID

PMC5402950

Scopus Document Identifier

85018585958

Digital Object Identifier (DOI)

10.1093/bioinformatics/btu375

PubMed ID

28437440

Additional Document Info

has global citation frequency

1

volume

12

issue

4

VIVO Weill Cornell Medical College

Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques. Academic Article

Overview

abstract

authors

publication date

published in

Research

keywords

Identity

PubMed Central ID

Scopus Document Identifier

Digital Object Identifier (DOI)

PubMed ID

Additional Document Info

has global citation frequency

volume

issue