NCBI Molecular Biology Resources & Tools

Workshop description

A challenge facing researchers today is that of piecing together and analyzing the plethora of data currently being generated through the various Genome and other big data biological Projects. Finding the data in both their original submitted form and in curated or annotated forms that reflect their integration into reference resources is challenging. The National Center for Biotechnology Information NCBI’s Web site serves an integrated, one-stop, genomic information infrastructure for biomedical researchers from around the world so that they may use these data in their research efforts. These biological databases and tools are an important part of biomedical research nowadays and life-scientists require continuous learning and updating. To contribute in these efforts, a three-days training course entitled “A Field Guide to NCBI Molecular Biology Resources and Tools” will be held on October 27-29, 2015 in Rabat, Morocco to provide training for national and regional biomedical and life scientists using these data and tools.

The National Center for Biotechnology Information (NCBI) at the National Institutes of Health was created in 1988 to develop information systems for molecular biology. In addition to maintaining the GenBank biological sequence database, NCBI provides data retrieval systems and computational resources for the analysis of GeneBank data and many other biological data through the NCBI web site. NCBI resources and tools include PubMed, BLAST, OrfFinder, Reference Sequence, UniGene, dbMHC, dbSNP, Map Viewer, Sequence Read Archive SRA, Gene Expression Omnibus GEO, Online Mendelian Inheritance in Man OMIM, Conserved Domain Database CDD, Protein Clusters and many other databases and tools. In most cases, the data underlying these resources and executables for many softwares are also available for download at ftp.ncbi.nih.gov. For the purposes of this workshop, the NCBI suite of resources is grouped into number of categories and few of them will be discussed in details and recent developments will be summarized as well.

This workshop is intended to help participants to gain hands-on experience with the resources and tools at the NCBI site and to learn how to explore these resources and apply these tools in their specific research area. The course will provide a large overview on how large-scale data projects are received and processed by NCBI. NCBI data Resources and data analysis tools will be presented as well as the most challenging data integration aspect to create an integrated information infrastructure for efficient navigation. Attendees will be guided to the most efficient way of navigating through NCBI resources and how to keep up with what’s new as well as on how to use NCBI help pages and tutorials. In particular the training will introduce and present an update on some of the most successful tools, such as sequence alignment package that include the family of BLAST programs including its last improvements. Specialized Genome Resources and the new Metagenomics and Epigenomics projects will also be presented. Human Genetic Variations Resources including types of variations, disease markers, and classification of SNPs will be described. Participants will be also introduced on how to use PubMed accurately and efficiently to search MEDLINE and NCBI Bookshelf. Some of the objectives include helping to understand how data sets are generated, deposited and managed at the NCBI, to learn how to find and download data sets of interest and to gain knowledge on using genome-scale data and employing high level analysis tools for displaying and comparing these data and extract useful and pertinent data.

The workshop sessions include a combination of lectures, demonstrations and hands-on, eventually with actual user questions and emphasizing a different set of NCBI resources. Each session uses specific examples to highlight important features of the resources and tools under study and to demonstrate how to accomplish common tasks. The tutorials describe specific tasks that can be completed using combinations of these tools and databases. The hands-on exercises will enhance the understanding of common bioinformatics tasks using NCBI R&T. In particular, these hands-on sessions will target some computational genomics case studies that illustrate the present methods and strategies in bioinformatics and their application in biomedical research. These interactive sessions will also give opportunities for participants to provide comments and suggestions on NCBI services to NCBI trainers. Detailed handouts will provide step-by-step instructions and additional information about each example illustrated. The handouts and presentation materials will be made available online before the workshop. Here below details of some of the aspects that will be targeted in this event.

Genome databases: There will be a presentation of NCBI Genome database: past, present, and future development. In September 1995 NCBI created Genome database for handling the data obtained from large-scale sequencing projects for genomes and chromosomes. This coincided with the release of the complete Haemophilus influenza genome sequence, marking the start of a new era of megabase sequence generation and analysis. The first version of Genome database contained data from small viral and organelle genomes, complete and near-complete bacteria and lower eukaryotes. From the very beginning NCBI genome resources were designed to represent the genome data for a given organism. The viewers and tools were presented in Genome in a taxonomic context. With the rapid advances in genome sequencing technologies the genome data become more complicated and enriched with additional information. The organism is no longer represented by a single reference genome but rather by many variant genomes of multiple individuals or isolates. Genome database becomes an aggregated collection of genome sequence assemblies and annotated genes obtained from biological samples representing natural variation within the population. Future developments include better navigation through thousands of genome assemblies, improving the visualization and browsing tools by using new NCBI Tree Viewer Tool and Solr-based table generator.

Genomes and GenBank: Sequencing the genomes of prokaryotes and eukaryotes has become an almost commonplace research activity, and the rate of genome submissions to GenBank has been increasing, with a potential avalanche looming in the near future. Currently, most of these genomes are in multiple pieces and not assembled into complete genomes; these are processed as WGS (whole genome shotgun) submissions. Various NCBI groups have been building or redesigning multiple databases to handle the submission, storage and display of WGS and complete genomes, and NCBI are nearing the holy grail of having all of these separate entities closely integrated. The speakers will talk about how all of these pieces are coming together and the current status of genomes in GenBank.

ClinVar: Improving Variant Interpretation through Data-sharing. The widespread use of next-generation sequencing technology has improved the ability of genetic testers to identify many more variants than ever before. However, many of the variants identified through clinical testing are uncommon and may be novel to the testing lab, which makes it challenging for clinical lab directors to interpret the clinical significance of those variants. The ClinVar database is part of NCBI’s suite of medical genetics resources, along with the Genetic Testing Registry and MedGen, and it serves as a centralized archive of variant interpretations. It is an important forum for data-sharing, as many of the interpretations are submitted by clinical testing labs who would not otherwise publish the data. Additionally, interpretations from other submitters may be public in other forms, but the data are not always standardized. Some of the issues involved in storing data related to variant interpretations, standardizing the data, and dealing with a diverse groups of users will be discussed.

dbSNP Windfalls and Pitfalls : Annotation for Improved Data Quality:  From its inception in 1998, dbSNP was designed to serve as a repository of molecular variation that includes single nucleotide polymorphisms (SNPs) and multiple small-scale variations such as insertions/deletions, microsatellites, and non-polymorphic variants. Since submission to dbSNP is not restricted to common or neutral polymorphisms, submissions from all classes of simple molecular variation can be accepted, including rare variations of germline or somatic origin that are clinically significant. dbSNP is a useful tool for the rapidly growing areas of somatic mutation identification and association studies since all variation classes can be accepted and aggregate submissions from multiple sources to augment NCBI human Reference SNP (rs) collection, which now numbers over 50 million records. dbSNP users often employ dbSNP records to identify a disease causing variation by filtering their experimental results against dbSNP records. Many users, however, may misinterpret their filtering results and misidentify a causal variation when they filter for common germline polymorphisms without knowing about the many other classes of rare and somatic variations that exist in dbSNP. In this workshop, steps that the dbSNP team has taken to make users aware of the many classes of variation archived in dbSNP will be discussed so as to prevent misinterpretation of experimental results. These steps include a name change from the database of “Single Nucleotide Polymorphism” to the database of “Short Genetic Variation”; the addition of new curated attributes to Reference SNP (rs) records; and the creation of Variant Call Format (VCF) files that will aid in the identification of novel and causal variations by allowing users to identify the best approximate “polymorphic” variations in their studies.

 

dbSNP currently houses more than 1.5 billion Submitted SNPs (SS) and 0.7 billion Reference SNPs (RS) from over 350 organisms. Through NCBI mapping and annotation processing of these variants on genome and RefSeq sequences, dbSNP now has a total of 250 million RS with predicted functional consequences (synonymous, non-synonymous, etc.) that are linked to 550 thousand genes. The functional consequence data in dbSNP can be used to interpret how a sequence variants can give rise to individual traits and phenotypes such as hair color, eye color, height, if a person is likely to develop a certain disease, and how a person’s variant make-up may affect their response to treatment. Here will be discussed the results of NCBI recent analysis of dbSNP’s genome and RefSeq sequence annotation that will be used to develop QA metrics and testing for improved dbSNP data quality and accuracy.

The NCBI Epigenomics database: The NCBI Epigenomics resource has been created to serve as a comprehensive public repository for whole-genome epigenetic data sets. By selecting the subset of epigenetics-specific data from GEO and then subjecting them to further review and annotation, NCBI epigenomic group have developed a concise and accessible resource. Epigenomic data tracks can be viewed using popular genome browsers or downloaded for local analysis. New features and improvements are continuously being implemented based on users’ feedback. Substantial usability improvements to user interfaces have made, functionality enhanced, identification of data tracks of interest made easier, and new tools for preliminary data analyses created. Additionally, efforts have made to enhance the integration between the Epigenomics resource and other NCBI databases, including NCBI Gene and PubMed. These aspects of this resource will be highlighted.

New BLAST reports: BLAST (Basic Local Alignment Search Tool) is a bioinformatics workhorse for sequence search and alignment, with over 200,000 interactive searches per day performed at the NCBI web site alone. In spite of this heavy usage, the standard BLAST report is little changed since it was originally conceived as a text report for a stand-alone program. Among other issues, the current report is difficult to navigate, may present data the user does not want, may also not present data the user does want, and does not make good use of modern web technology. New BLAST report that fixes these issues have been designed and implemented. First, an overview of the current BLAST report will be presented and problems with it discussed. Second, alternate BLAST report formats will be discussed. Finally, a completely redesigned BLAST report will be presented and discussed. It has improved navigation, offers formatting of alignments on demand, and allows better access to other NCBI resources.

Pathogen Outbreak Analysis: With increased world population, much greater mobility around the world, and the shipping of food for long distances, outbreaks of disease are an increasing risk for humanity. NCBI has done a number of projects in this area including the influenza resource, and developing viral and bacterial reference genomes. With the advent of high-throughput sequencing, not only are more genomes being sequenced, but microbial genome sequence is also becoming a diagnostic tool. some background work by several NCBI groups will be discussed and how this work, and NCBI general infrastructure for sequence data, is being applied to some new collaborations with FDA and others on food borne outbreak surveillance, and how this affects many NCBI projects.

The Pathogen Detection Pipeline: Recent news has featured deadly outbreaks of foodborne illness, cholera, and antibiotic resistant hospital infection in military environments. Despite the great strides in epidemiology, we are only now beginning to see the use of genome sequencing on a large scale in order to meet the challenges of disease outbreaks. In principle it should be possible to accept sequencing data from submitters and produce an analysis result that identifies a close (or even clonal) analog already in NCBI’s database, measure the distance of each isolate to other isolates or reference genomes, and provide clues that can aid attribution of a sample to a particular source. NCBI can play a role in outbreak analysis at the molecular level by leveraging its stock of more than 5000 complete bacterial genomes and its extensive computational infrastructure. Firstly, distance trees based on protein markers, k-mer spectra, and alternate taxonomy are now being computed on a periodic basis. Secondly, a robust pipeline for assembling and analyzing outbreak genomes has been implemented. Finally, data storage and visualization inspired by the 1000 Genomes browser and its infrastructure are now being considered. The Pathogen Detection Pipeline will produce a report that includes a list of variations, the context around each variation, evidence that supports the variation call, and a phylogeny tree computed from the set of variations. Development of this resource has touched and will touch many corners of NCBI. An overview of the pathogen detection pipeline under development will be given.

The NCBI SRA, an evolving resource: As biology continues down the path of supporting discovery with systematic sequencing efforts based on next generation technologies, NIH has directed NCBI’s Sequence Read Archive (SRA) to continue to serve as the central repository for sequence results emanating from these studies. The NCBI instance of the Sequence Read Archive, already the biggest biological database in the world, continued to show huge expansion in 2011 with the acquisition of read placements in addition to raw sequencing data. The definition of “primary data” has been augmented to include read placements in addition to raw reads. Archiving of reference genome alignments and de novo contig multiple alignments will allow users of the archive to “drill down” to primary reads as evidence. The introduction of reference alignments will also help compress the footprint of this enormous archive, making it more tractable for long-term service. Since its inception in 2007 the SRA has accumulated over 300 terabases of raw sequencing data from human and non-human sources, and is currently growing 20 terabases per month. Reducing the footprint of the existing archive and new deposits is of critical importance to both NCBI and its users. Using fully indexed columnar database design, the SRA toolkit has brought lossless compression of the SRA from 8 bytes per base in 2008 to under 1 byte per base today, with data from the latest sequencing platforms under 0.5 bytes per base. The toolkit also manages alignment properties of reads such as those delivered with BAM files. An update of the content of the SRA will be provided.

NCBI Submission Portal: Submission Portal is a new project at NCBI. It’s ambitious goal is to offer the external world a single, easy to navigate resource for data submissions to NCBI. Internally, it’s a platform that includes all required infrastructure and a framework for easy creation of individual submission apps. The challenges, lessons learned and some interesting opportunities in the world of submissions and curation at NCBI We will be discussed.

Medical Information: Finding, Organizing, and Using Medical Information is the gate to Improving knowledge and skills in accessing health information. NLM’s objective in supporting this workshop is to assist local students and scientists in creating an approach for strengthening libraries through outreach and training in Africa. The course will be divided in modules including: 1-Information Sources, 2-Searching Tools, 3- Electronic Information Searching Techniques, and 4-Management and evaluation of Information. Information literacy has become an essential requirement in today’s world of information and technology. The web has revolutionized information retrieval and in today’s information world, the word literacy therefore goes beyond the traditional definition of knowing how to read and write. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices widely. This cannot be overemphasized when it comes to health as information literacy is crucial in ensuring all those involved with health including Biomedical students and researchers, healthcare providers, medical librarians and consumers know how to find, use and manage health information. This is due to the fact that access to right and accurate information is a prerequisite to good health.

 

 

No strong computer knowledge is needed but familiarity with NCBI tools will be advantageous. Although this Field Guide to NCBI Resources is primarily designed to provide information for researchers, educators and students as an introduction and survey of the NCBI suite, experienced users can also find this course to be useful as many update and new tools will be presented.

 

In this edition, local PhD students will be involved as trainer assistants for hands on under the supervision of NCBI instructors. Indeed one of our objective is to increase the part of local and regional instructors in such workshop as indicator of actual capacity building process. There will be also a session for students to present their works as related to the topic of this workshop, namely Genomics and Genomic variation. Students are invited to submit their abstract. The selected ones will be taken in charge.

Workshop Objectives

This training is designed for all NCBI users who want to increase their efficiency in searching, navigating and analyzing biological data and results. Participants will learn the tips and tricks to locating, searching and migrating between the information and data from within NCBI’s collection of databases, resources and tools. Medical application of these NCBI R&T will be emphasized and discussed. The objectives also include building teaching and research capacity in bioinformatics in Morocco and the region and pave the way for technology and knowledge transfer into academia and health care institutions.

Course Outline

The following is a brief description of the topics that the workshop will cover:

Part 1 : NCBI Structure and Mission

Part 2 : NCBI Discovery

Part 3 : Data integration at NCBI

Part 4 : What’s new at NCBI and What’s next

Part 5 : Genome assembly, Genome annotation

Part 6 : Genome databases and GenBank

Part 7 : Genome Variation : ClinVar and Variant Interpretation.

Part 8 : The NCBI Epigenomics

Part 9 : New BLAST reports

Part 10 : Pathogen Outbreak Analysis

Part 11 : The NCBI SRA, an evolving resource

Part 12 : NCBI Submission Portal

Part 13 : NLM Medical Information suite

Language

Lectures and hands-on sessions will be given in English.

Target Audience

This course is designed for individuals, particularly those based in biomedical and biological institutions, interested in improving or updating their knowledge about NCBI tools and resources or who provide bioinformatics support to their colleagues. It will provide an overview of a wide range of molecular biology resources that research communities need and use. The target audience includes scientists from different fields including Biologists, Geneticists, Physicists, Mathematicians, Statisticians, and Software Developers, from various levels including researchers, educators, graduate students, and other scientific staff who either work with biomedical and biological global data or are interested in understanding how to incorporate such data into their specific research. No prior experience with bioinformatics is required. Familiarity with basic computer operation and common Web browsers is assumed. Knowledge of molecular biology and basic experience with NCBI resources such BLAST is preferred. Both experienced and novice users of the NCBI tools and resources will benefit, as the latest tools and methods will be taught. This workshop is not intended for computer scientists or programmers wishing to learn about programmatic access to data, but rather is focused on the biomedical researchers. The focus will be on demonstrating NCBI web and client-side applications that can be used for obtaining, managing and analyzing biological or genome-scale data. Some aspects of the presentations may be of more interest to those working with large data sets while other aspects will benefit the occasional users.

Participant profile and Selection criteria

The subject matter is suited for researchers, masters of bioinformatics, PhD students or last year students in a broad range of disciplines: computer science, biology, agricultural engineering, medicine, pharmacy. Candidates will be asked to provide information on their personal tasks and field of research. Evidence of involvement in research in Computational Biology or related fields needs to be provided. Good knowledge of the English language is a prerequisite.

The number of candidates for the theoretical course is limited to 50 persons. The places will be provided on the first come first served basis, provided the selection committee accepted the candidate’s application. Those accepted will be informed by October 23th, 2015. Priority will be given to Moroccan applicants working in national universities and research centers. Applicants from Africa and Middle-East are also encouraged. Application will be accepted till October 20th, 2015.

Workshop materials

Lecture notes and information package (hard and digital copies). The attendees will be given a copy of the PowerPoint slides.  The participants are encouraged to bring their laptops for tutorials and hands on sessions. A certificate of attendance will be provide however the presence in

 

Travel fellowships

Travel Fellowship funds for students coming from other far Moroccan cities will be provided to help them to attend the workshop. The number of these grants is limited and will depend on sponsors’ donations. Those who are interested are asked to send their request and justification by email to contact@biomedicalintelligence.org. Applicants from Africa and Middle-East are also encouraged as they will be taken in charge provided they are willing to pay for their travel. Interested candidates are invited to apply and send their CV at the time of on line registration.

Partners

From USA:

National Library of Medicine (NLM, NIH, Bethesda, MD, USA)National Center for Biotechnology Information (NCBI, NLM, NIH, Bethesda, MD, USA)

From Morocco:

National Center for Scientific and Technological Research, CNRST, Rabat
University Mohammed First, Oujda/Nador
Faculty of Sciences and Technology of Tangier
Pasteur Institute of Casablanca