Defining Cancer, Gene by GeneReported by Bob Kuska
April 30, 2001
When Robert Strausberg, Ph.D., became director of the NCI's Cancer Genome Anatomy Project (CGAP) in 1997, he admittedly faced a huge challenge. He had been asked to lead a brand-new program, whose initial project was to create the first index of genes expressed in human cancers - a feat, many said that was more ambitious than feasible.
Yet, four years later, the mission has been accomplished. Strausberg said he and his collaborators are close to wrapping up its tumor gene indexes, having identified over a million gene transcripts in over 40 tissues. Meanwhile, Strausberg said related projects, such as the Mammalian Gene Collection and Genetic Annotation Initiative, have emerged as important tools to explore the molecular causes of cancer.
Strausberg said CGAP's success means scientists can now click on the CGAP web site and, within seconds, access free of charge a vast database of genes, chromosomal changes, and other biological information relevant to the study of cancer. "When you consider that just a decade ago, entire laboratories spent 10 years searching for a single gene that might be involved in cancer, you can see just how far the field has come in pursuing the molecular underpinnings of cancer," he said.
In a recent interview with Behind The News, Strausberg offered his perspective on the success of CGAP, the challenges it faces, and the future of molecular-based cancer care.
Q: Over the past four years, CGAP has identified over a miillion transcripts. When will the indices be complete?
A: I think that the human gene indexes, while not complete, are in a very mature state right now. What we are doing now is filling in gaps in the database, since some tumor types have greater coverage than others. We are carefully evaluating the gaps that still remain and how to reach closure using various technological approaches.
Q: In addition to gene transcripts, does CGAP have any plans down the road to explore proteomics?
A: The full CGAP vision that was put forward several years ago was not just one of finding transcripts, but of uncovering all of the molecular information in a cancer cell and its component parts, including proteins. So, the vision is to have molecular databases where you have information about all of the changes during cancer development. From that complete catalogue, one could find the most informative features for various aspects of cancer research.
Q: How difficult has data management been for the CGAP database and, what have been some of the lessons learned in creating such a vast biological database?
A: To my mind, it's not the computing power that is limiting. It is really our ability to carefully capture the biology of cancer and to link different types of information so that there is a seamless interface. One issue, at a very basic level, is having common terminology for genes and proteins, such that one can link different kinds of databases. With that in place, there is the opportunity to link data sets in a manner that was not possible just a few years ago. For example, the emphasis on gene expression technology as a basic feature of cancer research means that we have the opportunity to link the basic information from CGAP with information about intervention strategies coming from the NCI Developmental Therapeutics Program, the Director's Challenge, and the Early Detection Research Network, all gaining various perspectives of molecular changes associated with cancer development and progression. Moreover, the ability to link human gene data with that from model organisms provides an opportunity to experimentally study functions of genes related to cancer development. What's needed is terminology that will provide a foundation for to link all of these databases. So, at a very basic level, human genes are named differently than mouse genes. New nomenclature, based on specific DNA sequence information, will provide the necessary foundation for these efforts.
Q: So, the major issues are annotation and volume of information?
A: Yes. Everybody now is confronted with an enormous volume of data. The key is to build effective tools for mining the data sets such that the CGAP investment is used most effectively. Toward that end, CGAP has built, and will continue to build, a variety of bioinformatics tools that allow data mining from varius perspectives. It's really a matter of building a panel of tools that allow one to move seamlessly...to ask the question that you would like to ask scientifically and then be taken through a series of databases without necessarily having a priori knowledge of all the data sets that might provide key information. For example, if you find a transcript that appears to be uniquely expressed in the prostate, you'd like to know right away: Do we have information about the corresponding protein? What is the function of this protein? What else do we know about that gene from the biomedical literature? And most importantly, do we have information that suggests this might be a good target for intervention?