ISWC2005 Notes: Day 2
First up is Professor Carole Goble from the University of Manchester speaking about e-science and the semantic web. What is e-science? Uses all the standard science tools: microscopes etc but they are connected and can be operated remotely. Global collaboration is the key word.
Science has moved beyond hypothesis and experiment to a new era of collection based experiments gathering huge sets of data and subsequent analysis. There are around 600 large databases used by life sciences community - all different. Gathering more data in 5 years than we have ever had in all of history. Huge amounts of text to mine. It's a descriptive science, much more so than Physics.
The Web has revolutionised science. The problem is that the information was originally designed for humans only and is not interlinked. The cheapest way is to get a load of PhD students and sit them in a room together - but they get bored. Another way is to scrape via Perl scripts - but since each system is autonomous the scripts often break. The science grid has adopted a workflow approach predicated on the data being available via web services.
WWW Nursery was physics. TBL: "The need for the web arose from geographical dispersion of knowledge and rapid turnover of students and researchers."
Semantic Web nursery - the problem is not in representing and sharing knowledge but in why but life sciences has the need right now since there's so much disorganised information.
Reasons why semantic web for life sciences: problem matches up; community matches up; culture of controlled sharing curating and connecting; content.
Gene Ontology - a controlled vocabulary for life sciences. Far more than 17,000 concepts. Changes multiple times a day. Could form the basis of sharing information. Doesn't matter that it's scruffy or even wrong. What matters is that it is being used widely.
BioPAX is a new ontology under development to describe biological pathways. Over 180 databases contain this kind of information.
5 examples of semweb for life sciences:
- FOAF for scientists.
- FOAF for proteins [!]
- "I don't have to nail down my schema if I use RDF", "I can always extend it", "I don't have to close my research". Don't have to a priori fix everything.
- RDF 4 Proteomics
- UniProt - major protein sequence database - 90% is descriptions about the sequences. Very volatile, structure and schema changes all the time. If use RDF then this would ease problems with changing schemas
Global naming scheme voted biggest boon in scientific data management - basically a URI with a resolution scheme.
5 Reasons why semweb irritates life science practitioners:
- Purpose and expectation mismatch. Means to an end, not the end.
- Legacy - services, content and practice. Got ontologies coming out of their ears
- Practicality mismatches
- Technical and feature mismatches (e.g. qualified cardinality constraints such as "a hand usually has four fingers plus a thumb")
- Being used, undervalued and hyped. It's a partnership - use it.