
"You can have data without information, but you cannot have information without data." ~ Daniel K. Moran
"Data is not information, Information is not knowledge, Knowledge is not understanding, Understanding is not wisdom." ~ Clifford Stoll
"Institutional respositories are now clearly and broadly being
recoginized as essential infrastructure for scholarship in the digital
world."
"The scientist wants to do science, not be a clerk. And besides, who
cares? Most data is never looked at again anyway." ~ Szaly, Thakar,
Stoughton, and VandenBerg
"Computer science is poised to become as fundamental to biology as
mathematics is to physics." ~ Microsoft Research Cambridge
"A general problem amongst [scientific] disciplines is the
relatively low priority attached to data management and archiving."
Science is about observation and measurement. Each object
measured creates values for a particular variable. These values
and variables constitute a data set for a particular project or study.
Data collection yields
sets of data that may then be analyzed using instruments and or
statistical software. Some research projects generate enormous amounts
of data. For example, the remote sensing instruments in space vehicles
generate such a large amount of data that it is difficult to collect,
analyze, and preserve.
What might information professionals and scientists do to
handle enormous amounts of data?
Thus, each research project is a date generator. Ideally, that data
would be "published" and preserved for future use by the scientific
community. With preservation, replication is possible as is new
research generated from existing data. Without preservation,
replication is not possible and new research will require that the data
be regenerated at some notable cost and effort.
What does it mean to talk about "publishing data"?
Data that has been collected but has not been processed for use is raw data or source data. A distinction is sometimes made between data and information. Information is the result of data analysis and summarization. After collection and analysis, the data rests in a database where it is available for further analysis or validation.
If a data archive is the "main repository of the organization's
historical data, its corporate memory," then the scientific data
repository collects the historical data of science, and becomes the
scientific memory. Unhappily, the scientific community has
not been as
focused on data longevity as it has been on current and future research
projects.
What might be done to encourage the scientific community to be
more concerned with the data associated with past science? What role
might information professionals have in this process?
Although slowly growing in usage, e-science describes how the rapid
growth of information technology has and continues to change the way
that science is conducted. It includes new and better instrumentation
such as remote sensing devices as well as the decline in the cost of
ever more powerful computing power that allows more and better modeling
of natural phenomena. The impact of information technology on
scientific communication is also involved with the emphasis on authors
in different countries actively involved in the same research project
using various Internet tools to collaborate. Instrumentation in one
country may be used by scientists in another with the data flowing
easily to the researcher's laboratory or office. Collections of
datasets are more widely available and that encourages both replication
as well as new research using previously gathered data. As the cost to
acquire some scientific data increases notably [high energy physics is
an example], the importance of access to
previously gathered data also increases. This is particulary true of
physical science data which is usually not dependent upon time and
place. The explosion
of data sets associated with e-science creates dramatic opportunities
and challenges for those in the scientific information business.
Notable data repositories typically include content developed by
scientists working in different countries. However, this may not be
true for data collections created to meet the need of governmental
agencies. Still, data repositories in a particular field or research
area should be open to accomplished researchers regardless of where
they live and work. High energy physics is a good example.
How might e-science impact information professionals and
information provision?
Data may be classified by collection attributes. For example, data
may be:
Raw data [data may be both singular and plural] is collected but not
yet
processed. Processed data is usually considered to be information. Data
may be found in a wide variety of formats, both analog and digital,
text, audio, and video. The number of content formats is substantial.
Here are examples:
Typically, data sets are segregated by format category
although not all categories are exclusive. Multivariate data
[a number of variables and a number of observations yielding
different values] is probably the most common with most current
collections
limited to digital data. Image data may be digital [including
scanned content] or analog. It may consist of single images or a
series of continuous images. The latter are growing in popularity and
amount. Some
collections will also include relational data [data in the form of
multiple related tables] and saptio-temporal
[positions of objects over time] data.
As you might imagine, there are many formats designed specifically
for digital content. These range from CDF [common data format] to DLG-3
[Digital Line Graph]. Data repositories must deal with data in
different formats and sometimes provide translation services from one
format to another in the same way that user's convert TIFF files to JPG
or PNG files. One of the challenges in managing the data repository is
the ability to handle a variety for data types and formats, the ability
to transfer from one format to another, and the need to "refresh"
data as formats change.
How might data repositories handle all of these formats?
"Data acquisition [today] is the sampling of the real world to
generate data
that can be manipulated by the computer." A more specific definition is
"the automatic collection of data from sensors, instruments and
devices."
Typically, sensors convert a
measurement into an electrical signal which can be acquired by
appropriate hardware. This automatic data collection eliminates
measurement error and makes it much easier to acquire and process large
samples. How data are acquired is an important element in the metadata
accompanying datasets.
These steps are necessary if the repository is to continue to be
useful and relevant. Although not difficult, each of these steps
involves expense and effort. Often, the repository will contain
datasets from a variety of applications and operating systems,
including some legacy systems. Data archiving may require
considerable
preparation and packaging of data and supplementary documentation. It
is much more than simply transferring files. This is particularly true
of the metadata.
Since the quality of experimental or observational data may vary,
data should be evaluated and corrected as needed. This is one aspect of
"cleaning" data files. The quality control step may be difficult or
less difficult depending on the data collection, the nature of the
observation or experiment, and the knowledge and experience of the
researcher. Ideally, each file in the repository would include
corrected data.
Note that databases are often not static, but change
over time. New
data may be added to the datasets. This may be called continuous
revision. The data may be analyzed again [becoming secondary data]
using different approaches and thus yields new results. Metadata may be
revised or enlarged. Scientists may add notes or annotations to
particular items in the database. Thus, we can speak of "editions"
for
a dataset or database. Research results are based upon a particular
edition of that database. Since results vary according to
edition, one edition does not necessarily replace another.
A continuing problem is that many government agencies do not
consider their scientific data or the scientific data resulting from
their funded research to be official governmental records. A national
policy is needed that considers scientific research data to be a
national resource.
Obviously, there are important security issues with any data
repository to insure that the datasets are true copies, authentic,
complete, and tamper-proof. There may also be ownership or intellectual
property issues that need to be settled.
The minimum standard for successful preservation would be for the
database and its complementary documentation to be stored at two widely
separated sites by two independent organizations [ in two different
countries would be best]. How likely is this to happen?
Repositories add value to datasets by adding appropriate metadata. Such metadata might include:
Metadata should allow scientists to find individual datasets within
an experiment, all datasets associated with experiment, links to
derived eprints and published literature. Increasingly, large databases
are constructed so that subsets may be
retrieved and delivered to researchers rathet than the whole dataset.
Again, this shows the important the metadata that allows this
slicing and dicing, especially of a large dataset. Knowledge [or
information] extraction tools will increasingly be
used to generate new results from repository content.
What role might information professionals have in metadata
creation and standards?
With more powerful computing and more sophisticated sensors attached to satellites, aircraft, telescopes, and a wide variety of scientific instrumentation, increasingly large datasets are generated. "Scientific data is getting not only larger and larger -- multi-petabyte archives are routinely discussed -- but its complexity is increasing, meaning that the extraction of meaningful knowledge requires more and more computing resources." "Just as simulation was added to the two traditional scientific cultures -- theory and experiment -- so collection-based research will be added. The experimentalists take data, catalogue it, calibrate it, and make it available to others on the Internet; the collection-based scientist will reduce, mine, and sift that data to make or break a hypothesis from the theorist. In this new paradigm, the raw or nearly-raw data is published in a way that would be impossible with only paper journals: not just tables and graphs, but rather ramified palaces of logic. These would be subsequently reduced and interpreted by other researchers: thus the person who takes the data may be different from the person who reduces it to small, palatable representations that can be printed on paper." [European-United States joint workshop on Large Scientific Databases].
Computer scientists are developing software
applications that optimize the storage, retrieval and processing of
these datasets. Still, this is a challenging and expensive business.
What challenges might large scientific databases have for
information professionals?
Some costs for data repositories are dropping, especially the costs
associated with storage. However, the costs of documentation and
curation [preservation in particular requires "refreshing" content] may
be substantial. While some believe that all scientific data ought to be
preserved, others suggest that some data is ephmeral and other data is
stable. Emphemeral data is time specific and is lost
forever if not captured. Today's weather is a good example. Stable
data
is not time or location specific so that it can be recreated without
too much
effort. Computer simulations are a good example.
Costs may be reduced as workable, tested standards are adopted and
as software, especially open source software, becomes widely available
to simplify documentation and curation activities. Data documentation
can be costly in time and effort. Costs vary according to the nature of
the data, who will
publish the data and who will be responsible for it.
Some government agencies have taken the lead.
Research funding agencies are increasingly concerned with data
management and archiving. NIH requires a data management plan for all
grants over $500,000. NSF is considering a similar requirement.
Some multi-national organizations have also created portals to
provide access to scientific data in a particular discipline or
disciplinary area. For example, OceanPortal "is a high
level directory of ocean data" hosted by a UNESCO commission.
In general, however, there is a lack of high-level
strategic thinking and planning by most governments and their leading
scientific agencies. Research funding remains based on short-term
horizons often related to the three year grant or similar. Too little
attention is given to the construction and maintenance of a scientific
infrastructure that includes data repositories.
While repository is most often used for accessible, organized data collections, the alternatives above are also encountered. A repository is a place where "deposited content collections," usually digital, may be found. Some object to "archive" because an archive may be seen as a "data cemetary" rather than a place where data is used and reused. Repository may be seen as a data repository, an archival repository, a content repository, or an institutional repository. Such depositories may be a service of a library, a computing center, or even a stand-alone agency.
The data management center is typically focused on the reduction of
data management overhead associated with scientific workload and data,
including the support of data mining. One concern here is that ability
of the place holding data collections to provide the software/hardware
and the support needed for re-analysis of the data and then to capture
that re-analysis and add it to the repository.
Many research-extensive universities now have library
led institutional repositories. They often include student work such as
theses and dissertations as well as instructional materials. Some also
include faculty research products. Relatively
few contain meaningful collections of datasets. Most faculty are
reluctant to participate in institutional repositories. There are about
260 doctoral research universities in the U.S.
Nearly all have a scientific research program so
there is considerable potential for institutional repositories that
include scientific datasets.
Institutional norms and practices vary notably in regard to encouraging or even requiring researchers to deposit datasets in a depository when the research project is complete. In most cases, participation is voluntary and often quite low. Researchers may be uncertain of the consistency and quality of access, preservation, and sharing content that is the result of considerable time and effort. Too, while libraries are the likely leader in this area, many libraries are not equipped or interested in archival collections of this type. Computer centers are typically stretched thin and lack the knowledge needed to provide the intellectual access.
There are several good examples of the university data repository. The University of California at Irvine has its UCI Knowledge Discovery in Databases Archive. The Archive is devoted to large data sets. It is a permanent repository for publicly accessible datasets. Intellectual access is provided by:
Some data repositories are hosted by scholarly societies. The Geological Society of America hosts the GSA Data Repository established in 1974. It is limitied to data and supporting content created by authors whose work was published in a GSA periodical.
As a general rule, data should be stored at its most elemental level because that provides the most useful basis for further analysis and validation.
Intellectual access is simplified with fewer and more inclusive collections. Academic datasets may be found on the researcher's website, on a department website, on a college website, or on a university website. In some countries, including the U.S. and Canada, there are national repositories for disciplinary data. For example, Natural Resources Canada hosts the Geoscience Data Repository. This repository includes digital maps and map images as well as data from geophysical surveys, geochemistry, marine expeditions, and other geoscience data. If properly funded, there are many advantages to a national repository.
Discipline-specific depositories may be found at the departmental level where intellectual access and preservation issues may not be well understood. At the same time, many university libraries lack the interest and the knowledge to create and mantain data repositories. Economies of scale suggest that an institutional repository is a more cost-effective solution to providing affordable access and preservation than departmental or personal websites. Better are repositories hosted by national and international scholarly organizations.
Some repositories allow for data manipulation with the results
exported to the user. Others do not provide data manipulation, but do
allow datasets to be exported for manipulation by the user at her
location.
What do you see as the major challenges facing the library
responsible for a digital data repository?
While curation is defined differently in different sources, the
ordinary meaning includes:
