2. Collecting and using data

The collection and use of research data during research often involves interplay between research methods, research questions and data management practices. In this chapter, we examine different types of data, their quality and origins, as well as research ethics and legislation from the perspective of data management.

Particular attention is given to personal and sensitive data, as their processing requires special care. Even if the research does not explicitly involve human participants or data collected directly from individuals, it is still important to consider whether personal data may be involved. For instance, data related to inanimate objects, substances, animals, celestial bodies, or weather phenomena may still include information about the individuals observing them or may otherwise be linked to identifiable people

Understanding what constitutes personal data and how it should be processed is thus essential across all fields. For this reason, these topics are emphasized as fundamental knowledge in this learning material.

Types of data and data quality

The types of data collected for a study are primarily determined by the research questions. Data may consist of, for example, images, interviews, statistical datasets, audio recordings, or numerical measurements. In addition, the nature of the data influences how it can be generated and processed.

For instance, interview data typically involve personal data, which means that particular care must be taken when selecting devices and software for data collection and handling. Conversely, data produced by measurement instruments may be large in volume, placing specific demands on storage capacity and data-sharing solutions. It is therefore important to consider in advance which file formats and software are most appropriate for different types of research data and research settings. These choices are also shaped by discipline-specific practices.

In the context of data management, data quality primarily refers to ensuring that data are not accidentally modified, damaged, or lost. Researchers must identify situations and processing steps where the content of the data could be unintentionally altered. Such risks may arise, for example, during file format conversions or when transcribing interviews. Ensuring data quality is therefore closely linked to effective risk management, supported by systematic backup and versioning.

Assessing the quality of data collection is inherently connected to research methods, but it also relates to data management practices. Measures that support quality during data collection include calibrating instruments, taking repeated measurements, and using standardised methods and protocols. In addition, thorough and accurate documentation of the data collection process is a key indicator of high-quality data.

Data origin: new, pre-existing, produced

Research data can be classified based on how they are obtained. Data may be collected specifically for a given study (e.g. samples, surveys, measurements, or imaging), drawn from pre-existing sources and reused (e.g. text corpora, image repositories, biobanks, or registers), or produced during the research process (e.g. data generated through analysis or software development).

The origin of the data is important because it determines what can be done with the data and who has the authority to make those decisions. When researchers collect data themselves, they can define the terms of use within the limits set by legislation and contractual agreements. When using pre-existing data, however, the researcher must carefully consider the applicable terms of use. For open-access data, these are typically defined through licences, such as Creative Commons (CC) licences. This topic is discussed in more detail in Chapter 5.

Existing data can also be reused when collecting research materials. Data reuse refers to using already available data for new purposes, such as research, teaching, studying, or commercial applications. Such data can be found in data repositories and search services (see, for example, the sections Research data servicesand Reuse of open research data in the UEF Open science website).

Reusing data can save both time and resources, as it reduces the need to collect data from scratch. This option is worth considering whenever suitable data are available.

When reusing pre-existing data, it is essential to cite them in the same way as books or scientific articles. A proper data citation should include the following elements:

creator
title
host institution (repository or archive)
publication date or time
persistent identifier.

Additional useful elements may include:

version
resource type
license
ORCID (persistent identifier for researchers)
possible embargo information.

Data repositories and archives often provide specific guidelines for citing datasets. In addition, publishers may have their own requirements for referencing data in academic publications.

Example of listing different types of data with additional information.

Research ethics and data management

Research ethics is relevant to all scientific research, although it is often associated only with research involving personal data. If research material contains personal data or information on, for example, endangered animals, national defence, or trade secrets (i.e., sensitive or confidential information), ethical and legal considerations must be naturally taken into account from the very beginning of the research process and in the design of the study. When using social media data, ethical and legal aspects also require broad and careful consideration. In addition, AI and research security must be taken into account in data management.

It is important that researchers themselves identify any ethical, contractual, and legal issues, as well as any restrictions, related to their research data and address them throughout the data management process. The primary responsibility for implementing good scientific practice in one’s own work lies with each researcher, but researchers are not alone in this responsibility. It is shared by the entire scientific community, including research teams and their principal investigators (PI), as well as research organisations and their management.

In the context of research ethics, research integrity is a key concept. It is based on the principles of reliability, integrity, respect, and responsibility or accountability. Research integrity guides the actions throughout the different stages of the research process. In Finland, the Finnish National Board on Research Integrity (TENK) provides guidance on research ethics in its guidelines for research integrity, which state the following about good research practices in data management: All partners

agree in advance about the ownership of the research data and about the rights to its use, its processing, storage and possible reuse
if necessary, revisit the agreements later during the course of the research, for potential amendments
comply with current data protection legislation and obligations related to non-disclosure, confidentiality and secrecy
promote to openness and further use of the data to the extent possible.

To consider

Are you processing personal data in your research?
What kind of personal data do you process?
What happens to the personal data after the research project?
Do you need to anonymize or pseudonymize the data? Make sure to understand the difference between the two.

Personal data

Personal data may be used in scientific research when such use is appropriate, planned, and justified, and when there is a legal basis for processing the personal data. When collecting data on individuals for research purposes, all ethical and legal requirements applicable to the research in question must be observed. These requirements must be taken into account throughout the entire lifecycle of the research, from design and implementation to the post-research storage of data. As a general rule, the applicability of the General Data Protection Regulation (GDPR) should always be considered unless absolutely certain that the research data contain no information relating to an identifiable person.

Personal data is any information relating to an identified or identifiable natural person:

direct identifiers that alone are enough to identify a person (e.g. full name, social security number, email address containing the personal name, biometric identifiers such as facial image, voice patterns, fingerprints, iris, handwritten signature)
strong indirect identifiers with which a person can be identified with reasonable effort (e.g. postal address, phone number, IP-address, student number, insurance number, bank account number, exact annual income, vehicle registration number, unusual job title, rare disease)
indirect identifiers, i.e. information which on their own are not enough to identify someone but when linked with other available information, could be used for deducing the identity of the person (e.g. sex, age, principal abode, profession, workplace, education, school, dates such as date of birth, death, accident).

When processing personal data, it is important to identify any special categories of personal data, often referred to in everyday language as sensitive personal data. These include, for example, religious beliefs, health, and trade union membership. It is also important to understand the difference between pseudonymous and anonymous research data, which are discussed in the next section (Processing personal data).

Watch the video

You may need an ethical review in the human sciences. This short video clip provides guidelines for ethical review in Finland. The video is produced by TENK, i.e., the Advisory Board on Research Ethics (2:30).

Processing personal data

When handling personal data, appropriate data security measures must be in place. This helps safeguard research participants’ right to the protection of their personal data. One important general principle is data minimisation: only personal data that is necessary for the research should be collected.

In accordance with research ethics principles, the research participants should be informed about what data will be collected, how it will be processed during the research, and what will happen to the data after the research has ended. It is therefore essential to plan the data management both during and after the research right from the very beginning. Research involving human participants may also require an ethical review.

Research organisations provide instructions on practices for handling personal data in accordance with the GDPR. UEF Intranet provides guidelines and templates for the processing of personal data in scientific research (requires UEF credentials). The templates can help in preparing the necessary documents, which may include:

privacy statement
participant information
ethical consent to participate
consent to process personal data
data protection risk assessment or a data protection impact assessment (check your own organisation’s instructions, processes, and templates)

In data management, special attention must be paid to secure storing, processing, and transferring of data (see Chapter 4), so that access is limited to those who have a legitimate need to use the data. It is also important to plan carefully how research participants will be informed about data processing and data management, for example, how date security will be ensured, how long the data will be retained, and whether, where and how the data will be shared for reuse.

Pseudonymized or anonymized?

For research purposes data may be processed either with direct identifiers or in pseudonymized form. Pseudonymization means removing identifiers or replacing them with pseudonyms or codes, which are stored separately. Pseudonymized data is still personal data.

Anonymization means permanently removing all identifiers. Various techniques and tools can be used for anonymization. As a general rule, anonymization is necessary if the data is to be shared openly or stored for reuse. It should be remembered that anonymization as a process involves the processing of personal data, but anonymized research data no longer contain personal data. At that stage, the data can be treated like any other data that do not contain personal information. However, it is important to note that producing fully anonymized data may not always be possible.

For further information on personal data and anonymization, personal data and anonymization, see the guidelines provided by the Finnish Social Science Data Archive (FSD).

Further information

^(2026-06)

_{Previous: 1. Introduction to Research Management}

_{Next: 3. Describing research data (documentation, metadata)}