2. Collecting and using data
The research questions mainly determine what kind of data is collected for the research. The data can be, for example, images, interviews, statistical data, interview recordings or numerical measurements. Furthermore, the data types influence how the data can be created, processed and shared. Interviews require the use of a collection tool and usually include processing of personal data. The data produced by the measuring devices may require a lot of data space and special solutions for data sharing.
The quality in data collection methods strongly influence data quality. Detailed documentation of these methods provides evidence of such quality. Quality control measures during data collection may include, e.g., calibration of instruments, taking multiple measurements and using standardized methods and protocols.
The format and software in which research data are created usually depend on how researchers plan to analyze data, the hardware used, the availability of software, or can be determined by discipline-specific standards and customs. So, consider the file formats and software suitable for your research data in advance.
When collecting the material, existing data can also be reused. Data reuse means the use of existing data for new purposes such as research, teaching, studying, or commercial purposes. The existing data can be searched from data repositories and portals (see, e.g., the sections Research data services and Reuse of open research data in the UEF Open science website).
Reusing data helps to save time and money because you don’t have to do everything from the beginning. This option is worth considering if useful data is available.
When using data produced by others, the terms of use for the data must be confirmed. Terms of use are usually defined by a license (e.g., CC license). Licenses are discussed in chapter 5 of the learning material.
If you are planning to reuse existing data, this research data must be cited just like books and scientific articles. Data reference should consist of following elements:
- creator
- title
- host organization
- publication time and/or date
- persistent identifier.
Useful additional elements are:
- version
- resource type
- license
- ORCID (persistent identifier for researchers)
- possible embargo information
- repository.
Data repositories and archives usually have guidelines for data citation. Also, publishers can have their own guidelines on how to refer to data in journals.
There are several ethical and juridical aspects that need to be considered when collecting and processing data. Research data can include
- personal data
- personal data that belongs to special category of personal data (e.g., religion, health)
- sensitive species data (e.g., related to endangered animals and plants) or
- otherwise confidential data (e.g., data related to patents, national defense or trade secrets).
It is important that the researcher identifies the juridical and ethical aspects and restrictions related to the research data and acknowledges them throughout the data management process.
Personal data can be used for scientific research when the use is appropriate, planned and justified, and when there is a legal basis for processing the personal data. As a rule, the application of the EU data protection law, i.e., General Data Protection Regulation (GDPR), should always be considered, unless you are absolutely sure that there is no human-related information in the research data.
In any case, it is a good idea to go through the research data to see if it contains personal information. For example, inanimate objects, substances, animals, celestial bodies, or weather phenomena may contain information about the observers or are otherwise related to a natural person.
In the next section, we will discuss personal data in a little more detail.
To consider
- Are you processing personal data in your research?
- What kind of personal data do you process?
- What happens to the personal data after the research project?
- Do you need to anonymize or pseudonymize the data?
- Is there any data related to scientific research that does not contain personal information?
Personal data must always be collected in compliance with the ethical guidelines of professional professional bodies, institutions and funding organizations. The lifecycle of personal data processing must be planned from start to finish.
Research subjects must be informed of what kind of personal data will be collected, how their personal data will be processed during the research project (e.g., data protection and sharing it to 3rd parties), and what will happen to the data after the research project. It is essential to plan beforehand how to manage data during and after the research. The researcher may need an ethical review statement from a human sciences ethics committee if conducting research with human participants.
Personal data is any information relating to an identified or identifiable natural person:
- direct identifiers that alone are enough to identify a person (e.g., such as full name, social security number, email address containing the personal name, biometric identifiers such as facial image, voice patterns, fingerprints, iris, handwritten signature)
- strong indirect identifiers with which a person can be identified with reasonable effort (e.g., postal address, phone number, IP-address, student number, insurance number, bank account number, exact annual income, vehicle registration number, unusual job title, rare disease)
- indirect identifiers, i.e., information which on their own are not enough to identify someone but when linked with other available information, could be used for deducing the identity of the person (e.g., sex, age, principal abode, profession, workplace, education, school, dates such as date of birth, death, accident).
Watch the video
You may need an ethical review in the human sciences. This short video clip provides guidelines for ethical review in Finland. The video is produced by TENK, i.e., the Advisory Board on Research Ethics (2:30).
Personal data can be used for scientific research when the use is appropriate, planned and justified, and when there is a legal basis for processing the personal data. Personal data processing must be planned and executed carefully according to the GDPR. Research institutes guide how to follow the GDPR in practice. For example, UEF instructions about personal data in scientific research with templates are found in Heimo services (requires UEF login).
As a general rule, personal data should be minimized when collecting research data. This means that only such personal data necessary for the research purpose should be collected.
In data management, special attention must be paid to secure storing, processing, and transferring of data, so that only those who have justified reasons to access the data can access it. It is good to plan carefully how the research participants are informed about data processing and data management, for example, how long the data is preserved and where will the data be shared for reuse and how. For further information on processing personal data, see, e.g.
- Data Management Guidelines (Finnish Social Science Data Archive)
- Data protection (UEF) in Heimo services (requires UEF login)
For research purposes data can be processed with identifiers, or as pseudonymous. Pseudonymization means removing or replacing identifiers with pseudonyms or codes, which are kept separately. Pseudonymized data is still personal data.
Anonymization means removing all identifiers permanently. Anonymization is necessary if the data is shared openly or preserved for reuse. Notice that also anonymizing is processing personal data. Several techniques and tools can be used to achieve anonymity. Read more about personal data and anonymization from FSD guidelines.
Laine, Heidi (ed.) 2018. Tracing data: Data citation roadmap for Finland Finnish Committee for Research Data (FCRD).
Finnish advisory board on research integrity (TENK)
(2023-08)