This lesson aims to teach those just starting to undertake research how to manage their data and files.
- Masters/PhD/Postdoc researchers at the beginning of their projects.
- Basic digital skills required (e.g., file management, Excel, some version control exposure).
- No programming experience necessary.
- Basic Excel use (open/save tables)
- File/folder management on a computer
- A research project or dataset in progress
After completing this course, the learners should be able to:
- Define research data and distinguish between different data types.
- Structure research materials using clear file naming conventions and a logical folder hierarchy
- Describe methods of data collection that make data cleaner and easier to analyse
- Detect inconsistencies and errors in a tabular dataset ("dirty data")
- Use a set of basic techniques to remove/correct errors and inconsistencies in tabular data ("cleaning data")
- Use version control to track different versions of files, and switch between them.
- Victoria Yorke-Edwards (@vyorkeedwards)
- Kimberly Meechan (@K-Meech)
- Katie Buntic (@katiebuntic)
- Size:
- Types:
- Requires noise/messiness injection for teaching
- Licensing:
A fictional researcher, Alex, inherits disorganised MET data. Learners help clean and structure it.
- Unclear file naming (final_final_v3.csv)
- Scattered/misplaced files
- Dirty data: duplicates, missing values, format errors, inconsistent naming
1. What is Research Data?
- Data types
- Sources of data
- What is research data management (collection, storage, organisation, sharing, etc)?
Need to write objectives
2. Structuring research materials
- Naming conventions
- Folder structures
- Version Control
- Introduction to version control software, Git/ Github
Objectives
After following this episode, learners will be able to:
- Organise their research data into a standard folder structure
- Name files with a consistent naming convention
- Understand why version control is important, and how to incorporate this into your naming conventions
- Explain why version control software such as Git/GitHub can be useful for certain types of data.
3. Tabular data collection
- Have a look at a 'dirty' data set
- Is there a standard set of responses?
- Is it free text?
- How do you control what data is being collected?
- Asking the right questions
- Data dictionaries
Objectives
After following this episode, learners will be able to:
- List variable types and formats
- Identify inconsistencies in data that can cause problems during analysis
- Describe methods that can be used during data collection and data entry that can prevent inconsistencies
- Write guidance for how to collect and enter data
- Create a data dictionary describing a dataset
4. How to clean a tabular dataset (using Excel)
- Finding inconsistencies
- Missing data
- Capitalisation
- Spelling mistakes
- Pros and cons of Excel
Objectives
After following this episode, learners will be able to:
- Describe what data cleaning is and why it is important
- Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
- Identify missing values within a tabular dataset using filters
- Correct spelling mistakes using spell check tools and find + replace
- Standardise text formats using spreadsheet functions
- Describe the pros and cons of using spreadsheets for data collection and cleaning
- [Note: update for using R?]
5. Introduction to R
Need to write objectives