-
Run the code/notebooks in the cloud via Binder
-
Course materials for "Data Science for Public Policy", a course at the University of Tokyo's Graduate School of Public Policy (Graspp)
-
Instructor: Cory Baird
Module 1: How to Run Statistical Software (3 weeks)
- Week 1 (Apr. 7): The Easy Way to Code and Useful Tools
- Week 2 (Apr. 14): Acquiring Data through APIs
- Week 3 (Apr. 21): Downloading and transforming with tools (functions)
Module 2: Visualization (3 weeks)
- Week 4 (Apr. 28): Introduction to Data Visualization
- Week 5 (May 12): More visualization and mapping libraries
- Week 6 (May 19): Data pipeline and regression
Module 3: Regression, ML, AI
- Week 7 (May 26): Regression & Machine Learning
- Week 8 (June 2): ML & Neural Networks (A.I.)
Module 4: AI, LLM and Text analysis
- Week 9 (June 9): Scraping
- Week 10 (June 16): Reading PDF, NLP basics (Bag-of-words)
- Week 11 (June 23): Using LLMs
- Week 12 (June 30): Fine-tuning/training LLMs
Final Presentations
- Week 13 (July 7): Final presentations
-
Milestone 1: Data selection and research question
- Grade: 20% of grade
- Task: Import and manipulate the data and show descriptive statistics in table or graphs.
- Due: by Week 4 (Apr. 28)
-
Milestone 2: Data Visulaization and Interpretation
- Grade: 20% of grade
- Task: Create at least 5 different visualizations (including charts) of the dataset.
- Due: by Week 4 (May. 26)
-
Milestone 3: Analytical Presentation
- Grade: 20% of grade
- Task: Present analysis in a whitepaper, slides or a dashboard
- Due: by Week 11 (June 23)
- Use Python to collect, clean, and analyze policy-relevant data.
- Design and implement reproducible research workflows to effectively manage and utilize public data.
- Apply statistical and machine learning methods to analyze policy problems
- Process and analyze text data using traditional NLP and modern LLMs (ChatGPT) to extract meaningful insights.
- Develop visualization to communicate research findings effectively to both technical and non-technical audiences.
- Collaborate effectively using professional data science tools like GitHub, Overleaf, and Google Colab.
-
Code version control: Git/Github
- GitHub Account: Create account then "star" class page
- GitHub Desktop: For collaboration on code/notebooks
- Git software: https://git-scm.com/downloads
- git software is automatically downloaded with github desktop for mac but may not be for windows
-
Running code AND notebooks
- VSCode: For running notebooks and code (Download Link)
- Sublime/PyCharm also acceptable
- UV: Python version control and running notebooks (Download Link)
- VSCode: For running notebooks and code (Download Link)
-
If you are having issues running the previous software
- The easiest way is to use github code space: This launches vscode in the cloud
- Other solutions:
- Anaconda: https://www.anaconda.com/
- Jupyter.org Try: https://jupyter.org/try
- Google Colab: https://colab.research.google.com/
- pip: The standard tool for installing and managing extra Python libraries that provide specialized functions for data analysis, machine learning, and more.