Use OpenAI to assess OpenAI?

After
- #1 

change it from a flat list to something like
```yaml
-
  q: What is 2 + 2?
  a:
    - should be 4
```
and have OpenAI grade its own work. (We'll want to check the accuracy of this before doing experiments at a larger scale.)