Skip to content

perf: do not store redundant feature strings for each candidate #217

@lukehsiao

Description

@lukehsiao

The size of the feature table is very, very large because we store the strings of each feature for each candidate, even though many of these strings are shared between candidates. This also slows down queries. We should have some additional table which maps to those strings, so that those strings do not all need to be stored.

This will take some consideration, though. As it is right now, it makes it very easy to inspect a particular candidate's features or labels simply though cand.features, and we would lose that if we were to refactor the code in this way.

As a rough analysis, with a dataset of 253,524 candidates, we see the following.

Total number of unique feature keys:

# select count(*) from feature_key;
 count 
-------
 17705

Total number of features:

# select count(*) from (select unnest(keys) from feature) as temp;
  count   
----------
 43317015

The feature table is by far the largest table.

# \d+
                               List of relations
 Schema |         Name          |   Type   |  Owner  |    Size    | Description
--------+-----------------------+----------+---------+------------+-------------
 public | candidate             | table    | user    | 14 MB      |
 public | candidate_id_seq      | sequence | user    | 8192 bytes |
 public | caption               | table    | user    | 8192 bytes |
 public | caption_mention       | table    | user    | 0 bytes    |
 public | ce_v_max              | table    | user    | 48 kB      |
 public | cell                  | table    | user    | 8128 kB    |
 public | cell_mention          | table    | user    | 0 bytes    |
 public | context               | table    | user    | 30 MB      |
 public | context_id_seq        | sequence | user    | 8192 bytes |
 public | document              | table    | user    | 3208 kB    |
 public | document_mention      | table    | user    | 0 bytes    |
 public | feature               | table    | user    | 598 MB     |
 public | feature_key           | table    | user    | 3256 kB    |
 public | figure                | table    | user    | 600 kB     |
 public | figure_mention        | table    | user    | 0 bytes    |
 public | gold_label            | table    | user    | 23 MB      |
 public | gold_label_key        | table    | user    | 16 kB      |
 public | implicit_span_mention | table    | user    | 1272 kB    |
 public | label                 | table    | user    | 40 MB      |
 public | label_key             | table    | user    | 24 kB      |
 public | marginal              | table    | user    | 0 bytes    |
 public | marginal_id_seq       | sequence | user    | 8192 bytes |
 public | mention               | table    | user    | 560 kB     |
 public | mention_id_seq        | sequence | user    | 8192 bytes |
 public | paragraph             | table    | user    | 4936 kB    |
 public | paragraph_mention     | table    | user    | 0 bytes    |
 public | part                  | table    | user    | 216 kB     |
 public | part_ce_v_max         | table    | user    | 272 kB     |
 public | part_polarity         | table    | user    | 2440 kB    |
 public | part_stg_temp_max     | table    | user    | 4216 kB    |
 public | part_stg_temp_min     | table    | user    | 4224 kB    |
 public | polarity              | table    | user    | 88 kB      |
 public | prediction            | table    | user    | 8192 bytes |
 public | prediction_key        | table    | user    | 8192 bytes |
 public | section               | table    | user    | 48 kB      |
 public | section_mention       | table    | user    | 0 bytes    |
 public | sentence              | table    | user    | 87 MB      |
 public | span_mention          | table    | user    | 264 kB     |
 public | stable_label          | table    | user    | 8192 bytes |
 public | stg_temp_max          | table    | user    | 152 kB     |
 public | stg_temp_min          | table    | user    | 152 kB     |
 public | table                 | table    | user    | 112 kB     |
 public | table_mention         | table    | user    | 0 bytes    |
 public | webpage               | table    | user    | 8192 bytes |
(44 rows)

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionFurther information is requestedhelp wantedExtra attention is required

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions