Skip to content

Commit 341434b

Browse files
📝 Add CONTRIBUTING.md
1 parent f0e71a5 commit 341434b

File tree

4 files changed

+112
-150
lines changed

4 files changed

+112
-150
lines changed

CONTRIBUTING.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Contributing to pySBD
2+
Thanks for your interest in contributing to pySBD 🎉. The project is maintained by [@nipunsadvilkar](https://github.com/nipunsadvilkar), and I'll do my best to help you get started. This page will give you a quick overview of how things are organised and most importantly, how to get involved.
3+
4+
## Table of contents
5+
6+
1. [Issues and bug reports](#issues-and-bug-reports)</br>
7+
a. [Submitting issues](#submitting-issues)</br>
8+
b. [Issue labels](#issue-labels)
9+
2. [Contributing to the code base](#contribute-to-the-code-base)</br>
10+
a. [Getting started](#getting-started)</br>
11+
b. [Add a new rule to existing *Golden Rules Set* (GRS)](#add-a-new-rule-to-existing-golden-rules-set-grs)</br>
12+
c. [Add new language support](#add-new-language-support)</br>
13+
d. [Add tests](#add-tests)</br>
14+
e. [Fix bugs](#fix-bugs)
15+
16+
## Issues and bug reports
17+
First, do a [quick search](https://github.com/nipunsadvilkar/pySBD/issues?q=is%3Aissue+sort%3Aupdated-desc+is%3Aclosed+is%3Aopen) to see if the issue has already been reported or already open. If so, it's often better to just leave a comment on an existing issue, rather than creating a new one. Old issues also often include helpful tips and solutions to common problems.
18+
19+
Please understand that author won't be able to provide individual support via
20+
email. Author also believe that help is much more valuable if it's **shared publicly**,
21+
so that more people can benefit from it.
22+
23+
### Submitting issues
24+
25+
When opening an issue, use an **appropriate and descriptive title** and include your
26+
**environment** (operating system, Python version, pySBD version). Choose the report type [from here](https://github.com/nipunsadvilkar/pySBD/issues/new/choose), if type is not available then open a [blank issue](https://github.com/nipunsadvilkar/pySBD/issues/new). The
27+
[issue template](https://github.com/nipunsadvilkar/pySBD/issues/new?assignees=&labels=&template=bug_report.md&title=%3CAppropriate+title%3E) helps you
28+
remember the most important details to include. If you've discovered a bug, you
29+
can also submit a [regression test](#fixing-bugs) straight away. When you're
30+
opening an issue to report the bug, simply refer to your pull request in the
31+
issue body. A few more tips:
32+
33+
- **Describing your issue:** Try to provide as many details as possible. What
34+
exactly goes wrong? _How_ is it failing? Is there an error?
35+
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
36+
remember to include the code you ran and if possible, extract only the relevant
37+
parts and don't just dump your entire script. Also, provide what was the expected output for given input. This will make it easier for contributors to
38+
reproduce the error.
39+
40+
- **Getting info about your pySBD installation and environment:** You can use the command line interface to print details and copy-paste psybd verson along with python version into GitHub issues:
41+
`pip freeze|grep pysbd`.
42+
43+
- **Sharing long blocks of code/logs/tracebacks:** If you need to include long code,
44+
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
45+
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
46+
so it only becomes visible on click, making the issue easier to read and follow.
47+
48+
### Issue labels
49+
50+
[See this page](https://github.com/nipunsadvilkar/pySBD/labels) for an overview of
51+
the system author uses to tag our issues and pull requests.
52+
53+
## Contribute to the code base
54+
55+
Happy to see you contibute to pySBD codebase. To help you get started and understand internals of pySBD, a good place to start is to refer to the implementation section of pySBD research paper (link to be added soon). Another great place for reference is to look at [merged pull requests](https://github.com/nipunsadvilkar/pySBD/pulls?q=is%3Apr+sort%3Aupdated-desc+is%3Amerged). Depending on the type of your contribution, refer to the assigned labels.
56+
57+
### Getting started
58+
To make changes to pySBD's code base, you need to fork then clone the GitHub repository to your local machine. You'll need to make sure that you have a development environment consisting of a Python distribution including python 3+, pipenv and git installed.
59+
60+
```python
61+
python -m pip install -U pip
62+
git clone https://github.com/nipunsadvilkar/pySBD
63+
cd pySBD
64+
pipenv install --dev
65+
```
66+
Since pySBD is lightweight, it requires only python inbuilt modules, more specifically python `re` module to function. Development packages requiremment will be provided by `Pipfile` `[dev-packages]`. If you want to use pySBD as a spacy component then install spacy in your environment.
67+
68+
### Add a new rule to existing *Golden Rules Set* (GRS)
69+
The language specific *Golden Rules Set* are hand-constructed rules, designed to cover sentence boundaries across a variety of domains. The set is by no means complete and will evolve and expand over time. If you would like to report an issue in existing rule or report a new rule, please [open an issue.](#submitting-issues) If you want to contribute yourself then please go ahead and send pull request by referring to [adding tests](#adding-tests) section.
70+
71+
### Add new language support
72+
Great to see you adding new language support to pySBD ✨.</br>
73+
You would need following steps to add new language support:
74+
75+
^^ Please use already supported language commits - [Marathi](https://github.com/nipunsadvilkar/pySBD/commit/ab39442ece525285e5e83a80e2d2672bba467db7), [Spanish](https://github.com/nipunsadvilkar/pySBD/commit/ed6fb8672e30521e6e5d55bc86b779b2b4cf47dd), [Chinese](https://github.com/nipunsadvilkar/pySBD/commit/092764f896911bb97259720998b636f18980bb62) - as a frame of reference as you go through each steps below.
76+
77+
1. **New Language Specific *Golden Rules Set***</br>
78+
You would require to create *Golden Rule Set* representing basic to complex sentence boundary variations as a test set. Assuming you know the language, its sentence syntax and other intricacies you can create a new file at `tests/lang/test_<language_name>.py` and enlist input text and expected output in the same way author has added support for existing^^ languages. You may want to refer to [adding tests](#adding-tests) section to know more details on how to add, run tests, adding language fixture. Next, run the tests using `pytest` and let it deliberately fail.
79+
80+
2. **Add your language module**</br>
81+
Create a new file at `pysbd/lang/<language_name>.py` and define a new class `LanguageName` which should be inheriting from two base classes - `Common, Standard` - involving basic rules common across majority of languages. Try running tests to see your GRS passes or not. If fails, you would need to override `SENTENCE_BOUNDARY_REGEX`, `Punctuations` class variables and `AbbreviationReplacer` class to support your language specific punctuations, sentence boundaries.
82+
83+
> Illustration: As you could see in [`Marathi`](https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/lang/marathi.py) language, `AbbreviationReplacer` & its `SENTENCE_STARTERS` are kept blank to override `Standard`'s [`SENTENCE_STARTERS`](https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/lang/common/standard.py#L111). Next, `Punctuations` are limited to `['.', '!', '?']` and as per it `SENTENCE_BOUNDARY_REGEX` is constructed to make sure it would pass [Marathi GRS](https://github.com/nipunsadvilkar/pySBD/blob/master/tests/lang/test_marathi.py). Similar to the class variables, if you find any rule not pertaining to your language then you can override it in your language class.
84+
85+
3. **Add language code**<br>
86+
Your language module & language GRS should be in place by now. Next step is to make it available to pySBD's [`languages`](https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/languages.py) module by importing your language module and adding a new key having [ISO 639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) equivalent language code belonging to your language to the `LANGUAGE_CODES` dictionary and value as your language class you would have imported.
87+
88+
### Add tests
89+
Author emphasizes on Test-Driven Development [(TDD)](https://testdriven.io/test-driven-development/) approach to ensure robustness of the pySBD module. You will follow a "<span style="color: red;">Red</span>-<span style="color: green;">Green</span>-<span style="color: orange;">Refactor</span>" cycle.
90+
91+
1. Make sure you have proper development environment [setup](#getting-started)
92+
2. Depending on your type of contribution your test script would vary between [feature-specific](#add-new-language-support) / [bugfix specific](#fix-bugs).
93+
3. (<span style="color: red;">Red</span>) Once you add those tests, run `pytest` to make sure it fails deliberately.
94+
4. (<span style="color: green;">Green</span>) Write just enough code to implement your logic in respective python script to pass the specific test which you added and got failed earlier.
95+
5. Once it passes, run all the tests to see if your added code doesn't break existing code.
96+
6. (<span style="color: orange;">Refactor</span>) Do necessary refactoring & cleaning to keep tests green.
97+
7. Repeat 🔁
98+
99+
### Fix bug(s)
100+
101+
When fixing a bug, first create an
102+
[issue](https://github.com/nipunsadvilkar/pySBD/issues) if one does not already exist.
103+
The description text can be very short – don't need to be verbose.
104+
105+
Next, depending on your type of issue, add your test in `TEST_ISSUE_DATA` / `TEST_ISSUE_DATA_CHAR_SPANS` with a tuple `("#ISSUE_NUMBER", "<input_text>", <expected_output>)` in the
106+
[`pysbd/tests/regression`](pysbd/tests/regression) folder. Test for the bug
107+
you're fixing, and make sure the test fails. Next, add and commit your test file
108+
referencing the issue number in the commit message. Finally, fix the bug, make
109+
sure your test passes and reference the issue in your commit message.
110+
111+
Thank you for contributing! ✨ 🍰 ✨

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ print(list(doc.sents))
5151

5252
## Contributing
5353

54-
If you find a text that is incorrectly segmented using pySBD, please submit an issue.
54+
If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to [CONTRIBUTING.md](https://github.com/nipunsadvilkar/pySBD/blob/master/CONTRIBUTING.md) to know more and follow these steps.
5555

5656
1. Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
5757
2. Create your feature branch (`git checkout -b my-new-feature`)

tests/sample.txt

Lines changed: 0 additions & 106 deletions
This file was deleted.

tests/sample2.txt

Lines changed: 0 additions & 43 deletions
This file was deleted.

0 commit comments

Comments
 (0)