Text extraction #138

ajstanley · 2019-08-21T14:24:27Z

What does this Pull Request do?

Adds text extraction module to playbook.

What's new?

With text extraction in place images with OCR will have that text extracted and put into an editable media.
Any Original File with an Original File tag will also have text extracted into an editable media

How should this be tested?

After playbook is spun up create a node tagged as both Image and Digital document.
Add image media (containing test) tagged as Original File.
Extracted Text media should be created and attached to node.

Create another node and attach a media tagged with Original File and with a media type of application/pdf.
Extracted Text media should be created and attached to node.

Interested parties

@Islandora-Devops/committers

* Update README.md De-claw * Update forBeginners.md de-CLAW * Update README.md

dannylamb · 2019-08-28T16:27:15Z

text_extraction.yml

+        update_cache: yes
+      when: ansible_os_family == "Redhat"
+
+    - name: Download Islandora Text Extraction module


You can just add this and the next task to variables in our inventory.

https://github.com/Islandora-Devops/claw-playbook/blob/dev/inventory/vagrant/group_vars/webserver/drupal.yml#L6

https://github.com/Islandora-Devops/claw-playbook/blob/dev/inventory/vagrant/group_vars/webserver/drupal.yml#L30

dannylamb · 2019-08-28T19:11:27Z

vagrant uping to test. I added the text-extraction play to the main play for this to trigger.

dannylamb · 2019-08-29T13:32:23Z

Box provisions ok and I can confirm the listeners are deploying as is the islandora_text_extraction module, but when uploading a PDF, I'm getting a white screen with this in the logs

Notice: Undefined offset: 0 in Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluateEntity() (line 46 of /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php)

Natkeeran · 2019-08-30T18:14:44Z

I was looking to test this as well. Please modify the instructions/playbook to include the text-extraction play. I vagrant up, then realized that I did not have the text-extraction. I will hold on until Danny's comments are addressed.

Currently, only one model can be chosen. Thus, not sure how to tag a node "as both Image and Digital document."

Questions

I assume tagging something as "Image and Digital Document" will trigger text extraction. Which is ok. But, if we want to optionally not to do text extraction for a given individual image/document, how can we do that?
It would be better to enable the user to select if they want to do OCR, then select which service to use. We can provide tesseract-ocr as the default, but if we can provide option to add other end points, that would be useful as tesseract-ocr does not support many languages well.
How does this info get indexed in Solr. i.e how if we search for a word found in the extracted doc, would it bring back the node?

Thank you.

ajstanley · 2019-08-30T18:37:09Z

@Natkeeran - Good questions!
You have to change the field definitions for the repository item content type to allow more than one tag. We had to do this because some images will be ocr'ed and some will not so double tagging allows the behaviours we want to apply to both images and digital documents.
(Ultimately I'd like to see a list of microservices as a taxonomy and on any given ingest you'd select the ones you want, but for now we're relying on media use tags)
If we come with other text extraction methods or microservices we'll just have to create actions.contexts to trigger those instead of these.
Both OCR and PDF text extraction create a media with an 'editable_text' field. Index that field like any other in the solr configuration and it becomes solr searchable.

dannylamb · 2019-09-03T19:15:47Z

Linking to Islandora/documentation#932

dannylamb · 2019-09-03T19:16:02Z

Testing and review of this PR to be done as part of our paged content sprint!

dbernstein · 2019-09-04T20:31:50Z

@ajstanley, @dannylamb : I'm testing the PR and running into some issues. I tried to run the test, but after modifying the "Model" field to allow the user to check both Image and Digital Document, Drupal said that it could not make the change to the database. I tried to walk through the steps anyway, but the ocr file was not generated. So I'm guessing Drupal was being serious about not being to update the database.

A few requests/questions:

Can you update the PR so that it includes the necessary changes to the Content model?
If for some reason that is not possible, can you detail the steps for bringing the database into alignment to enable this to work?
Is it necessary for me to modify any of the playbooks or vars to get this to work? If so, shouldn't the PR be updated first before trying to test it?

ajstanley · 2019-09-05T11:50:26Z

@dbernstein - You should be able to navigate to
http://localhost:8000/admin/structure/types/manage/islandora_object/fields/node.islandora_object.field_model/storage
and change allowed number of values to unlimited.
(You might get a message saying the field can't be altered, but just ignore that. It doesn't know what it's talking about).
You might want to change the the widget on the form display to checkbox, but its not (strictly speaking) necessary.

Natkeeran · 2019-09-06T19:25:12Z

Getting the following error:
The website encountered an unexpected error. Please try again later.

Notice: Undefined offset: 0 in Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluateEntity() (line 46 of /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php) #0 /var/www/html/drupal/web/core/includes/bootstrap.inc(587): _drupal_error_handler_real(8, 'Undefined offse...', '/var/www/html/d...', 46, Array) #1 /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php(46): _drupal_error_handler(8, 'Undefined offse...', '/var/www/html/d...', 46, Array) #2 /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php(33): Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluateEntity(Object(Drupal\media\Entity\Media)) #3 /var/www/html/drupal/web/core/lib/Drupal/Core/Condition/ConditionManager.php(74): Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluate() #4 /var/www/html/drupal/web/core/lib/Drupal/Core/Condition/ConditionPluginBase.php(84): Drupal\Core\Condition\ConditionManager->execute(Object(Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition)) #5 /var/www/html/drupal/web/core/lib/Drupal/Core/Condition/ConditionAccessResolverTrait.php(26): Drupal\Core\Condition\ConditionPluginBase->execute() #6 /var/www/html/drupal/web/modules/contrib/islandora/src/IslandoraContextManager.php(64): Drupal\context\ContextManager->resolveConditions(Object(Drupal\Core\Condition\ConditionPluginCollection), 'and') #7 /var/www/html/drupal/web/modules/contrib/islandora/src/IslandoraContextManager.php(28): Drupal\islandora\IslandoraContextManager->evaluateContextConditions(Object(Drupal\context\Entity\Context), Array) #8 /var/www/html/drupal/web/modules/contrib/context/src/ContextManager.php(189): Drupal\islandora\IslandoraContextManager->evaluateContexts() #9 /var/www/html/drupal/web/modules/contrib/context/src/ContextManager.php(220): Drupal\context\ContextManager->getActiveContexts() #10 /var/www/html/drupal/web/modules/contrib/islandora/islandora.module(315): Drupal\context\ContextManager->getActiveReactions('\\Drupal\\islando...') #11 /var/www/html/drupal/web/core/lib/Drupal/Core/Extension/ModuleHandler.php(539): islandora_entity_form_display_alter(Object(Drupal\Core\Entity\Entity\EntityFormDisplay), Array, NULL) #12 /var/www/html/drupal/web/core/lib/Drupal/Core/Entity/Entity/EntityFormDisplay.php(124): Drupal\Core\Extension\ModuleHandler->alter('entity_form_dis...', Object(Drupal\Core\Entity\Entity\EntityFormDisplay), Array) #13 /var/www/html/drupal/web/core/lib/Drupal/Core/Entity/ContentEntityForm.php(288): Drupal\Core\Entity\Entity\EntityFormDisplay::collectRenderDisplay(Object(Drupal\media\Entity\Media), 'add') #14 /var/www/html/drupal/web/core/lib/Drupal/Core/Entity/EntityForm.php(107): Drupal\Core\Entity\ContentEntityForm->init(Object(Drupal\Core\Form\FormState)) #15 [internal function]: Drupal\Core\Entity\EntityForm->buildForm(Array, Object(Drupal\Core\Form\FormState)) #16 /var/www/html/drupal/web/core/lib/Drupal/Core/Form/FormBuilder.php(519): call_user_func_array(Array, Array) #17 /var/www/html/drupal/web/core/lib/Drupal/Core/Form/FormBuilder.php(276): Drupal\Core\Form\FormBuilder->retrieveForm('media_file_add_...', Object(Drupal\Core\Form\FormState)) #18 /var/www/html/drupal/web/core/lib/Drupal/Core/Controller/FormController.php(93): Drupal\Core\Form\FormBuilder->buildForm('media_file_add_...', Object(Drupal\Core\Form\FormState)) #19 [internal function]: Drupal\Core\Controller\FormController->getContentResult(Object(Symfony\Component\HttpFoundation\Request), Object(Drupal\Core\Routing\RouteMatch)) #20 /var/www/html/drupal/web/core/lib/Drupal/Core/EventSubscriber/EarlyRenderingControllerWrapperSubscriber.php(123): call_user_func_array(Array, Array) #21 /var/www/html/drupal/web/core/lib/Drupal/Core/Render/Renderer.php(582): Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->Drupal\Core\EventSubscriber\{closure}() #22 /var/www/html/drupal/web/core/lib/Drupal/Core/EventSubscriber/EarlyRenderingControllerWrapperSubscriber.php(124): Drupal\Core\Render\Renderer->executeInRenderContext(Object(Drupal\Core\Render\RenderContext), Object(Closure)) #23 /var/www/html/drupal/web/core/lib/Drupal/Core/EventSubscriber/EarlyRenderingControllerWrapperSubscriber.php(97): Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->wrapControllerExecutionInRenderContext(Array, Array) #24 /var/www/html/drupal/vendor/symfony/http-kernel/HttpKernel.php(151): Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->Drupal\Core\EventSubscriber\{closure}() #25 /var/www/html/drupal/vendor/symfony/http-kernel/HttpKernel.php(68): Symfony\Component\HttpKernel\HttpKernel->handleRaw(Object(Symfony\Component\HttpFoundation\Request), 1) #26 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/Session.php(57): Symfony\Component\HttpKernel\HttpKernel->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #27 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/KernelPreHandle.php(47): Drupal\Core\StackMiddleware\Session->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #28 /var/www/html/drupal/web/core/modules/page_cache/src/StackMiddleware/PageCache.php(106): Drupal\Core\StackMiddleware\KernelPreHandle->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #29 /var/www/html/drupal/web/core/modules/page_cache/src/StackMiddleware/PageCache.php(85): Drupal\page_cache\StackMiddleware\PageCache->pass(Object(Symfony\Component\HttpFoundation\Request), 1, true) #30 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/ReverseProxyMiddleware.php(47): Drupal\page_cache\StackMiddleware\PageCache->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #31 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/NegotiationMiddleware.php(52): Drupal\Core\StackMiddleware\ReverseProxyMiddleware->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #32 /var/www/html/drupal/vendor/stack/builder/src/Stack/StackedHttpKernel.php(23): Drupal\Core\StackMiddleware\NegotiationMiddleware->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #33 /var/www/html/drupal/web/core/lib/Drupal/Core/DrupalKernel.php(693): Stack\StackedHttpKernel->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #34 /var/www/html/drupal/web/index.php(19): Drupal\Core\DrupalKernel->handle(Object(Symfony\Component\HttpFoundation\Request)) #35 {main}.

Also, wondering if it is passing any language parameters to tesseract.

dannylamb · 2019-09-19T17:10:02Z

Superseded by #140

DonRichards and others added 5 commits June 20, 2019 10:12

Removes vm's audio (#113)

555cf97

Update readme and beginners docs to de-CLAW (#116)

234642d

* Update README.md De-claw * Update forBeginners.md de-CLAW * Update README.md

merge conflicts

9e4f15b

added two new variables

d923c02

text extraction

d6daf54

dannylamb reviewed Aug 28, 2019

View reviewed changes

dannylamb closed this Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text extraction #138

Text extraction #138

Uh oh!

ajstanley commented Aug 21, 2019

Uh oh!

dannylamb Aug 28, 2019

Uh oh!

dannylamb commented Aug 28, 2019

Uh oh!

dannylamb commented Aug 29, 2019

Uh oh!

Natkeeran commented Aug 30, 2019 •

edited

Loading

Uh oh!

ajstanley commented Aug 30, 2019

Uh oh!

dannylamb commented Sep 3, 2019

Uh oh!

dannylamb commented Sep 3, 2019

Uh oh!

dbernstein commented Sep 4, 2019

Uh oh!

ajstanley commented Sep 5, 2019 •

edited

Loading

Uh oh!

Natkeeran commented Sep 6, 2019

Uh oh!

dannylamb commented Sep 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Text extraction #138

Text extraction #138

Uh oh!

Conversation

ajstanley commented Aug 21, 2019

What does this Pull Request do?

What's new?

How should this be tested?

Interested parties

Uh oh!

dannylamb Aug 28, 2019

Choose a reason for hiding this comment

Uh oh!

dannylamb commented Aug 28, 2019

Uh oh!

dannylamb commented Aug 29, 2019

Uh oh!

Natkeeran commented Aug 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Questions

Uh oh!

ajstanley commented Aug 30, 2019

Uh oh!

dannylamb commented Sep 3, 2019

Uh oh!

dannylamb commented Sep 3, 2019

Uh oh!

dbernstein commented Sep 4, 2019

Uh oh!

ajstanley commented Sep 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Natkeeran commented Sep 6, 2019

Uh oh!

dannylamb commented Sep 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Natkeeran commented Aug 30, 2019 •

edited

Loading

ajstanley commented Sep 5, 2019 •

edited

Loading