Skip to content

Conversation

@ajstanley
Copy link
Contributor

What does this Pull Request do?

Adds text extraction module to playbook.

What's new?

With text extraction in place images with OCR will have that text extracted and put into an editable media.
Any Original File with an Original File tag will also have text extracted into an editable media

How should this be tested?

After playbook is spun up create a node tagged as both Image and Digital document.
Add image media (containing test) tagged as Original File.
Extracted Text media should be created and attached to node.

Create another node and attach a media tagged with Original File and with a media type of application/pdf.
Extracted Text media should be created and attached to node.

Interested parties

@Islandora-Devops/committers

update_cache: yes
when: ansible_os_family == "Redhat"

- name: Download Islandora Text Extraction module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannylamb
Copy link
Member

vagrant uping to test. I added the text-extraction play to the main play for this to trigger.

@dannylamb
Copy link
Member

Box provisions ok and I can confirm the listeners are deploying as is the islandora_text_extraction module, but when uploading a PDF, I'm getting a white screen with this in the logs

Notice: Undefined offset: 0 in Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluateEntity() (line 46 of /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php)

@Natkeeran
Copy link
Contributor

Natkeeran commented Aug 30, 2019

I was looking to test this as well. Please modify the instructions/playbook to include the text-extraction play. I vagrant up, then realized that I did not have the text-extraction. I will hold on until Danny's comments are addressed.

Currently, only one model can be chosen. Thus, not sure how to tag a node "as both Image and Digital document."

Questions

  • I assume tagging something as "Image and Digital Document" will trigger text extraction. Which is ok. But, if we want to optionally not to do text extraction for a given individual image/document, how can we do that?

  • It would be better to enable the user to select if they want to do OCR, then select which service to use. We can provide tesseract-ocr as the default, but if we can provide option to add other end points, that would be useful as tesseract-ocr does not support many languages well.

  • How does this info get indexed in Solr. i.e how if we search for a word found in the extracted doc, would it bring back the node?

Thank you.

@ajstanley
Copy link
Contributor Author

@Natkeeran - Good questions!
You have to change the field definitions for the repository item content type to allow more than one tag. We had to do this because some images will be ocr'ed and some will not so double tagging allows the behaviours we want to apply to both images and digital documents.
(Ultimately I'd like to see a list of microservices as a taxonomy and on any given ingest you'd select the ones you want, but for now we're relying on media use tags)
If we come with other text extraction methods or microservices we'll just have to create actions.contexts to trigger those instead of these.
Both OCR and PDF text extraction create a media with an 'editable_text' field. Index that field like any other in the solr configuration and it becomes solr searchable.

@dannylamb
Copy link
Member

Linking to Islandora/documentation#932

@dannylamb
Copy link
Member

Testing and review of this PR to be done as part of our paged content sprint!

@dbernstein
Copy link
Contributor

@ajstanley, @dannylamb : I'm testing the PR and running into some issues. I tried to run the test, but after modifying the "Model" field to allow the user to check both Image and Digital Document, Drupal said that it could not make the change to the database. I tried to walk through the steps anyway, but the ocr file was not generated. So I'm guessing Drupal was being serious about not being to update the database.

A few requests/questions:

  1. Can you update the PR so that it includes the necessary changes to the Content model?
  2. If for some reason that is not possible, can you detail the steps for bringing the database into alignment to enable this to work?
  3. Is it necessary for me to modify any of the playbooks or vars to get this to work? If so, shouldn't the PR be updated first before trying to test it?

@ajstanley
Copy link
Contributor Author

ajstanley commented Sep 5, 2019

@dbernstein - You should be able to navigate to
http://localhost:8000/admin/structure/types/manage/islandora_object/fields/node.islandora_object.field_model/storage
and change allowed number of values to unlimited.
(You might get a message saying the field can't be altered, but just ignore that. It doesn't know what it's talking about).
You might want to change the the widget on the form display to checkbox, but its not (strictly speaking) necessary.

@Natkeeran
Copy link
Contributor

Getting the following error:
The website encountered an unexpected error. Please try again later.

Notice: Undefined offset: 0 in Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluateEntity() (line 46 of /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php) #0 /var/www/html/drupal/web/core/includes/bootstrap.inc(587): _drupal_error_handler_real(8, 'Undefined offse...', '/var/www/html/d...', 46, Array) #1 /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php(46): _drupal_error_handler(8, 'Undefined offse...', '/var/www/html/d...', 46, Array) #2 /var/www/html/drupal/web/modules/contrib/islandora_text_extraction/src/Plugin/Condition/GeneratePdfTextCondition.php(33): Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluateEntity(Object(Drupal\media\Entity\Media)) #3 /var/www/html/drupal/web/core/lib/Drupal/Core/Condition/ConditionManager.php(74): Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition->evaluate() #4 /var/www/html/drupal/web/core/lib/Drupal/Core/Condition/ConditionPluginBase.php(84): Drupal\Core\Condition\ConditionManager->execute(Object(Drupal\islandora_text_extraction\Plugin\Condition\GeneratePdfTextCondition)) #5 /var/www/html/drupal/web/core/lib/Drupal/Core/Condition/ConditionAccessResolverTrait.php(26): Drupal\Core\Condition\ConditionPluginBase->execute() #6 /var/www/html/drupal/web/modules/contrib/islandora/src/IslandoraContextManager.php(64): Drupal\context\ContextManager->resolveConditions(Object(Drupal\Core\Condition\ConditionPluginCollection), 'and') #7 /var/www/html/drupal/web/modules/contrib/islandora/src/IslandoraContextManager.php(28): Drupal\islandora\IslandoraContextManager->evaluateContextConditions(Object(Drupal\context\Entity\Context), Array) #8 /var/www/html/drupal/web/modules/contrib/context/src/ContextManager.php(189): Drupal\islandora\IslandoraContextManager->evaluateContexts() #9 /var/www/html/drupal/web/modules/contrib/context/src/ContextManager.php(220): Drupal\context\ContextManager->getActiveContexts() #10 /var/www/html/drupal/web/modules/contrib/islandora/islandora.module(315): Drupal\context\ContextManager->getActiveReactions('\\Drupal\\islando...') #11 /var/www/html/drupal/web/core/lib/Drupal/Core/Extension/ModuleHandler.php(539): islandora_entity_form_display_alter(Object(Drupal\Core\Entity\Entity\EntityFormDisplay), Array, NULL) #12 /var/www/html/drupal/web/core/lib/Drupal/Core/Entity/Entity/EntityFormDisplay.php(124): Drupal\Core\Extension\ModuleHandler->alter('entity_form_dis...', Object(Drupal\Core\Entity\Entity\EntityFormDisplay), Array) #13 /var/www/html/drupal/web/core/lib/Drupal/Core/Entity/ContentEntityForm.php(288): Drupal\Core\Entity\Entity\EntityFormDisplay::collectRenderDisplay(Object(Drupal\media\Entity\Media), 'add') #14 /var/www/html/drupal/web/core/lib/Drupal/Core/Entity/EntityForm.php(107): Drupal\Core\Entity\ContentEntityForm->init(Object(Drupal\Core\Form\FormState)) #15 [internal function]: Drupal\Core\Entity\EntityForm->buildForm(Array, Object(Drupal\Core\Form\FormState)) #16 /var/www/html/drupal/web/core/lib/Drupal/Core/Form/FormBuilder.php(519): call_user_func_array(Array, Array) #17 /var/www/html/drupal/web/core/lib/Drupal/Core/Form/FormBuilder.php(276): Drupal\Core\Form\FormBuilder->retrieveForm('media_file_add_...', Object(Drupal\Core\Form\FormState)) #18 /var/www/html/drupal/web/core/lib/Drupal/Core/Controller/FormController.php(93): Drupal\Core\Form\FormBuilder->buildForm('media_file_add_...', Object(Drupal\Core\Form\FormState)) #19 [internal function]: Drupal\Core\Controller\FormController->getContentResult(Object(Symfony\Component\HttpFoundation\Request), Object(Drupal\Core\Routing\RouteMatch)) #20 /var/www/html/drupal/web/core/lib/Drupal/Core/EventSubscriber/EarlyRenderingControllerWrapperSubscriber.php(123): call_user_func_array(Array, Array) #21 /var/www/html/drupal/web/core/lib/Drupal/Core/Render/Renderer.php(582): Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->Drupal\Core\EventSubscriber\{closure}() #22 /var/www/html/drupal/web/core/lib/Drupal/Core/EventSubscriber/EarlyRenderingControllerWrapperSubscriber.php(124): Drupal\Core\Render\Renderer->executeInRenderContext(Object(Drupal\Core\Render\RenderContext), Object(Closure)) #23 /var/www/html/drupal/web/core/lib/Drupal/Core/EventSubscriber/EarlyRenderingControllerWrapperSubscriber.php(97): Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->wrapControllerExecutionInRenderContext(Array, Array) #24 /var/www/html/drupal/vendor/symfony/http-kernel/HttpKernel.php(151): Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->Drupal\Core\EventSubscriber\{closure}() #25 /var/www/html/drupal/vendor/symfony/http-kernel/HttpKernel.php(68): Symfony\Component\HttpKernel\HttpKernel->handleRaw(Object(Symfony\Component\HttpFoundation\Request), 1) #26 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/Session.php(57): Symfony\Component\HttpKernel\HttpKernel->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #27 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/KernelPreHandle.php(47): Drupal\Core\StackMiddleware\Session->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #28 /var/www/html/drupal/web/core/modules/page_cache/src/StackMiddleware/PageCache.php(106): Drupal\Core\StackMiddleware\KernelPreHandle->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #29 /var/www/html/drupal/web/core/modules/page_cache/src/StackMiddleware/PageCache.php(85): Drupal\page_cache\StackMiddleware\PageCache->pass(Object(Symfony\Component\HttpFoundation\Request), 1, true) #30 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/ReverseProxyMiddleware.php(47): Drupal\page_cache\StackMiddleware\PageCache->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #31 /var/www/html/drupal/web/core/lib/Drupal/Core/StackMiddleware/NegotiationMiddleware.php(52): Drupal\Core\StackMiddleware\ReverseProxyMiddleware->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #32 /var/www/html/drupal/vendor/stack/builder/src/Stack/StackedHttpKernel.php(23): Drupal\Core\StackMiddleware\NegotiationMiddleware->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #33 /var/www/html/drupal/web/core/lib/Drupal/Core/DrupalKernel.php(693): Stack\StackedHttpKernel->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true) #34 /var/www/html/drupal/web/index.php(19): Drupal\Core\DrupalKernel->handle(Object(Symfony\Component\HttpFoundation\Request)) #35 {main}.

Also, wondering if it is passing any language parameters to tesseract.

@dannylamb
Copy link
Member

Superseded by #140

@dannylamb dannylamb closed this Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants