Skip to content

Provided seed files are updated (the more the job is repited, the more they are modified) #558

@cgr71ii

Description

@cgr71ii

Hi!

I'm running several crawls with the same seed file, but I noticed that Heritrix add lines to this file and explicitly modify it. Couldn't this be avoided like, I don't know, maybe copying the seed file to the job directory and modify that copy?

Seed-related configuration:

 <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
  <property name="textSource">
   <bean class="org.archive.spring.ConfigFile">
    <property name="path" value="/path/to/seeds" />
   </bean>
  </property>
  <property name='sourceTagSeeds' value='false'/>
  <property name='blockAwaitingSeedLines' value='-1'/>
 </bean>

The problem is that I started with a file of 51451 lines and currently has 1083095 after maybe 20 times it's been reused. This slows down the initialization, but even worse, the initialization is different after each crawl because some of the seeds I have redirects to other website or the same website but a specific resource (not only the common redirection from http to https which I guess it's the reason why this feature was implemented), but that redirection which is annotated in this seed file, in the next crawl job redirects again to another redirection. So, in the end, my seed files is adding new seeds which I hadn't noticed before.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions