feat: working parallel processing

ronaldtse · ronaldtse · commit 7aeda310c719 · 2025-05-07T09:23:10.000+08:00
diff --git a/README.adoc b/README.adoc
@@ -83,8 +83,8 @@ Available subcommands:
 ** `--manifest-path PATH` - Path to schemas.yml manifest file
 ** `--cache-dir DIR` - Directory for caching downloaded tools
 ** `--log-dir DIR` - Directory for storing log files
-** `--parallel` / `--no-parallel` - Enable/disable parallel processing with Ractors
-** `--ractors NUM` - Number of parallel ractors to use (default: auto-configured)
+** `--parallel` / `--no-parallel` - Enable/disable parallel processing with Fractors (default: enabled)
+** `--workers NUM` - Number of parallel workers to use (default: auto-configured)
 * `clean` - Remove generated documentation
 * `distclean` - Remove generated documentation and downloaded tools
 ** `--global-cache` - Also clean the global cache directory
@@ -99,8 +99,8 @@ bundle exec hrma build documentation
 # Generate documentation with custom manifest file
 bundle exec hrma build documentation --manifest-path=custom-schemas.yml
 
-# Generate documentation with 4 ractors
-bundle exec hrma build documentation --ractors=4
+# Generate documentation with 4 workers
+bundle exec hrma build documentation --workers=4
 
 # Generate documentation without parallel processing
 bundle exec hrma build documentation --no-parallel
@@ -158,36 +158,35 @@ bundle exec hrma config set cache_dir /path/to/cache
 
 === Parallel processing
 
-The tool supports parallel processing using Ruby's Ractor feature. This
-significantly speeds up documentation generation for large numbers of schema
-files.
+The tool supports parallel processing using Ruby's Ractor feature through the Fractor framework. This
+significantly speeds up documentation generation for large numbers of schema files.
 
-By default, the tool automatically determines the optimal number of ractors to
+By default, the tool automatically determines the optimal number of workers to
 use based on your system resources:
 
-* In "auto" mode (default), the number of ractors is determined by:
+* In "auto" mode (default), the number of workers is determined by:
 ** Using half of your CPU cores (rounded down)
 ** Ensuring at least 2 cores are left free for system processes
-** Using at least 1 ractor
-** Using one ractor per file when possible (up to the calculated maximum)
+** Using at least 1 worker
+** Using one worker per file when possible (up to the calculated maximum)
 
 This auto-configuration provides a good balance between performance and system
 responsiveness.
 
 [example]
 ====
-* With 4 files on a 4-core system: 1 ractor would be used (half cores = 2, but ensuring 2 cores are free = 1)
-* With 4 files on an 8-core system: 4 ractors would be used (half cores = 4, which leaves enough free cores)
-* With 4 files on a 16-core system: 4 ractors would be used (one per file, even though 8 ractors would be available)
-* With 10 files on a 16-core system: 8 ractors would be used (half cores = 8, which is less than file count)
+* With 4 files on a 4-core system: 1 worker would be used (half cores = 2, but ensuring 2 cores are free = 1)
+* With 4 files on an 8-core system: 4 workers would be used (half cores = 4, which leaves enough free cores)
+* With 4 files on a 16-core system: 4 workers would be used (one per file, even though 8 workers would be available)
+* With 10 files on a 16-core system: 8 workers would be used (half cores = 8, which is less than file count)
 ====
 
-You can manually specify the number of ractors:
+You can manually specify the number of workers:
 
 [source,sh]
 ----
-# Use 4 ractors for parallel processing
-bundle exec hrma build documentation --ractors=4
+# Use 4 workers for parallel processing
+bundle exec hrma build documentation --workers=4
 ----
 
 To disable parallel processing entirely:
@@ -196,9 +195,6 @@ To disable parallel processing entirely:
 ----
 # Disable parallel processing
 bundle exec hrma build documentation --no-parallel
-
-# Alternative method
-HRMA_DISABLE_RACTORS=1 bundle exec hrma build documentation
 ----
 
 
@@ -215,8 +211,9 @@ The `hrma` tool is organized into several components:
 === Build system
 
 * `lib/hrma/build/document_generator.rb` - Main class for generating documentation
-* `lib/hrma/build/ractor_document_processor.rb` - Processor for XSD files that can run within a Ractor
-* `lib/hrma/build/documentation.rb` - Module with documentation generation utilities
+* `lib/hrma/build/schema_processor.rb` - Processes individual schema files
+* `lib/hrma/build/schema_work.rb` - Work item representation for parallel processing
+* `lib/hrma/build/schema_worker.rb` - Worker implementation for parallel processing
 * `lib/hrma/build/tools.rb` - Handles downloading and setting up external tools
 * `lib/hrma/build/cleaner.rb` - Handles cleaning generated files
 
diff --git a/hrma.gemspec b/hrma.gemspec
@@ -19,4 +19,5 @@ Gem::Specification.new do |spec|
   spec.add_dependency "thor", "~> 1.2"
   spec.add_dependency "rake", "~> 13.0"
   spec.add_dependency "ruby-progressbar", "~> 1.13"
+  spec.add_dependency "fractor", "~> 0.1"
 end
diff --git a/lib/hrma/README.adoc b/lib/hrma/README.adoc
@@ -0,0 +1,77 @@
+= HRMA Library Documentation
+
+== Overview
+
+The HRMA (Harmonized Resources Maintenance Agency) library provides tools for managing ISO/TC 211 schemas and generating documentation. This document describes the internal structure and components of the library.
+
+== Directory Structure
+
+* `lib/hrma/` - Root directory for the HRMA library
+** `build/` - Documentation generation components
+** `commands/` - Command implementations for the CLI
+** `cli.rb` - Main CLI class
+** `config.rb` - Configuration management
+** `version.rb` - Version information
+
+== Key Components
+
+=== Build System
+
+The build system is responsible for generating documentation from schema files:
+
+* `build/document_generator.rb` - Main class for generating documentation
+* `build/schema_processor.rb` - Processes individual schema files
+* `build/schema_work.rb` - Work item representation for parallel processing
+* `build/schema_worker.rb` - Worker implementation for parallel processing
+* `build/tools.rb` - Handles downloading and setting up external tools
+* `build/cleaner.rb` - Handles cleaning generated files
+
+=== Parallel Processing
+
+The library uses Ruby's Ractor feature for parallel processing of schema files:
+
+* Work is distributed across multiple Ractors using the Fractor framework
+* Each schema file is processed in its own Ractor
+* Results are collected and aggregated
+* The number of Ractors is configurable or auto-detected based on system resources
+
+=== Commands
+
+The command system provides the CLI interface:
+
+* `commands/build.rb` - Commands for building documentation
+* `commands/schemas.rb` - Commands for managing schemas
+* `commands/config.rb` - Commands for managing configuration
+
+=== Configuration
+
+Configuration is managed through:
+
+* Command-line options
+* Environment variables
+* Configuration file (`~/.hrma/config.yml`)
+
+== Development
+
+=== Adding New Features
+
+When adding new features:
+
+1. Identify the appropriate component to modify
+2. Update tests to cover the new functionality
+3. Update documentation (including this README)
+4. Update the main README.adoc if the feature affects user-facing functionality
+
+=== Parallel Processing
+
+The parallel processing system uses the Fractor framework to distribute work across multiple Ractors:
+
+1. `SchemaWork` objects represent individual schema files to process
+2. `SchemaWorker` processes each work item in a separate Ractor
+3. `DocumentGenerator` coordinates the workers and collects results
+
+When modifying the parallel processing system:
+
+* Ensure all objects passed between Ractors are shareable
+* Handle errors appropriately to prevent worker crashes
+* Consider the impact on memory usage and system resources
diff --git a/lib/hrma/build/document_generator.rb b/lib/hrma/build/document_generator.rb
@@ -4,8 +4,12 @@
 require "fileutils"
 require "ruby-progressbar"
 require "logger"
+require "etc"
+require "fractor"
 require_relative "../config"
 require_relative "schema_processor"
+require_relative "schema_work"
+require_relative "schema_worker"
 
 module Hrma
   module Build
@@ -35,11 +39,80 @@ def generate
         puts "Found #{xsd_files.size} XSD files to process"
         @progressbar = create_progressbar(xsd_files.size)
 
-        generate_sequential(xsd_files)
+        if options[:parallel] == false
+          generate_sequential(xsd_files)
+        else
+          generate_parallel(xsd_files)
+        end
 
         puts "\nDocumentation generation complete. See _site/ directory."
       end
 
+      # Generate documentation in parallel using Fractors
+      #
+      # @param xsd_files [Array<String>] List of XSD files to process
+      # @return [void]
+      def generate_parallel(xsd_files)
+        puts "Generating documentation in parallel..."
+
+        # Determine number of workers - use either the specified number or auto-detect
+        num_workers = options[:workers] || [xsd_files.size, Etc.nprocessors].min
+        puts "Using #{num_workers} parallel workers"
+
+        # Create work items - each item contains just the basic string path and log file path
+        work_items = xsd_files.map do |xsd_file|
+          # Create log file path for this file if log_dir is specified
+          log_file = nil
+          if @log_dir
+            log_file_name = "#{File.basename(xsd_file, '.xsd')}.log"
+            log_file = File.join(@log_dir, log_file_name)
+            FileUtils.mkdir_p(File.dirname(log_file))
+          end
+
+          # Use the original string path directly - no nested objects
+          SchemaWork.new({
+            schema_path: xsd_file,  # This is just a string
+            log_file: log_file
+          })
+        end
+
+        # Create supervisor with worker pools
+        supervisor = Fractor::Supervisor.new(
+          worker_pools: [
+            { worker_class: SchemaWorker, num_workers: num_workers }
+          ]
+        )
+
+        # Add work items
+        supervisor.add_work_items(work_items)
+
+        # Run processing
+        supervisor.run
+
+        # Process results
+        process_results(supervisor.results)
+      end
+
+      # Process results from parallel processing
+      #
+      # @param aggregator [Fractor::ResultAggregator] Result aggregator
+      # @return [void]
+      def process_results(aggregator)
+        # Handle successful results
+        aggregator.results.each do |result|
+          schema_path = result.work.input[:schema_path]
+          puts "Successfully processed #{schema_path}"
+          progressbar.increment
+        end
+
+        # Handle errors
+        aggregator.errors.each do |error_result|
+          schema_path = error_result.work.input[:schema_path]
+          puts "Error processing #{schema_path}: #{error_result.error}"
+          progressbar.increment
+        end
+      end
+
       private
 
       # Load XSD files from schemas.yml
@@ -59,38 +132,20 @@ def load_xsd_files
         xsd_files
       end
 
-      # Generate documentation sequentially
+      # Generate documentation sequentially (using parallel processing with 1 worker)
       #
       # @param xsd_files [Array<String>] List of XSD files to process
       # @return [void]
       def generate_sequential(xsd_files)
-        puts "Generating documentation sequentially..."
-
-        # Create a schema processor
-        processor = SchemaProcessor.new
+        puts "Generating documentation sequentially (single worker)..."
 
-        # Process each file
-        xsd_files.each do |xsd_file|
-          puts "Processing: #{xsd_file}"
+        # Just use parallel processing with 1 worker
+        options_with_one_worker = options.dup
+        options_with_one_worker[:workers] = 1
+        @options = options_with_one_worker
 
-          # Create a logger for this file if log_dir is specified
-          logger = create_logger(xsd_file) if @log_dir
-
-          # Process the file
-          result = processor.process(schema_path: xsd_file, logger: logger)
-
-          # Close the logger if it was created
-          logger&.close
-
-          # Handle the result
-          if result
-            puts "Successfully processed #{xsd_file}"
-          else
-            puts "Error processing #{xsd_file}"
-          end
-
-          progressbar.increment
-        end
+        # Use the parallel implementation with 1 worker
+        generate_parallel(xsd_files)
       end
 
       # Create a logger for a specific file
diff --git a/lib/hrma/build/schema_work.rb b/lib/hrma/build/schema_work.rb
@@ -0,0 +1,30 @@
+# frozen_string_literal: true
+
+require 'fractor'
+
+module Hrma
+  module Build
+    # Class representing a work item for schema processing
+    class SchemaWork < Fractor::Work
+      attr_reader :schema_path, :log_file
+
+      # Initialize a new SchemaWork
+      #
+      # @param data [Hash] Hash containing schema_path and log_file
+      # @option data [String] :schema_path Path to the schema file
+      # @option data [String, nil] :log_file Path to log file, if any
+      def initialize(data)
+        @schema_path = data[:schema_path]
+        @log_file = data[:log_file]
+        super(data)
+      end
+
+      # Provide a readable representation of this work item
+      #
+      # @return [String] String representation
+      def to_s
+        "SchemaWork: #{@schema_path}"
+      end
+    end
+  end
+end
diff --git a/lib/hrma/build/schema_worker.rb b/lib/hrma/build/schema_worker.rb
diff --git a/lib/hrma/commands/build.rb b/lib/hrma/commands/build.rb