Ultimate Patch V2: Stabilizing CI/CD And Deployment

Alex Johnson
-
Ultimate Patch V2: Stabilizing CI/CD And Deployment

This article details the implementation of the "Ultimate Patch v2," a critical refactor designed to resolve significant issues within our repository. The primary goal is to consolidate all necessary fixes for CI/CD, logic, and deployment processes, ultimately achieving a stable make itest pipeline. This patch addresses a state of deadlock caused by multiple independent fixes blocking each other, resolving issues from network hangs during make setup to cascading infrastructure and logic bugs during make deploy.

Context and Genesis of the Ultimate Patch

Our HomeOps repository faced a critical juncture where various fixes, represented by pull requests such as #79, #81, #83, and #89, were mutually blocking progress. The make setup command was failing due to network issues, and the make deploy command was plagued by numerous infrastructure and logic errors, including missing Loki user configurations, missing kernel headers, SSH timeouts, and invalid validation logic. These issues collectively prevented the successful completion of the make itest pipeline, which encompasses both deployment and verification stages.

The "Ultimate Patch v2" was conceived as a "Genesis Task" to supersede all previous, fragmented attempts at resolving these issues. It aims to implement every known fix in a single, atomic, and correct sequence. This comprehensive approach is intended to break the existing "chicken-and-egg" paradoxes and enable our full make itest pipeline to pass consistently for the first time. The task focuses on the Golden Path (i.e., make itest) to ensure end-to-end functionality. Success is determined by adherence to the specifications outlined in docs/verification-spec.md. The patch requires a self-hosted runner environment with Ansible and hardware access. While the initial setup may require network access for the AI to create the patch, the final CI run is designed to be offline-capable, ensuring reliable execution in isolated environments.

The scope of the "Ultimate Patch v2" encompasses and supersedes all open pull requests related to various fixes, including VENDORIZE, FIX-LOKI-GROUP, FIX-WS01-KERNEL-HEADERS, FIX-WS01-SSH, REDEFINE-GATE2-ITEST, FIX-CICD-SYNTAX, FIX-LOKI-CONFIG, and FIX-CICD-SECURITY. This consolidation ensures that all known issues are addressed in a coordinated manner, preventing conflicts and ensuring a cohesive solution.

Requirements: Achieving a Holistic Refactor

The successful implementation of the Ultimate Patch v2 requires a holistic refactoring of the repository, integrating all identified fixes into a single, coherent pull request. This section details the specific requirements across various components of the system.

1. Core Process Refactor ("宪法"修正)

This aspect focuses on redefining the fundamental processes and workflows within the repository to ensure a streamlined and reliable CI/CD pipeline.

  • Makefile Fixes: The make itest target must be redefined as a composite action. This involves executing make setup first, followed by playbooks/deploy-observability-stack.yml to deploy the observability stack, and finally, playbooks/tests/verify_observability.yml to verify the deployment.
  • CI Workflow Adjustments: The CI workflows, specifically pr-quality-check.yml and auto-deploy...yml (or main-pipeline.yml), need significant adjustments.
    • pr-quality-check.yml should only run on pull_request events. Its Gate 2 job should be simplified to execute a single command: run: make itest. This ensures that pull requests are thoroughly tested before merging.
    • auto-deploy...yml (or main-pipeline.yml) should only run on push events to the main branch. Similarly, it must be simplified to execute run: make itest. This ensures that deployments are triggered automatically upon code merges to the main branch.
  • Documentation Updates: All references to the Golden Path in AGENTS.md and README.md must be updated to reflect the new, correct definition of make itest (Deploy + Verify). This ensures that contributors and users have accurate information about the intended workflow.

2. Offline Setup Fix (“自带干粮”)

This part addresses the need for an offline setup capability, allowing the system to function even without a direct internet connection during certain phases.

  • Vendor Directory Creation: A vendor/ directory must be created at the root of the repository. This directory will house all necessary dependencies for offline operation.
  • Collection Addition: The .tar.gz files for community.general and ansible.windows must be downloaded and placed in the vendor/ directory. These collections are essential for Ansible to function correctly.
  • requirements.yml Modification: The requirements.yml file must be modified to point to the local file paths within the vendor/ directory. This is achieved by specifying source: ./vendor/... and type: file for each collection. This ensures that Ansible uses the locally vendored collections instead of attempting to download them from the internet.
  • Makefile Enhancement: The $(COLLECTIONS_MARKER) target in the Makefile must be updated to include the --offline flag in the ansible-galaxy install command. This enforces offline installation, preventing Ansible from attempting to access the internet for dependencies.

3. Deploy Playbook Fixes (“部署剧本”修复)

This section focuses on addressing specific issues within the playbooks/deploy-observability-stack.yml playbook to ensure reliable and correct deployment of the observability stack.

  • Loki Configuration: The invalid option enable_multi_variant_querier must be removed from the templates/loki-config.yaml.j2 template. This ensures that Loki is configured correctly without encountering errors due to unsupported options.
  • Loki User/Group Management: Before the Ensure Loki data directories exist task, new tasks must be added to idempotently create the loki system group (using ansible.builtin.group) and the loki system user (using ansible.builtin.user). These tasks must ensure that the loki user and group exist with the correct permissions before proceeding with the deployment.
  • dpkg Lock Contention Resolution: This addresses issues related to lock contention during package installations, which can cause deployment failures.
    • The apt_lockdown_units_common list must be expanded to include all potential lock sources, such as apt-daily*, unattended-upgrades, snapd*, packagekit*, and dkms_autoinstal*. This ensures that all potential sources of lock contention are accounted for.
    • The Fail if any apt lockdown unit error... task must use a robust when condition that ignores all known "not found/not loaded" errors (as seen in our logs) and only fails on genuine errors. This prevents the task from failing due to transient or irrelevant errors.
    • All ansible.builtin.apt tasks must include the lock_timeout: 600 parameter. This increases the timeout duration for acquiring locks, reducing the likelihood of failures due to lock contention.
  • ws-01-linux SSH Timeout Fix: The PLAY [Deploy Grafana Alloy on Linux hosts] must have ignore_unreachable: true added to it. This prevents the playbook from failing if the ws-01-linux host is temporarily unreachable via SSH.
  • ws-01-linux DKMS Bug Fix: In the PLAY [Deploy Grafana Alloy on Linux hosts], a new task must be added (before Install Grafana Alloy) to install the correct kernel headers (using name: "linux-headers-{{ ansible_kernel }}") with full apt retries and lock_timeout. This ensures that the necessary kernel headers are installed before attempting to install Grafana Alloy, resolving potential DKMS-related issues.

4. Verification Playbook Fixes (“验收剧本”修复)

This section addresses a security leak in playbooks/tests/verify_observability.yml.

  • Security Leak Mitigation: The task Read Grafana provisioning datasources file must be modified. It should not slurp the entire file (which leaks secrets). Instead, it must use a safer method (e.g., ansible.builtin.stat to check existence, or ansible.builtin.command: grep "type: loki" to check content) to verify the Loki datasource is configured. This prevents sensitive information from being exposed during the verification process.

Deliverables: A Unified Pull Request

The primary deliverable is a single, new Pull Request that supersedes all previous attempts (e.g., #79, #81, #83, #89). This PR must contain all the file modifications listed above, including changes to Makefile, requirements.yml, both CI workflow files, deploy-observability-stack.yml, verify_observability.yml, loki-config.yaml.j2, AGENTS.md, and README.md. All changes must be included in a single, atomic commit to ensure consistency and ease of review.

The PR description must include a completed "Testing Done" section (Phase R), detailing the testing procedures and results to demonstrate the effectiveness of the changes.

Acceptance Criteria: Ensuring Quality and Stability

The acceptance of the Ultimate Patch v2 is contingent upon meeting specific criteria across different stages of the CI/CD pipeline.

  • Gate 1 (ubuntu-latest): The following commands must complete with an exit code of 0:
    • make setup
    • make lint
    • make test
  • Gate 2 (self-hosted) - Final Verdict: The entire make itest command must complete with an exit code of 0, indicating that all tests have passed successfully (all green ✅).
  • Gate 2 (self-hosted) - Evidence: The CI log must provide evidence of the following:
    • ansible-galaxy installing from ./vendor/. indicating that Ansible is correctly installing collections from the local vendor directory.
    • The make deploy playbook executing without dpkg lock errors, demonstrating that the lock contention issues have been resolved.
    • The make deploy playbook successfully creating the loki user/group, confirming that the user and group management tasks are functioning correctly.
    • The make deploy playbook not failing on ws-01-linux due to the implementation of ignore_unreachable: true and the kernel header fix.
    • The make verify playbook executing after make deploy and passing all checks (A, B, C, D) from docs/verification-spec.md, verifying that the observability stack is deployed and functioning as expected.

By adhering to these requirements and acceptance criteria, the Ultimate Patch v2 aims to bring stability and reliability to our CI/CD pipeline, ensuring that the make itest command consistently passes and that our observability stack is deployed and verified correctly.

Learn more about CI/CD pipelines

You may also like