Ultimate Patch V2: Stabilizing CI/CD And Deployment
This article details the implementation of the "Ultimate Patch v2," a critical refactor designed to resolve significant issues within our repository. The primary goal is to consolidate all necessary fixes for CI/CD, logic, and deployment processes, ultimately achieving a stable make itest pipeline. This patch addresses a state of deadlock caused by multiple independent fixes blocking each other, resolving issues from network hangs during make setup to cascading infrastructure and logic bugs during make deploy.
Context and Genesis of the Ultimate Patch
Our HomeOps repository faced a critical juncture where various fixes, represented by pull requests such as #79, #81, #83, and #89, were mutually blocking progress. The make setup command was failing due to network issues, and the make deploy command was plagued by numerous infrastructure and logic errors, including missing Loki user configurations, missing kernel headers, SSH timeouts, and invalid validation logic. These issues collectively prevented the successful completion of the make itest pipeline, which encompasses both deployment and verification stages.
The "Ultimate Patch v2" was conceived as a "Genesis Task" to supersede all previous, fragmented attempts at resolving these issues. It aims to implement every known fix in a single, atomic, and correct sequence. This comprehensive approach is intended to break the existing "chicken-and-egg" paradoxes and enable our full make itest pipeline to pass consistently for the first time. The task focuses on the Golden Path (i.e., make itest) to ensure end-to-end functionality. Success is determined by adherence to the specifications outlined in docs/verification-spec.md. The patch requires a self-hosted runner environment with Ansible and hardware access. While the initial setup may require network access for the AI to create the patch, the final CI run is designed to be offline-capable, ensuring reliable execution in isolated environments.
The scope of the "Ultimate Patch v2" encompasses and supersedes all open pull requests related to various fixes, including VENDORIZE, FIX-LOKI-GROUP, FIX-WS01-KERNEL-HEADERS, FIX-WS01-SSH, REDEFINE-GATE2-ITEST, FIX-CICD-SYNTAX, FIX-LOKI-CONFIG, and FIX-CICD-SECURITY. This consolidation ensures that all known issues are addressed in a coordinated manner, preventing conflicts and ensuring a cohesive solution.
Requirements: Achieving a Holistic Refactor
The successful implementation of the Ultimate Patch v2 requires a holistic refactoring of the repository, integrating all identified fixes into a single, coherent pull request. This section details the specific requirements across various components of the system.
1. Core Process Refactor ("宪法"修正)
This aspect focuses on redefining the fundamental processes and workflows within the repository to ensure a streamlined and reliable CI/CD pipeline.
- Makefile Fixes: The
make itesttarget must be redefined as a composite action. This involves executingmake setupfirst, followed byplaybooks/deploy-observability-stack.ymlto deploy the observability stack, and finally,playbooks/tests/verify_observability.ymlto verify the deployment. - CI Workflow Adjustments: The CI workflows, specifically
pr-quality-check.ymlandauto-deploy...yml(ormain-pipeline.yml), need significant adjustments.pr-quality-check.ymlshould only run onpull_requestevents. Its Gate 2 job should be simplified to execute a single command:run: make itest. This ensures that pull requests are thoroughly tested before merging.auto-deploy...yml(ormain-pipeline.yml) should only run onpushevents to themainbranch. Similarly, it must be simplified to executerun: make itest. This ensures that deployments are triggered automatically upon code merges to the main branch.
- Documentation Updates: All references to the Golden Path in
AGENTS.mdandREADME.mdmust be updated to reflect the new, correct definition ofmake itest(Deploy + Verify). This ensures that contributors and users have accurate information about the intended workflow.
2. Offline Setup Fix (“自带干粮”)
This part addresses the need for an offline setup capability, allowing the system to function even without a direct internet connection during certain phases.
- Vendor Directory Creation: A
vendor/directory must be created at the root of the repository. This directory will house all necessary dependencies for offline operation. - Collection Addition: The
.tar.gzfiles forcommunity.generalandansible.windowsmust be downloaded and placed in thevendor/directory. These collections are essential for Ansible to function correctly. requirements.ymlModification: Therequirements.ymlfile must be modified to point to the local file paths within thevendor/directory. This is achieved by specifyingsource: ./vendor/...andtype: filefor each collection. This ensures that Ansible uses the locally vendored collections instead of attempting to download them from the internet.- Makefile Enhancement: The
$(COLLECTIONS_MARKER)target in theMakefilemust be updated to include the--offlineflag in theansible-galaxy installcommand. This enforces offline installation, preventing Ansible from attempting to access the internet for dependencies.
3. Deploy Playbook Fixes (“部署剧本”修复)
This section focuses on addressing specific issues within the playbooks/deploy-observability-stack.yml playbook to ensure reliable and correct deployment of the observability stack.
- Loki Configuration: The invalid option
enable_multi_variant_queriermust be removed from thetemplates/loki-config.yaml.j2template. This ensures that Loki is configured correctly without encountering errors due to unsupported options. - Loki User/Group Management: Before the
Ensure Loki data directories existtask, new tasks must be added to idempotently create thelokisystem group (usingansible.builtin.group) and thelokisystem user (usingansible.builtin.user). These tasks must ensure that thelokiuser and group exist with the correct permissions before proceeding with the deployment. dpkgLock Contention Resolution: This addresses issues related to lock contention during package installations, which can cause deployment failures.- The
apt_lockdown_units_commonlist must be expanded to include all potential lock sources, such asapt-daily*,unattended-upgrades,snapd*,packagekit*, anddkms_autoinstal*. This ensures that all potential sources of lock contention are accounted for. - The
Fail if any apt lockdown unit error...task must use a robustwhencondition that ignores all known "not found/not loaded" errors (as seen in our logs) and only fails on genuine errors. This prevents the task from failing due to transient or irrelevant errors. - All
ansible.builtin.apttasks must include thelock_timeout: 600parameter. This increases the timeout duration for acquiring locks, reducing the likelihood of failures due to lock contention.
- The
ws-01-linuxSSH Timeout Fix: ThePLAY [Deploy Grafana Alloy on Linux hosts]must haveignore_unreachable: trueadded to it. This prevents the playbook from failing if thews-01-linuxhost is temporarily unreachable via SSH.ws-01-linuxDKMS Bug Fix: In thePLAY [Deploy Grafana Alloy on Linux hosts], a new task must be added (beforeInstall Grafana Alloy) to install the correct kernel headers (usingname: "linux-headers-{{ ansible_kernel }}") with fullaptretries andlock_timeout. This ensures that the necessary kernel headers are installed before attempting to install Grafana Alloy, resolving potential DKMS-related issues.
4. Verification Playbook Fixes (“验收剧本”修复)
This section addresses a security leak in playbooks/tests/verify_observability.yml.
- Security Leak Mitigation: The task
Read Grafana provisioning datasources filemust be modified. It should not slurp the entire file (which leaks secrets). Instead, it must use a safer method (e.g.,ansible.builtin.statto check existence, oransible.builtin.command: grep "type: loki"to check content) to verify the Loki datasource is configured. This prevents sensitive information from being exposed during the verification process.
Deliverables: A Unified Pull Request
The primary deliverable is a single, new Pull Request that supersedes all previous attempts (e.g., #79, #81, #83, #89). This PR must contain all the file modifications listed above, including changes to Makefile, requirements.yml, both CI workflow files, deploy-observability-stack.yml, verify_observability.yml, loki-config.yaml.j2, AGENTS.md, and README.md. All changes must be included in a single, atomic commit to ensure consistency and ease of review.
The PR description must include a completed "Testing Done" section (Phase R), detailing the testing procedures and results to demonstrate the effectiveness of the changes.
Acceptance Criteria: Ensuring Quality and Stability
The acceptance of the Ultimate Patch v2 is contingent upon meeting specific criteria across different stages of the CI/CD pipeline.
- Gate 1 (ubuntu-latest): The following commands must complete with an exit code of 0:
make setupmake lintmake test
- Gate 2 (self-hosted) - Final Verdict: The entire
make itestcommand must complete with an exit code of 0, indicating that all tests have passed successfully (all green ✅). - Gate 2 (self-hosted) - Evidence: The CI log must provide evidence of the following:
ansible-galaxy installing from ./vendor/.indicating that Ansible is correctly installing collections from the local vendor directory.- The
make deployplaybook executing withoutdpkglock errors, demonstrating that the lock contention issues have been resolved. - The
make deployplaybook successfully creating thelokiuser/group, confirming that the user and group management tasks are functioning correctly. - The
make deployplaybook not failing onws-01-linuxdue to the implementation ofignore_unreachable: trueand the kernel header fix. - The
make verifyplaybook executing aftermake deployand passing all checks (A, B, C, D) fromdocs/verification-spec.md, verifying that the observability stack is deployed and functioning as expected.
By adhering to these requirements and acceptance criteria, the Ultimate Patch v2 aims to bring stability and reliability to our CI/CD pipeline, ensuring that the make itest command consistently passes and that our observability stack is deployed and verified correctly.