Setting Up Bare-Metal Infrastructure Provider
Tutorial for setting up Bare-Metal Infrastructure Provider for Omni
In this tutorial, we will set up a Bare-Metal Infrastructure Provider instance for our Omni instance to be able to provision bare metal machines in it.
Requirements
An Omni instance (either managed or self-hosted) with Admin access
Access to the Image Factory (either the public one or self-hosted)
Some bare-metal machines with:
BMC power management capabilities via one of the following:
IPMI
RedFish
Outbound access to the Omni instance and the Image Factory
A machine/container/cluster etc. in the same subnet of the bare-metal machines to run the infrastructure provider service
In this tutorial, we will assume that we use:
Our managed Omni instance running at
my-instance.omni.siderolabs.io
Public Image Factory at
factory.talos.dev
Bare-Metal machines with IPMI support
An additional server with Docker installed to run the infrastructure provider service, with the IP address
172.16.0.42
within the subnet172.16.0.0/24
, reachable by the bare-metal machinesTwo bare-metal servers within the subnet
172.16.0.0/24
with access to the infrastructure provider, our Omni instance and to the Image Factory
1. Creating an Omni service account
We start by creating an Omni service account for the infrastructure provider to authenticate/authorize to Omni.
Here, we need to create the service account with the same ID as the ID of the provider instance we are going to run. It defaults to bare-metal
, hence we use it as the name, but if you plan to use a different ID (passed via --id
) or run multiple provider instances, set the name accordingly for the service account.
Navigate to Settings - Service Accounts tab on Omni web UI, and create a service account with ID bare-metal
Store the displayed service account key securely for later use.
2. Starting the provider
We will run the provider in a Docker container in our server with IP 172.16.0.42
.
The provider requires its following ports to be accessible:
50042
: HTTP and GRPC API port, customizable via--api-port
)69
: TFTP port used to provide iPXE binaries to PXE-booted machines
Start by getting the image reference of the latest version of the provider from its packages page.
At the time of writing, it is ghcr.io/siderolabs/omni-infra-provider-bare-metal:v0.1.0-alpha.1
, and we are going to use it in this tutorial.
Set the required environment variables, using the service account key you got in the previous step:
Run the following command to start the provider service:
Make sure it is started and running by checking its status:
Sample output:
And start tailing its logs in a separate shell:
Sample output:
At this point, the provider is started and established a connection to our Omni instance.
The provider will start a DHCP proxy server, responding to the DHCP requests from the interface, in which the --api-advertise-address
resides in. This DHCP proxy server is only responsible of generating PXE-boot responses for the machines configured to PXE boot. It does not affect the existing DHCP server otherwise.
If you need to run this DHCP proxy on a different interface (so the responses are broadcasted to the correct network), you can pass the --dhcp-proxy-iface-or-ip
flag to the provider, specifying either the name of the network interface or an IP on that machine which belongs to the desired interface.
3. Starting the Bare-Metal Machines
At this point, we can boot our bare-metal machines. Before we start, make sure that they are configured to boot over the network via PXE on the next boot, so that they can be booted by the provider.
We recommend using the default boot order of first disk, then network.
Power cycle the two machines, and when they attempt to boot via PXE, you will see that they are PXE booted by the provider in the provider logs, similar to the lines below:
At this point, these machines are booted into a special mode of Talos called "Agent Mode". In this mode, Talos
does not detect any existing Talos installation on the disk, neither attempt to boot from it
runs only the required services
does not let a configuration to be applied to it
establishes a secure SideroLink connection to the Omni instance
runs the Metal Agent extension which establishes a connection to the provider
runs the only the required services to be able to further provisioned by the provider
4. Accepting the Machines
At this point, the machines should be booted into the Agent Mode, and have established a SideroLink connection to our Omni instance. Let's verify this:
Navigate to Machines - Pending tab on Omni web UI. You should see the machines pending acceptance:
Our machines have the following IDs:
33313750-3538-5a43-4a44-315430304c46
33313750-3538-5a43-4a44-315430304c47
For security reasons, the machines cannot be used before they are "Accepted". We will accept these machines using the Omni API.
The following step will wipe the disks of these machines, so proceed with caution!
Simply click "Accept" button on each machine under the Machines - Pending tab on Omni web UI, and confirm the action:
When you do this, the provider will do the following under the hood:
Ask the Talos Agent service on the machines to configure their IPMI credentials
Retrieve these credentials and store them
Wipes the disks of these machines since they are initially discovered
Power off these machines over IPMI
Additionally, Omni will create a Machine
, and an InfraMachineStatus
resource for each machine. You can verify this by:
Output will be similar to:
5. Adding Machines to a Cluster
We can now create a cluster using these machines. For this, simply follow the guide for creating a cluster.
When you add these machines to a cluster, the following will happen under the hood.
The provider will:
Power these machines on, marking their next boot to be a PXE boot
PXE boot them into Talos maintenance mode
Then Omni will proceed with the regular flow of:
Applying a configuration to them, causing Talos to be installed to the disk
Reboot (possibly using
kexec
)
The cluster will be provisioned as normal, and will get to the Ready
status.
6. Removing Machines from a Cluster
When you delete a cluster and/or remove some bare-metal machines from a cluster, the following will happen:
Omni does the regular de-allocation flow:
Remove the nodes from the cluster (leave
etcd
membership for control planes)Reset the machines
Afterwards, the provider will follow it up with these additional steps:
PXE boot the machine into Agent Mode (to be able to wipe its disks)
Wipe its disks
Power off the machine
At this point, these machines will again be ready to be allocated to a different cluster.
Last updated