=============================================================================== Peer memory support over Mellanox OFED README Dec 2013 =============================================================================== Table of Contents =============================================================================== 1. Overview 2. Peer memory API =============================================================================== 1. Overview =============================================================================== In General: ------------ MLNX_OFED 2.1 introduced an API between IB CORE to peer memory clients, (e.g. GPU cards) to provide access for the HCA to read/write peer memory for data buffers. As a result it allows RDMA-based (over InfiniBand/RoCE) application to use peer device computing power, and RDMA interconnect at the same time w/o copying the data between the P2P devices This README describes the required steps to develop a peer memory over Mellanox OFED. Specifically it focuses on the flow and API to achieve this Task. =============================================================================== 2. Peer memory API =============================================================================== Flow: ------ Each peer memory should register itself into the IB CORE (ib_core) module, and provide a set of callbacks to manage its memory basic functionality such as get/put pages, get_page_size, dma map/unmap. Those callbacks are quite similar to HOST memory ones, description for each one is detailed below. Peer client structure: ------------------------------------------------------------------------------- struct peer_memory_client { char name[IB_PEER_MEMORY_NAME_MAX]; char version[IB_PEER_MEMORY_VER_MAX]; int (*acquire) (unsigned long addr, size_t size, void *peer_mem_private_data, char *peer_mem_name, void **client_context); int (*get_pages) (unsigned long addr, size_t size, int write, int force, struct sg_table *sg_head, void *client_context, void *core_context); int (*dma_map) (struct sg_table *sg_head, void *client_context, struct device *dma_device, int dmasync, int *nmap); int (*dma_unmap) (struct sg_table *sg_head, void *client_context, struct device *dma_device); void (*put_pages) (struct sg_table *sg_head, void *client_context); unsigned long (*get_page_size) (void *client_context); void (*release) (void *client_context); }; APIs: ------------------------------------------------------------------------------- void *ib_register_peer_memory_client(struct peer_memory_client *peer_client, invalidate_peer_memory *invalidate_callback); Description: Each peer driver register its callbacks upon loading, the callbacks provided as part of the peer_client may be used later on by the IB core when processing peer memory. Parameters: peer_client [IN] - Structure filled with peer information, name/version/callbacks. invalidate_callback [OUT] - IB core callback function to be called by the peer once some allocation should be invalidated. Return value: A valid registration handle pointer on success, NULL otherwise. Notes: name should be unique comparing other registered peers. version and some statistics for RDMA operations performed by the peer are exposed by IB CORE via sysfs entries under: /sys/kernel/mm/memory_peers// ------------------------------------------------------------------------------- void ib_unregister_peer_memory_client(void *reg_handle); Description: On unload, peer client should unregister itself with IB CORE Parameters: reg_handle [IN] - registration handle previously returned in registration. ------------------------------------------------------------------------------- int (*acquire) (unsigned long addr, size_t size, void *peer_mem_private_data, char *peer_mem_name, void **client_context); Description: Given a virtual address, the peer driver should be able to identify whether this address sits on its physical memory. If the answer is positive further calls for memory management will be tunneled to that peer callbacks. Parameters: addr [IN] - virtual address to be checked whether belongs to. size [IN] - size of memory area starting at addr. peer_mem_private_data [IN] - private data which peer already set on ib_ucontext, if it's not set, this param will be NULL. This parameter normally can help peers that running in kernel space and have access to ib_ucontext to set some private data to let them later identify their memory. peer_mem_name [IN] - peer name in case was already set on ib_ucontext, if it's not set, this param is NULL. This param is normally used along with peer_mem_private_data. client_context [OUT] - peer opaque data which holds peer context for that address range, will be passed in for further calls for that given memory. Return value: 1 - virtual address belongs to the peer device, otherwise 0 ------------------------------------------------------------------------------- int (*get_pages) (unsigned long addr, size_t size, int write, int force, struct sg_table *sg_head, void *client_context, void *core_context); Description: This function is called for the peer device that owns the virtual address range, peer is expected to pin the physical pages of the given address (if not already pinned) and to fill sg_table with the information of the physical pages associated with the given address range. This function is equivalent to the Host memory get_user_pages() method. Parameters: addr [IN] - start virtual address of that given allocation. size [IN] - size of memory area starting at addr. write [IN] - indicates whether pages will be written to by the caller. Same meaning as of kernel API get_user_pages, can be ignored if not relevant. force [IN] - indicates whether to force write access even if user mapping is readonly. Same meaning as of kernel API get_user_pages, can be ignored if not relevant. sg_head [OUT] - pointer to head of struct sg_table, peer should allocate required entries to set its pages information on, then fill it with its physical addresses and size. Below APIs expect to be used: sg_alloc_table, sg_set_page. client_context [IN] - peer context for that given allocation, returned as part of the acquire call. core_context [IN] - opaque IB core context, to be used by the invalidate callback in case this address range should be freed. Return value: 0 success, otherwise errno return code. ------------------------------------------------------------------------------- int (*dma_map) (struct sg_table *sg_head, void *client_context, struct device *dma_device, int dmasync, int *nmap); Description: This function is called to let the peer driver fill the sg_table with its dma information for that address range. Parameters have same meaning as of kernel host ones when calling dma_map_sg. Parameters: sg_head [OUT] - pointer to head of struct sg_table, peer should set its dma_address & dma_length on each scatter gather entry. client_context [IN] - peer context for that given allocation. dma_device [IN] - current device for that operation. dmasync [IN] - flush in-flight DMA when the memory region is written. Same meaning as with host memory mapping, can be ignored if not relevant. nmap [OUT] - number of mapped/set entries. Return value: 0 success, otherwise errno return code. ------------------------------------------------------------------------------- int (*dma_unmap) (struct sg_table *sg_head, void *client_context, struct device *dma_device); Description: This API is the opposite of the dma map API, it should take relevant actions to unmap it. Parameters: sg_head [IN] - pointer to head of struct sg_table. client_context [IN] - peer context for that given allocation. dma_device [IN] - current device for that operation. Return value: 0 success, otherwise errno return code. ------------------------------------------------------------------------------- void (*put_pages) (struct sg_table *sg_head, void *client_context); Description: This API is the opposite of the get_pages API, it should free the pages as done in Host memory when calling put_page. Parameters: sg_head [IN] - pointer to head of struct sg_table. client_context [IN] - peer context for that given allocation. ------------------------------------------------------------------------------- unsigned long (*get_page_size) (void *client_context); Description: This API returns page size for that given allocation. Parameters: client_context [IN] - peer context for that given allocation. Return value: Page size in bytes ------------------------------------------------------------------------------- void (*release) (void *client_context); Description: This API is the opposite of the acquire call, let peer release all resources. Parameters: client_context [IN] - peer context for that given allocation. ------------------------------------------------------------------------------- typedef int (*invalidate_peer_memory)(void *reg_handle, void *core_context); Description: Function pointer returned from IB core as part of successful peer registration. Should be used by peer when a given memory allocation represented by core_context should be deleted. Parameters: reg_handle [IN] - peer handle. core_context [IN] - core context that represents a given allocation, as it was set in as part of get_pages call. Return value: 0 success, otherwise errno return code. -------------------------------------------------------------------------------