Skip to content

Commit 985d4d7

Browse files
wkozaczuknyh
authored andcommitted
socket: implement Linux flavor of SO_REUSE_PORT option
This patch is a manual back-port of the original FreeBSD patch https://reviews.freebsd.org/rS334719. The FreeBSD patch adds support of the SO_REUSEPORT_LB socket option, whereas the one below implements the Linux flavor of SO_REUSEPORT which in effect borrows good chunk of the FreeBSD implementation. Please note the FreeBSD committers decided to retain support of the original SO_REUSEPORT option and add new one - SO_REUSEPORT_LB. The new option exhibits same behavior as the older one but adds important new feature - load balancing across listener sockets sharing the same port. The FreeBSD manual states this: "SO_REUSEPORT_LB allows completely duplicate bindings by multiple pro- cesses if they all set SO_REUSEPORT_LB before binding the port. Incoming TCP and UDP connections are distributed among the sharing processes based on a hash function of local port number, foreign IP address and port num- ber. A maximum of 256 processes can share one socket." So most of the original patch was back-ported as-is except for the parts with the conditional logic to account for both SO_REUSEPORT and SO_REUSEPORT_LB which was unnecessary for OSv as it implements Linux which only supports the SO_REUSEPORT option. In addition in some places I had to change some of C code to use C++ constructs just like in another places of in_pcb.cc. Bulk of the patch below, is about adding definitions of the struct inpcblbgroup and functions to allocate, deallocate and manipulate it to manage load balancing groups including adding and removing member sockets or more specifically their PCBs - Protocol Control Blocks: (Internal API) - struct inpcblbgroup *in_pcblbgroup_alloc() - allocates new LB group - void in_pcblbgroup_free(struct inpcblbgroup *grp) - frees existing LB group - struct inpcblbgroup *in_pcblbgroup_resize(struct inpcblbgrouphead *hdr, struct inpcblbgroup *old_grp, int size) - creates new LB group that is a resized version of the old one - in_pcblbgroup_reorder(struct inpcblbgrouphead *hdr, struct inpcblbgroup **grpp, int i) - PCB at index 'i' is removed from the group, pull up the ones below il_inp[i] and shrink group if possible (Public API) - int in_pcbinslbgrouphash(struct inpcb *inp) - adds PCB member to the LB group for SO_REUSEPORT option (allocate new LB group if necessary) - void in_pcbremlbgrouphash(struct inpcb *inp) - removes PCB from load balance group (free existing LB group if last member) - struct inpcb *in_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo, const struct in_addr *laddr, uint16_t lport, const struct in_addr *faddr, uint16_t fport, int lookupflags) - looks up inpcb in a load balancing group The remaining part of the patch, modifies relevant parts in in_pcb.cc to: 1) add logic add and remove inpcb members to/from LB groups by delegating to in_pcbinslbgrouphash() and in_pcbremlbgrouphash() during setup and teardown of sockets and their PCBs 2) add logic to lookup PCBs (and relevant sockets) by delegating to in_pcblookup_lbgroup() This patch does not add any new locking appart for some places that verify certain locks are held in place when functions are called. Please note that at some point during the review process the original version of the FreeBSD patch contained the logic originating from DragonFlyBSD (DragonFlyBSD/DragonFlyBSD@02ad2f0) to handle a drawback when processes/threads using SO_REUSE_PORT would crash causing some pending sockets in the completion and incompletion queues to be dropped. But due to the concerns in the locking logic, this part was removed from the patch (https://reviews.freebsd.org/D11003#326149) and therefore also is absent in this patch below. I believe also Linux does not handle this drawback correctly as of now. From practical standpoint, this patch greatly improves the throughput of applications using SO_REUSEPORT. More specifically this http server example implemented in Rust - https://gist.github.com/alexcrichton/7b97beda66d5e9b10321207cd69afbbc - yields way better performance in SMP mode (the 4 CPU difference is most profound): Req/sec BEFORE this patch: 2 CPU - 82199.52 4 CPU - 97982.16 AFTER this patch: 2 CPU - 82361.77 4 CPU - 147389.79 Finally note this patch does not change any non-load balancing aspects of the SO_REUSEPORT option that were already in place before this patch, but inactive. More specifically these would be related to how SO_REUSEADDR and/or SO_REUSEPORT flags drive same address and/or port collision logic. Some articles about SO_REUSE_PORT: - https://lwn.net/Articles/542629/ - https://linuxjedi.co.uk/2020/04/25/socket-so_reuseport-and-kernel-implementations/ V2: Comparing to the V1, this patch changes slightly the expression to calculate size of the allocated memory in in_pcblbgroup_alloc() in order to make it compile with GCC 11 (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95942); So I changed this: bytes = __offsetof(struct inpcblbgroup, il_inp[size]); to: bytes = __offsetof(struct inpcblbgroup, il_inp) + sizeof(inpcblbgroup::il_inp[0]) * size; Fixes #1170 Signed-off-by: Waldemar Kozaczuk <[email protected]> Message-Id: <[email protected]>
1 parent 8f1f36c commit 985d4d7

File tree

4 files changed

+320
-0
lines changed

4 files changed

+320
-0
lines changed

bsd/sys/compat/linux/linux.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ typedef struct {
8989
#define LINUX_SO_NO_CHECK 11
9090
#define LINUX_SO_PRIORITY 12
9191
#define LINUX_SO_LINGER 13
92+
#define LINUX_SO_REUSEPORT 15
9293
#define LINUX_SO_PEERCRED 17
9394
#define LINUX_SO_RCVLOWAT 18
9495
#define LINUX_SO_SNDLOWAT 19

bsd/sys/compat/linux/linux_socket.cc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,8 @@ linux_to_bsd_so_sockopt(int opt)
340340
return (SO_OOBINLINE);
341341
case LINUX_SO_LINGER:
342342
return (SO_LINGER);
343+
case LINUX_SO_REUSEPORT:
344+
return (SO_REUSEPORT);
343345
case LINUX_SO_RCVLOWAT:
344346
return (SO_RCVLOWAT);
345347
case LINUX_SO_SNDLOWAT:

bsd/sys/netinet/in_pcb.cc

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,9 @@
8585

8686
#include <osv/trace.hh>
8787

88+
#define INPCBLBGROUP_SIZMIN 8
89+
#define INPCBLBGROUP_SIZMAX 256
90+
8891
TRACEPOINT(trace_inpcb_ref, "inp=%x", struct inpcb *);
8992
TRACEPOINT(trace_inpcb_rele, "inp=%x", struct inpcb *);
9093
TRACEPOINT(trace_inpcb_free, "inp=%x", struct inpcb *);
@@ -199,6 +202,202 @@ SYSCTL_VNET_INT(_net_inet_ip_portrange, OID_AUTO, randomtime, CTLFLAG_RW,
199202
* functions often modify hash chains or addresses in pcbs.
200203
*/
201204

205+
static struct inpcblbgroup *
206+
in_pcblbgroup_alloc(struct inpcblbgrouphead *hdr, u_char vflag,
207+
uint16_t port, const union in_dependaddr *addr, int size)
208+
{
209+
struct inpcblbgroup *grp;
210+
size_t bytes;
211+
212+
bytes = __offsetof(struct inpcblbgroup, il_inp) + sizeof(inpcblbgroup::il_inp[0]) * size;
213+
grp = (struct inpcblbgroup *)malloc(bytes);
214+
if (!grp)
215+
return (NULL);
216+
grp->il_vflag = vflag;
217+
grp->il_lport = port;
218+
grp->il_dependladdr = *addr;
219+
grp->il_inpsiz = size;
220+
LIST_INSERT_HEAD(hdr, grp, il_list);
221+
return (grp);
222+
}
223+
224+
static void
225+
in_pcblbgroup_free(struct inpcblbgroup *grp)
226+
{
227+
228+
LIST_REMOVE(grp, il_list);
229+
free(grp);
230+
}
231+
232+
static struct inpcblbgroup *
233+
in_pcblbgroup_resize(struct inpcblbgrouphead *hdr,
234+
struct inpcblbgroup *old_grp, int size)
235+
{
236+
struct inpcblbgroup *grp;
237+
int i;
238+
239+
grp = in_pcblbgroup_alloc(hdr, old_grp->il_vflag,
240+
old_grp->il_lport, &old_grp->il_dependladdr, size);
241+
if (!grp)
242+
return (NULL);
243+
244+
KASSERT(old_grp->il_inpcnt < grp->il_inpsiz,
245+
("invalid new local group size %d and old local group count %d",
246+
grp->il_inpsiz, old_grp->il_inpcnt));
247+
248+
for (i = 0; i < old_grp->il_inpcnt; ++i)
249+
grp->il_inp[i] = old_grp->il_inp[i];
250+
grp->il_inpcnt = old_grp->il_inpcnt;
251+
in_pcblbgroup_free(old_grp);
252+
return (grp);
253+
}
254+
255+
/*
256+
* PCB at index 'i' is removed from the group. Pull up the ones below il_inp[i]
257+
* and shrink group if possible.
258+
*/
259+
static void
260+
in_pcblbgroup_reorder(struct inpcblbgrouphead *hdr, struct inpcblbgroup **grpp,
261+
int i)
262+
{
263+
struct inpcblbgroup *grp = *grpp;
264+
265+
for (; i + 1 < grp->il_inpcnt; ++i)
266+
grp->il_inp[i] = grp->il_inp[i + 1];
267+
grp->il_inpcnt--;
268+
269+
if (grp->il_inpsiz > INPCBLBGROUP_SIZMIN &&
270+
grp->il_inpcnt <= (grp->il_inpsiz / 4)) {
271+
/* Shrink this group. */
272+
struct inpcblbgroup *new_grp =
273+
in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz / 2);
274+
if (new_grp)
275+
*grpp = new_grp;
276+
}
277+
return;
278+
}
279+
280+
/*
281+
* Add PCB to load balance group for SO_REUSEPORT option.
282+
*/
283+
static int
284+
in_pcbinslbgrouphash(struct inpcb *inp)
285+
{
286+
struct inpcbinfo *pcbinfo;
287+
struct inpcblbgrouphead *hdr;
288+
struct inpcblbgroup *grp;
289+
uint16_t hashmask, lport;
290+
uint32_t group_index;
291+
static int limit_logged = 0;
292+
293+
pcbinfo = inp->inp_pcbinfo;
294+
295+
INP_LOCK_ASSERT(inp);
296+
INP_HASH_WLOCK_ASSERT(pcbinfo);
297+
298+
if (pcbinfo->ipi_lbgrouphashbase == NULL)
299+
return (0);
300+
301+
hashmask = pcbinfo->ipi_lbgrouphashmask;
302+
lport = inp->inp_lport;
303+
group_index = INP_PCBLBGROUP_PORTHASH(lport, hashmask);
304+
hdr = &pcbinfo->ipi_lbgrouphashbase[group_index];
305+
306+
#ifdef INET6
307+
/*
308+
* Don't allow IPv4 mapped INET6 wild socket.
309+
*/
310+
if ((inp->inp_vflag & INP_IPV4) &&
311+
inp->inp_laddr.s_addr == INADDR_ANY &&
312+
INP_CHECK_SOCKAF(inp->inp_socket, AF_INET6)) {
313+
return (0);
314+
}
315+
#endif
316+
317+
hdr = &pcbinfo->ipi_lbgrouphashbase[
318+
INP_PCBLBGROUP_PORTHASH(inp->inp_lport,
319+
pcbinfo->ipi_lbgrouphashmask)];
320+
LIST_FOREACH(grp, hdr, il_list) {
321+
if (grp->il_vflag == inp->inp_vflag &&
322+
grp->il_lport == inp->inp_lport &&
323+
memcmp(&grp->il_dependladdr,
324+
&inp->inp_inc.inc_ie.ie_dependladdr,
325+
sizeof(grp->il_dependladdr)) == 0) {
326+
break;
327+
}
328+
}
329+
if (grp == NULL) {
330+
/* Create new load balance group. */
331+
grp = in_pcblbgroup_alloc(hdr, inp->inp_vflag,
332+
inp->inp_lport, &inp->inp_inc.inc_ie.ie_dependladdr,
333+
INPCBLBGROUP_SIZMIN);
334+
if (!grp)
335+
return (ENOBUFS);
336+
} else if (grp->il_inpcnt == grp->il_inpsiz) {
337+
if (grp->il_inpsiz >= INPCBLBGROUP_SIZMAX) {
338+
if (!limit_logged) {
339+
limit_logged = 1;
340+
printf("lb group port %d, limit reached\n",
341+
ntohs(grp->il_lport));
342+
}
343+
return (0);
344+
}
345+
346+
/* Expand this local group. */
347+
grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz * 2);
348+
if (!grp)
349+
return (ENOBUFS);
350+
}
351+
352+
KASSERT(grp->il_inpcnt < grp->il_inpsiz,
353+
("invalid local group size %d and count %d",
354+
grp->il_inpsiz, grp->il_inpcnt));
355+
356+
grp->il_inp[grp->il_inpcnt] = inp;
357+
grp->il_inpcnt++;
358+
return (0);
359+
}
360+
361+
/*
362+
* Remove PCB from load balance group.
363+
*/
364+
static void
365+
in_pcbremlbgrouphash(struct inpcb *inp)
366+
{
367+
struct inpcbinfo *pcbinfo;
368+
struct inpcblbgrouphead *hdr;
369+
struct inpcblbgroup *grp;
370+
int i;
371+
372+
pcbinfo = inp->inp_pcbinfo;
373+
374+
INP_LOCK_ASSERT(inp);
375+
INP_HASH_WLOCK_ASSERT(pcbinfo);
376+
377+
if (pcbinfo->ipi_lbgrouphashbase == NULL)
378+
return;
379+
380+
hdr = &pcbinfo->ipi_lbgrouphashbase[
381+
INP_PCBLBGROUP_PORTHASH(inp->inp_lport,
382+
pcbinfo->ipi_lbgrouphashmask)];
383+
384+
LIST_FOREACH(grp, hdr, il_list) {
385+
for (i = 0; i < grp->il_inpcnt; ++i) {
386+
if (grp->il_inp[i] != inp)
387+
continue;
388+
389+
if (grp->il_inpcnt == 1) {
390+
/* We are the last, free this local group. */
391+
in_pcblbgroup_free(grp);
392+
} else {
393+
/* Pull up inpcbs, shrink group if possible. */
394+
in_pcblbgroup_reorder(hdr, &grp, i);
395+
}
396+
return;
397+
}
398+
}
399+
}
400+
202401
/*
203402
* Initialize an inpcbinfo -- we should be able to reduce the number of
204403
* arguments in time.
@@ -221,6 +420,8 @@ in_pcbinfo_init(struct inpcbinfo *pcbinfo, const char *name,
221420
&pcbinfo->ipi_hashmask);
222421
pcbinfo->ipi_porthashbase = (inpcbporthead *)hashinit(porthash_nelements, 0,
223422
&pcbinfo->ipi_porthashmask);
423+
pcbinfo->ipi_lbgrouphashbase = (inpcblbgrouphead *)hashinit(hash_nelements, 0,
424+
&pcbinfo->ipi_lbgrouphashmask);
224425
// FIXME: uma_zone_set_max(pcbinfo->ipi_zone, maxsockets);
225426
}
226427

@@ -1090,6 +1291,7 @@ in_pcbdrop(struct inpcb *inp)
10901291
struct inpcbport *phd = inp->inp_phd;
10911292

10921293
INP_HASH_WLOCK(inp->inp_pcbinfo);
1294+
in_pcbremlbgrouphash(inp);
10931295
LIST_REMOVE(inp, inp_hash);
10941296
LIST_REMOVE(inp, inp_portlist);
10951297
if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
@@ -1340,6 +1542,61 @@ in_pcblookup_local(struct inpcbinfo *pcbinfo, struct in_addr laddr,
13401542
}
13411543
#undef INP_LOOKUP_MAPPED_PCB_COST
13421544

1545+
static struct inpcb *
1546+
in_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo,
1547+
const struct in_addr *laddr, uint16_t lport, const struct in_addr *faddr,
1548+
uint16_t fport, int lookupflags)
1549+
{
1550+
struct inpcb *local_wild = NULL;
1551+
const struct inpcblbgrouphead *hdr;
1552+
struct inpcblbgroup *grp;
1553+
struct inpcblbgroup *grp_local_wild;
1554+
1555+
INP_HASH_LOCK_ASSERT(pcbinfo);
1556+
1557+
hdr = &pcbinfo->ipi_lbgrouphashbase[
1558+
INP_PCBLBGROUP_PORTHASH(lport, pcbinfo->ipi_lbgrouphashmask)];
1559+
1560+
/*
1561+
* Order of socket selection:
1562+
* 1. non-wild.
1563+
* 2. wild (if lookupflags contains INPLOOKUP_WILDCARD).
1564+
*
1565+
* NOTE:
1566+
* - Load balanced group does not contain jailed sockets
1567+
* - Load balanced group does not contain IPv4 mapped INET6 wild sockets
1568+
*/
1569+
LIST_FOREACH(grp, hdr, il_list) {
1570+
#ifdef INET6
1571+
if (!(grp->il_vflag & INP_IPV4))
1572+
continue;
1573+
#endif
1574+
1575+
if (grp->il_lport == lport) {
1576+
1577+
uint32_t idx = 0;
1578+
int pkt_hash = INP_PCBLBGROUP_PKTHASH(faddr->s_addr,
1579+
lport, fport);
1580+
1581+
idx = pkt_hash % grp->il_inpcnt;
1582+
1583+
if (grp->il_laddr.s_addr == laddr->s_addr) {
1584+
return (grp->il_inp[idx]);
1585+
} else {
1586+
if (grp->il_laddr.s_addr == INADDR_ANY &&
1587+
(lookupflags & INPLOOKUP_WILDCARD)) {
1588+
local_wild = grp->il_inp[idx];
1589+
grp_local_wild = grp;
1590+
}
1591+
}
1592+
}
1593+
}
1594+
if (local_wild != NULL) {
1595+
return (local_wild);
1596+
}
1597+
return (NULL);
1598+
}
1599+
13431600
/*
13441601
* Lookup PCB in hash list, using pcbinfo tables. This variation assumes
13451602
* that the caller has locked the hash list, and will not perform any further
@@ -1387,6 +1644,18 @@ in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
13871644
if (tmpinp != NULL)
13881645
return (tmpinp);
13891646

1647+
/*
1648+
* Then look in lb group (for wildcard match).
1649+
*/
1650+
if (pcbinfo->ipi_lbgrouphashbase != NULL &&
1651+
(lookupflags & INPLOOKUP_WILDCARD)) {
1652+
inp = in_pcblookup_lbgroup(pcbinfo, &laddr, lport, &faddr,
1653+
fport, lookupflags);
1654+
if (inp != NULL) {
1655+
return (inp);
1656+
}
1657+
}
1658+
13901659
/*
13911660
* Then look for a wildcard match, if requested.
13921661
*/
@@ -1552,6 +1821,18 @@ in_pcbinshash_internal(struct inpcb *inp)
15521821
pcbporthash = &pcbinfo->ipi_porthashbase[
15531822
INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_porthashmask)];
15541823

1824+
/*
1825+
* Add entry to load balance group.
1826+
* Only do this if INP_REUSEPORT is set.
1827+
*/
1828+
if (inp->inp_flags2 & INP_REUSEPORT) {
1829+
int ret = in_pcbinslbgrouphash(inp);
1830+
if (ret) {
1831+
/* pcb lb group malloc fail (ret=ENOBUFS). */
1832+
return (ret);
1833+
}
1834+
}
1835+
15551836
/*
15561837
* Go through port list and look for a head for this lport.
15571838
*/
@@ -1642,6 +1923,10 @@ in_pcbremlists(struct inpcb *inp)
16421923
struct inpcbport *phd = inp->inp_phd;
16431924

16441925
INP_HASH_WLOCK(pcbinfo);
1926+
1927+
/* XXX: Only do if SO_REUSEPORT set? */
1928+
in_pcbremlbgrouphash(inp);
1929+
16451930
LIST_REMOVE(inp, inp_hash);
16461931
LIST_REMOVE(inp, inp_portlist);
16471932
if (LIST_FIRST(&phd->phd_pcblist) == NULL) {

0 commit comments

Comments
 (0)