Opened 8 years ago

Closed 4 years ago

Last modified 4 years ago

#53117 closed defect (fixed)

Braeburn POST requests (Trac, Mailman) are unreliable

Reported by: mf2k (Frank Schima) Owned by: admin@…
Priority: Normal Milestone:
Component: server/hosting Version:
Keywords: Cc: ryandesign (Ryan Carsten Schmidt)
Port:

Description (last modified by mf2k (Frank Schima))

As a frequent user of the new Trac, I see it fail to accept my changes about 10-20% of the time. What happens is that I press the "Preview" or "Update" button and get a timeout page. I will update this ticket with more precise details the next time it happens to me.

Attachments (1)

Screen Shot 2016-12-20 at 1.25.25 PM.png (31.3 KB) - added by mf2k (Frank Schima) 8 years ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 8 years ago by neverpanic (Clemens Lang)

I don't see anything out of the ordinary in the statistics. If you have a timestamp I can look into the logs.

comment:2 Changed 8 years ago by mf2k (Frank Schima)

Last time it happened to me was approximately 8:15 am MST (about 18 minutes before this post).

comment:3 Changed 8 years ago by mf2k (Frank Schima)

It happened just now when trying to update a ticket. Attaching a screenshot of the error from Safari.

Changed 8 years ago by mf2k (Frank Schima)

comment:4 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

Cc: ryandesign added
Summary: Trac is unreliableBraeburn POST requests (Trac, Mailman) are unreliable

This has been happening to me intermittently ever since we moved to Braeburn, as I've previously reported by email. It's not limited to Trac either; I've seen the same problem with Mailman. It doesn't matter what network I'm on. It seems to be limited to POST requests.

comment:5 Changed 8 years ago by neverpanic (Clemens Lang)

On the trac side, we see errors such as

[Tue Dec 20 15:16:42.208829 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119] mod_wsgi (pid=24819): Exception occurred processing WSGI script '/var/www/trac/cgi-bin/trac.wsgi'.
[Tue Dec 20 15:16:42.208961 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119] Traceback (most recent call last):
[Tue Dec 20 15:16:42.208980 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/cgi-bin/trac.wsgi", line 31, in application
[Tue Dec 20 15:16:42.209005 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     return dispatch_request(environ, start_request)
[Tue Dec 20 15:16:42.209014 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/web/main.py", line 530, in dispatch_request
[Tue Dec 20 15:16:42.209028 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     return _dispatch_request(req, env, env_error)
[Tue Dec 20 15:16:42.209035 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/web/main.py", line 569, in _dispatch_request
[Tue Dec 20 15:16:42.209045 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     send_internal_error(env, req, sys.exc_info())
[Tue Dec 20 15:16:42.209052 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/web/main.py", line 688, in send_internal_error
[Tue Dec 20 15:16:42.209062 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     req.send_error(exc_info, status=500, env=env, data=data)
[Tue Dec 20 15:16:42.209069 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/web/api.py", line 597, in send_error
[Tue Dec 20 15:16:42.209080 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     exc_info)
[Tue Dec 20 15:16:42.209086 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/web/main.py", line 562, in _dispatch_request
[Tue Dec 20 15:16:42.209096 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     dispatcher.dispatch(req)
[Tue Dec 20 15:16:42.209103 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/web/main.py", line 249, in dispatch
[Tue Dec 20 15:16:42.209113 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     resp = chosen_handler.process_request(req)
[Tue Dec 20 15:16:42.209120 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/versioncontrol/web_ui/browser.py", line 404, in process_request
[Tue Dec 20 15:16:42.209131 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     self._render_zip(req, context, repos, node, rev)
[Tue Dec 20 15:16:42.209205 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/versioncontrol/web_ui/browser.py", line 661, in _render_zip
[Tue Dec 20 15:16:42.209231 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     render_zip(req, filename, repos, root_node, self._iter_nodes)
[Tue Dec 20 15:16:42.209243 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/versioncontrol/web_ui/util.py", line 231, in render_zip
[Tue Dec 20 15:16:42.209368 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     req.write(zip_str)
[Tue Dec 20 15:16:42.209375 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]   File "/var/www/trac/virtualenv/lib/python2.7/site-packages/trac/web/api.py", line 698, in write
[Tue Dec 20 15:16:42.209394 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119]     self._write(''.join(buf))
[Tue Dec 20 15:16:42.209410 2016] [wsgi:error] [pid 24819] [remote x.y.z.a:64119] IOError: Apache/mod_wsgi failed to write response data: Broken pipe

Unfortunately that's not very helpful in telling us why the pipe was broken (i.e. the connection was closed)

comment:6 Changed 8 years ago by neverpanic (Clemens Lang)

I can also see the requests sent to the ticket you were trying to update, but unfortunately no entry in access.log with a 500 error code for this ticket (weird, that should have been there).

As to the claim that this affects only post requests, that's an interesting observation. I've run some statistics on our logs to check which requests fail more often. We had 664993 GET requests on 2016-12-20, out of which 9973 failed with error 500, which is about 1.5 %. In the same time, we had 970 POST requests, 6 of which failed, i.e. 0.62 %.

Ryan: Do you happen to have a specific request to mailman that you saw fail? That would be interesting to analyze to find out whether the problem is on Apache-side or in mod_wsgi.

Last edited 8 years ago by neverpanic (Clemens Lang) (previous) (diff)

comment:7 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

I don't recall specific mailman failures, just that I experienced the problem several times when submitting forms on the mailman web interface on the day when we moved the lists from macosforge to Braeburn.

comment:8 Changed 8 years ago by neverpanic (Clemens Lang)

I have updated mod_wsgi to the latest version to see if that helps fix the issue. For the record, the resources to build a newer version are

comment:9 Changed 8 years ago by mf2k (Frank Schima)

The problem happened again to me today.

comment:10 Changed 8 years ago by neverpanic (Clemens Lang)

Do you have IPv6 connectivity when this happens?

comment:11 Changed 8 years ago by mf2k (Frank Schima)

I do not believe so.

comment:12 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

Just now, I saw the problem when pressing the "Let me in..." button to get in to mailman's admindb.

I'm not aware of ever having had IPv6 access to anything.

comment:13 in reply to:  12 Changed 8 years ago by neverpanic (Clemens Lang)

Replying to ryandesign:

Just now, I saw the problem when pressing the "Let me in..." button to get in to mailman's admindb.

That would have been Thu Jan 5 02:58:04 UTC 2017. The error log for the mailman virtual host has no entry for that time. The access log contains some of the requests that I assume have been sent by you, but all of those requests succeeded.

Since this rules out mod_wsgi (which isn't used for the virtual host at all), this leaves us with potential problems in Apache MPM ITK, the network stack, the network connection or the client. Syslog and dmesg don't have anything for this period. Statistics look normal as well.

I think we may have to go down to wireshark level with this and capture the traffic when it fails to debug it further. Is anybody willing to record their network traffic to braeburn for a while until they can reproduce and then extract the relevant parts?

comment:14 Changed 8 years ago by mf2k (Frank Schima)

I have to say that I rarely see this problem anymore.

comment:15 Changed 8 years ago by neverpanic (Clemens Lang)

Resolution: fixed
Status: newclosed

comment:16 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

It still happens occasionally.

comment:17 Changed 8 years ago by neverpanic (Clemens Lang)

I can see a spike in our statistics around 13:00 UTC on Saturday April 8. It seems like somebody issued a large number of requests to our HTTP server.

Could this be the issue you were seeing?

comment:18 Changed 6 years ago by mf2k (Frank Schima)

This still happens to me occasionally, including today.

comment:19 Changed 6 years ago by ryandesign (Ryan Carsten Schmidt)

I also still see the problem occasionally, maybe once or twice every few days. It's annoying when it happens after I've just written a new ticket, because I can't just click the Back button and click Submit again, because the first time I clicked Submit, it was disabled with JavaScript (presumably to prevent an errant double–click from causing two tickets to be created, something we used to see before Trac had this feature).

It surprises me that the error occurs instantly. When other servers are busy, the behavior I usually see is that my request is attempted for several seconds at least before the server drops the connection.

Would some sort of per-IP rate-limiting configuration be another solution? Start delaying requests if too many come from the same IP address? Or would that be ineffective for spiders—do they send requests from a large number of IP addresses simultaneously?

comment:20 in reply to:  19 Changed 6 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: closedreopened

Replying to ryandesign:

I also still see the problem occasionally, maybe once or twice every few days.

This still happens often enough to be annoying. I think a contributing factor might be that while I'm composing a ticket, many http requests are being sent to Trac to render the live preview (everytime I stop typing for a few seconds) or to look up a GitHub handle from an email address, such that by the time I press the Submit button, the web server (or whatever component contains this apparent DoS protection) has counted so many requests that it decides that this one is one too many and blocks it. I'm starting to try to train myself to wait 30 seconds after I finish composing a ticket before clicking the Submit button.

It's annoying when it happens after I've just written a new ticket, because I can't just click the Back button and click Submit again, because the first time I clicked Submit, it was disabled with JavaScript (presumably to prevent an errant double–click from causing two tickets to be created, something we used to see before Trac had this feature).

Can we at least fix this specific issue with Trac? That would make it much less annoying. Can we add custom JavaScript to our template to cause the Submit button to become re-enabled if the page is displayed in response to clicking the Back button? It probably affects not only the new-ticket form but also the add-comment form.

comment:21 Changed 6 years ago by mf2k (Frank Schima)

Description: modified (diff)

comment:22 Changed 6 years ago by raimue (Rainer Müller)

As far as I am aware, there is no kind of DoS-protection per IP address configured on braeburn for HTTP. Not in apache2 and also not in Trac. I also checked TracSpamFilter, but there is also nothing configured for IP throttling. There is also nothing in the logs related to this kind of behavior.

I have never experienced the issue, so I really don't know what else to look for. As @neverpanic said above, we would need a network traffic capture showing the problem, but it does not necessarily have to be a wireshark dump. If the Web Developer Tools in your browser also show the problem, please export a HAR archive from there. Probably better to share it privately to macports-infra@ to avoid exposing private information such as cookies.

comment:23 Changed 4 years ago by neverpanic (Clemens Lang)

Resolution: fixed
Status: reopenedclosed

I think this is no longer happening, we've got the load issues on trac that initially caused this solved.

comment:24 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

It still happens to me several times a week, I would say.

Last edited 4 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:25 Changed 4 years ago by neverpanic (Clemens Lang)

OK. Can you let me know when exactly and which URL, so I can take another look at the logs?

Note: See TracTickets for help on using tickets.