eXist app through Apache proxy goes compute-bound

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

eXist app through Apache proxy goes compute-bound

Craig A. Berry
We have an eXist app (a now somewhat distant cousin of TEI Publisher) that has been running reasonably well for a couple of months on an AWS linux instance, but last week started hanging.  Access from the outside world is via an Apache proxy.  

One of the eXist processes, according to top, now goes into a state shortly after start-up in which it consumes 180%-200% of the cpu on this 2-cpu instance.  Initially the application still works, albeit slowly, but within somewhere between a couple of minutes and a couple of hours it stops responding at all.  If I start Monitoring and Profiling immediately upon start-up, it will run briefly before getting disconnected, and it shows that there are no running jobs, no running queries, no recent queries, no waiting threads, and no active threads.

The problem only happens when the Apache proxy is running.  If I don't start Apache and only access the application on the 8443 port, everything seems fine.  I changed the proxy timeout from 60 seconds to 20 minutes and it had no effect; the problem started in well under 20 minutes of start-up.  Whether the problem has anything to do with Apache per se or rather with something arriving from the outside world via Apache I don't know.  That said, the only thing I see in the Apache access log is some robots following our links, but it's only two or three requests per minute, so it doesn't seem like that would overwhelm anything.

We upgraded eXist from 3.1.0 to 3.2.0 and observed no differences.

Has anyone seen anything like this or have any suggestions on how to debug it?

________________________________________
Craig A. Berry

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: eXist app through Apache proxy goes compute-bound

Joe Wicentowski
Hi Craig,

Are there any clues in exist.log? Do the requests in the Apache access log line up 1:1 with the requests in Jetty's access logs ($EXIST_HOME/tools/jetty/logs)?

You might also try to grab JMX status snapshots when eXist is seizing up by visiting the "/status" page directly and saving it. You will probably needn't to append "?token=" followed by the token stored in $EXIST_HOME/webapp/WEB-INF/data/jmxservlet.token (if memory serves). This is what monex polls, but if you can post the one that you get just before eXist becomes completely unresponsive this can contain clues about what's going on.

Joe

On Wed, Jun 7, 2017 at 11:40 PM Craig A. Berry <[hidden email]> wrote:
We have an eXist app (a now somewhat distant cousin of TEI Publisher) that has been running reasonably well for a couple of months on an AWS linux instance, but last week started hanging.  Access from the outside world is via an Apache proxy.

One of the eXist processes, according to top, now goes into a state shortly after start-up in which it consumes 180%-200% of the cpu on this 2-cpu instance.  Initially the application still works, albeit slowly, but within somewhere between a couple of minutes and a couple of hours it stops responding at all.  If I start Monitoring and Profiling immediately upon start-up, it will run briefly before getting disconnected, and it shows that there are no running jobs, no running queries, no recent queries, no waiting threads, and no active threads.

The problem only happens when the Apache proxy is running.  If I don't start Apache and only access the application on the 8443 port, everything seems fine.  I changed the proxy timeout from 60 seconds to 20 minutes and it had no effect; the problem started in well under 20 minutes of start-up.  Whether the problem has anything to do with Apache per se or rather with something arriving from the outside world via Apache I don't know.  That said, the only thing I see in the Apache access log is some robots following our links, but it's only two or three requests per minute, so it doesn't seem like that would overwhelm anything.

We upgraded eXist from 3.1.0 to 3.2.0 and observed no differences.

Has anyone seen anything like this or have any suggestions on how to debug it?

________________________________________
Craig A. Berry

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
--
Sent from my iPhone

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: eXist app through Apache proxy goes compute-bound

Adam Retter
In reply to this post by Craig A. Berry
Hi Craig,

Two things come to mind:

1) Some sort of runaway process in eXist. You can use the 'jstack' tool which shops with the JDK to get a point-in-time trace of exactly what eXist is doing. You might want to take a few of these to compare it's slow and locked up states.

2) Apache overwhelming eXist with network requests, either for genuine user reasons or perhaps due to a bad config causing some sort of feedback loop. You can use tools like Wiredshark or tcpdump to capture the network traffic between Apache and eXist to help understand the interactions between startup and when it all seems to stop responding.

On 8 Jun 2017 4:39 a.m., "Craig A. Berry" <[hidden email]> wrote:
We have an eXist app (a now somewhat distant cousin of TEI Publisher) that has been running reasonably well for a couple of months on an AWS linux instance, but last week started hanging.  Access from the outside world is via an Apache proxy.

One of the eXist processes, according to top, now goes into a state shortly after start-up in which it consumes 180%-200% of the cpu on this 2-cpu instance.  Initially the application still works, albeit slowly, but within somewhere between a couple of minutes and a couple of hours it stops responding at all.  If I start Monitoring and Profiling immediately upon start-up, it will run briefly before getting disconnected, and it shows that there are no running jobs, no running queries, no recent queries, no waiting threads, and no active threads.

The problem only happens when the Apache proxy is running.  If I don't start Apache and only access the application on the 8443 port, everything seems fine.  I changed the proxy timeout from 60 seconds to 20 minutes and it had no effect; the problem started in well under 20 minutes of start-up.  Whether the problem has anything to do with Apache per se or rather with something arriving from the outside world via Apache I don't know.  That said, the only thing I see in the Apache access log is some robots following our links, but it's only two or three requests per minute, so it doesn't seem like that would overwhelm anything.

We upgraded eXist from 3.1.0 to 3.2.0 and observed no differences.

Has anyone seen anything like this or have any suggestions on how to debug it?

________________________________________
Craig A. Berry

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: eXist app through Apache proxy goes compute-bound

Craig A. Berry
In reply to this post by Joe Wicentowski

> On Jun 8, 2017, at 6:10 AM, Joe Wicentowski <[hidden email]> wrote:

Thanks for the reply.

> Are there any clues in exist.log?

As far as I can tell, only victims, not perpetrators.  So, for example, I've seen a broken pipe here and there, but it seemed to be from after things went crazy.

> Do the requests in the Apache access log line up 1:1 with the requests in Jetty's access logs ($EXIST_HOME/tools/jetty/logs)?

Yes.

> You might also try to grab JMX status snapshots when eXist is seizing up by visiting the "/status" page directly and saving it. You will probably needn't to append "?token=" followed by the token stored in $EXIST_HOME/webapp/WEB-INF/data/jmxservlet.token (if memory serves). This is what monex polls, but if you can post the one that you get just before eXist becomes completely unresponsive this can contain clues about what's going on.

That's a good tip.  It took me a couple minutes to find jmxservlet.token because we have our data in an alternate location, but following the trail from the configuration got me there.

This particular bug decided to go underground as soon as I announced its presence in public.  We started things up so we could follow the recommended debugging suggestions, but everything has been working fine for some hours now.  For the first time in over a week.  Quite a puzzle, but I now have some things to try if it shows up again.

>
> Joe
>
> On Wed, Jun 7, 2017 at 11:40 PM Craig A. Berry <[hidden email]> wrote:
> We have an eXist app (a now somewhat distant cousin of TEI Publisher) that has been running reasonably well for a couple of months on an AWS linux instance, but last week started hanging.  Access from the outside world is via an Apache proxy.
>
> One of the eXist processes, according to top, now goes into a state shortly after start-up in which it consumes 180%-200% of the cpu on this 2-cpu instance.  Initially the application still works, albeit slowly, but within somewhere between a couple of minutes and a couple of hours it stops responding at all.  If I start Monitoring and Profiling immediately upon start-up, it will run briefly before getting disconnected, and it shows that there are no running jobs, no running queries, no recent queries, no waiting threads, and no active threads.
>
> The problem only happens when the Apache proxy is running.  If I don't start Apache and only access the application on the 8443 port, everything seems fine.  I changed the proxy timeout from 60 seconds to 20 minutes and it had no effect; the problem started in well under 20 minutes of start-up.  Whether the problem has anything to do with Apache per se or rather with something arriving from the outside world via Apache I don't know.  That said, the only thing I see in the Apache access log is some robots following our links, but it's only two or three requests per minute, so it doesn't seem like that would overwhelm anything.
>
> We upgraded eXist from 3.1.0 to 3.2.0 and observed no differences.
>
> Has anyone seen anything like this or have any suggestions on how to debug it?
>
> ________________________________________
> Craig A. Berry
>
> "... getting out of a sonnet is much more
>  difficult than getting in."
>                  Brad Leithauser
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open
> --
> Sent from my iPhone

________________________________________
Craig A. Berry

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: eXist app through Apache proxy goes compute-bound

Craig A. Berry
In reply to this post by Adam Retter

> On Jun 8, 2017, at 6:21 AM, Adam Retter <[hidden email]> wrote:
>
> Hi Craig,
>
> Two things come to mind:
>
> 1) Some sort of runaway process in eXist. You can use the 'jstack' tool which shops with the JDK to get a point-in-time trace of exactly what eXist is doing. You might want to take a few of these to compare it's slow and locked up states.

Thanks.  I'm not much of a Java person and had been thinking there must be some way to identify what's stuck in a loop the way you would do with dtrace or a debugger or some profiling tool with other languages. Now I know.

> 2) Apache overwhelming eXist with network requests, either for genuine user reasons or perhaps due to a bad config causing some sort of feedback loop. You can use tools like Wiredshark or tcpdump to capture the network traffic between Apache and eXist to help understand the interactions between startup and when it all seems to stop responding.

We installed wireshark and tcpdump, but as I mentioned in my reply to Joe, the problem went away as soon as we started things up again to observe it in action.  I've used wireshark or tcpdump once or twice in the past, and the problem is you tend to need a lot of knowledge about networking primitives in order to know what they are telling you.  But still good to have available, and I will keep them in mind if this crops up again.

>
> On 8 Jun 2017 4:39 a.m., "Craig A. Berry" <[hidden email]> wrote:
> We have an eXist app (a now somewhat distant cousin of TEI Publisher) that has been running reasonably well for a couple of months on an AWS linux instance, but last week started hanging.  Access from the outside world is via an Apache proxy.
>
> One of the eXist processes, according to top, now goes into a state shortly after start-up in which it consumes 180%-200% of the cpu on this 2-cpu instance.  Initially the application still works, albeit slowly, but within somewhere between a couple of minutes and a couple of hours it stops responding at all.  If I start Monitoring and Profiling immediately upon start-up, it will run briefly before getting disconnected, and it shows that there are no running jobs, no running queries, no recent queries, no waiting threads, and no active threads.
>
> The problem only happens when the Apache proxy is running.  If I don't start Apache and only access the application on the 8443 port, everything seems fine.  I changed the proxy timeout from 60 seconds to 20 minutes and it had no effect; the problem started in well under 20 minutes of start-up.  Whether the problem has anything to do with Apache per se or rather with something arriving from the outside world via Apache I don't know.  That said, the only thing I see in the Apache access log is some robots following our links, but it's only two or three requests per minute, so it doesn't seem like that would overwhelm anything.
>
> We upgraded eXist from 3.1.0 to 3.2.0 and observed no differences.
>
> Has anyone seen anything like this or have any suggestions on how to debug it?
>
> ________________________________________
> Craig A. Berry
>
> "... getting out of a sonnet is much more
>  difficult than getting in."
>                  Brad Leithauser
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open

________________________________________
Craig A. Berry

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open