Android系统ANR浅析

Android系统ANR机制分析与简单的日志排查
Views: 201
1 0
Read Time:7 Minute, 7 Second

ANR全称为Application Not Response,是由于App主线程长时间卡滞导致的。ANR并不代表我们的应用代码一定存在编码问题,通常不会导致FATAL EXECPTION。直观上来讲,ANR发生时应用会无法及时响应用户操作,这会极大影响用户体验。

Android APP开发中的四大基础组件中Service、BroadcastReceiver与ContentProvider以及Input都会去检测ANR超时,这里以Service为例,简述ANR产生的原因:

当我们的应用组件通过startService启动一个Service时,会通过binder通知AMS启动Service,AMS在经过一系列逻辑处理后判断可以启动,就会通过binder通知应用所在进程创建一个Service并执行Service的onCreate生命周期,与此同时AMS会通过handler发送一个延时任务,此延时任务就是ANR任务。在应用进程Service生命周期执行完后就会通过binder告知AMS解除ANR任务;如果应用进程Service生命周期超时,AMS内的ANR任务就会执行,从而记录ANR报错并dump当前系统状态。

机制解析

这里的机制解析,我们分别从应用APP以及AMS的角度来进行,并且以Android14的代码进行佐证分析。首先是应用App调用startService启动服务:

Intent serviceIntent = new Intent(context, MyService.class);
context.startService(serviceIntent);

在通过startService启动服务后,程序调用会进入到ContextImpl.java内:

//frameworks/base/core/java/android/app/ContextImpl.java
@Override
public ComponentName startService(Intent service) {
        warnIfCallingFromSystemProcess();
        return startServiceCommon(service, false, mUser);
}

private ComponentName startServiceCommon(Intent service, boolean requireForeground,
         UserHandle user) {
	   //startService核心
	    validateServiceIntent(service);
            service.prepareToLeaveProcess(this);
            ComponentName cn = ActivityManager.getService().startService(
               mMainThread.getApplicationThread(), service,
               service.resolveTypeIfNeeded(getContentResolver()),requireForeground,
               getOpPackageName(), getAttributionTag(), user.getIdentifier());
}

接下来我们进入ActivityManagerService,也就是AMS内部:

//frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
public ComponentName startService(IApplicationThread caller, Intent service,
            String resolvedType, boolean requireForeground, String callingPackage,
            String callingFeatureId, int userId)
            throws TransactionTooLargeException {
            return startService(caller, service, resolvedType, 
            requireForeground,callingPackage,callingFeatureId, userId, false /* 
            isSdkSandboxService */, INVALID_UID, null, null);
}

private ComponentName startService(IApplicationThread caller, Intent service,
    String resolvedType, boolean requireForeground, String callingPackage,
    String callingFeatureId, int userId, boolean isSdkSandboxService,
    int sdkSandboxClientAppUid, String sdkSandboxClientAppPackage, String instanceName)
{
	//mServices属于ActiveServices对象
        res = mServices.startServiceLocked(caller, service,
                 resolvedType, callingPid, callingUid,
                 requireForeground, callingPackage, callingFeatureId, userId,
                 isSdkSandboxService,
                 sdkSandboxClientAppUid,sdkSandboxClientAppPackage,
                 instanceName);
         
}

紧接着我们进入ActiveServices类:

//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
ComponentName startServiceLocked(IApplicationThread caller, Intent service, String resolvedType,int callingPid, int callingUid, boolean fgRequired,String callingPackage, @Nullable String callingFeatureId, final int userId,BackgroundStartPrivileges backgroundStartPrivileges, boolean isSdkSandboxService,
int sdkSandboxClientAppUid, String sdkSandboxClientAppPackage, String instanceName)
{

final ComponentName realResult =
                startServiceInnerLocked(r, service, callingUid, callingPid,
                callingProcessName, callingProcessState,
                fgRequired, callerFg,
                backgroundStartPrivileges, callingPackage);
	……
}

//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
ComponentName startServiceInnerLocked(ServiceMap smap, Intent service, ServiceRecord r,boolean callerFg, boolean addToStarting, int callingUid, String callingProcessName,int callingProcessState, boolean wasStartRequested, String callingPackage)
{
     final int uid = r.appInfo.uid;
     final String packageName = r.name.getPackageName();
     final String serviceName = r.name.getClassName();
     FrameworkStatsLog.write(FrameworkStatsLog.SERVICE_STATE_CHANGED, uid, 
     packageName,serviceName,FrameworkStatsLog.SERVICE_STATE_CHANGED__STATE__START);
     mAm.mBatteryStatsService.noteServiceStartRunning(uid, packageName, serviceName);
     //启动Service的核心
     String error = bringUpServiceLocked(r, service.getFlags(), callerFg,
                false /* whileRestarting */,
                false /* permissionsReviewRequired */,
                false /* packageFrozen */,
                true /* enqueueOomAdj */);
}

在startServiceInnerLocked中会进入到bringUpServiceInnerLocked,这里面会分两种情况进行下一步处理。

一种情况是如果Service所在的App进程已经启动,则会调用realStartServiceLocked方法:

//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
if (app != null) {
    app.addPackage(r.appInfo.packageName, r.appInfo.longVersionCode,	mAm.mProcessStats);
    //当APP已经启动时,启动service
    realStartServiceLocked(r, app, thread, pid, uidRecord, execInFg,enqueueOomAdj);
}

//进入realStartServiceLocked内通过scheduleCreateService创建service
private void realStartServiceLocked(ServiceRecord r, ProcessRecord app,
      IApplicationThread thread, int pid, UidRecord uidRecord, boolean execInFg,
      boolean enqueueOomAdj) throws RemoteException {
       ……
	boolean created = false;
        try {
	//这里的thread对象代表应用进程的ApplicationThread
        thread.scheduleCreateService(r, r.serviceInfo,
        null /* compatInfo (unused but need to keep method signature) */,
        app.mState.getReportedProcState());
        r.postNotification(false);
        created = true;
     }
}

另一种情况是如果App进程未启动,又可细分为两类:

//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
if (app == null && !permissionsReviewRequired && !packageFrozen) {	  
     if (r.isSdkSandbox) {
                final int uid = Process.toSdkSandboxUid(r.sdkSandboxClientAppUid);
                app = mAm.startSdkSandboxProcessLocked(procName, r.appInfo, true, 
                intentFlags,hostingRecord, ZYGOTE_POLICY_FLAG_EMPTY, uid, 
                r.sdkSandboxClientAppPackage);
                r.isolationHostProc = app;
            } else {
                app = mAm.startProcessLocked(procName, r.appInfo, true, intentFlags,
                        hostingRecord, ZYGOTE_POLICY_FLAG_EMPTY, false, isolated);
            }

}

这里我们以App已启动为例,通过realStartServiceLocked启动Service,最终会进入到scheduleServiceTimeoutLocked方法:

//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
void scheduleServiceTimeoutLocked(ProcessRecord proc) {
   if (proc.mServices.numberOfExecutingServices() == 0 || proc.getThread() == null) {
            return;
     }
    Message msg = mAm.mHandler.obtainMessage(
ActivityManagerService.SERVICE_TIMEOUT_MSG);
        msg.obj = proc;
        mAm.mHandler.sendMessageDelayed(msg, proc.mServices.shouldExecServicesFg()
                ? mAm.mConstants.SERVICE_TIMEOUT : mAm.mConstants.SERVICE_BACKGROUND_TIMEOUT);
}

这里我们可以看到,在scheduleServiceTimeoutLocked中通过handler为Service启动设定了超时任务,这里的超时时间有两个值:针对前台服务是20s,针对后台服务是200s.

DEFAULT_SERVICE_TIMEOUT = 20 * 1000 * Build.HW_TIMEOUT_MULTIPLIER;
DEFAULT_SERVICE_BACKGROUND_TIMEOUT = DEFAULT_SERVICE_TIMEOUT * 10;

至此在AMS中的任务已经告一段落,接着我们回到应用App自身,前面讲到在realStartServiceLocked方法中通过App的ApplicationThread的scheduleCreateService创建service,这里最终进入的是ActivityThread.java中scheduleCreateService:

public final void scheduleCreateService(IBinder token,
      ServiceInfo info, CompatibilityInfo compatInfo, int processState) {
      updateProcessState(processState, false);
      CreateServiceData s = new CreateServiceData();
       s.token = token;
       s.info = info;

        sendMessage(H.CREATE_SERVICE, s);
 }

沿着调用链继续追踪,最终回到了ActiveServices.java内的serviceDoneExecutingLocked方法:

//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
     boolean finishing, boolean enqueueOomAdj, @OomAdjReason int oomAdjReason) {
final ProcessServiceRecord psr = r.app.mServices;
          if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,"Nesting at 0 of " + r.shortInstanceName);
                psr.setExecServicesFg(false);
                psr.stopExecutingService(r);
                if (psr.numberOfExecutingServices() == 0) {
                    if (DEBUG_SERVICE || DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING, "No more executingServices of " + r.shortInstanceName);
mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
} else if (r.executeFg) {
    ……
  }
}

这里我们可以看到,在Service创建完成之后,会通过handler将之前的ANR任务解除。

日志解析

Android原生记录的ANR日志都位于/data/anr文件夹内,当应用发生ANR时,系统可能会存在问题从而导致众多应用都会发生ANR,针对ANR日志我们重点是需要学会如何分析日志,找到ANR的root cause。

当我们通过adb打开/data/anr文件夹时,我们能够看到如下所示内容:

anr_2024-03-31-02-16-39-479
anr_2024-03-31-02-17-21-731
anr_2024-03-31-02-17-58-405
…

所有的anr日志都以anr_xxxx开头,日志文件会携带时间信息,用于表明该anr捕获的时间;

我们打开anr日志文件,有一些关键字段的信息是极其重要的。

subject字段开头的日志表明anr发生的问题原因,如下所示:

Subject: Input dispatching timed out (141c2c9 com.blacksesame.video/com.example.video.ui.activity.VideoPlayActivity (server) is not responding. Waited 5000ms for MotionEvent(deviceId=4, eventTime=225178371585000, source=TOUCHSCREEN, displayId=2, action=DOWN, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=8.5, yPrecision=13.3, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (756.5, 645.2)]), policyFlags=0x62000001).

上述错误表明在com.example.video应用中的VideoPlayActivity发生了anr,anr的原因是由于Input dispatch超时,表明Input分发存在卡滞导致没有响应用户输入。

events字段用于记录该ANR发生之前的ANR记录,如下所示:

events <
  timestamp_ms: 1711851399478
  anr <
    pid: 2564
    process: "com.example.video"
    process_class: 1
    subject: "Input dispatching timed out (141c2c9 com.blacksesame.video/com.example.video.ui.activity.VideoPlayActivity (server) is not responding. Waited 5000ms for MotionEvent(deviceId=4, eventTime=225178371[...]"
    uid: 10114
  >
>
events <
  timestamp_ms: 1711851441487
  half_watchdog <
    subject: "Blocked in monitor com.android.server.am.ActivityManagerService on monitor thread (watchdog.monitor) for 30s, Blocked in handler on  on foreground thread (android.fg) for 30s, Blocked in handler on  o[...]"
  >
>

这里的events记录了两次ANR,这些ANR都发生在本次ANR之前,可以作为ANR问题发生时的辅助信息。

上述内容是一些信息总览,用于我们快速获取信息,具体的报错日志需要通过dumping pid字段进行分析,如下所示:

----- dumping pid: 710 at 225225702

----- pid 710 at 2024-03-31 02:17:21.735883895+0000 -----
Cmd line: system_server
Build fingerprint: 'c1200_aaos/c1200:14/UQ1A.231205.015.A1/V1.1.3.0:userdebug/test-keys'
ABI: 'arm64'
Build type: optimized
suspend all histogram:	Sum: 11.916ms 99% C.I. 1us-1476.159us Avg: 85.726us Max: 2919us
DALVIK THREADS (181):
//“main”与tid=1表明属于主线程,Blocked表明主线程属于Block状态
"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x71e344c8 self=0xb400f270432d4be0
  | sysTid=710 nice=-2 cgrp=foreground sched=0/0 handle=0xf271325994f8
  | state=S schedstat=( 13723566710 13467097570 113753 ) utm=1082 stm=289 core=3 HZ=100
  | stack=0xffffc7177000-0xffffc7179000 stackSize=8188KB
//表明发生死锁
  | held mutexes=
  at com.android.server.am.ActivityManagerService.broadcastIntentWithFeature(ActivityManagerService.java:15389)
//关键信息,对应lock被thread 14占有
  - waiting to lock <0x0d8bc079> (a com.android.server.am.ActivityManagerService) held by thread 14
  at android.app.ContextImpl.sendBroadcastAsUser(ContextImpl.java:1478)
  at com.android.server.alarm.AlarmManagerService$3.lambda$doAlarm$0(AlarmManagerService.java:1995)
  at com.android.server.alarm.AlarmManagerService$3.$r8$lambda$o_OrcmiwSegk0Cl6DW0Phad6Mm8(AlarmManagerService.java:0)
  at com.android.server.alarm.AlarmManagerService$3$$ExternalSyntheticLambda0.run(R8$$SyntheticClass:0)
  at android.os.Handler.handleCallback(Handler.java:958)
  at android.os.Handler.dispatchMessage(Handler.java:99)
  at android.os.Looper.loopOnce(Looper.java:205)
  at android.os.Looper.loop(Looper.java:294)
  at com.android.server.SystemServer.run(SystemServer.java:973)
  at com.android.server.SystemServer.main(SystemServer.java:667)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:552)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:949)

这里我们可以看到发生的问题在于dead lock,且lock被同一进程内的其他线程(Thread 14)持有,我们看看此时Thread 14此时的状态:

"android.ui" prio=5 tid=14 Native
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x14c01c28 self=0xb400f2704330c5e0
  | sysTid=729 nice=-10 cgrp=top-app sched=0/0 handle=0xf26ddf04bcb0
  | state=S schedstat=( 754225290 619216305 2951 ) utm=55 stm=20 core=0 HZ=100
  | stack=0xf26ddef48000-0xf26ddef4a000 stackSize=1039KB
  | held mutexes=
  native: #00 pc 000aa698  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+8) (BuildId: c74277f481a383c87215b672f6465e24)
  native: #01 pc 00063668  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156) (BuildId: c74277f481a383c87215b672f6465e24)
  native: #02 pc 0005e294  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver+280) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
  native: #03 pc 0005f5f0  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse+60) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
  native: #04 pc 0005f330  /system/lib64/libbinder.so (android::IPCThreadState::transact+216) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
  native: #05 pc 00076640  /system/lib64/libbinder.so (android::BpBinder::transact+188) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
  native: #06 pc 00114d1c  /system/lib64/libgui.so (android::gui::BpSurfaceComposer::captureLayers+436) (BuildId: d6c3eaedda5518d14dd35993bc419095)
  native: #07 pc 000ff578  /system/lib64/libgui.so (android::ScreenshotClient::captureLayers+88) (BuildId: d6c3eaedda5518d14dd35993bc419095)
  native: #08 pc 00205fd8  /system/lib64/libandroid_runtime.so (android::nativeCaptureLayers+340) (BuildId: e18d4023419fc50e2dfbce708dce5ae6)
  at android.window.ScreenCapture.;(Native method)
  at android.window.ScreenCapture.captureLayers(ScreenCapture.java:173)
  at android.window.ScreenCapture.captureLayers(ScreenCapture.java:136)
  at android.window.ScreenCapture.captureLayersExcluding(ScreenCapture.java:163)

这里我们可以看到,在名为android.ui的线程内,通过jni调用进入到了libgui所提供的captureLayers方法,其作为bp端,对应的实现在SurfaceFlinger进程。此时我们应当去观察SurfaceFlinger及其相关组件(Gralloc Mapper,Grolloc Allocator,HWC等),从而进行下一步追查。

以上就是一个ANR日志分析的大致步骤。

规避措施

大多数ANR发生都是由于应用编码不规范导致的,少部分ANR是由于Native FWK的阻塞引起(binder或者jni调用),也有部分ANR是由于系统资源紧张而导致的。作为应用开发者,我们应当遵循一定的准则从而尽可能规避ANR的产生:

1.避免在主线程上执行涉及 复杂I/O 的操作。

2.避免在主线程上进行长时间的CPU计算。

3.避免在主线程在对另一个进程进行同步 binder 调用,而后者需要很长时间才能返回。

4.避免主线程因长时间等待其他线程状态同步而处于阻塞状态。

5.避免主线程在进程中或通过 binder 调用与另一个线程之间发生死锁。

Happy
Happy
100 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %
FranzKafka95
FranzKafka95

极客,文学爱好者。如果你也喜欢我,那你大可不必害羞。

文章: 88

留下评论

您的电子邮箱地址不会被公开。 必填项已用*标注

zh_CNCN