Sentaurus Device
6. Three-Dimensional Device Simulation

6.1 Overview
6.2 Meshing Controls
6.3 Numeric Controls
6.4 Linear Solvers
6.5 Parallelization
6.6 Understanding the Log File

Objectives

6.1 Overview

In addition to 2D simulation capabilities, Sentaurus Device is designed to simulate the device thermoelectrical behavior for native 2D and 3D cylindrical device geometries.

Typically, 3D device simulations are more computationally expensive than 2D ones, which require special solutions to be applied to make them running more efficiently.

Some of these solutions are:

The corresponding example, where most of the solutions described in this section are applied, can be found in the Applications_Library/GettingStarted/sdevice/3Ddiode_demo directory.

6.2 Meshing Controls

Mesh generation, which produces grids suitable for 3D device simulations, might be not an easy task. See Section 2. Using the Axis-Aligned Mesh Generator for detailed descriptions of different aspects of mesh generation using Sentaurus Mesh.

This section summarizes some recommendations for those who are starting with 3D mesh generation:

3D mesh controls

Figure 1. Suggested Delaunizer settings for 3D mesh generation. (Click image for full-size view.)

The high-level Sentaurus Structure Editor command to produce the corresponding Delaunizer section is shown here:

(sdesnmesh:delaunizer 
 	"maxPoints" 1e6 
 	"maxConnectivity" 100
 	"delaunayTolerance" 0. 
 	"sliverRemovalAlgorithm" 2
 	"storeDelaunayWeight" #t
)

6.3 Numeric Controls

Numeric controls can be split into different groups:

The first two controls are discussed here. The last two controls are discussed in Section 6.4 Linear Solvers and Section 6.5 Parallelization.

The full list of controls is demonstrated in the sdevice_des.cmd file from the Applications_Library/GettingStarted/sdevice/3Ddiode_demo project.

6.3.1 General Controls

Math {
  
  Transient = BE
  eMobilityAveraging = ElementEdge
  hMobilityAveraging = ElementEdge
  
  CDensityMin = 0.
  ElementVolumeAvalanche
  AvalFlatElementExclusion = 1.
  
  ParallelToInterfaceInBoundaryLayer(FullLayer -ExternalBoundary)
  ComputeGradQuasiFermiAtContacts= UseQuasiFermi
  
  BM_ExtendedPrecision
  WeightedVoronoiBox
  
  AutoCNPMinStepFactor = 0
  AutoNPMinStepFactor = 0
  -PlotLoadable
  SimStats   
  
  ExitOnFailure
  ...
}

The general controls are:

6.3.2 Nonlinear Solver Controls

Math {
  ...
  Digits = 5
  ErrRef(electron) = 1e8
  ErrRef(hole) = 1e8
  
  Iterations = 10
  NotDamped = 100
  RHSMin = 1e-8
  EquilibriumSolution(Iterations=100)
  
  Extrapolate
  
  RefDens_eGradQuasiFermi_ElectricField = 1e8
  RefDens_hGradQuasiFermi_ElectricField = 1e8
}

For more details about the nonlinear Newton solver, see Section 9. Nonlinear Newton Solver. The most important controls are discussed here:

6.3.3 Incremental Step Strategy

While running a simulation on a real or virtual timescale, Sentaurus Device uses a dedicated strategy to control the timescale resolution. Two scaling factors are used for this, Increment and Decrement, as shown in Figure 7.

Schematics of the time-step incremental strategy

Figure 7. Schematics of the time-step incremental strategy. (Click image for full-size view.)

It is recommended to apply a smaller value for Increment than used for Decrement, to avoid back and forth time-step jumping in the case of nonconvergence.

For transient simulations, Increment or Decrement factor values are used exactly as specified in the Transient section:

Transient (
  InitialTime=0 FinalTime=100e-11
  InitialStep=1e-11 Increment= 1.4 Decrement= 2
  MaxStep=5e-11 MinStep=1e-15
){ Coupled{ Poisson Electron Hole } }

By default, the same value (2) is used for both Increment and Decrement.

For steady-state simulations, the Increment value also depends on the number of iterations in a previous step, according to the following formula:

Increment value also depends on the number of iterations in a previous step

Therefore, you might observe different time-step values reported in the log file, compared to what has been specified in the command file.

6.4 Linear Solvers

Different linear solvers are available in Sentaurus Device:

The keyword Method specifies which solver Sentaurus Device will use. By default, the SUPER linear solver is used. Table 1 summarizes different aspects of each linear solver.

Table 1. Linear solvers available in Sentaurus Device.
Solver Type Thread parallelism Extended precision support Recommended for Speciality
Blocked Not a real solver, but a strategy to solve complex matrices in mixed mode no Circuit and mixed-mode device simulations Mixed-mode circuit and device simulations
SUPER Direct, sparse matrices no yes Medium 2D tasks, wide-bandgap devices Robust, very accurate, no pivoting
PARDISO Direct, sparse matrices yes yes* Large 2D devices, small- and medium-sized 3D silicon devices ~2 times faster than SUPER, well parallelized
ILS Iterative, sparse matrices yes yes Large 2D and 3D tasks, wide-bandgap devices Adjustable behavior, slight superlinear parallel scaling

*Combining extended-precision arithmetic and parallelization degrades the performance of PARDISO because the BLAS3 fast matrix multiplication library (only available for the standard precision arithmetic) is not applied in this case.

6.4.1 Blocked Decomposition Solver

Math {
  Method= Blocked
  SubMethod= ILS(set=1)
  ACMethod= Blocked
  ACSubMethod= ILS(set=2)
  PrintLinearSolver
}

Use the Blocked solver (Method=Blocked) when single or multiple numerically simulated 2D or 3D devices are combined together within a circuit. This is called the mixed-mode device operation, which is described in Section 3. Mixed Mode.

The concept of block decomposition is illustrated in Figure 8. Here, two devices (Device 1 and Device 2) are simulated numerically and coupled using the circuit. The full matrix is decomposed into blocks, which include the sparse matrices of the numerically simulated devices ("1" and "2") coupled to the circuit matrix "c" using the "a" and "b" matrices. Each "1" and "2" matrix is solved using the method specified by the SubMethod keyword, and the circuit matrix "c" is solved by the internal direct solver.

Schematics of the Blocked decomposition method

Figure 8. Schematics of the blocked decomposition solver. Linear systems for two numerically simulated devices are solved using the method specified by the SubMethod keyword. In the case of small-signal AC analysis, a different linear solver can be selected using the ACSubMethod command. (Click image for full-size view.)

Use the option PrintLinearSolver of the Math section to obtain additional information regarding the linear solvers in the log file.

6.4.2 SUPER Linear Solver

To specify the SUPER direct linear solver for single-device simulations, use:

Math {
  Method= Super
}

To specify SUPER for mixed-mode device simulations, use:

Math {
  Method= Blocked
  SubMethod= Super
}

If no solver is specified in the Math section, the SUPER solver will be used. The default solver settings for SUPER are optimized to address most standard simulation needs. Modifying the direct solver settings is not recommended.

For direct linear solvers, the approximate memory footprint M(N) increases with the number of nodes N between ~ c x N x log(N) (ideal case) and ~ c x N3/2 (realistic cases). Correspondingly, the number of arithmetic operations A(N) increases with the number of nodes N between ~ c x N3/2 (ideal case) and ~ c x N5/2 (realistic cases). The proportionality constant c reflects specific properties of the mesh structure and equations (that is, a linear system matrix portrait).

For 2D tasks, especially wide-bandgap device simulations, SUPER provides the best overall solution performance. It is generally not suitable for 3D device simulations because it lacks parallelization, and memory consumption drastically increases with the increased node count.

6.4.3 PARDISO Linear Solver

To specify the Parallel Direct Solver (PARDISO) for single-device simulations, use:

Math {
  Method= ParDiSo[(options)]
}

To specify PARDISO for mixed-mode device simulations, use:

Math {
  Method= Blocked
  SubMethod= ParDiSo[(options)]
}

Unlike SUPER, PARDISO takes full advantage of shared-memory multiprocessor computers while being run on multiple CPUs in parallel. The default solver settings for PARDISO are optimized to address most standard simulation needs.

Several customizations are available for PARDISO (see Table 2).

Table 2. Options available for PARDISO customization.
Option Description Default
NonsymmetricPermutation Computes initial nonsymmetric matrix permutation and scaling, which places large matrix entries on the diagonal. on
IterativeRefinement Performs up to two iterative refinement steps to improve the accuracy of the solution. off
MultipleRHS Solves linear systems with multiple right-hand sides. This option applies to AC analysis only. off
RecomputeNonsymmetricPermutation Computes nonsymmetric matrix permutation and scaling before each factorization. off

Using -NonsymmetricPermutation results in better speed but less accuracy and is not recommended.

Switching on IterativeRefinement and RecomputeNonsymmetricPermutation targets better accuracy, but typically slows down the system solve.

Running in single-CPU mode, PARDISO outperforms SUPER with respect to runtime and memory requirements. Therefore, you can consider it for standard 2D simulation tasks. For tasks that require higher accuracy (for example, wide-bandgap device simulations), these customization options might be required, which make SUPER a more favorable choice.

Having almost superlinear parallel scaling capabilities, PARDISO is also well suited for medium-sized 3D device simulations. You might expect up to a ~12 times speedup factor with PARDISO while running 3D simulations in parallel on a machine, equipped with 16 or more CPUs.

6.4.4 ILS Linear Solver

To specify the Iterative Linear Solver (ILS) for single-device simulations, use:

Math {
  Method= ILS[(set=<n>)]
  [ILSrc= "set(<n>) {...};"]
}

To specify ILS for mixed-mode device simulations, use:

Math {
  Method= Blocked
  SubMethod= ILS[(set=<n>)]
  [ILSrc= "set(<n>) {...};"]
}

Here, <n> indicates the set number, which is referred to in the subsequent ILSrc statement.

Compared to the direct method, the iterative approach reduces memory requirements for large simulation tasks dramatically. Therefore, ILS is the preferred solver for device simulations where the mesh node count exceeds the limit suitable for direct linear solvers.

ILS operation consist of the following steps:

  1. Compute nonsymmetric ordering to improve the matrix condition.
  2. Determine symmetric ordering to reduce fill-in in the preconditioner.
  3. Create a preconditioner to accelerate convergence.
  4. Call an iterative method.

Each step is controlled by different parameters, which are combined within groups. Some dedicated parameter sets, which target different applications, are hardcoded and do not actually need an explicit set specification as shown in Figure 9. To activate any of these sets, specify the set number, for example:

Math {
  Method= ILS(set=2)
}

Built-in ILS parameter sets

Figure 9. Built-in ILS parameter sets and their parameter values. (Click image for full-size view.)

In addition, you can define custom sets using the ILSrc keyword. Table 3 refers to several ILS sets, which were established for different applications. To use these sets, they must be defined explicitly using the ILSrc command in the Math section.

Table 3. ILS sets customized for different applications.
Application ILSrc set
3D device off-state breakdown ILSrc= "set (5) {
iterative (gmres(150), tolrel=1e-10, maxit=300);
preconditioning(ilut(1e-5,-1),left);
ordering(symmetric=nd, nonsymmetric=mpsilst);
options(verbose=0, refineresidual=10);
};"
SiC device simulations ILSrc= "set (6) {
iterative(gmres(100), tolrel=1e-10, tolunprec=1e-4, tolabs=0, maxit=200);
preconditioning(ilut(1.5e-6,-1), left);
ordering(symmetric=nd, nonsymmetric=mpsilst);
options(compact=yes, linscale=0, refineresidual=10, verbose=0);
};"
III–N device simulations ILSrc= "set (7) {
iterative(gmres(100), tolrel=1e-12, tolunprec=1e-8, tolabs=0, maxit=200);
preconditioning(ilut(1e-8,-1),left);
ordering ( symmetric=nd, nonsymmetric=mpsilst );
options( compact=yes, linscale=0, refineresidual=10, verbose=0); };
};"

The most relevant parameters for adjusting ILS behavior are:

6.5 Parallelization

Math {
  NumberOfThreads= <n>
  ParallelLicense (Wait)
  Wallclock
}

Sentaurus Device uses thread parallelism to accelerate simulations on shared-memory computers. Table 4 lists the parallelized components of Sentaurus Device.

Table 4. Tasks involved in parallelization in Sentaurus Device.
Task Description
Box discretization Computation of box method coefficients.
Matrix assembly Computation of mobility, avalanche, current density, energy flux; assembly of Poisson, continuity, lattice, and temperature equations.
Linear system solve PARDISO and ILS solvers contribute to thread parallelism.

The value of the keyword NumberOfThreads declares the number of threads to be allocated and used for parallelization. It must not exceed the physical number of CPUs available on a machine or a cluster node where the simulation is run.

Do not use NumberOfThreads= maximum, which will allocate all CPUs on a machine. At least one CPU must remain free to allow the system task to be run. Otherwise, instead of a speedup, performance degradation will be observed.

The ParallelLicense (Wait) command instructs the simulator to wait until a parallel license becomes available.

To display wallclock times (the real time between process invocation and termination) rather than CPU times, use the Wallclock option.

One parallel license supports parallelization on four CPUs. Therefore, to run on 16 CPUs, for example, four parallel licenses are required.

To investigate different task contributions to the overall speedup, their parallelization can be independently enabled using the following commands:

Math {
  NumberOfAssemblyThreads = <n>
  NumberOfSolverThreads   = <n>
}

To see how parallelization contributes to a simulation speedup, run a simulation in single-thread mode first, and then rerun it in parallel mode using a value higher than one assigned using NumberOfThreads. Then, feed both log files to the sdevicestat stand-alone routine, which will produce cumulative reports, showing the accumulated times spent for each particular task, for example:

> sdevicestat n123_des.log
...
Total CPU-time        : 118.04 s
Rhs-time              : 46.92 % ( 55.39 s )
Jacobian-time         : 8.38 % ( 9.89 s )
Solve-time            : 43.63 % ( 51.50 s )

Figure 10 explains the meaning of the speedup estimations shown in Figure 11.

Speedup factors

Figure 10. Explanation of fields shown in Figure 11:
– Wallclock Time: The real time a simulation took from invocation until termination.
– Solver Speedup: The speedup due to the parallelization applied in the linear solver compared to a single CPU run (Solve-time in log file).
– Assembly Speedup: How faster the matrix assembly is (Rhs-time and Jacobian-time in log file) due to parallelization, compared to a single CPU run. (Click image for full-size view.)

Figure 11 shows the comparison of 3D simulation results for a structure with ~1M mesh nodes, running in parallel on different numbers of CPUs using PARDISO and ILS.

PARDISO versus ILS speedups

Figure 11. Comparison of the simulation speedup factors while running large 3D device simulations on different numbers of CPUs using PARDISO and ILS. (Click image for full-size view.)

To run Sentaurus Device in parallel mode inside Sentaurus Workbench, the automatic allocation feature is available, which allows you to automatically synchronize the number of requested parallel slots between a simulation job and a scheduler.

To activate this feature, the following command must be placed in a user tooldb.tcl file:

set WB_tool(sdevice,parallel,mode) tool

For more details about this Sentaurus Workbench feature, see Section 2.3 Running Projects.

6.6 Understanding the Log File

When starting your 3D simulation for the first time, it is useful to look in the log file to better understand the details of the mesh on which you run your device simulations.

All the commands you specified in the Math section of the command file should be reflected in the log file:

Model: Math  Switched on
  Parameter: WeightedVoronoiBox = on
  Parameter: BM_ExtendedPrecision = on
  Parameter: AutoNPMinStepFactor = 0
  Parameter: AutoCNPMinStepFactor = 0
  Model: SimStats  Switched on
  Parameter: ElementVolumeAvalanche = on
  Parameter: AvalFlatElementExclusion = 1 deg
  Parameter: ComputeGradQuasiFermiAtContacts = UseQuasiFermi
  Parameter: RefDens_eGradQuasiFermi_ElectricField = 100000000 cm^-3
  Parameter: RefDens_hGradQuasiFermi_ElectricField = 100000000 cm^-3

At the beginning of the simulation, the BoxMethod analyzes the input grid and prints out detailed information about it:

           NumberOfVertices =  10935 
           NumberOfElements =  50688 
       NumberOfTetrahedrons =  50688  (100.0 %)
     NumberOfObtuseElements =  16896  (33.33 %)
NumberOfNonDelaunayElements =  0  ( 0.00 %)

The most important indication of the 3D mesh quality is how the generated mesh fulfills the Delaunay criterion. The most important criteria are the DeltaVolume, which indicates the difference between the real physical device volume and the volume computed by BoxMethod, and the number of non-Delaunay mesh elements, which are reported for each region inside the device independently. It is important to have as small as possible DeltaVolume values for semiconductor regions and as few as possible non-Delaunay elements.

Sentaurus Device non-Delaunay mesh statistics

Figure 12. Sentaurus Device report on non-Delaunay mesh statistics. (Click image for full-size view.)

For more details, see Section 7. Sentaurus Device at Runtime.

main menu    |   module menu    |   << previous section    |   next section >>