summaryrefslogtreecommitdiff
path: root/chroma/cuda
AgeCommit message (Collapse)Author
2021-05-09Fix warning about redefinition of INFINITY.Stan Seibert
2021-05-09Fix bug in intersection calculation that prevented axis-aligned photons from ↵Stan Seibert
hitting triangles.
2021-05-09Mark one of the propagate functions as __noinline__ to work around bug where ↵Stan Seibert
cicc hangs during compilation.
2021-05-09Remove unused function to read a word bypassing L1 cache. Seems to be ↵Stan Seibert
incorrect asm syntax.
2021-05-09oops. fix interpolation for a uniformly sampled functionAnthony LaTorre
2021-05-09add gpu interpolation functions. fix sample_cdf functions which were ↵Anthony LaTorre
interpolating on the wrong side of the interval between two points.
2021-05-09add abs() function for float3sAnthony LaTorre
2021-05-09rotate() does not need to create a matrix. update test. remove superflous ↵Anthony LaTorre
function in last commit.
2021-05-09add norm() function to get the length of vectors and shorten normalize() ↵Anthony LaTorre
function.
2021-05-09Remove some placeholder DAQ properties.Stan Seibert
2021-05-09Diffuse reflection is lambertian, not isotropic.Stan Seibert
2021-05-09Photons tagged with NAN_ABORT should not continue to be propagated.Stan Seibert
2021-05-09Change surface re-emission simulation to not use the diffuseStan Seibert
reflection function. This allows the photon to reemit on either side of the surface and also removes a spurious diffuse reflection bit in the history.
2021-05-09Refactor the saving and loading of packed BVH nodes to fully abstractStan Seibert
away the split storage. Also fixes a bug discovered by Mike Jewell that crashed BVH creation after the last commit.
2021-05-09GPU geometry modification to permit the BVH node storage to be splitStan Seibert
between GPU and CPU. This allows much more complex geometries to be run on CUDA devices with less memory. The GPUGeometry object now takes a min_free_gpu_mem parameter giving the minimum number of bytes that can be free on the GPU after the BVH is loaded. By default, this number is 300 MB. Cards with sufficient memory will have the entire BVH on card, but those without enough memory will have the BVH split such that the top of the hierarchy (the most frequently traversed) is on the GPU.
2021-05-09make bulk reemission isotropicAndy Mastbaum
2021-05-09add simple bulk reemissionAndy Mastbaum
The ``Material`` struct now includes two new arrays: ``reemission_prob`` and ``reemission_cdf``. The former is sampled only when a photon is absorbed, and should be normalized accordingly. The latter defines the distribution from which the reemitted photon wavelength is drawn. This process changes the photon wavelength in place, and is not capable of producing multiple secondaries. It also does not enforce energy conservation; the reemission spectrum is not itself wavelength-dependent.
2021-05-09simplify surface modelsAndy Mastbaum
remove the ``SURFACE_SPECULAR`` and ``SURFACE_DIFFUSE`` models, since their functionality is available using the more-general ``SURFACE_DEFAULT``. also allow the user to specify the reflection type (specular/diffuse) for the complex and wls models. change wls so the normalization of properties is more consistent with the default.
2021-05-09fixes and tweaks for surface modelsAndy Mastbaum
All surface models including ``SURFACE_COMPLEX`` and ``SURFACE_WLS`` are now working. Note that the WLS won't work right in hybrid rendering mode since that mode relies on matching up incoming and outgoing photon wavelengths in a lookup table.
2021-05-09update python-side gpu structs to reflect cuda changesAndy Mastbaum
this fixes hybrid rendering mode
2021-05-09add surface model documentationAndy Mastbaum
2021-05-09generalize surface models and add thin film modelAndy Mastbaum
reduce models to the following: SURFACE_DEFAULT, // specular + diffuse + absorption + detection SURFACE_SPECULAR, // perfect specular reflector SURFACE_DIFFUSE, // perfect diffuse reflector SURFACE_COMPLEX, // use complex index of refraction SURFACE_WLS // wavelength-shifting reemission where SURFACE_COMPLEX uses the complex index of refraction (`eta' and `k') to compute reflection, absorption, and transmission. this model comes from the sno+ rat pmt optical model.
2021-05-09towards a more flexible surface modelAndy Mastbaum
surfaces now have an associated model which defines how photons are propagated. currently, these include specular, diffuse, mirror, photocathode (not implemented), and tpb. the default is the old behavior, where surfaces do some weighted combination of detection, absorption, and specular and diffuse reflection. `struct Surface` contains as members the superset of all model parameters; not all are used by all models. documentation (forthcoming) will make clear what each model looks at.
2021-05-09Remove unneeded Node.kind member from struct. Speeds up benchmark further, ↵Stan Seibert
but no improvement to actual simulation.
2021-05-09Add an argsort_direction() function to chroma.tools and use it toStan Seibert
group photons so that they take similar paths on the GPU. argsort_direction() morton-orders an array of normalized direction vectors according to their spherical coordinates. Photons sorted in this way tend to follow similar paths through a detector geometry, which enhances cache locality. As a result, get_node() uses the GPU L1 cache again, with good results.
2021-05-09For paranoia reasons, add some padding on the low corner of the BVH leaf nodes.Stan Seibert
2021-05-09BVH optimization to sort child nodes by area. Only has a small effect.Stan Seibert
2021-05-09Collapse chains of BVH nodes with single children.Stan Seibert
2021-05-09New BVH algorithm: Recursive GridStan Seibert
This is an adaptation of the original Chroma BVH construction algorithm. The generation stage is very slow, but can be fixed.
2021-05-09Bugfixes to BVH traversal and generation code.Stan Seibert
2021-05-09Redo node format to include number of children, rather than just leaf bit.Stan Seibert
2021-05-09Skip L1 cache when loading nodes.Stan Seibert
Node access is very irregular as each thread descends the BVH tree. Each node is only 16 bytes, so the 128 byte cache line size in the L1 cache means that a lot of useless data is often fetched. Using some embedded PTX, we can force the L1 cache to be skipped, going directly to L2. The L2 cache line is 32 bytes long, which means that both children in a binary tree will be cached at the same time. This improves the speed on the default generated binary trees, but does not help an optimized tree yet.
2021-05-09Fix the rendering code to use the new geometry and BVH definition.Stan Seibert
2021-05-09Speed up node intersection by 2.5x using tips from "Optimizing rayStan Seibert
tracing for CUDA" by Hannu Saransaari. The intersect_box() function has been rewritten to be much shorter and use the min() and max() functions, which map directly to hardware instructions. Additionally, the calculations inside intersect_box() have been reorganized to allow the compiler to use the combined multiply-add instruction, instead of doing a subtraction followed by a division (which is way slower).
2021-05-09Add more BVH manipulation commands:Stan Seibert
* chroma-bvh create [name] [degree] - Creates a new BVH with the specified branching degree. * chroma-bvh node_swap [name] [layer] - Optimizes a BVH layer with a "greedy, short-sighted" algorithm that swaps around nodes to minimize the surface area of the immediate parent layer. Rebuilds the tree above the modified layer when finished. Also modified the chroma-bvh stat command to print the sum of the logarithms of the areas of each layer. It seems to be a rough predictor of the simulation speed of the BVH.
2021-05-09Simple BVH generator using new infrastructureStan Seibert
2021-05-09Pull BVH calculation out of GPUGeometry class and make the world scale ↵Stan Seibert
factor the same in all dimensions so that distance and area calculations are easy to do.
2021-05-09Replace the use of pointers in the intersect_mesh with array indices.Stan Seibert
2021-05-09First steps toward an alternative BVH implementation.Stan Seibert
In this implementation, the BVH is represented in a different way. Unlike the standard implementation, this BVH is constrained to be very uniform. Every node has the same number of children, B, except the leaf layer, where each leaf node wraps exactly one triangle. All leaves are at the same depth, and dummy nodes are inserted into the tree in order to meet the requirement that all nodes have B children. By requiring each triangle be wrapped by a bounding box, we increase the number of nodes required to represent a geometry quite dramatically. To compensate, we cut the storage requirement per node by storing the following per-node properties: 1) 6 * 16 bits: x, y, z upper and lower bounds in a 16-bit fixed-point representation. The Geometry struct contains the global offset and scale factor required to transform the fixed point values back into floats. The bounding values are rounded to ensure they still fully contain the triangle vertices, even though the vertices are represented with greater precision than the bounding boxes. 2) 1 bit: Leaf bit. If not set, this node has child nodes, otherwise this node is a leaf node wrapping a single triangle. 3) 31 bits: The index of the first child of this node in the node list. All other children of the node immediately follow the first child, and there are always exactly B of them. If the leaf bit is set, then the child index refers to the triangle list, rather than the node list. A dummy node is defined to have the x lower bound = x upper bound = 0. All dummy children of a node will be consecutive at the end of the list of children, so once the first dummy is encountered, the remaining children can be skipped. To build this BVH, we use the GPU for assistance. One kernel computes the boundaries of the leaf nodes and the Morton codes from the triangle list. Since GPU storage is severely strained for large detectors, we then pull back these arrays to the CPU, sort the leaf nodes in Morton-order, delete the GPU arrays, and then allocate the full node list on the GPU. The BVH is then built in-place on the GPU layer by layer, moving bottom to top. The BVH for LBNE can be constructed in about 20 seconds in this fashion, which is actually faster than the standard BVH can be deserialized from disk. Unfortunately, the benchmark geometry requires so many nodes (due to one leaf per triangle), that the BVH leaves almost no room on the GPU for the actual triangles or vertices. As a result, the current code actually stores these arrays on the CPU and *maps them into the GPU address space*. That sounds kind of crazy, but given the bounding of each triangle in a box, the number of triangles that actually need to be fetched over the PCI-E bus is fairly low. For a branching factor of 3, the BVH needs to test ~400 nodes and only 3-7 triangles to project a single ray. If the off-board storage of the triangles and vertices can be preserved, then it might be possible in the future to squeeze LBNE onto a standard 1.5 GB GTX 580. The current mesh intersection code has some severe performance issues at the moment. It is about 5-10x slower than the standard BVH code for reasons that are not immediately clear, although there are plenty of candidates. Some investigation is in order, but I will probably just skip right over to the ultimate goal of this effort: parallel BVH search. Instead of assigning one ray per thread, I would like to assign one ray per 32-thread block, and use the 32-threads to cooperatively search the BVH. This requires a BVH with branch factor of 32, and a very regular layout. That wide of a tree requires more nodes to be evaluated (~1200 instead of 400), but will hopefully have more streamlined memory access patterns and less warp divergence. It is not clear whether this experiment will actually speed things up...
2021-05-09Lower the weight threshold cutoff to propagate even more improbable photonsStan Seibert
2021-05-09Split photon propagation in the likelihood calculation into a "forcedStan Seibert
scatter" and a "forced no-scatter" pass. Since we want to include contributions from two populations of weighted photons, we have to break up the DAQ simulation into three functions: * begin_acquire() * acquire() * end_acquire() The first function resets the channel states, the second function accumulates photoelectrons (and can be called multiple times), and the last function returns the hit information. A global weight has also been added to the DAQ simulation if a particular set of weighted photons need to have an overall penalty. The forced scattering pass can be repeated many times on the same photons (with the photons individually deweighted to compensate). This reduces the variance on the final likelihoods quite a bit.
2021-05-09Use the sincosf() function when computing both the sin() and cos() of the ↵Stan Seibert
same angle for greater efficiency.
2021-05-09Allow for option of forced scattering or forced non-scattering on the first ↵Stan Seibert
step during photon propagation.
2021-05-09Add optional weight to each photon being propagated.Stan Seibert
For consistence, weights must be less than or equal to one at all times. When weight calculation is enabled by the likelihood calculator, photons are prevented from being absorbed, and instead their weight is decreased to reflect their probability of survival. Once the photon's weight drops below a given threshold (0.01 for now), the weighting procedure is stopped and the photon can be extinguished. With weighting enabled, the detection efficiency of surfaces is also applied to the weight, and the photon terminated with the DETECT bit set in the history. This is not completely accurate, as a photon could pass through the surface, reflect, and reintersect the surface later (or some other PMT) and be detected. As a result, weighting will slightly underestimate PMT efficiency compared to the true Monte Carlo. This is not intrinsic to the weighting procedure, but only comes about because of the inclusion of detection efficiency into the weight. Without the detection efficiency included, weighting cuts in half the number of evaluations required to achieve a given likelihood uncertainty (at least for the hit probabilities). Add in the detection efficiency, and that factor becomes 1/5 or 1/6!
2021-05-09Major overhaul to the way that DAQ and PDF information is accumulatedStan Seibert
to speed up likelihood evaluation. When generating a likelihood, the DAQ can be run many times in parallel by the GPU, creating a large block of channel hit information in memory. The PDF accumulator processes that entire block in two passes: * First update the channel hit count, and the count of channel hits falling into the bin around the channel hit time being evaluated. Add any channel hits that should also be included in the n-th nearest neighbor calculation to a channel-specific work queue. * Process all the work queues for each channel and update the list of nearest neighbors. This is hugely faster than what we were doing before. Kernel estimation (or some kind of orthogonal function expansion of the PDF) should be better ultimately, but for now the nearest neighbor approach to PDF estimation seems to be working the best.
2021-05-09Add a select function to GPUPhotons to extract a reduced list of photons ↵Stan Seibert
that all have a particular interaction process code set. Handy for selection just the detected photons from a large list of photons.
2021-05-09Minor fixes to doing time & charge PDFs via kernel estimation. ThingsStan Seibert
still look off, but this is an improvement.
2021-05-09Fix returning from PDF accumulator too soon.Stan Seibert
2011-10-07Create a Detector class to hold information about the PMTs in aStan Seibert
geometry, like the mapping from solid IDs to channels, and the time and charge distributions. Detector is a subclass of Geometry, so that a Detector can be used wherever a Geometry is used. Only code (like the DAQ stuff) that needs to know how PMT solids map to channels should look for a Detector object. There is a corresponding GPUDetector class as well, with its own device side struct to hold PMT channel information. The GPU code now can sample an arbitrary time and charge PDF, but on the host side, the only interface exposed right now creates a Gaussian distribution.
2011-10-05Epic port of Chroma from units of meters/seconds/MeV toStan Seibert
millimeters/nanoseconds/MeV in order to match GEANT4, and also avoid huge discrepancies in magnitude caused by values like 10e-9 sec. Along the way, cleaned up a few things: * Switch the PI and SPEED_OF_LIGHT constants from double to single precision. This avoid some unnecessary double precision calculations in the GPU code. * Fixed a silly problem in the definition of the spherical spiral. Now the demo detector looks totally awesome. Also wrapped it in a black surface. Demo detector now has 10055 PMTs. * Updated the test_ray_intersection data file to reflect the new units. * Fix a missing import in chroma.gpu.tools