chroma - Chroma is a high performance optical photon simulation for particle physics detectors

Age	Commit message (Collapse)	Author
2021-05-09	Remove unneeded Node.kind member from struct. Speeds up benchmark further, ↵	Stan Seibert
	but no improvement to actual simulation.
2021-05-09	BVH optimization to sort child nodes by area. Only has a small effect.	Stan Seibert

2021-05-09	Redo node format to include number of children, rather than just leaf bit.	Stan Seibert

2021-05-09	Speed up node intersection by 2.5x using tips from "Optimizing ray	Stan Seibert
	tracing for CUDA" by Hannu Saransaari. The intersect_box() function has been rewritten to be much shorter and use the min() and max() functions, which map directly to hardware instructions. Additionally, the calculations inside intersect_box() have been reorganized to allow the compiler to use the combined multiply-add instruction, instead of doing a subtraction followed by a division (which is way slower).
2021-05-09	Replace the use of pointers in the intersect_mesh with array indices.	Stan Seibert

2021-05-09	First steps toward an alternative BVH implementation.	Stan Seibert
	In this implementation, the BVH is represented in a different way. Unlike the standard implementation, this BVH is constrained to be very uniform. Every node has the same number of children, B, except the leaf layer, where each leaf node wraps exactly one triangle. All leaves are at the same depth, and dummy nodes are inserted into the tree in order to meet the requirement that all nodes have B children. By requiring each triangle be wrapped by a bounding box, we increase the number of nodes required to represent a geometry quite dramatically. To compensate, we cut the storage requirement per node by storing the following per-node properties: 1) 6 * 16 bits: x, y, z upper and lower bounds in a 16-bit fixed-point representation. The Geometry struct contains the global offset and scale factor required to transform the fixed point values back into floats. The bounding values are rounded to ensure they still fully contain the triangle vertices, even though the vertices are represented with greater precision than the bounding boxes. 2) 1 bit: Leaf bit. If not set, this node has child nodes, otherwise this node is a leaf node wrapping a single triangle. 3) 31 bits: The index of the first child of this node in the node list. All other children of the node immediately follow the first child, and there are always exactly B of them. If the leaf bit is set, then the child index refers to the triangle list, rather than the node list. A dummy node is defined to have the x lower bound = x upper bound = 0. All dummy children of a node will be consecutive at the end of the list of children, so once the first dummy is encountered, the remaining children can be skipped. To build this BVH, we use the GPU for assistance. One kernel computes the boundaries of the leaf nodes and the Morton codes from the triangle list. Since GPU storage is severely strained for large detectors, we then pull back these arrays to the CPU, sort the leaf nodes in Morton-order, delete the GPU arrays, and then allocate the full node list on the GPU. The BVH is then built in-place on the GPU layer by layer, moving bottom to top. The BVH for LBNE can be constructed in about 20 seconds in this fashion, which is actually faster than the standard BVH can be deserialized from disk. Unfortunately, the benchmark geometry requires so many nodes (due to one leaf per triangle), that the BVH leaves almost no room on the GPU for the actual triangles or vertices. As a result, the current code actually stores these arrays on the CPU and maps them into the GPU address space. That sounds kind of crazy, but given the bounding of each triangle in a box, the number of triangles that actually need to be fetched over the PCI-E bus is fairly low. For a branching factor of 3, the BVH needs to test ~400 nodes and only 3-7 triangles to project a single ray. If the off-board storage of the triangles and vertices can be preserved, then it might be possible in the future to squeeze LBNE onto a standard 1.5 GB GTX 580. The current mesh intersection code has some severe performance issues at the moment. It is about 5-10x slower than the standard BVH code for reasons that are not immediately clear, although there are plenty of candidates. Some investigation is in order, but I will probably just skip right over to the ultimate goal of this effort: parallel BVH search. Instead of assigning one ray per thread, I would like to assign one ray per 32-thread block, and use the 32-threads to cooperatively search the BVH. This requires a BVH with branch factor of 32, and a very regular layout. That wide of a tree requires more nodes to be evaluated (~1200 instead of 400), but will hopefully have more streamlined memory access patterns and less warp divergence. It is not clear whether this experiment will actually speed things up...
2011-09-16	Move CUDA source inside chroma package, rename tests directory to test	Stan Seibert