Age | Commit message (Collapse) | Author |
|
background.
|
|
|
|
|
|
|
|
about 60-90 seconds.
|
|
This is an adaptation of the original Chroma BVH construction algorithm.
The generation stage is very slow, but can be fixed.
|
|
old location in newer pycuda releases.
|
|
parent nodes if combining them would result in a parent node that is
excessively large compared to the surface area of the children.
This doesn't help as much as you might imagine.
|
|
|
|
|
|
|
|
areas of the BVH nodes in a particular layer of the tree.
|
|
|
|
Node access is very irregular as each thread descends the BVH tree.
Each node is only 16 bytes, so the 128 byte cache line size in the L1
cache means that a lot of useless data is often fetched. Using some
embedded PTX, we can force the L1 cache to be skipped, going directly
to L2. The L2 cache line is 32 bytes long, which means that both
children in a binary tree will be cached at the same time.
This improves the speed on the default generated binary trees, but
does not help an optimized tree yet.
|
|
|
|
|
|
using the Camera class.
|
|
|
|
tracing for CUDA" by Hannu Saransaari.
The intersect_box() function has been rewritten to be much shorter and
use the min() and max() functions, which map directly to hardware
instructions. Additionally, the calculations inside intersect_box()
have been reorganized to allow the compiler to use the combined
multiply-add instruction, instead of doing a subtraction followed by a
division (which is way slower).
|
|
consistency everywhere.
|
|
minimization
|
|
* chroma-bvh create [name] [degree] - Creates a new BVH with the specified
branching degree.
* chroma-bvh node_swap [name] [layer] - Optimizes a BVH layer with a
"greedy, short-sighted" algorithm that swaps around nodes to minimize
the surface area of the immediate parent layer. Rebuilds the tree
above the modified layer when finished.
Also modified the chroma-bvh stat command to print the sum of the
logarithms of the areas of each layer. It seems to be a rough
predictor of the simulation speed of the BVH.
|
|
|
|
|
|
|
|
Geometry.
|
|
Note that rendering is still broken by the new BVH format.
|
|
searching through files, named geometries in the cache, and geometry
creation functions.
The loader function also is responsible for fetching or creating a BVH
to go with the geometry.
This commit also removes some code that has been replaced by the new
system. Other bits will come back in future commits.
|
|
|
|
|
|
|
|
|
|
These classes will be the data types used by the new BVH generation code.
|
|
|
|
The storage format is changing relative to the old format, so all
geometry files will be saved in the ~/.chroma/geo directory. For now,
the format is just a simple pickle. We know this is not optimal for
space or speed, but the Geometry class will be changing soon, and we
can optimize further after that.
This Cache class will also soon manage the separate BVH cache.
|
|
|
|
factor the same in all dimensions so that distance and area calculations are easy to do.
|
|
algorithms.
|
|
|
|
|
|
In this implementation, the BVH is represented in a different way.
Unlike the standard implementation, this BVH is constrained to be very
uniform. Every node has the same number of children, B, except the
leaf layer, where each leaf node wraps exactly one triangle. All
leaves are at the same depth, and dummy nodes are inserted into the
tree in order to meet the requirement that all nodes have B children.
By requiring each triangle be wrapped by a bounding box, we increase
the number of nodes required to represent a geometry quite
dramatically. To compensate, we cut the storage requirement per node
by storing the following per-node properties:
1) 6 * 16 bits: x, y, z upper and lower bounds in a 16-bit fixed-point
representation. The Geometry struct contains the global offset and
scale factor required to transform the fixed point values back into
floats. The bounding values are rounded to ensure they still fully
contain the triangle vertices, even though the vertices are
represented with greater precision than the bounding boxes.
2) 1 bit: Leaf bit. If not set, this node has child nodes, otherwise
this node is a leaf node wrapping a single triangle.
3) 31 bits: The index of the first child of this node in the node
list. All other children of the node immediately follow the first
child, and there are always exactly B of them. If the leaf bit is
set, then the child index refers to the triangle list, rather than the
node list.
A dummy node is defined to have the x lower bound = x upper bound = 0.
All dummy children of a node will be consecutive at the end of the
list of children, so once the first dummy is encountered, the
remaining children can be skipped.
To build this BVH, we use the GPU for assistance. One kernel computes
the boundaries of the leaf nodes and the Morton codes from the
triangle list. Since GPU storage is severely strained for large
detectors, we then pull back these arrays to the CPU, sort the leaf
nodes in Morton-order, delete the GPU arrays, and then allocate the
full node list on the GPU. The BVH is then built in-place on the GPU
layer by layer, moving bottom to top.
The BVH for LBNE can be constructed in about 20 seconds in this
fashion, which is actually faster than the standard BVH can be
deserialized from disk.
Unfortunately, the benchmark geometry requires so many nodes (due to
one leaf per triangle), that the BVH leaves almost no room on the GPU
for the actual triangles or vertices. As a result, the current code
actually stores these arrays on the CPU and *maps them into the GPU
address space*. That sounds kind of crazy, but given the bounding of
each triangle in a box, the number of triangles that actually need to
be fetched over the PCI-E bus is fairly low. For a branching factor
of 3, the BVH needs to test ~400 nodes and only 3-7 triangles to
project a single ray. If the off-board storage of the triangles and
vertices can be preserved, then it might be possible in the future to
squeeze LBNE onto a standard 1.5 GB GTX 580.
The current mesh intersection code has some severe performance issues
at the moment. It is about 5-10x slower than the standard BVH code
for reasons that are not immediately clear, although there are plenty
of candidates.
Some investigation is in order, but I will probably just skip right
over to the ultimate goal of this effort: parallel BVH search.
Instead of assigning one ray per thread, I would like to assign one
ray per 32-thread block, and use the 32-threads to cooperatively
search the BVH. This requires a BVH with branch factor of 32, and a
very regular layout. That wide of a tree requires more nodes to be
evaluated (~1200 instead of 400), but will hopefully have more
streamlined memory access patterns and less warp divergence.
It is not clear whether this experiment will actually speed things
up...
|
|
|
|
global import of RootReader in the chroma.camera module.
Bringing in ROOT adds 14 seconds to the import time, which really
sucks if you don't need it.
|
|
function. This makes it easier to experiment with external BVH construction implementations, which only need the flat list of triangles.
|
|
|
|
|
|
|
|
|
|
|
|
|